2025-05-07T20:23:26.1393090Z Current runner version: '2.323.0' 2025-05-07T20:23:26.1398843Z Runner name: 'i-061cb0426579ace80' 2025-05-07T20:23:26.1399775Z Machine name: 'ip-10-0-29-91' 2025-05-07T20:23:26.1402490Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:23:26.1404928Z Contents: read 2025-05-07T20:23:26.1405441Z Metadata: read 2025-05-07T20:23:26.1405919Z Packages: read 2025-05-07T20:23:26.1406410Z ##[endgroup] 2025-05-07T20:23:26.1408571Z Secret source: None 2025-05-07T20:23:26.1409212Z Prepare workflow directory 2025-05-07T20:23:26.2313191Z Prepare all required actions 2025-05-07T20:23:26.2352745Z Getting action download info 2025-05-07T20:23:26.4733888Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:23:26.7676259Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:23:27.1265939Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:23:28.7842033Z Getting action download info 2025-05-07T20:23:29.0872873Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:23:29.3205637Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.13, 12.8.0, 12.6.3, clang) 2025-05-07T20:23:29.3700753Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:23:29.3813101Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:23:29.3824393Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:29.3825026Z ##[endgroup] 2025-05-07T20:23:30.5860307Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:23:30.5860727Z Instance Type: g5.4xlarge 2025-05-07T20:23:30.5860969Z AMI Name: unknown 2025-05-07T20:23:30.5898217Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:23:35.9472564Z ##[group]Run actions/checkout@v4 2025-05-07T20:23:35.9472863Z with: 2025-05-07T20:23:35.9473082Z submodules: true 2025-05-07T20:23:35.9473313Z repository: pytorch/FBGEMM 2025-05-07T20:23:35.9473688Z token: *** 2025-05-07T20:23:35.9473888Z ssh-strict: true 2025-05-07T20:23:35.9474085Z ssh-user: git 2025-05-07T20:23:35.9474303Z persist-credentials: true 2025-05-07T20:23:35.9474541Z clean: true 2025-05-07T20:23:35.9474767Z sparse-checkout-cone-mode: true 2025-05-07T20:23:35.9475024Z fetch-depth: 1 2025-05-07T20:23:35.9475230Z fetch-tags: false 2025-05-07T20:23:35.9475442Z show-progress: true 2025-05-07T20:23:35.9475653Z lfs: false 2025-05-07T20:23:35.9475859Z set-safe-directory: true 2025-05-07T20:23:35.9476090Z env: 2025-05-07T20:23:35.9476292Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:35.9476571Z BUILD_ENV: build_binary 2025-05-07T20:23:35.9476807Z BUILD_TARGET: genai 2025-05-07T20:23:35.9477036Z BUILD_VARIANT: cuda 2025-05-07T20:23:35.9477294Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:35.9477538Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:35.9477766Z ##[endgroup] 2025-05-07T20:23:36.0638303Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:23:36.0639566Z ##[group]Getting Git version info 2025-05-07T20:23:36.0639989Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:23:36.0641066Z [command]/usr/bin/git version 2025-05-07T20:23:36.0641500Z git version 2.47.1 2025-05-07T20:23:36.0661436Z ##[endgroup] 2025-05-07T20:23:36.0676021Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/e68163b3-6f02-4d62-afcd-52334a42cb06' before making global git config changes 2025-05-07T20:23:36.0677121Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:23:36.0690657Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:36.0730093Z [command]/usr/bin/git config --local --get remote.origin.url 2025-05-07T20:23:36.0753412Z https://github.com/pytorch/FBGEMM 2025-05-07T20:23:36.0771258Z ##[group]Removing previously created refs, to avoid conflicts 2025-05-07T20:23:36.0776099Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD 2025-05-07T20:23:36.0801871Z refs/heads/main 2025-05-07T20:23:36.0811771Z [command]/usr/bin/git checkout --detach 2025-05-07T20:23:36.9428485Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:36.9479771Z [command]/usr/bin/git branch --delete --force main 2025-05-07T20:23:36.9507483Z Deleted branch main (was b6b2ce3). 2025-05-07T20:23:36.9514645Z ##[endgroup] 2025-05-07T20:23:36.9517637Z [command]/usr/bin/git submodule status 2025-05-07T20:23:36.9941483Z e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b) 2025-05-07T20:23:37.0028272Z 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd) 2025-05-07T20:23:37.0114658Z 6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec) 2025-05-07T20:23:37.0199386Z 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e) 2025-05-07T20:23:37.0283483Z f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77) 2025-05-07T20:23:37.0369628Z 420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844) 2025-05-07T20:23:37.0452151Z 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280) 2025-05-07T20:23:37.0465139Z ##[group]Cleaning the repository 2025-05-07T20:23:37.0469689Z [command]/usr/bin/git clean -ffdx 2025-05-07T20:23:37.0525837Z [command]/usr/bin/git reset --hard HEAD 2025-05-07T20:23:37.0631639Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:37.0638990Z ##[endgroup] 2025-05-07T20:23:37.0641268Z ##[group]Disabling automatic garbage collection 2025-05-07T20:23:37.0645669Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:23:37.0676801Z ##[endgroup] 2025-05-07T20:23:37.0677400Z ##[group]Setting up auth 2025-05-07T20:23:37.0682835Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:23:37.0724812Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:23:37.1055540Z Entering 'external/asmjit' 2025-05-07T20:23:37.1122068Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.1195102Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.1262514Z Entering 'external/cutlass' 2025-05-07T20:23:37.1336212Z Entering 'external/googletest' 2025-05-07T20:23:37.1402067Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.1468286Z Entering 'external/json' 2025-05-07T20:23:37.1552620Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:23:37.1584992Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:23:37.1912719Z Entering 'external/asmjit' 2025-05-07T20:23:37.1979101Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.2052990Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.2118974Z Entering 'external/cutlass' 2025-05-07T20:23:37.2193415Z Entering 'external/googletest' 2025-05-07T20:23:37.2260439Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.2326578Z Entering 'external/json' 2025-05-07T20:23:37.2415170Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:37.2465937Z ##[endgroup] 2025-05-07T20:23:37.2466424Z ##[group]Fetching the repository 2025-05-07T20:23:37.2473295Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:23:37.4642356Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:23:37.4643148Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:23:37.4668518Z ##[endgroup] 2025-05-07T20:23:37.4669044Z ##[group]Determining the checkout info 2025-05-07T20:23:37.4670359Z ##[endgroup] 2025-05-07T20:23:37.4674214Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:23:37.4724238Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:23:37.4753968Z ##[group]Checking out the ref 2025-05-07T20:23:37.4757390Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:23:37.4879730Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:37.4882997Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:23:37.4892660Z ##[endgroup] 2025-05-07T20:23:37.4893249Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:23:37.4897907Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:37.4944940Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:23:37.4976416Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:23:37.5007743Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:23:37.5035805Z ##[endgroup] 2025-05-07T20:23:37.5036306Z ##[group]Fetching submodules 2025-05-07T20:23:37.5039123Z [command]/usr/bin/git submodule sync 2025-05-07T20:23:37.5412980Z Synchronizing submodule url for 'external/asmjit' 2025-05-07T20:23:37.5413501Z Synchronizing submodule url for 'external/composable_kernel' 2025-05-07T20:23:37.5414206Z Synchronizing submodule url for 'external/cpuinfo' 2025-05-07T20:23:37.5414853Z Synchronizing submodule url for 'external/cutlass' 2025-05-07T20:23:37.5416599Z Synchronizing submodule url for 'external/googletest' 2025-05-07T20:23:37.5417198Z Synchronizing submodule url for 'external/hipify_torch' 2025-05-07T20:23:37.5417676Z Synchronizing submodule url for 'external/json' 2025-05-07T20:23:37.5429053Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:23:37.5860347Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:23:37.6008249Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:23:37.6108602Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:23:37.6276279Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:23:37.6365916Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:23:37.6448121Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:23:37.6549631Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:23:37.6566793Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:23:37.6898521Z Entering 'external/asmjit' 2025-05-07T20:23:37.6931126Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.6962578Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.6995872Z Entering 'external/cutlass' 2025-05-07T20:23:37.7027967Z Entering 'external/googletest' 2025-05-07T20:23:37.7061044Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.7093035Z Entering 'external/json' 2025-05-07T20:23:37.7137812Z ##[endgroup] 2025-05-07T20:23:37.7138649Z ##[group]Persisting credentials for submodules 2025-05-07T20:23:37.7145939Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:23:37.7474839Z Entering 'external/asmjit' 2025-05-07T20:23:37.7516736Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7517473Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7559646Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.7601641Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7602028Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7650414Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.7692391Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7692922Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7734065Z Entering 'external/cutlass' 2025-05-07T20:23:37.7776084Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7776590Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7826419Z Entering 'external/googletest' 2025-05-07T20:23:37.7868838Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7869294Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7910986Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.7970204Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7970641Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7995173Z Entering 'external/json' 2025-05-07T20:23:37.8037010Z url.https://github.com/.insteadof 2025-05-07T20:23:37.8037397Z url.https://github.com/.insteadof 2025-05-07T20:23:37.8098627Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:23:37.8430277Z Entering 'external/asmjit' 2025-05-07T20:23:37.8492611Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:23:37.8495272Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.8556198Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:23:37.8559234Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.8619909Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:23:37.8622854Z Entering 'external/cutlass' 2025-05-07T20:23:37.8684581Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:23:37.8687837Z Entering 'external/googletest' 2025-05-07T20:23:37.8750054Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:23:37.8753103Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.8814120Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:23:37.8816982Z Entering 'external/json' 2025-05-07T20:23:37.8878246Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:23:37.8993654Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:23:37.9325257Z Entering 'external/asmjit' 2025-05-07T20:23:37.9358433Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.9389912Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.9421395Z Entering 'external/cutlass' 2025-05-07T20:23:37.9454188Z Entering 'external/googletest' 2025-05-07T20:23:37.9485316Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.9516642Z Entering 'external/json' 2025-05-07T20:23:37.9564675Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:23:37.9892149Z Entering 'external/asmjit' 2025-05-07T20:23:37.9924847Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.9956745Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.9988287Z Entering 'external/cutlass' 2025-05-07T20:23:38.0021572Z Entering 'external/googletest' 2025-05-07T20:23:38.0056356Z Entering 'external/hipify_torch' 2025-05-07T20:23:38.0088407Z Entering 'external/json' 2025-05-07T20:23:38.0137860Z ##[endgroup] 2025-05-07T20:23:38.0179552Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:23:38.0206742Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:38.0379655Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:23:38.0379960Z with: 2025-05-07T20:23:38.0380202Z name: fbgemm_genai_x86_clang_py3.13_cu12.8.0.whl 2025-05-07T20:23:38.0380519Z merge-multiple: false 2025-05-07T20:23:38.0380764Z repository: pytorch/FBGEMM 2025-05-07T20:23:38.0381017Z run-id: 14891846252 2025-05-07T20:23:38.0381225Z env: 2025-05-07T20:23:38.0381444Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:38.0381735Z BUILD_ENV: build_binary 2025-05-07T20:23:38.0381977Z BUILD_TARGET: genai 2025-05-07T20:23:38.0382192Z BUILD_VARIANT: cuda 2025-05-07T20:23:38.0382430Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:38.0382671Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:38.0382900Z ##[endgroup] 2025-05-07T20:23:38.2673056Z Downloading single artifact 2025-05-07T20:23:38.3922073Z Preparing to download the following artifacts: 2025-05-07T20:23:38.3922879Z - fbgemm_genai_x86_clang_py3.13_cu12.8.0.whl (ID: 3081408483, Size: 18517235, Expected Digest: sha256:2c430e283306050771ed0148f8bc0ff9c88d696c9122c4b4956d4418e1e568bd) 2025-05-07T20:23:38.4520988Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-569411a8-0277-5d3b-912a-bdc2bb6543f6/artifacts/b2a3a6e3a6de2b82b0a644dc87c7372954cdbe64040c2d38887481b8860a6fb7.zip 2025-05-07T20:23:38.4522380Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:38.5742479Z (node:58325) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:23:38.5743436Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:23:38.8551631Z SHA256 digest of downloaded artifact is 2c430e283306050771ed0148f8bc0ff9c88d696c9122c4b4956d4418e1e568bd 2025-05-07T20:23:38.8552188Z Artifact download completed successfully. 2025-05-07T20:23:38.8558237Z Total of 1 artifact(s) downloaded 2025-05-07T20:23:38.8558550Z Download artifact has finished successfully 2025-05-07T20:23:38.8802249Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:23:38.8802630Z with: 2025-05-07T20:23:38.8802836Z driver-version: 570.133.07 2025-05-07T20:23:38.8803071Z env: 2025-05-07T20:23:38.8803278Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:38.8803568Z BUILD_ENV: build_binary 2025-05-07T20:23:38.8803809Z BUILD_TARGET: genai 2025-05-07T20:23:38.8804027Z BUILD_VARIANT: cuda 2025-05-07T20:23:38.8804252Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:38.8804502Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:38.8804732Z ##[endgroup] 2025-05-07T20:23:38.8900311Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:23:38.8900680Z with: 2025-05-07T20:23:38.8900874Z timeout_minutes: 10 2025-05-07T20:23:38.8901100Z max_attempts: 3 2025-05-07T20:23:38.8923905Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:23:38.8947221Z retry_wait_seconds: 10 2025-05-07T20:23:38.8947468Z polling_interval_seconds: 1 2025-05-07T20:23:38.8947764Z warning_on_retry: true 2025-05-07T20:23:38.8948003Z continue_on_error: false 2025-05-07T20:23:38.8948237Z env: 2025-05-07T20:23:38.8948449Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:38.8948741Z BUILD_ENV: build_binary 2025-05-07T20:23:38.8948990Z BUILD_TARGET: genai 2025-05-07T20:23:38.8949206Z BUILD_VARIANT: cuda 2025-05-07T20:23:38.8949435Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:38.8949677Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:38.8949911Z DRIVER_VERSION: 570.133.07 2025-05-07T20:23:38.8950147Z ##[endgroup] 2025-05-07T20:23:38.9765844Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:23:38.9766864Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:23:38.9770763Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:23:39.3367889Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:23:39.3368243Z No packages marked for removal. 2025-05-07T20:23:39.3431611Z Dependencies resolved. 2025-05-07T20:23:39.3443249Z Nothing to do. 2025-05-07T20:23:39.3443726Z Complete! 2025-05-07T20:23:39.3781147Z + install_nvidia_driver_common 2025-05-07T20:23:39.3785471Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:23:39.3786616Z Before installing NVIDIA driver 2025-05-07T20:23:39.3788979Z + lspci 2025-05-07T20:23:39.3979544Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:39.3980622Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:39.3981188Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:39.3981689Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:39.3982155Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:39.3982665Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:39.3983125Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:39.3983586Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:39.3983977Z + lsmod 2025-05-07T20:23:39.4024359Z Module Size Used by 2025-05-07T20:23:39.4025169Z xt_conntrack 16384 1 2025-05-07T20:23:39.4025974Z nft_chain_nat 16384 3 2025-05-07T20:23:39.4026631Z xt_MASQUERADE 20480 1 2025-05-07T20:23:39.4027032Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:39.4027442Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:39.4027918Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:39.4028344Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:39.4028648Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:39.4028932Z xfrm_user 57344 1 2025-05-07T20:23:39.4029192Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:39.4029472Z xt_addrtype 16384 2 2025-05-07T20:23:39.4029725Z nft_compat 20480 4 2025-05-07T20:23:39.4030011Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:39.4030409Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:39.4030765Z br_netfilter 36864 0 2025-05-07T20:23:39.4031243Z bridge 323584 1 br_netfilter 2025-05-07T20:23:39.4031532Z stp 16384 1 bridge 2025-05-07T20:23:39.4031810Z llc 16384 2 bridge,stp 2025-05-07T20:23:39.4032087Z overlay 167936 0 2025-05-07T20:23:39.4032323Z tls 135168 0 2025-05-07T20:23:39.4032561Z nls_ascii 16384 1 2025-05-07T20:23:39.4032806Z nls_cp437 20480 1 2025-05-07T20:23:39.4033044Z vfat 24576 1 2025-05-07T20:23:39.4033288Z fat 86016 1 vfat 2025-05-07T20:23:39.4033543Z ena 180224 0 2025-05-07T20:23:39.4033773Z sunrpc 696320 1 2025-05-07T20:23:39.4034014Z i8042 45056 0 2025-05-07T20:23:39.4034260Z serio 28672 3 i8042 2025-05-07T20:23:39.4034525Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:39.4034778Z button 24576 0 2025-05-07T20:23:39.4035025Z sch_fq_codel 20480 17 2025-05-07T20:23:39.4035276Z dm_mod 188416 0 2025-05-07T20:23:39.4035518Z dax 45056 1 dm_mod 2025-05-07T20:23:39.4035787Z loop 36864 0 2025-05-07T20:23:39.4036020Z fuse 163840 1 2025-05-07T20:23:39.4036266Z configfs 57344 1 2025-05-07T20:23:39.4036506Z dmi_sysfs 20480 0 2025-05-07T20:23:39.4036749Z crc32_pclmul 16384 0 2025-05-07T20:23:39.4036993Z crc32c_intel 24576 0 2025-05-07T20:23:39.4037284Z efivarfs 24576 1 2025-05-07T20:23:39.4037527Z + modinfo nvidia 2025-05-07T20:23:39.4044110Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:39.4044769Z import_ns: DMA_BUF 2025-05-07T20:23:39.4045095Z alias: char-major-195-* 2025-05-07T20:23:39.4045437Z version: 570.133.07 2025-05-07T20:23:39.4045675Z supported: external 2025-05-07T20:23:39.4045917Z license: Dual MIT/GPL 2025-05-07T20:23:39.4046205Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:39.4046555Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:39.4047065Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:39.4047384Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:39.4047806Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:39.4048150Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:39.4048451Z depends: i2c-core,drm 2025-05-07T20:23:39.4048701Z retpoline: Y 2025-05-07T20:23:39.4048905Z name: nvidia 2025-05-07T20:23:39.4049253Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:39.4049706Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:39.4050141Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:39.4050536Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:39.4050832Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:39.4051127Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:39.4051430Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:39.4051725Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:39.4052018Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:39.4052365Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:39.4052757Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:39.4053079Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:39.4053366Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:39.4053653Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:39.4054001Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:39.4054385Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:39.4054744Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:39.4055144Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:39.4055681Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:39.4056085Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:39.4056505Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:39.4056916Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:39.4057276Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:39.4057628Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:39.4057959Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:39.4058269Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:39.4058581Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:39.4058936Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:39.4059275Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:39.4059608Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:39.4060012Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:39.4060387Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:39.4060712Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:39.4061044Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:39.4061372Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:39.4061708Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:39.4062031Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:39.4062305Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:39.4062695Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:39.4063045Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:39.4063347Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:39.4063671Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:39.4064011Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:39.4064346Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:39.4064666Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:39.4065009Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:39.4065341Z parm: rm_firmware_active:charp 2025-05-07T20:23:39.4065736Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:23:39.4065975Z ++ command -v nvidia-smi 2025-05-07T20:23:39.4066228Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:23:39.4066472Z + set +e 2025-05-07T20:23:39.4066794Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:23:41.2269963Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:23:41.2270307Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:41.2270610Z + '[' 0 -ne 0 ']' 2025-05-07T20:23:41.2270816Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:23:41.2271089Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:23:41.2271509Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:23:41.2271954Z + set -e 2025-05-07T20:23:41.2272141Z + '[' 1 -eq 0 ']' 2025-05-07T20:23:41.2272516Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:23:41.2272997Z + post_install_nvidia_driver_common 2025-05-07T20:23:41.2276109Z + sudo modprobe nvidia 2025-05-07T20:23:41.3138084Z + echo 'After installing NVIDIA driver' 2025-05-07T20:23:41.3138402Z + lspci 2025-05-07T20:23:41.3138606Z After installing NVIDIA driver 2025-05-07T20:23:41.3257047Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:41.3257525Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:41.3258076Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:41.3258580Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:41.3259039Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:41.3259547Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:41.3260021Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:41.3260764Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:41.3261154Z + lsmod 2025-05-07T20:23:41.3288525Z Module Size Used by 2025-05-07T20:23:41.3288824Z nvidia_uvm 1884160 0 2025-05-07T20:23:41.3289108Z nvidia 11583488 1 nvidia_uvm 2025-05-07T20:23:41.3289383Z drm 602112 1 nvidia 2025-05-07T20:23:41.3289675Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:41.3289973Z backlight 24576 1 drm 2025-05-07T20:23:41.3290250Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:41.3290529Z xt_conntrack 16384 1 2025-05-07T20:23:41.3290786Z nft_chain_nat 16384 3 2025-05-07T20:23:41.3291036Z xt_MASQUERADE 20480 1 2025-05-07T20:23:41.3291317Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:41.3291636Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:41.3292015Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:41.3292431Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:41.3292737Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:41.3293022Z xfrm_user 57344 1 2025-05-07T20:23:41.3293273Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:41.3293552Z xt_addrtype 16384 2 2025-05-07T20:23:41.3293794Z nft_compat 20480 4 2025-05-07T20:23:41.3294083Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:41.3294470Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:41.3294829Z br_netfilter 36864 0 2025-05-07T20:23:41.3295098Z bridge 323584 1 br_netfilter 2025-05-07T20:23:41.3295373Z stp 16384 1 bridge 2025-05-07T20:23:41.3295647Z llc 16384 2 bridge,stp 2025-05-07T20:23:41.3295915Z overlay 167936 0 2025-05-07T20:23:41.3296150Z tls 135168 0 2025-05-07T20:23:41.3296397Z nls_ascii 16384 1 2025-05-07T20:23:41.3296809Z nls_cp437 20480 1 2025-05-07T20:23:41.3297044Z vfat 24576 1 2025-05-07T20:23:41.3297284Z fat 86016 1 vfat 2025-05-07T20:23:41.3297534Z ena 180224 0 2025-05-07T20:23:41.3297774Z sunrpc 696320 1 2025-05-07T20:23:41.3298006Z i8042 45056 0 2025-05-07T20:23:41.3298251Z serio 28672 3 i8042 2025-05-07T20:23:41.3298519Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:41.3298760Z button 24576 0 2025-05-07T20:23:41.3299011Z sch_fq_codel 20480 17 2025-05-07T20:23:41.3299259Z dm_mod 188416 0 2025-05-07T20:23:41.3299493Z dax 45056 1 dm_mod 2025-05-07T20:23:41.3299755Z loop 36864 0 2025-05-07T20:23:41.3299993Z fuse 163840 1 2025-05-07T20:23:41.3300226Z configfs 57344 1 2025-05-07T20:23:41.3300475Z dmi_sysfs 20480 0 2025-05-07T20:23:41.3300720Z crc32_pclmul 16384 0 2025-05-07T20:23:41.3300961Z crc32c_intel 24576 0 2025-05-07T20:23:41.3301204Z efivarfs 24576 1 2025-05-07T20:23:41.3301447Z + modinfo nvidia 2025-05-07T20:23:41.3305829Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:41.3306279Z import_ns: DMA_BUF 2025-05-07T20:23:41.3306526Z alias: char-major-195-* 2025-05-07T20:23:41.3306794Z version: 570.133.07 2025-05-07T20:23:41.3307030Z supported: external 2025-05-07T20:23:41.3307274Z license: Dual MIT/GPL 2025-05-07T20:23:41.3307644Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:41.3307999Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:41.3308301Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:41.3308613Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:41.3308937Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:41.3309393Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:41.3309698Z depends: i2c-core,drm 2025-05-07T20:23:41.3309945Z retpoline: Y 2025-05-07T20:23:41.3310149Z name: nvidia 2025-05-07T20:23:41.3310503Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:41.3310960Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:41.3311395Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:41.3311805Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:41.3312100Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:41.3312392Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:41.3312689Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:41.3312991Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:41.3313319Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:41.3313806Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:41.3314209Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:41.3314531Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:41.3314816Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:41.3315107Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:41.3315453Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:41.3315883Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:41.3316411Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:41.3316902Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:41.3317293Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:41.3317690Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:41.3318086Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:41.3318412Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:41.3318765Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:41.3319231Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:41.3319565Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:41.3319876Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:41.3320190Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:41.3320501Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:41.3320800Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:41.3321131Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:41.3321479Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:41.3321800Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:41.3322117Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:41.3322454Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:41.3322778Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:41.3323116Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:41.3323437Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:41.3323721Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:41.3324039Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:41.3324344Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:41.3324645Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:41.3324964Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:41.3325299Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:41.3325692Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:41.3326004Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:41.3326337Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:41.3326667Z parm: rm_firmware_active:charp 2025-05-07T20:23:41.3326931Z + set +e 2025-05-07T20:23:41.3327123Z + nvidia-smi 2025-05-07T20:23:42.7335099Z Wed May 7 20:23:42 2025 2025-05-07T20:23:42.7335498Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:42.7336363Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:42.7336850Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:42.7337332Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:42.7337846Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:42.7338255Z | | | MIG M. | 2025-05-07T20:23:42.7338584Z |=========================================+========================+======================| 2025-05-07T20:23:42.7399026Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:42.7399479Z | 0% 28C P0 64W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:42.7399869Z | | | N/A | 2025-05-07T20:23:42.7400248Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:42.7400636Z 2025-05-07T20:23:42.7401026Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:42.7401442Z | Processes: | 2025-05-07T20:23:42.7401882Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:42.7402282Z | ID ID Usage | 2025-05-07T20:23:42.7402617Z |=========================================================================================| 2025-05-07T20:23:42.7403827Z | No running processes found | 2025-05-07T20:23:42.7404480Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:43.1486845Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:44.5566654Z NVIDIA A10G 2025-05-07T20:23:44.8318498Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:44.8318858Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:44.8319207Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:44.8319614Z + set -e 2025-05-07T20:23:44.8319809Z INFO: Ignoring allowed status 0 2025-05-07T20:23:44.8328583Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:44.8331664Z + sudo yum install -y yum-utils 2025-05-07T20:23:45.2423230Z Last metadata expiration check: 0:06:17 ago on Wed May 7 20:17:28 2025. 2025-05-07T20:23:45.2677959Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:45.3068069Z Dependencies resolved. 2025-05-07T20:23:45.3247619Z Nothing to do. 2025-05-07T20:23:45.3247939Z Complete! 2025-05-07T20:23:45.3638848Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:45.3639562Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:45.3640649Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:45.7146343Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:45.7693936Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:46.3677970Z nvidia-container-toolkit 12 kB/s | 833 B 00:00 2025-05-07T20:23:46.3924030Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:46.4321888Z Dependencies resolved. 2025-05-07T20:23:46.4499610Z ================================================================================ 2025-05-07T20:23:46.4500383Z Package Arch Version Repository Size 2025-05-07T20:23:46.4500767Z ================================================================================ 2025-05-07T20:23:46.4511542Z Downgrading: 2025-05-07T20:23:46.4511954Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:46.4512539Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-05-07T20:23:46.4512885Z 2025-05-07T20:23:46.4512975Z Transaction Summary 2025-05-07T20:23:46.4513223Z ================================================================================ 2025-05-07T20:23:46.4513535Z Downgrade 2 Packages 2025-05-07T20:23:46.4513680Z 2025-05-07T20:23:46.4513787Z Total download size: 6.8 M 2025-05-07T20:23:46.4514045Z Downloading Packages: 2025-05-07T20:23:46.4932432Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64 30 MB/s | 1.2 MB 00:00 2025-05-07T20:23:46.5445228Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x 60 MB/s | 5.6 MB 00:00 2025-05-07T20:23:46.5453640Z -------------------------------------------------------------------------------- 2025-05-07T20:23:46.5456877Z Total 72 MB/s | 6.8 MB 00:00 2025-05-07T20:23:46.5459059Z Running transaction check 2025-05-07T20:23:46.5560014Z Transaction check succeeded. 2025-05-07T20:23:46.5560618Z Running transaction test 2025-05-07T20:23:46.5852278Z Transaction test succeeded. 2025-05-07T20:23:46.5854809Z Running transaction 2025-05-07T20:23:47.1315758Z Preparing : 1/1 2025-05-07T20:23:47.2367717Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/4 2025-05-07T20:23:47.2395057Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:47.2601773Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:47.2602340Z Cleanup : nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:47.2706058Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:47.2732740Z Cleanup : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4 2025-05-07T20:23:47.4514586Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/4 2025-05-07T20:23:47.4515164Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 1/4 2025-05-07T20:23:47.4515689Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:47.4516203Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 3/4 2025-05-07T20:23:47.5910183Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4================================================================================ 2025-05-07T20:23:47.5911204Z WARNING: 2025-05-07T20:23:47.5911556Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:47.5911965Z 2025-05-07T20:23:47.5912087Z Available Versions: 2025-05-07T20:23:47.5912292Z 2025-05-07T20:23:47.5912428Z Version 2023.7.20250331: 2025-05-07T20:23:47.5912747Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:47.5912996Z 2025-05-07T20:23:47.5913116Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:47.5913321Z 2025-05-07T20:23:47.5913412Z Release notes: 2025-05-07T20:23:47.5913812Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:47.5914193Z 2025-05-07T20:23:47.5914280Z Version 2023.7.20250414: 2025-05-07T20:23:47.5914575Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:47.5914821Z 2025-05-07T20:23:47.5914937Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:47.5915136Z 2025-05-07T20:23:47.5915218Z Release notes: 2025-05-07T20:23:47.5915603Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:47.5916272Z 2025-05-07T20:23:47.5916360Z Version 2023.7.20250428: 2025-05-07T20:23:47.5916658Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:47.5916896Z 2025-05-07T20:23:47.5917008Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:47.5917219Z 2025-05-07T20:23:47.5917301Z Release notes: 2025-05-07T20:23:47.5917677Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:47.5918043Z 2025-05-07T20:23:47.5918159Z ================================================================================ 2025-05-07T20:23:47.6262415Z 2025-05-07T20:23:47.6262615Z 2025-05-07T20:23:47.6262705Z Downgraded: 2025-05-07T20:23:47.6263057Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-05-07T20:23:47.6263624Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-05-07T20:23:47.6263966Z 2025-05-07T20:23:47.6264045Z Complete! 2025-05-07T20:23:47.6718490Z + sudo systemctl restart docker 2025-05-07T20:23:51.1286891Z Wed May 7 20:23:51 2025 2025-05-07T20:23:51.1287366Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:51.1287862Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:51.1288343Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:51.1288822Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:51.1289346Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:51.1289769Z | | | MIG M. | 2025-05-07T20:23:51.1290096Z |=========================================+========================+======================| 2025-05-07T20:23:51.1373422Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:51.1374225Z | 0% 28C P0 64W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:51.1374616Z | | | N/A | 2025-05-07T20:23:51.1374998Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:51.1375380Z 2025-05-07T20:23:51.1375874Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:51.1376293Z | Processes: | 2025-05-07T20:23:51.1376729Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:51.1377128Z | ID ID Usage | 2025-05-07T20:23:51.1377462Z |=========================================================================================| 2025-05-07T20:23:51.1379033Z | No running processes found | 2025-05-07T20:23:51.1379504Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:51.9517330Z Command completed after 1 attempt(s). 2025-05-07T20:23:51.9601437Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:51.9601885Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:51.9615985Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:51.9616331Z env: 2025-05-07T20:23:51.9616554Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:51.9616846Z BUILD_ENV: build_binary 2025-05-07T20:23:51.9617091Z BUILD_TARGET: genai 2025-05-07T20:23:51.9617320Z BUILD_VARIANT: cuda 2025-05-07T20:23:51.9617546Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:51.9617980Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:51.9618277Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:51.9618600Z ##[endgroup] 2025-05-07T20:23:52.2978908Z ################################################################################ 2025-05-07T20:23:52.2979364Z # Print System Info 2025-05-07T20:23:52.2979647Z # 2025-05-07T20:23:52.2995811Z # [2025-05-07T20:23:52.299Z] + print_system_info 2025-05-07T20:23:52.2996287Z ################################################################################ 2025-05-07T20:23:52.2996574Z 2025-05-07T20:23:52.2996718Z ################################################################################ 2025-05-07T20:23:52.2997086Z [INFO] Printing environment variables ... 2025-05-07T20:23:52.2997379Z + printenv 2025-05-07T20:23:52.2997493Z 2025-05-07T20:23:52.3019380Z SHELL=/bin/bash 2025-05-07T20:23:52.3019912Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:52.3020794Z BUILD_VARIANT=cuda 2025-05-07T20:23:52.3022218Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_9a7d7a3a-ddc7-4928-87b7-ac501d01f089 2025-05-07T20:23:52.3023734Z GITHUB_ACTION=__run 2025-05-07T20:23:52.3024310Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:52.3024972Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:52.3025441Z RUNNER_NAME=i-061cb0426579ace80 2025-05-07T20:23:52.3025961Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:52.3026532Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:52.3027032Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:52.3027908Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:52.3028713Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:52.3029236Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:52.3029788Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:52.3030228Z *** 2025-05-07T20:23:52.3030403Z LOGNAME=ec2-user 2025-05-07T20:23:52.3030628Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:52.3030883Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:52.3031107Z GITHUB_ACTIONS=true 2025-05-07T20:23:52.3031320Z SYSTEMD_EXEC_PID=55553 2025-05-07T20:23:52.3031586Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:52.3032107Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:52.3032607Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:52.3032872Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:52.3033117Z RUNNER_OS=Linux 2025-05-07T20:23:52.3033322Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:52.3033556Z HOME=/home/ec2-user 2025-05-07T20:23:52.3033802Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:52.3034081Z LANG=C.UTF-8 2025-05-07T20:23:52.3034371Z RUNNER_TRACKING_ID=github_442cc8eb-c0ed-4196-93e2-da6db7d8b0a7 2025-05-07T20:23:52.3034720Z RUNNER_ARCH=X64 2025-05-07T20:23:52.3034977Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:52.3035293Z BUILD_TARGET=genai 2025-05-07T20:23:52.3035803Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_9a7d7a3a-ddc7-4928-87b7-ac501d01f089 2025-05-07T20:23:52.3036647Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_9a7d7a3a-ddc7-4928-87b7-ac501d01f089 2025-05-07T20:23:52.3037359Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:52.3038245Z INVOCATION_ID=a688399bf0f247da91f959ddef8510d2 2025-05-07T20:23:52.3038697Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:52.3039034Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:52.3039748Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_9a7d7a3a-ddc7-4928-87b7-ac501d01f089 2025-05-07T20:23:52.3040612Z BUILD_ENV=build_binary 2025-05-07T20:23:52.3040824Z GITHUB_ACTOR=q10 2025-05-07T20:23:52.3041084Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:52.3041390Z KERN_NAME_LC=linux 2025-05-07T20:23:52.3041672Z BUILD_CUDA_VERSION=12.8.0 2025-05-07T20:23:52.3042064Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:52.3042768Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:52.3043108Z USER=ec2-user 2025-05-07T20:23:52.3043369Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:52.3043634Z SHLVL=1 2025-05-07T20:23:52.3043818Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:52.3044109Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:52.3044541Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:52.3044894Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:52.3045117Z KERN_NAME=Linux 2025-05-07T20:23:52.3045338Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:52.3045732Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:52.3046139Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:52.3046401Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:52.3046632Z JOURNAL_STREAM=8:96283 2025-05-07T20:23:52.3046927Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:52.3047285Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:52.3047592Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:52.3047919Z GITHUB_BASE_REF=main 2025-05-07T20:23:52.3048121Z CI=true 2025-05-07T20:23:52.3048323Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:52.3048597Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:52.3048852Z GITHUB_ACTION_REF= 2025-05-07T20:23:52.3049089Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:52.3049708Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_9a7d7a3a-ddc7-4928-87b7-ac501d01f089 2025-05-07T20:23:52.3050293Z MACHINE_NAME=x86_64 2025-05-07T20:23:52.3050505Z _=/usr/bin/printenv 2025-05-07T20:23:52.3050632Z 2025-05-07T20:23:52.3050750Z ################################################################################ 2025-05-07T20:23:52.3051045Z [INFO] Print ldd version ... 2025-05-07T20:23:52.3051287Z + ldd --version 2025-05-07T20:23:52.3051415Z 2025-05-07T20:23:52.3051512Z ldd (GNU libc) 2.34 2025-05-07T20:23:52.3051779Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:52.3052200Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:52.3052715Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:52.3053148Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:52.3053357Z 2025-05-07T20:23:52.3053479Z ################################################################################ 2025-05-07T20:23:52.3053773Z [INFO] Print CPU info ... 2025-05-07T20:23:52.3054004Z + nproc 2025-05-07T20:23:52.3054105Z 2025-05-07T20:23:52.3063959Z 16 2025-05-07T20:23:52.3065637Z 2025-05-07T20:23:52.3065936Z + lscpu 2025-05-07T20:23:52.3066090Z 2025-05-07T20:23:52.3182818Z Architecture: x86_64 2025-05-07T20:23:52.3183647Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:52.3184417Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3185157Z Byte Order: Little Endian 2025-05-07T20:23:52.3185804Z CPU(s): 16 2025-05-07T20:23:52.3186361Z On-line CPU(s) list: 0-15 2025-05-07T20:23:52.3186967Z Vendor ID: AuthenticAMD 2025-05-07T20:23:52.3187753Z Model name: AMD EPYC 7R32 2025-05-07T20:23:52.3188355Z CPU family: 23 2025-05-07T20:23:52.3189255Z Model: 49 2025-05-07T20:23:52.3189801Z Thread(s) per core: 2 2025-05-07T20:23:52.3190357Z Core(s) per socket: 8 2025-05-07T20:23:52.3190639Z Socket(s): 1 2025-05-07T20:23:52.3190901Z Stepping: 0 2025-05-07T20:23:52.3191194Z BogoMIPS: 5598.98 2025-05-07T20:23:52.3193210Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3195357Z Hypervisor vendor: KVM 2025-05-07T20:23:52.3195658Z Virtualization type: full 2025-05-07T20:23:52.3195986Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:52.3196335Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:52.3196686Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:52.3197035Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:52.3197341Z NUMA node(s): 1 2025-05-07T20:23:52.3197628Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:52.3197958Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:52.3198318Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:52.3198666Z Vulnerability L1tf: Not affected 2025-05-07T20:23:52.3199011Z Vulnerability Mds: Not affected 2025-05-07T20:23:52.3199361Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:52.3199707Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:52.3200106Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:52.3200629Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:52.3201186Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:52.3201717Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:52.3202395Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:52.3203237Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:52.3203884Z Vulnerability Srbds: Not affected 2025-05-07T20:23:52.3204238Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:52.3204553Z 2025-05-07T20:23:52.3204642Z + cat /proc/cpuinfo 2025-05-07T20:23:52.3204772Z 2025-05-07T20:23:52.3204858Z processor : 0 2025-05-07T20:23:52.3205069Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3205306Z cpu family : 23 2025-05-07T20:23:52.3205506Z model : 49 2025-05-07T20:23:52.3205704Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3205943Z stepping : 0 2025-05-07T20:23:52.3206144Z microcode : 0x830107f 2025-05-07T20:23:52.3206360Z cpu MHz : 3314.754 2025-05-07T20:23:52.3206568Z cache size : 512 KB 2025-05-07T20:23:52.3206777Z physical id : 0 2025-05-07T20:23:52.3206979Z siblings : 16 2025-05-07T20:23:52.3207178Z core id : 0 2025-05-07T20:23:52.3207372Z cpu cores : 8 2025-05-07T20:23:52.3207558Z apicid : 0 2025-05-07T20:23:52.3207754Z initial apicid : 0 2025-05-07T20:23:52.3207962Z fpu : yes 2025-05-07T20:23:52.3208153Z fpu_exception : yes 2025-05-07T20:23:52.3208366Z cpuid level : 13 2025-05-07T20:23:52.3208571Z wp : yes 2025-05-07T20:23:52.3210589Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3212836Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3213307Z bogomips : 5598.98 2025-05-07T20:23:52.3213525Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3213760Z clflush size : 64 2025-05-07T20:23:52.3213967Z cache_alignment : 64 2025-05-07T20:23:52.3214235Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3214558Z power management: 2025-05-07T20:23:52.3214687Z 2025-05-07T20:23:52.3214767Z processor : 1 2025-05-07T20:23:52.3214981Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3215218Z cpu family : 23 2025-05-07T20:23:52.3215417Z model : 49 2025-05-07T20:23:52.3215621Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3215861Z stepping : 0 2025-05-07T20:23:52.3216057Z microcode : 0x830107f 2025-05-07T20:23:52.3216280Z cpu MHz : 2095.023 2025-05-07T20:23:52.3216489Z cache size : 512 KB 2025-05-07T20:23:52.3216693Z physical id : 0 2025-05-07T20:23:52.3216902Z siblings : 16 2025-05-07T20:23:52.3217095Z core id : 1 2025-05-07T20:23:52.3217286Z cpu cores : 8 2025-05-07T20:23:52.3217477Z apicid : 2 2025-05-07T20:23:52.3217672Z initial apicid : 2 2025-05-07T20:23:52.3217877Z fpu : yes 2025-05-07T20:23:52.3218065Z fpu_exception : yes 2025-05-07T20:23:52.3218277Z cpuid level : 13 2025-05-07T20:23:52.3218477Z wp : yes 2025-05-07T20:23:52.3220424Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3222604Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3223085Z bogomips : 5598.98 2025-05-07T20:23:52.3223301Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3223527Z clflush size : 64 2025-05-07T20:23:52.3223742Z cache_alignment : 64 2025-05-07T20:23:52.3224005Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3224309Z power management: 2025-05-07T20:23:52.3224442Z 2025-05-07T20:23:52.3224527Z processor : 2 2025-05-07T20:23:52.3224742Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3224977Z cpu family : 23 2025-05-07T20:23:52.3225170Z model : 49 2025-05-07T20:23:52.3225368Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3225599Z stepping : 0 2025-05-07T20:23:52.3225798Z microcode : 0x830107f 2025-05-07T20:23:52.3226019Z cpu MHz : 2018.362 2025-05-07T20:23:52.3226228Z cache size : 512 KB 2025-05-07T20:23:52.3226446Z physical id : 0 2025-05-07T20:23:52.3226651Z siblings : 16 2025-05-07T20:23:52.3226850Z core id : 2 2025-05-07T20:23:52.3227036Z cpu cores : 8 2025-05-07T20:23:52.3227233Z apicid : 4 2025-05-07T20:23:52.3227430Z initial apicid : 4 2025-05-07T20:23:52.3227732Z fpu : yes 2025-05-07T20:23:52.3275008Z fpu_exception : yes 2025-05-07T20:23:52.3275288Z cpuid level : 13 2025-05-07T20:23:52.3275523Z wp : yes 2025-05-07T20:23:52.3277766Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3279962Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3280428Z bogomips : 5598.98 2025-05-07T20:23:52.3280749Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3280978Z clflush size : 64 2025-05-07T20:23:52.3281191Z cache_alignment : 64 2025-05-07T20:23:52.3281448Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3281749Z power management: 2025-05-07T20:23:52.3281877Z 2025-05-07T20:23:52.3281960Z processor : 3 2025-05-07T20:23:52.3282158Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3282393Z cpu family : 23 2025-05-07T20:23:52.3282597Z model : 49 2025-05-07T20:23:52.3282793Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3283015Z stepping : 0 2025-05-07T20:23:52.3283210Z microcode : 0x830107f 2025-05-07T20:23:52.3283427Z cpu MHz : 3299.002 2025-05-07T20:23:52.3283623Z cache size : 512 KB 2025-05-07T20:23:52.3283828Z physical id : 0 2025-05-07T20:23:52.3284028Z siblings : 16 2025-05-07T20:23:52.3284214Z core id : 3 2025-05-07T20:23:52.3284402Z cpu cores : 8 2025-05-07T20:23:52.3284588Z apicid : 6 2025-05-07T20:23:52.3284769Z initial apicid : 6 2025-05-07T20:23:52.3284980Z fpu : yes 2025-05-07T20:23:52.3285168Z fpu_exception : yes 2025-05-07T20:23:52.3285370Z cpuid level : 13 2025-05-07T20:23:52.3285562Z wp : yes 2025-05-07T20:23:52.3287596Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3289744Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3290254Z bogomips : 5598.98 2025-05-07T20:23:52.3290463Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3290686Z clflush size : 64 2025-05-07T20:23:52.3290883Z cache_alignment : 64 2025-05-07T20:23:52.3291141Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3291440Z power management: 2025-05-07T20:23:52.3291564Z 2025-05-07T20:23:52.3291647Z processor : 4 2025-05-07T20:23:52.3291844Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3292065Z cpu family : 23 2025-05-07T20:23:52.3292261Z model : 49 2025-05-07T20:23:52.3292454Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3292684Z stepping : 0 2025-05-07T20:23:52.3292884Z microcode : 0x830107f 2025-05-07T20:23:52.3293090Z cpu MHz : 3022.864 2025-05-07T20:23:52.3293294Z cache size : 512 KB 2025-05-07T20:23:52.3293499Z physical id : 0 2025-05-07T20:23:52.3293691Z siblings : 16 2025-05-07T20:23:52.3293880Z core id : 4 2025-05-07T20:23:52.3294066Z cpu cores : 8 2025-05-07T20:23:52.3294249Z apicid : 8 2025-05-07T20:23:52.3294432Z initial apicid : 8 2025-05-07T20:23:52.3294637Z fpu : yes 2025-05-07T20:23:52.3294876Z fpu_exception : yes 2025-05-07T20:23:52.3295089Z cpuid level : 13 2025-05-07T20:23:52.3295289Z wp : yes 2025-05-07T20:23:52.3297260Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3299415Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3299887Z bogomips : 5598.98 2025-05-07T20:23:52.3300094Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3300316Z clflush size : 64 2025-05-07T20:23:52.3300518Z cache_alignment : 64 2025-05-07T20:23:52.3300926Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3301224Z power management: 2025-05-07T20:23:52.3301351Z 2025-05-07T20:23:52.3301432Z processor : 5 2025-05-07T20:23:52.3301635Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3301860Z cpu family : 23 2025-05-07T20:23:52.3302045Z model : 49 2025-05-07T20:23:52.3302234Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3302467Z stepping : 0 2025-05-07T20:23:52.3302660Z microcode : 0x830107f 2025-05-07T20:23:52.3302874Z cpu MHz : 2590.152 2025-05-07T20:23:52.3303076Z cache size : 512 KB 2025-05-07T20:23:52.3303276Z physical id : 0 2025-05-07T20:23:52.3303475Z siblings : 16 2025-05-07T20:23:52.3303661Z core id : 5 2025-05-07T20:23:52.3303838Z cpu cores : 8 2025-05-07T20:23:52.3304023Z apicid : 10 2025-05-07T20:23:52.3304215Z initial apicid : 10 2025-05-07T20:23:52.3304412Z fpu : yes 2025-05-07T20:23:52.3304590Z fpu_exception : yes 2025-05-07T20:23:52.3304793Z cpuid level : 13 2025-05-07T20:23:52.3304989Z wp : yes 2025-05-07T20:23:52.3306869Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3309095Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3309564Z bogomips : 5598.98 2025-05-07T20:23:52.3309775Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3309992Z clflush size : 64 2025-05-07T20:23:52.3310203Z cache_alignment : 64 2025-05-07T20:23:52.3310461Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3310758Z power management: 2025-05-07T20:23:52.3310891Z 2025-05-07T20:23:52.3310967Z processor : 6 2025-05-07T20:23:52.3311173Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3311399Z cpu family : 23 2025-05-07T20:23:52.3311587Z model : 49 2025-05-07T20:23:52.3311778Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3312004Z stepping : 0 2025-05-07T20:23:52.3312191Z microcode : 0x830107f 2025-05-07T20:23:52.3312405Z cpu MHz : 2021.248 2025-05-07T20:23:52.3312607Z cache size : 512 KB 2025-05-07T20:23:52.3312803Z physical id : 0 2025-05-07T20:23:52.3312994Z siblings : 16 2025-05-07T20:23:52.3313184Z core id : 6 2025-05-07T20:23:52.3313358Z cpu cores : 8 2025-05-07T20:23:52.3313550Z apicid : 12 2025-05-07T20:23:52.3313742Z initial apicid : 12 2025-05-07T20:23:52.3313936Z fpu : yes 2025-05-07T20:23:52.3314119Z fpu_exception : yes 2025-05-07T20:23:52.3314323Z cpuid level : 13 2025-05-07T20:23:52.3314514Z wp : yes 2025-05-07T20:23:52.3316479Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3318652Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3319121Z bogomips : 5598.98 2025-05-07T20:23:52.3319321Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3319544Z clflush size : 64 2025-05-07T20:23:52.3319748Z cache_alignment : 64 2025-05-07T20:23:52.3320007Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3320297Z power management: 2025-05-07T20:23:52.3320509Z 2025-05-07T20:23:52.3320587Z processor : 7 2025-05-07T20:23:52.3320794Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3321015Z cpu family : 23 2025-05-07T20:23:52.3321210Z model : 49 2025-05-07T20:23:52.3321402Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3321622Z stepping : 0 2025-05-07T20:23:52.3321813Z microcode : 0x830107f 2025-05-07T20:23:52.3322020Z cpu MHz : 3298.161 2025-05-07T20:23:52.3322220Z cache size : 512 KB 2025-05-07T20:23:52.3322425Z physical id : 0 2025-05-07T20:23:52.3322619Z siblings : 16 2025-05-07T20:23:52.3322801Z core id : 7 2025-05-07T20:23:52.3322988Z cpu cores : 8 2025-05-07T20:23:52.3323173Z apicid : 14 2025-05-07T20:23:52.3323356Z initial apicid : 14 2025-05-07T20:23:52.3323554Z fpu : yes 2025-05-07T20:23:52.3323739Z fpu_exception : yes 2025-05-07T20:23:52.3323938Z cpuid level : 13 2025-05-07T20:23:52.3324138Z wp : yes 2025-05-07T20:23:52.3326026Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3328232Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3328698Z bogomips : 5598.98 2025-05-07T20:23:52.3328900Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3329119Z clflush size : 64 2025-05-07T20:23:52.3329323Z cache_alignment : 64 2025-05-07T20:23:52.3329576Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3329871Z power management: 2025-05-07T20:23:52.3329998Z 2025-05-07T20:23:52.3330078Z processor : 8 2025-05-07T20:23:52.3330273Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3330504Z cpu family : 23 2025-05-07T20:23:52.3330701Z model : 49 2025-05-07T20:23:52.3330890Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3331117Z stepping : 0 2025-05-07T20:23:52.3331317Z microcode : 0x830107f 2025-05-07T20:23:52.3331525Z cpu MHz : 2903.663 2025-05-07T20:23:52.3331729Z cache size : 512 KB 2025-05-07T20:23:52.3331928Z physical id : 0 2025-05-07T20:23:52.3332122Z siblings : 16 2025-05-07T20:23:52.3332312Z core id : 0 2025-05-07T20:23:52.3332495Z cpu cores : 8 2025-05-07T20:23:52.3332682Z apicid : 1 2025-05-07T20:23:52.3332861Z initial apicid : 1 2025-05-07T20:23:52.3333061Z fpu : yes 2025-05-07T20:23:52.3333244Z fpu_exception : yes 2025-05-07T20:23:52.3333442Z cpuid level : 13 2025-05-07T20:23:52.3333650Z wp : yes 2025-05-07T20:23:52.3335532Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3337793Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3338285Z bogomips : 5598.98 2025-05-07T20:23:52.3338490Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3338703Z clflush size : 64 2025-05-07T20:23:52.3338909Z cache_alignment : 64 2025-05-07T20:23:52.3339163Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3339453Z power management: 2025-05-07T20:23:52.3339584Z 2025-05-07T20:23:52.3339662Z processor : 9 2025-05-07T20:23:52.3339857Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3340337Z cpu family : 23 2025-05-07T20:23:52.3340818Z model : 49 2025-05-07T20:23:52.3341106Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3341419Z stepping : 0 2025-05-07T20:23:52.3341614Z microcode : 0x830107f 2025-05-07T20:23:52.3341825Z cpu MHz : 3119.159 2025-05-07T20:23:52.3342020Z cache size : 512 KB 2025-05-07T20:23:52.3342224Z physical id : 0 2025-05-07T20:23:52.3342419Z siblings : 16 2025-05-07T20:23:52.3342604Z core id : 1 2025-05-07T20:23:52.3342790Z cpu cores : 8 2025-05-07T20:23:52.3342974Z apicid : 3 2025-05-07T20:23:52.3343164Z initial apicid : 3 2025-05-07T20:23:52.3343360Z fpu : yes 2025-05-07T20:23:52.3343537Z fpu_exception : yes 2025-05-07T20:23:52.3343739Z cpuid level : 13 2025-05-07T20:23:52.3343930Z wp : yes 2025-05-07T20:23:52.3345941Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3348202Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3348663Z bogomips : 5598.98 2025-05-07T20:23:52.3348871Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3349098Z clflush size : 64 2025-05-07T20:23:52.3349302Z cache_alignment : 64 2025-05-07T20:23:52.3349566Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3349893Z power management: 2025-05-07T20:23:52.3350040Z 2025-05-07T20:23:52.3350120Z processor : 10 2025-05-07T20:23:52.3350334Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3350562Z cpu family : 23 2025-05-07T20:23:52.3350754Z model : 49 2025-05-07T20:23:52.3350952Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3351188Z stepping : 0 2025-05-07T20:23:52.3351376Z microcode : 0x830107f 2025-05-07T20:23:52.3351591Z cpu MHz : 3237.299 2025-05-07T20:23:52.3351793Z cache size : 512 KB 2025-05-07T20:23:52.3351997Z physical id : 0 2025-05-07T20:23:52.3352196Z siblings : 16 2025-05-07T20:23:52.3352386Z core id : 2 2025-05-07T20:23:52.3352567Z cpu cores : 8 2025-05-07T20:23:52.3352756Z apicid : 5 2025-05-07T20:23:52.3352944Z initial apicid : 5 2025-05-07T20:23:52.3353136Z fpu : yes 2025-05-07T20:23:52.3353323Z fpu_exception : yes 2025-05-07T20:23:52.3353532Z cpuid level : 13 2025-05-07T20:23:52.3353731Z wp : yes 2025-05-07T20:23:52.3355602Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3357756Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3358222Z bogomips : 5598.98 2025-05-07T20:23:52.3358554Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3358771Z clflush size : 64 2025-05-07T20:23:52.3358981Z cache_alignment : 64 2025-05-07T20:23:52.3359238Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3359531Z power management: 2025-05-07T20:23:52.3359662Z 2025-05-07T20:23:52.3359742Z processor : 11 2025-05-07T20:23:52.3359951Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3360175Z cpu family : 23 2025-05-07T20:23:52.3360365Z model : 49 2025-05-07T20:23:52.3360560Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3360788Z stepping : 0 2025-05-07T20:23:52.3361046Z microcode : 0x830107f 2025-05-07T20:23:52.3361259Z cpu MHz : 3237.805 2025-05-07T20:23:52.3361466Z cache size : 512 KB 2025-05-07T20:23:52.3361665Z physical id : 0 2025-05-07T20:23:52.3361866Z siblings : 16 2025-05-07T20:23:52.3362061Z core id : 3 2025-05-07T20:23:52.3362244Z cpu cores : 8 2025-05-07T20:23:52.3362489Z apicid : 7 2025-05-07T20:23:52.3362761Z initial apicid : 7 2025-05-07T20:23:52.3362968Z fpu : yes 2025-05-07T20:23:52.3363159Z fpu_exception : yes 2025-05-07T20:23:52.3363366Z cpuid level : 13 2025-05-07T20:23:52.3363558Z wp : yes 2025-05-07T20:23:52.3365554Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3367731Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3368201Z bogomips : 5598.98 2025-05-07T20:23:52.3368406Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3368635Z clflush size : 64 2025-05-07T20:23:52.3368841Z cache_alignment : 64 2025-05-07T20:23:52.3369099Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3369391Z power management: 2025-05-07T20:23:52.3369527Z 2025-05-07T20:23:52.3369607Z processor : 12 2025-05-07T20:23:52.3369810Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3370027Z cpu family : 23 2025-05-07T20:23:52.3370221Z model : 49 2025-05-07T20:23:52.3370412Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3370630Z stepping : 0 2025-05-07T20:23:52.3370826Z microcode : 0x830107f 2025-05-07T20:23:52.3371037Z cpu MHz : 3102.093 2025-05-07T20:23:52.3371241Z cache size : 512 KB 2025-05-07T20:23:52.3371446Z physical id : 0 2025-05-07T20:23:52.3371644Z siblings : 16 2025-05-07T20:23:52.3371825Z core id : 4 2025-05-07T20:23:52.3372014Z cpu cores : 8 2025-05-07T20:23:52.3372203Z apicid : 9 2025-05-07T20:23:52.3372382Z initial apicid : 9 2025-05-07T20:23:52.3372590Z fpu : yes 2025-05-07T20:23:52.3372775Z fpu_exception : yes 2025-05-07T20:23:52.3372978Z cpuid level : 13 2025-05-07T20:23:52.3373207Z wp : yes 2025-05-07T20:23:52.3375361Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3377525Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3377994Z bogomips : 5598.98 2025-05-07T20:23:52.3378200Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3378426Z clflush size : 64 2025-05-07T20:23:52.3378631Z cache_alignment : 64 2025-05-07T20:23:52.3378992Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3379297Z power management: 2025-05-07T20:23:52.3379421Z 2025-05-07T20:23:52.3379510Z processor : 13 2025-05-07T20:23:52.3379713Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3379941Z cpu family : 23 2025-05-07T20:23:52.3380141Z model : 49 2025-05-07T20:23:52.3380332Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3380563Z stepping : 0 2025-05-07T20:23:52.3380762Z microcode : 0x830107f 2025-05-07T20:23:52.3380975Z cpu MHz : 2431.197 2025-05-07T20:23:52.3381184Z cache size : 512 KB 2025-05-07T20:23:52.3381393Z physical id : 0 2025-05-07T20:23:52.3381666Z siblings : 16 2025-05-07T20:23:52.3381859Z core id : 5 2025-05-07T20:23:52.3382052Z cpu cores : 8 2025-05-07T20:23:52.3382237Z apicid : 11 2025-05-07T20:23:52.3382433Z initial apicid : 11 2025-05-07T20:23:52.3382638Z fpu : yes 2025-05-07T20:23:52.3382825Z fpu_exception : yes 2025-05-07T20:23:52.3383028Z cpuid level : 13 2025-05-07T20:23:52.3383229Z wp : yes 2025-05-07T20:23:52.3385120Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3387292Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3387839Z bogomips : 5598.98 2025-05-07T20:23:52.3388049Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3388274Z clflush size : 64 2025-05-07T20:23:52.3388474Z cache_alignment : 64 2025-05-07T20:23:52.3388734Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3389037Z power management: 2025-05-07T20:23:52.3389162Z 2025-05-07T20:23:52.3389240Z processor : 14 2025-05-07T20:23:52.3389452Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3389680Z cpu family : 23 2025-05-07T20:23:52.3389877Z model : 49 2025-05-07T20:23:52.3390066Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3390296Z stepping : 0 2025-05-07T20:23:52.3390492Z microcode : 0x830107f 2025-05-07T20:23:52.3390704Z cpu MHz : 3064.028 2025-05-07T20:23:52.3390913Z cache size : 512 KB 2025-05-07T20:23:52.3391119Z physical id : 0 2025-05-07T20:23:52.3391314Z siblings : 16 2025-05-07T20:23:52.3391513Z core id : 6 2025-05-07T20:23:52.3391702Z cpu cores : 8 2025-05-07T20:23:52.3391887Z apicid : 13 2025-05-07T20:23:52.3392082Z initial apicid : 13 2025-05-07T20:23:52.3392284Z fpu : yes 2025-05-07T20:23:52.3392465Z fpu_exception : yes 2025-05-07T20:23:52.3392672Z cpuid level : 13 2025-05-07T20:23:52.3392875Z wp : yes 2025-05-07T20:23:52.3394760Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3396909Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3397385Z bogomips : 5598.98 2025-05-07T20:23:52.3397597Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3397819Z clflush size : 64 2025-05-07T20:23:52.3398028Z cache_alignment : 64 2025-05-07T20:23:52.3398293Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3398600Z power management: 2025-05-07T20:23:52.3398728Z 2025-05-07T20:23:52.3398895Z processor : 15 2025-05-07T20:23:52.3399115Z vendor_id : AuthenticAMD 2025-05-07T20:23:52.3399345Z cpu family : 23 2025-05-07T20:23:52.3399534Z model : 49 2025-05-07T20:23:52.3399737Z model name : AMD EPYC 7R32 2025-05-07T20:23:52.3399968Z stepping : 0 2025-05-07T20:23:52.3400161Z microcode : 0x830107f 2025-05-07T20:23:52.3400377Z cpu MHz : 3282.272 2025-05-07T20:23:52.3400587Z cache size : 512 KB 2025-05-07T20:23:52.3400786Z physical id : 0 2025-05-07T20:23:52.3400990Z siblings : 16 2025-05-07T20:23:52.3401186Z core id : 7 2025-05-07T20:23:52.3401372Z cpu cores : 8 2025-05-07T20:23:52.3401635Z apicid : 15 2025-05-07T20:23:52.3401830Z initial apicid : 15 2025-05-07T20:23:52.3402028Z fpu : yes 2025-05-07T20:23:52.3402215Z fpu_exception : yes 2025-05-07T20:23:52.3402423Z cpuid level : 13 2025-05-07T20:23:52.3402615Z wp : yes 2025-05-07T20:23:52.3404502Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:52.3406665Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:52.3407138Z bogomips : 5598.98 2025-05-07T20:23:52.3407348Z TLB size : 3072 4K pages 2025-05-07T20:23:52.3407565Z clflush size : 64 2025-05-07T20:23:52.3407775Z cache_alignment : 64 2025-05-07T20:23:52.3408039Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:52.3408337Z power management: 2025-05-07T20:23:52.3408469Z 2025-05-07T20:23:52.3408473Z 2025-05-07T20:23:52.3408584Z ################################################################################ 2025-05-07T20:23:52.3408880Z [INFO] Print PCI info ... 2025-05-07T20:23:52.3409108Z + lspci -v 2025-05-07T20:23:52.3409224Z 2025-05-07T20:23:52.3409440Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:52.3409806Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:52.3410115Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:52.3410315Z 2025-05-07T20:23:52.3410512Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:52.3410875Z Physical Slot: 1 2025-05-07T20:23:52.3411114Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:52.3411310Z 2025-05-07T20:23:52.3411555Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:52.3411968Z Physical Slot: 1 2025-05-07T20:23:52.3412218Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:52.3412441Z 2025-05-07T20:23:52.3412701Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:52.3413130Z Physical Slot: 3 2025-05-07T20:23:52.3419645Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:52.3420009Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:52.3420360Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:52.3420574Z 2025-05-07T20:23:52.3420863Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:52.3421353Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:52.3421636Z Physical Slot: 4 2025-05-07T20:23:52.3421881Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:52.3422244Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:52.3422586Z Capabilities: 2025-05-07T20:23:52.3422846Z Kernel driver in use: nvme 2025-05-07T20:23:52.3423001Z 2025-05-07T20:23:52.3423345Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:52.3423809Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:52.3424145Z Physical Slot: 5 2025-05-07T20:23:52.3424377Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:52.3424724Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:52.3425097Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:52.3425403Z Capabilities: 2025-05-07T20:23:52.3425654Z Kernel driver in use: ena 2025-05-07T20:23:52.3425885Z Kernel modules: ena 2025-05-07T20:23:52.3426094Z 2025-05-07T20:23:52.3426262Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:52.3426621Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:52.3426903Z Physical Slot: 30 2025-05-07T20:23:52.3427151Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:52.3427586Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:52.3427998Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:52.3428363Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:52.3428680Z Capabilities: 2025-05-07T20:23:52.3428936Z Kernel driver in use: nvidia 2025-05-07T20:23:52.3429180Z Kernel modules: nvidia 2025-05-07T20:23:52.3429319Z 2025-05-07T20:23:52.3429626Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:52.3430110Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:52.3430384Z Physical Slot: 31 2025-05-07T20:23:52.3430618Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:52.3430956Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:52.3431325Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:52.3431639Z Capabilities: 2025-05-07T20:23:52.3431890Z Kernel driver in use: nvme 2025-05-07T20:23:52.3432047Z 2025-05-07T20:23:52.3432051Z 2025-05-07T20:23:52.3432162Z ################################################################################ 2025-05-07T20:23:52.3432473Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:52.3432747Z + uname -a 2025-05-07T20:23:52.3432851Z 2025-05-07T20:23:52.3433241Z Linux ip-10-0-29-91.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:52.3433721Z 2025-05-07T20:23:52.3433797Z + uname -m 2025-05-07T20:23:52.3433914Z 2025-05-07T20:23:52.3433986Z x86_64 2025-05-07T20:23:52.3434086Z 2025-05-07T20:23:52.3434177Z + cat /proc/version 2025-05-07T20:23:52.3434305Z 2025-05-07T20:23:52.3434826Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:52.3435439Z 2025-05-07T20:23:52.3435524Z + cat /etc/os-release 2025-05-07T20:23:52.3435670Z 2025-05-07T20:23:52.3435754Z NAME="Amazon Linux" 2025-05-07T20:23:52.3435960Z VERSION="2023" 2025-05-07T20:23:52.3436156Z ID="amzn" 2025-05-07T20:23:52.3436332Z ID_LIKE="fedora" 2025-05-07T20:23:52.3436528Z VERSION_ID="2023" 2025-05-07T20:23:52.3436742Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:52.3437009Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:52.3437290Z ANSI_COLOR="0;33" 2025-05-07T20:23:52.3437525Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:52.3437902Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:52.3438327Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:52.3438731Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:52.3439161Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:52.3439521Z VENDOR_NAME="AWS" 2025-05-07T20:23:52.3439755Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:52.3440029Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:52.3440468Z 2025-05-07T20:23:52.3440912Z ################################################################################ 2025-05-07T20:23:52.3441322Z # Print EC2 Instance Info 2025-05-07T20:23:52.3441546Z # 2025-05-07T20:23:52.3441745Z # [2025-05-07T20:23:52.339Z] + print_ec2_info 2025-05-07T20:23:52.3442049Z ################################################################################ 2025-05-07T20:23:52.3442250Z 2025-05-07T20:23:52.3521492Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:52.3636056Z instance-id: i-061cb0426579ace80 2025-05-07T20:23:52.3745412Z instance-type: g5.4xlarge 2025-05-07T20:23:52.3786618Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:52.3787136Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:52.3796141Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:52.3796499Z env: 2025-05-07T20:23:52.3796717Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:52.3797028Z BUILD_ENV: build_binary 2025-05-07T20:23:52.3797277Z BUILD_TARGET: genai 2025-05-07T20:23:52.3797508Z BUILD_VARIANT: cuda 2025-05-07T20:23:52.3797735Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:52.3798002Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:52.3798301Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:52.3798631Z ##[endgroup] 2025-05-07T20:23:52.7138087Z ################################################################################ 2025-05-07T20:23:52.7138469Z [INFO] Printing general display info ... 2025-05-07T20:23:52.7171426Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:52.8328928Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:52.8339624Z /usr/bin/sudo 2025-05-07T20:23:52.8350640Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:52.8361195Z /usr/bin/yum 2025-05-07T20:23:52.8363144Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:52.8385293Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:53.2689085Z Last metadata expiration check: 0:00:07 ago on Wed May 7 20:23:46 2025. 2025-05-07T20:23:53.3453357Z ================================================================================ 2025-05-07T20:23:53.3453996Z WARNING: 2025-05-07T20:23:53.3454307Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:53.3454603Z 2025-05-07T20:23:53.3454695Z Available Versions: 2025-05-07T20:23:53.3454855Z 2025-05-07T20:23:53.3454978Z Version 2023.7.20250331: 2025-05-07T20:23:53.3455304Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:53.3455553Z 2025-05-07T20:23:53.3455697Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:53.3455908Z 2025-05-07T20:23:53.3455994Z Release notes: 2025-05-07T20:23:53.3456383Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:53.3456761Z 2025-05-07T20:23:53.3456847Z Version 2023.7.20250414: 2025-05-07T20:23:53.3457141Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:53.3457378Z 2025-05-07T20:23:53.3457485Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:53.3457688Z 2025-05-07T20:23:53.3457767Z Release notes: 2025-05-07T20:23:53.3458148Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:53.3458510Z 2025-05-07T20:23:53.3458599Z Version 2023.7.20250428: 2025-05-07T20:23:53.3458883Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:53.3459124Z 2025-05-07T20:23:53.3459232Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:53.3459437Z 2025-05-07T20:23:53.3459532Z Release notes: 2025-05-07T20:23:53.3459910Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:53.3460266Z 2025-05-07T20:23:53.3460372Z ================================================================================ 2025-05-07T20:23:53.4604931Z Dependencies resolved. 2025-05-07T20:23:53.4891484Z ================================================================================ 2025-05-07T20:23:53.4891973Z Package Arch Version Repository Size 2025-05-07T20:23:53.4892517Z ================================================================================ 2025-05-07T20:23:53.4892814Z Upgrading: 2025-05-07T20:23:53.4893164Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:53.4893727Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:53.4894080Z 2025-05-07T20:23:53.4894432Z Transaction Summary 2025-05-07T20:23:53.4894825Z ================================================================================ 2025-05-07T20:23:53.4895127Z Upgrade 2 Packages 2025-05-07T20:23:53.4895267Z 2025-05-07T20:23:53.4895370Z Total download size: 6.9 M 2025-05-07T20:23:53.4895963Z Downloading Packages: 2025-05-07T20:23:53.5419714Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 24 MB/s | 1.2 MB 00:00 2025-05-07T20:23:53.5683424Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 73 MB/s | 5.7 MB 00:00 2025-05-07T20:23:53.5691576Z -------------------------------------------------------------------------------- 2025-05-07T20:23:53.5694566Z Total 87 MB/s | 6.9 MB 00:00 2025-05-07T20:23:53.5696891Z Running transaction check 2025-05-07T20:23:53.5793593Z Transaction check succeeded. 2025-05-07T20:23:53.5794217Z Running transaction test 2025-05-07T20:23:53.6087966Z Transaction test succeeded. 2025-05-07T20:23:53.6090879Z Running transaction 2025-05-07T20:23:54.1577044Z Preparing : 1/1 2025-05-07T20:23:54.2636159Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:54.2656416Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:54.2864269Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:54.2865117Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:54.2965693Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:54.2987940Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:54.4366580Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:54.4367370Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:54.4368007Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:54.4368547Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:54.5867733Z ================================================================================ 2025-05-07T20:23:54.5868295Z WARNING: 2025-05-07T20:23:54.5868630Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:54.5868937Z 2025-05-07T20:23:54.5869072Z Available Versions: 2025-05-07T20:23:54.5869275Z 2025-05-07T20:23:54.5869372Z Version 2023.7.20250331: 2025-05-07T20:23:54.5869681Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:54.5869934Z 2025-05-07T20:23:54.5870064Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:54.5870269Z 2025-05-07T20:23:54.5870360Z Release notes: 2025-05-07T20:23:54.5870751Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:54.5871128Z 2025-05-07T20:23:54.5871243Z Version 2023.7.20250414: 2025-05-07T20:23:54.5871552Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:54.5871791Z 2025-05-07T20:23:54.5871899Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:54.5872107Z 2025-05-07T20:23:54.5872192Z Release notes: 2025-05-07T20:23:54.5872578Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:54.5872945Z 2025-05-07T20:23:54.5873038Z Version 2023.7.20250428: 2025-05-07T20:23:54.5873328Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:54.5873569Z 2025-05-07T20:23:54.5873679Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:54.5873882Z 2025-05-07T20:23:54.5873971Z Release notes: 2025-05-07T20:23:54.5874343Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:54.5874717Z 2025-05-07T20:23:54.5875151Z ================================================================================ 2025-05-07T20:23:54.6447884Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:54.6448338Z 2025-05-07T20:23:54.6448461Z Upgraded: 2025-05-07T20:23:54.6448904Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:54.6449672Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:54.6450147Z 2025-05-07T20:23:54.6450256Z Complete! 2025-05-07T20:23:54.6887012Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:54.6909548Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:55.1132551Z Last metadata expiration check: 0:00:09 ago on Wed May 7 20:23:46 2025. 2025-05-07T20:23:55.1375323Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:55.1777261Z Dependencies resolved. 2025-05-07T20:23:55.1954848Z ================================================================================ 2025-05-07T20:23:55.1955519Z Package Architecture Version Repository Size 2025-05-07T20:23:55.1956089Z ================================================================================ 2025-05-07T20:23:55.1956430Z Installing: 2025-05-07T20:23:55.1956712Z lshw x86_64 B.02.19.2-7.amzn2023.0.3 amazonlinux 319 k 2025-05-07T20:23:55.1956972Z 2025-05-07T20:23:55.1957063Z Transaction Summary 2025-05-07T20:23:55.1957333Z ================================================================================ 2025-05-07T20:23:55.1957752Z Install 1 Package 2025-05-07T20:23:55.1957933Z 2025-05-07T20:23:55.1958070Z Total download size: 319 k 2025-05-07T20:23:55.1958393Z Installed size: 837 k 2025-05-07T20:23:55.1959149Z Downloading Packages: 2025-05-07T20:23:55.2775621Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm 6.6 MB/s | 319 kB 00:00 2025-05-07T20:23:55.2781228Z -------------------------------------------------------------------------------- 2025-05-07T20:23:55.2783952Z Total 3.8 MB/s | 319 kB 00:00 2025-05-07T20:23:55.2937344Z Running transaction check 2025-05-07T20:23:55.2992767Z Transaction check succeeded. 2025-05-07T20:23:55.2993144Z Running transaction test 2025-05-07T20:23:55.3454931Z Transaction test succeeded. 2025-05-07T20:23:55.3458760Z Running transaction 2025-05-07T20:23:55.4511503Z Preparing : 1/1 2025-05-07T20:23:55.5048941Z Installing : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:55.6767657Z Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:55.8136849Z ================================================================================ 2025-05-07T20:23:55.8137341Z WARNING: 2025-05-07T20:23:55.8137649Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:55.8137956Z 2025-05-07T20:23:55.8138074Z Available Versions: 2025-05-07T20:23:55.8138312Z 2025-05-07T20:23:55.8138414Z Version 2023.7.20250331: 2025-05-07T20:23:55.8138716Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:55.8138966Z 2025-05-07T20:23:55.8139086Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:55.8139298Z 2025-05-07T20:23:55.8139379Z Release notes: 2025-05-07T20:23:55.8139778Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:55.8140430Z 2025-05-07T20:23:55.8140541Z Version 2023.7.20250414: 2025-05-07T20:23:55.8140841Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:55.8141091Z 2025-05-07T20:23:55.8141202Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:55.8141400Z 2025-05-07T20:23:55.8141490Z Release notes: 2025-05-07T20:23:55.8141868Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:55.8142243Z 2025-05-07T20:23:55.8142623Z Version 2023.7.20250428: 2025-05-07T20:23:55.8143056Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:55.8143303Z 2025-05-07T20:23:55.8143419Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:55.8143616Z 2025-05-07T20:23:55.8143701Z Release notes: 2025-05-07T20:23:55.8144091Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:55.8144442Z 2025-05-07T20:23:55.8144566Z ================================================================================ 2025-05-07T20:23:55.8482079Z Verifying : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:55.8482513Z 2025-05-07T20:23:55.8482625Z Installed: 2025-05-07T20:23:55.8483031Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64 2025-05-07T20:23:55.8483418Z 2025-05-07T20:23:55.8483516Z Complete! 2025-05-07T20:23:55.8942813Z + hostname 2025-05-07T20:23:55.8943006Z 2025-05-07T20:23:55.8956139Z ip-10-0-29-91.ec2.internal 2025-05-07T20:23:55.8957133Z 2025-05-07T20:23:55.8957805Z + sudo lshw -C display 2025-05-07T20:23:55.8958006Z 2025-05-07T20:23:56.3239381Z *-display:0 UNCLAIMED 2025-05-07T20:23:56.3239712Z description: VGA compatible controller 2025-05-07T20:23:56.3240023Z product: Amazon.com, Inc. 2025-05-07T20:23:56.3240529Z vendor: Amazon.com, Inc. 2025-05-07T20:23:56.3240805Z physical id: 3 2025-05-07T20:23:56.3241060Z bus info: pci@0000:00:03.0 2025-05-07T20:23:56.3241311Z version: 00 2025-05-07T20:23:56.3241515Z width: 32 bits 2025-05-07T20:23:56.3241723Z clock: 33MHz 2025-05-07T20:23:56.3241964Z capabilities: vga_controller bus_master 2025-05-07T20:23:56.3242270Z configuration: latency=0 2025-05-07T20:23:56.3242588Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:56.3242904Z *-display:1 2025-05-07T20:23:56.3243121Z description: 3D controller 2025-05-07T20:23:56.3243413Z product: GA102GL [A10G] 2025-05-07T20:23:56.3243665Z vendor: NVIDIA Corporation 2025-05-07T20:23:56.3243929Z physical id: 1e 2025-05-07T20:23:56.3244159Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:56.3244402Z version: a1 2025-05-07T20:23:56.3244608Z width: 64 bits 2025-05-07T20:23:56.3244823Z clock: 33MHz 2025-05-07T20:23:56.3245101Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:56.3245465Z configuration: driver=nvidia latency=0 2025-05-07T20:23:56.3246082Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:56.3279137Z 2025-05-07T20:23:56.3279349Z ################################################################################ 2025-05-07T20:23:56.3279659Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:56.3410000Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:56.3577465Z Wed May 7 20:23:56 2025 2025-05-07T20:23:56.3586244Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:56.3586951Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:56.3587426Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:56.3587989Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:56.3588502Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:56.3588922Z | | | MIG M. | 2025-05-07T20:23:56.3589241Z |=========================================+========================+======================| 2025-05-07T20:23:56.3657426Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:56.3658202Z | 0% 29C P0 61W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:56.3658714Z | | | N/A | 2025-05-07T20:23:56.3659091Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:56.3659476Z 2025-05-07T20:23:56.3660009Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:56.3660431Z | Processes: | 2025-05-07T20:23:56.3660889Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:56.3661302Z | ID ID Usage | 2025-05-07T20:23:56.3661647Z |=========================================================================================| 2025-05-07T20:23:56.3662600Z | No running processes found | 2025-05-07T20:23:56.3663058Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:56.5073529Z ################################################################################ 2025-05-07T20:23:56.5073877Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:56.5220507Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:56.5221457Z [CHECK] rocminfo not found 2025-05-07T20:23:56.5230412Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:56.5231342Z [CHECK] rocm-smi not found 2025-05-07T20:23:56.5265910Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:56.5266334Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:56.5279116Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:56.5279474Z env: 2025-05-07T20:23:56.5279694Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:56.5280011Z BUILD_ENV: build_binary 2025-05-07T20:23:56.5280260Z BUILD_TARGET: genai 2025-05-07T20:23:56.5280490Z BUILD_VARIANT: cuda 2025-05-07T20:23:56.5280719Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:56.5280998Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:56.5281318Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:56.5281637Z ##[endgroup] 2025-05-07T20:23:56.8608916Z ################################################################################ 2025-05-07T20:23:56.8609275Z # Setup Miniconda 2025-05-07T20:23:56.8609478Z # 2025-05-07T20:23:56.8623894Z # [2025-05-07T20:23:56.862Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:56.8624296Z ################################################################################ 2025-05-07T20:23:56.8624512Z 2025-05-07T20:23:56.8638946Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:56.9548079Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:56.9548444Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:56.9548634Z 2025-05-07T20:23:56.9563689Z 2025-05-07T20:23:56.9564060Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:56.9585023Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:58.5673204Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:58.5673583Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:58.5673836Z 2025-05-07T20:23:58.5817173Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:59.0309017Z Unpacking payload ... 2025-05-07T20:23:59.5471727Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:24:00.3442989Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:24:02.4527853Z 2025-05-07T20:24:02.4528379Z Installing base environment... 2025-05-07T20:24:02.4528601Z 2025-05-07T20:24:03.5321112Z Preparing transaction: ...working... done 2025-05-07T20:24:06.5147755Z Executing transaction: ...working... done 2025-05-07T20:24:07.1721311Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:24:07.2599645Z installation finished. 2025-05-07T20:24:07.2608464Z 2025-05-07T20:24:07.2608627Z + rm -f miniconda.sh 2025-05-07T20:24:07.2608806Z 2025-05-07T20:24:07.2913857Z 2025-05-07T20:24:07.2914292Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:24:07.2914650Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:24:07.2914858Z 2025-05-07T20:24:07.6566126Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:24:07.6566517Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:24:07.6566867Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:24:07.6567218Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:24:07.6567567Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:24:07.6567941Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:24:07.6568362Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:24:07.6568793Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:24:07.6569228Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:24:07.6570035Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:24:07.6570547Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:24:07.6570895Z modified /home/ec2-user/.bashrc 2025-05-07T20:24:07.6571079Z 2025-05-07T20:24:07.6571268Z ==> For changes to take effect, close and re-open your current shell. <== 2025-05-07T20:24:07.6571575Z 2025-05-07T20:24:07.7220471Z 2025-05-07T20:24:07.7220969Z + . /home/ec2-user/.bashrc 2025-05-07T20:24:07.7221175Z 2025-05-07T20:24:08.5589159Z 2025-05-07T20:24:08.5590019Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:24:08.5612794Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:24:21.7783762Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:23.3569837Z Solving environment: | / - \ | / - \ | / - \ done 2025-05-07T20:24:23.4521730Z 2025-05-07T20:24:23.4521892Z ## Package Plan ## 2025-05-07T20:24:23.4522038Z 2025-05-07T20:24:23.4522172Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:23.4522414Z 2025-05-07T20:24:23.4522523Z added / updated specs: 2025-05-07T20:24:23.4522780Z - conda-libmamba-solver 2025-05-07T20:24:23.4523028Z - libarchive 2025-05-07T20:24:23.4523228Z - libmamba 2025-05-07T20:24:23.4523428Z - libmambapy 2025-05-07T20:24:23.4523550Z 2025-05-07T20:24:23.4523555Z 2025-05-07T20:24:23.4523690Z The following packages will be downloaded: 2025-05-07T20:24:23.4523899Z 2025-05-07T20:24:23.4524014Z package | build 2025-05-07T20:24:23.4524328Z ---------------------------|----------------- 2025-05-07T20:24:23.4524729Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:24:23.4525397Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:24:23.4525808Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:24:23.4526277Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:24:23.4526722Z ------------------------------------------------------------ 2025-05-07T20:24:23.4527045Z Total: 1.4 MB 2025-05-07T20:24:23.4527258Z 2025-05-07T20:24:23.4527372Z The following packages will be UPDATED: 2025-05-07T20:24:23.4527612Z 2025-05-07T20:24:23.4531167Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:23.4531930Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:24:23.4532300Z 2025-05-07T20:24:23.4532514Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:24:23.4532837Z 2025-05-07T20:24:23.4533160Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:24:23.4533934Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:24:23.4534408Z 2025-05-07T20:24:23.4534419Z 2025-05-07T20:24:23.4534424Z 2025-05-07T20:24:23.4534564Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:23.4534929Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:24:23.4535144Z 2025-05-07T20:24:23.4539561Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:24:23.4539825Z 2025-05-07T20:24:23.4544081Z 2025-05-07T20:24:23.4555465Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:24:23.4555742Z 2025-05-07T20:24:23.4555746Z 2025-05-07T20:24:23.4555750Z 2025-05-07T20:24:23.5056403Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:24:23.5107437Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:23.5107752Z 2025-05-07T20:24:23.5108309Z 2025-05-07T20:24:23.5151012Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:23.5151266Z 2025-05-07T20:24:23.5281299Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:23.5281568Z 2025-05-07T20:24:23.5281572Z 2025-05-07T20:24:23.5285792Z 2025-05-07T20:24:23.5362566Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:23.5362886Z 2025-05-07T20:24:23.5364193Z 2025-05-07T20:24:23.5475926Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:23.5476351Z 2025-05-07T20:24:23.5578203Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:23.5578612Z 2025-05-07T20:24:23.5578619Z 2025-05-07T20:24:23.5579120Z 2025-05-07T20:24:23.5581593Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:23.5581961Z 2025-05-07T20:24:23.5581981Z 2025-05-07T20:24:23.5581986Z 2025-05-07T20:24:23.6530860Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:23.6531314Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:23.6536272Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:23.6536721Z 2025-05-07T20:24:23.6536994Z 2025-05-07T20:24:23.6537238Z  2025-05-07T20:24:23.6537519Z 2025-05-07T20:24:23.6537525Z 2025-05-07T20:24:23.6537767Z  2025-05-07T20:24:23.6538043Z 2025-05-07T20:24:23.6538077Z 2025-05-07T20:24:23.6538083Z 2025-05-07T20:24:23.6538333Z  done 2025-05-07T20:24:23.7540761Z Preparing transaction: / done 2025-05-07T20:24:23.8543397Z Verifying transaction: \ done 2025-05-07T20:24:25.1563299Z Executing transaction: / - \ | / - \ | / - \ | / done 2025-05-07T20:24:26.9087648Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:24:26.9110730Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:24:27.8961551Z Channels: 2025-05-07T20:24:27.8961787Z - defaults 2025-05-07T20:24:27.8961991Z Platform: linux-64 2025-05-07T20:24:29.0906092Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:24:29.2089557Z Solving environment: - \ Channels: 2025-05-07T20:24:29.2089859Z - defaults 2025-05-07T20:24:29.2090063Z Platform: linux-64 2025-05-07T20:24:29.5172438Z Collecting package metadata (repodata.json): / - \ | done 2025-05-07T20:24:29.7277543Z Solving environment: - \ | / done 2025-05-07T20:24:29.8143142Z done 2025-05-07T20:24:29.8794620Z 2025-05-07T20:24:29.8795029Z ## Package Plan ## 2025-05-07T20:24:29.8795270Z 2025-05-07T20:24:29.8795465Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:29.8795876Z 2025-05-07T20:24:29.8796002Z added / updated specs: 2025-05-07T20:24:29.8796241Z - conda 2025-05-07T20:24:29.8796350Z 2025-05-07T20:24:29.8796354Z 2025-05-07T20:24:29.8796467Z The following packages will be downloaded: 2025-05-07T20:24:29.8796678Z 2025-05-07T20:24:29.8796799Z package | build 2025-05-07T20:24:29.8797106Z ---------------------------|----------------- 2025-05-07T20:24:29.8797437Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:24:29.8797807Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:24:29.8798548Z ------------------------------------------------------------ 2025-05-07T20:24:29.8798952Z Total: 1.4 MB 2025-05-07T20:24:29.8799150Z 2025-05-07T20:24:29.8799255Z The following packages will be UPDATED: 2025-05-07T20:24:29.8799460Z 2025-05-07T20:24:29.8799810Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:29.8800486Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:24:29.8800825Z 2025-05-07T20:24:29.8800830Z 2025-05-07T20:24:29.8800836Z 2025-05-07T20:24:29.8801038Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:29.8801541Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:24:29.8801839Z 2025-05-07T20:24:29.9367999Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:24:29.9661180Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:29.9663805Z 2025-05-07T20:24:30.1271742Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:30.1272484Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:30.1666453Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:30.1666918Z 2025-05-07T20:24:30.1667382Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:30.1667987Z 2025-05-07T20:24:30.1671293Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:30.1671709Z 2025-05-07T20:24:30.1671904Z 2025-05-07T20:24:30.1672075Z  done 2025-05-07T20:24:30.2678760Z Preparing transaction: \ done 2025-05-07T20:24:30.3684063Z Verifying transaction: / done 2025-05-07T20:24:32.7714185Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:33.3702550Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:24:33.3706266Z + conda clean --packages --tarball -y 2025-05-07T20:24:33.3706541Z 2025-05-07T20:24:34.3811206Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:24:34.3811547Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:24:34.4432463Z 2025-05-07T20:24:34.4441496Z + conda clean --all -y 2025-05-07T20:24:34.4441730Z 2025-05-07T20:24:34.9819833Z There are no unused tarball(s) to remove. 2025-05-07T20:24:34.9820298Z Will remove 1 index cache(s). 2025-05-07T20:24:34.9820671Z There are no unused package(s) to remove. 2025-05-07T20:24:34.9821095Z There are no tempfile(s) to remove. 2025-05-07T20:24:34.9821503Z There are no logfile(s) to remove. 2025-05-07T20:24:35.0448556Z 2025-05-07T20:24:35.0453293Z + conda info 2025-05-07T20:24:35.0453490Z 2025-05-07T20:24:35.7911056Z 2025-05-07T20:24:35.7911510Z active environment : base 2025-05-07T20:24:35.7911956Z active env location : /home/ec2-user/miniconda 2025-05-07T20:24:35.7912276Z shell level : 1 2025-05-07T20:24:35.7912573Z user config file : /home/ec2-user/.condarc 2025-05-07T20:24:35.7912948Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:24:35.7913290Z conda version : 25.3.1 2025-05-07T20:24:35.7913562Z conda-build version : not installed 2025-05-07T20:24:35.7913854Z python version : 3.13.2.final.0 2025-05-07T20:24:35.7914147Z solver : libmamba (default) 2025-05-07T20:24:35.7914444Z virtual packages : __archspec=1=zen2 2025-05-07T20:24:35.7914730Z __conda=25.3.1=0 2025-05-07T20:24:35.7914989Z __cuda=12.8=0 2025-05-07T20:24:35.7915252Z __glibc=2.34=0 2025-05-07T20:24:35.7915520Z __linux=6.1.130=0 2025-05-07T20:24:35.7915785Z __unix=0=0 2025-05-07T20:24:35.7916100Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:24:35.7916492Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:24:35.7917162Z conda av metadata url : None 2025-05-07T20:24:35.7917521Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:24:35.7917941Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:24:35.7918310Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:24:35.7918673Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:24:35.7919029Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:24:35.7919355Z /home/ec2-user/.conda/pkgs 2025-05-07T20:24:35.7919684Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:24:35.7919999Z /home/ec2-user/.conda/envs 2025-05-07T20:24:35.7920285Z platform : linux-64 2025-05-07T20:24:35.7921112Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:24:35.7921904Z UID:GID : 1000:1000 2025-05-07T20:24:35.7922160Z netrc file : None 2025-05-07T20:24:35.7922406Z offline mode : False 2025-05-07T20:24:35.7922568Z 2025-05-07T20:24:35.8565447Z 2025-05-07T20:24:35.8566105Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:24:35.8566872Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_ec600945-1e6b-443f-bc5e-7e18edd52288 ... 2025-05-07T20:24:35.8567646Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:24:35.8637099Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.13 2025-05-07T20:24:35.8637589Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.13 2025-05-07T20:24:35.8654473Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:35.8664054Z env: 2025-05-07T20:24:35.8664294Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:35.8664594Z BUILD_ENV: build_binary 2025-05-07T20:24:35.8664836Z BUILD_TARGET: genai 2025-05-07T20:24:35.8665066Z BUILD_VARIANT: cuda 2025-05-07T20:24:35.8665293Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:24:35.8665547Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:35.8665846Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:35.8666387Z ##[endgroup] 2025-05-07T20:24:36.2027927Z ################################################################################ 2025-05-07T20:24:36.2028287Z # Create Conda Environment 2025-05-07T20:24:36.2028535Z # 2025-05-07T20:24:36.2044125Z # [2025-05-07T20:24:36.204Z] + create_conda_environment build_binary 3.13 2025-05-07T20:24:36.2044539Z ################################################################################ 2025-05-07T20:24:36.2044749Z 2025-05-07T20:24:36.2059377Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:36.2958564Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:36.2958931Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:24:36.2959245Z + conda info --envs 2025-05-07T20:24:36.2959380Z 2025-05-07T20:24:37.0398681Z 2025-05-07T20:24:37.0398959Z # conda environments: 2025-05-07T20:24:37.0399203Z # 2025-05-07T20:24:37.0399417Z base /home/ec2-user/miniconda 2025-05-07T20:24:37.0399647Z 2025-05-07T20:24:37.1058497Z 2025-05-07T20:24:37.1059133Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:24:38.7264636Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:38.7265152Z 2025-05-07T20:24:38.7277880Z 2025-05-07T20:24:38.7288162Z [SETUP] Creating new Conda environment (Python 3.13) ... 2025-05-07T20:24:38.7311793Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.13 2025-05-07T20:24:39.4887287Z Channels: 2025-05-07T20:24:39.4887706Z - defaults 2025-05-07T20:24:39.4888111Z Platform: linux-64 2025-05-07T20:24:41.0400620Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done 2025-05-07T20:24:41.1645656Z Solving environment: / done 2025-05-07T20:24:41.1932998Z 2025-05-07T20:24:41.1933354Z ## Package Plan ## 2025-05-07T20:24:41.1933644Z 2025-05-07T20:24:41.1934086Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:41.1934697Z 2025-05-07T20:24:41.1934915Z added / updated specs: 2025-05-07T20:24:41.1935396Z - python=3.13 2025-05-07T20:24:41.1935656Z 2025-05-07T20:24:41.1935666Z 2025-05-07T20:24:41.1935904Z The following packages will be downloaded: 2025-05-07T20:24:41.1936314Z 2025-05-07T20:24:41.1936543Z package | build 2025-05-07T20:24:41.1937167Z ---------------------------|----------------- 2025-05-07T20:24:41.1937863Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:41.1938620Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:41.1939418Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:41.1940501Z python_abi-3.13 | 0_cp313 6 KB 2025-05-07T20:24:41.1941204Z ------------------------------------------------------------ 2025-05-07T20:24:41.1941833Z Total: 159 KB 2025-05-07T20:24:41.1942175Z 2025-05-07T20:24:41.1942295Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:41.1942517Z 2025-05-07T20:24:41.1942716Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:41.1943154Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:41.1943986Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T20:24:41.1944470Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:41.1944940Z expat pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 2025-05-07T20:24:41.1945374Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:41.1945835Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:41.1946248Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:41.1946670Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:41.1947083Z libmpdec pkgs/main/linux-64::libmpdec-4.0.0-h5eee18b_0 2025-05-07T20:24:41.1947775Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:41.1948216Z libuuid pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 2025-05-07T20:24:41.1948628Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:41.1949029Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:41.1949419Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:41.1949825Z python pkgs/main/linux-64::python-3.13.2-hf623796_100_cp313 2025-05-07T20:24:41.1950263Z python_abi pkgs/main/linux-64::python_abi-3.13-0_cp313 2025-05-07T20:24:41.1950673Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:41.1951139Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py313h06a4308_0 2025-05-07T20:24:41.1951589Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:41.1951962Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:41.1952332Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:41.1952737Z wheel pkgs/main/linux-64::wheel-0.45.1-py313h06a4308_0 2025-05-07T20:24:41.1953123Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:41.1953478Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:41.1953713Z 2025-05-07T20:24:41.1953718Z 2025-05-07T20:24:41.1953722Z 2025-05-07T20:24:41.1953861Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:41.1954242Z ca-certificates-2025 | 129 KB | | 0% 2025-05-07T20:24:41.1954467Z 2025-05-07T20:24:41.1954815Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:41.1955050Z 2025-05-07T20:24:41.1955054Z 2025-05-07T20:24:41.1965409Z python_abi-3.13 | 6 KB | | 0%  2025-05-07T20:24:41.1965646Z 2025-05-07T20:24:41.1965650Z 2025-05-07T20:24:41.1965664Z 2025-05-07T20:24:41.2320986Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:41.2322294Z 2025-05-07T20:24:41.2429032Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:41.2478128Z ca-certificates-2025 | 129 KB | ########## | 100% 2025-05-07T20:24:41.2478371Z 2025-05-07T20:24:41.2480372Z 2025-05-07T20:24:41.2562472Z python_abi-3.13 | 6 KB | ########## | 100%  2025-05-07T20:24:41.2562711Z 2025-05-07T20:24:41.2562722Z 2025-05-07T20:24:41.2562977Z 2025-05-07T20:24:41.2662229Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:41.2662525Z 2025-05-07T20:24:41.2662834Z 2025-05-07T20:24:41.2698049Z python_abi-3.13 | 6 KB | ########## | 100%  2025-05-07T20:24:41.2705010Z ca-certificates-2025 | 129 KB | ########## | 100% 2025-05-07T20:24:41.2705943Z 2025-05-07T20:24:41.2734902Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:41.2735157Z 2025-05-07T20:24:41.2735162Z 2025-05-07T20:24:41.2735167Z 2025-05-07T20:24:41.2740709Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:41.2741120Z 2025-05-07T20:24:41.2741317Z 2025-05-07T20:24:41.2741800Z  2025-05-07T20:24:41.2742105Z 2025-05-07T20:24:41.2742113Z 2025-05-07T20:24:41.2742367Z  2025-05-07T20:24:41.2742634Z 2025-05-07T20:24:41.2742638Z 2025-05-07T20:24:41.2742642Z 2025-05-07T20:24:41.2742832Z  done 2025-05-07T20:24:41.4849979Z Preparing transaction: \ | done 2025-05-07T20:24:42.9096416Z Verifying transaction: - \ | / - \ | / - \ | / - done 2025-05-07T20:24:45.2245094Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:45.2744201Z # 2025-05-07T20:24:45.2744626Z # To activate this environment, use 2025-05-07T20:24:45.2745129Z # 2025-05-07T20:24:45.2745477Z # $ conda activate build_binary 2025-05-07T20:24:45.2745923Z # 2025-05-07T20:24:45.2746294Z # To deactivate an active environment, use 2025-05-07T20:24:45.2746809Z # 2025-05-07T20:24:45.2747127Z # $ conda deactivate 2025-05-07T20:24:45.2747410Z 2025-05-07T20:24:45.3785508Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:45.3809720Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:48.3444635Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (25.1) 2025-05-07T20:24:48.3445256Z Collecting pip 2025-05-07T20:24:48.3445564Z Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:48.3445979Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:48.3448738Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 83.4 MB/s eta 0:00:00 2025-05-07T20:24:48.3449142Z Installing collected packages: pip 2025-05-07T20:24:48.3449431Z Attempting uninstall: pip 2025-05-07T20:24:48.3449709Z Found existing installation: pip 25.1 2025-05-07T20:24:48.3450023Z Uninstalling pip-25.1: 2025-05-07T20:24:48.3450297Z Successfully uninstalled pip-25.1 2025-05-07T20:24:48.3450603Z Successfully installed pip-25.1.1 2025-05-07T20:24:48.3450789Z 2025-05-07T20:24:48.4073836Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:48.4096606Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:49.2629440Z Channels: 2025-05-07T20:24:49.2629685Z - conda-forge 2025-05-07T20:24:49.2629904Z Platform: linux-64 2025-05-07T20:24:59.4949187Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:25:01.1755166Z Solving environment: / - \ | / done 2025-05-07T20:25:01.2373267Z 2025-05-07T20:25:01.2373613Z ## Package Plan ## 2025-05-07T20:25:01.2373780Z 2025-05-07T20:25:01.2374013Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:01.2374312Z 2025-05-07T20:25:01.2374413Z added / updated specs: 2025-05-07T20:25:01.2374688Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:25:01.2374885Z 2025-05-07T20:25:01.2374889Z 2025-05-07T20:25:01.2375007Z The following packages will be downloaded: 2025-05-07T20:25:01.2375222Z 2025-05-07T20:25:01.2375338Z package | build 2025-05-07T20:25:01.2375653Z ---------------------------|----------------- 2025-05-07T20:25:01.2376008Z cffi-1.17.1 | py313hfab6e84_0 289 KB conda-forge 2025-05-07T20:25:01.2376449Z cryptography-44.0.3 | py313h6556f6e_0 1.5 MB conda-forge 2025-05-07T20:25:01.2376882Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:25:01.2377284Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:25:01.2377695Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:25:01.2378093Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:25:01.2378832Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:25:01.2379273Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:25:01.2379724Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:25:01.2380193Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:25:01.2380599Z ------------------------------------------------------------ 2025-05-07T20:25:01.2380933Z Total: 6.4 MB 2025-05-07T20:25:01.2381140Z 2025-05-07T20:25:01.2381265Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:01.2381478Z 2025-05-07T20:25:01.2381824Z cffi conda-forge/linux-64::cffi-1.17.1-py313hfab6e84_0 2025-05-07T20:25:01.2382312Z cryptography conda-forge/linux-64::cryptography-44.0.3-py313h6556f6e_0 2025-05-07T20:25:01.2382794Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:25:01.2385002Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:25:01.2385476Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:25:01.2385986Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:25:01.2386552Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:25:01.2386886Z 2025-05-07T20:25:01.2386998Z The following packages will be UPDATED: 2025-05-07T20:25:01.2387199Z 2025-05-07T20:25:01.2387698Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:25:01.2388458Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:25:01.2389107Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:25:01.2389725Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:25:01.2390115Z 2025-05-07T20:25:01.2390119Z 2025-05-07T20:25:01.2390123Z 2025-05-07T20:25:01.2390270Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:01.2390643Z openssl-3.5.0 | 3.0 MB | | 0% 2025-05-07T20:25:01.2390862Z 2025-05-07T20:25:01.2391261Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:25:01.2391507Z 2025-05-07T20:25:01.2391511Z 2025-05-07T20:25:01.2402697Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:25:01.2402936Z 2025-05-07T20:25:01.2402940Z 2025-05-07T20:25:01.2402948Z 2025-05-07T20:25:01.2418573Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:25:01.2418822Z 2025-05-07T20:25:01.2418833Z 2025-05-07T20:25:01.2418837Z 2025-05-07T20:25:01.2422357Z 2025-05-07T20:25:01.2438319Z cffi-1.17.1 | 289 KB | | 0%  2025-05-07T20:25:01.2438591Z 2025-05-07T20:25:01.2438601Z 2025-05-07T20:25:01.2438612Z 2025-05-07T20:25:01.2438615Z 2025-05-07T20:25:01.2441618Z 2025-05-07T20:25:01.2445081Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:25:01.2445496Z 2025-05-07T20:25:01.2445502Z 2025-05-07T20:25:01.2445507Z 2025-05-07T20:25:01.2445513Z 2025-05-07T20:25:01.2445518Z 2025-05-07T20:25:01.2445530Z 2025-05-07T20:25:01.2449996Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:25:01.2450281Z 2025-05-07T20:25:01.2450285Z 2025-05-07T20:25:01.2450297Z 2025-05-07T20:25:01.2450300Z 2025-05-07T20:25:01.2450304Z 2025-05-07T20:25:01.2450308Z 2025-05-07T20:25:01.2450312Z 2025-05-07T20:25:01.2461721Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:25:01.2462032Z 2025-05-07T20:25:01.2462036Z 2025-05-07T20:25:01.2462040Z 2025-05-07T20:25:01.2462043Z 2025-05-07T20:25:01.2462047Z 2025-05-07T20:25:01.2462051Z 2025-05-07T20:25:01.2462054Z 2025-05-07T20:25:01.2463572Z 2025-05-07T20:25:01.2464918Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:25:01.2465214Z 2025-05-07T20:25:01.2465218Z 2025-05-07T20:25:01.2465222Z 2025-05-07T20:25:01.2465225Z 2025-05-07T20:25:01.2465229Z 2025-05-07T20:25:01.2465233Z 2025-05-07T20:25:01.2465236Z 2025-05-07T20:25:01.2465240Z 2025-05-07T20:25:01.2465244Z 2025-05-07T20:25:01.2930233Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:25:01.2930548Z 2025-05-07T20:25:01.2930552Z 2025-05-07T20:25:01.2931237Z 2025-05-07T20:25:01.3343715Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:25:01.3344033Z 2025-05-07T20:25:01.3344372Z 2025-05-07T20:25:01.3344376Z 2025-05-07T20:25:01.3346971Z 2025-05-07T20:25:01.3375494Z cffi-1.17.1 | 289 KB | ########## | 100%  2025-05-07T20:25:01.3389284Z openssl-3.5.0 | 3.0 MB | ##5 | 26% 2025-05-07T20:25:01.3392575Z 2025-05-07T20:25:01.3409966Z cryptography-44.0.3 | 1.5 MB | ##1 | 21%  2025-05-07T20:25:01.3410341Z 2025-05-07T20:25:01.3410347Z 2025-05-07T20:25:01.3410353Z 2025-05-07T20:25:01.3410367Z 2025-05-07T20:25:01.3412492Z 2025-05-07T20:25:01.3445385Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:25:01.3445671Z 2025-05-07T20:25:01.3445675Z 2025-05-07T20:25:01.3451089Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:25:01.3451338Z 2025-05-07T20:25:01.3452389Z 2025-05-07T20:25:01.3838538Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:25:01.3838888Z 2025-05-07T20:25:01.3838893Z 2025-05-07T20:25:01.3841231Z 2025-05-07T20:25:01.3852046Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:25:01.3852388Z 2025-05-07T20:25:01.3852393Z 2025-05-07T20:25:01.3852397Z 2025-05-07T20:25:01.3854756Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:25:01.3855091Z 2025-05-07T20:25:01.3855105Z 2025-05-07T20:25:01.3855109Z 2025-05-07T20:25:01.3855112Z 2025-05-07T20:25:01.3855116Z 2025-05-07T20:25:01.3855120Z 2025-05-07T20:25:01.3855124Z 2025-05-07T20:25:01.3855127Z 2025-05-07T20:25:01.3859243Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:25:01.3859586Z 2025-05-07T20:25:01.3859592Z 2025-05-07T20:25:01.3859597Z 2025-05-07T20:25:01.3859602Z 2025-05-07T20:25:01.3859607Z 2025-05-07T20:25:01.3859612Z 2025-05-07T20:25:01.3922290Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:25:01.3922571Z 2025-05-07T20:25:01.3922575Z 2025-05-07T20:25:01.3922579Z 2025-05-07T20:25:01.3922583Z 2025-05-07T20:25:01.3922595Z 2025-05-07T20:25:01.3922598Z 2025-05-07T20:25:01.3922602Z 2025-05-07T20:25:01.3925486Z 2025-05-07T20:25:01.3972614Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:25:01.3972993Z 2025-05-07T20:25:01.3972998Z 2025-05-07T20:25:01.3973001Z 2025-05-07T20:25:01.3973016Z 2025-05-07T20:25:01.3973020Z 2025-05-07T20:25:01.3973024Z 2025-05-07T20:25:01.3990056Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:25:01.3990414Z 2025-05-07T20:25:01.3990419Z 2025-05-07T20:25:01.3990422Z 2025-05-07T20:25:01.3990426Z 2025-05-07T20:25:01.3990430Z 2025-05-07T20:25:01.3990433Z 2025-05-07T20:25:01.3990437Z 2025-05-07T20:25:01.4069814Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:25:01.4070180Z 2025-05-07T20:25:01.4070184Z 2025-05-07T20:25:01.4070188Z 2025-05-07T20:25:01.4070192Z 2025-05-07T20:25:01.4070196Z 2025-05-07T20:25:01.4070199Z 2025-05-07T20:25:01.4070203Z 2025-05-07T20:25:01.4131007Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:25:01.4131580Z 2025-05-07T20:25:01.4382781Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:25:01.4383088Z 2025-05-07T20:25:01.4383092Z 2025-05-07T20:25:01.4383095Z 2025-05-07T20:25:01.4383311Z 2025-05-07T20:25:01.4383316Z 2025-05-07T20:25:01.4389157Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:25:01.4389438Z 2025-05-07T20:25:01.4389442Z 2025-05-07T20:25:01.4389446Z 2025-05-07T20:25:01.4389450Z 2025-05-07T20:25:01.4391336Z 2025-05-07T20:25:01.4433281Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:25:01.4433607Z 2025-05-07T20:25:01.4433613Z 2025-05-07T20:25:01.4433618Z 2025-05-07T20:25:01.4433623Z 2025-05-07T20:25:01.4433628Z 2025-05-07T20:25:01.4433634Z 2025-05-07T20:25:01.4433639Z 2025-05-07T20:25:01.4433644Z 2025-05-07T20:25:01.4434107Z 2025-05-07T20:25:01.4463075Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:25:01.4463574Z 2025-05-07T20:25:01.4463578Z 2025-05-07T20:25:01.4463582Z 2025-05-07T20:25:01.4463592Z 2025-05-07T20:25:01.4463595Z 2025-05-07T20:25:01.4463599Z 2025-05-07T20:25:01.4463603Z 2025-05-07T20:25:01.4463606Z 2025-05-07T20:25:01.4465263Z 2025-05-07T20:25:01.4588206Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:01.4588619Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:25:01.4663522Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:25:01.4663757Z 2025-05-07T20:25:01.4663761Z 2025-05-07T20:25:01.4663765Z 2025-05-07T20:25:01.4663769Z 2025-05-07T20:25:01.4663772Z 2025-05-07T20:25:01.4663776Z 2025-05-07T20:25:01.4663780Z 2025-05-07T20:25:01.4664944Z 2025-05-07T20:25:01.4858750Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:25:01.4859044Z 2025-05-07T20:25:01.4859048Z 2025-05-07T20:25:01.4859062Z 2025-05-07T20:25:01.4859065Z 2025-05-07T20:25:01.4863034Z cffi-1.17.1 | 289 KB | ########## | 100%  2025-05-07T20:25:01.4863319Z 2025-05-07T20:25:01.4863325Z 2025-05-07T20:25:01.4863330Z 2025-05-07T20:25:01.4863335Z 2025-05-07T20:25:01.4986274Z cffi-1.17.1 | 289 KB | ########## | 100%  2025-05-07T20:25:01.4986601Z 2025-05-07T20:25:01.4986607Z 2025-05-07T20:25:01.4986612Z 2025-05-07T20:25:01.4986617Z 2025-05-07T20:25:01.4986623Z 2025-05-07T20:25:01.4986628Z 2025-05-07T20:25:01.4986713Z 2025-05-07T20:25:01.5134407Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:25:01.5134768Z 2025-05-07T20:25:01.5134772Z 2025-05-07T20:25:01.5588243Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:25:01.5588586Z 2025-05-07T20:25:01.5588593Z 2025-05-07T20:25:01.5588598Z 2025-05-07T20:25:01.5588604Z 2025-05-07T20:25:01.5588609Z 2025-05-07T20:25:01.5588614Z 2025-05-07T20:25:01.5588619Z 2025-05-07T20:25:01.5588635Z 2025-05-07T20:25:01.5588650Z 2025-05-07T20:25:01.5591983Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:01.5592346Z 2025-05-07T20:25:01.5592350Z 2025-05-07T20:25:01.5592354Z 2025-05-07T20:25:01.5592357Z 2025-05-07T20:25:01.5592377Z 2025-05-07T20:25:01.5592381Z 2025-05-07T20:25:01.5592385Z 2025-05-07T20:25:01.5592389Z 2025-05-07T20:25:01.5592567Z 2025-05-07T20:25:01.5846395Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:01.5846798Z 2025-05-07T20:25:01.5846805Z 2025-05-07T20:25:01.5846810Z 2025-05-07T20:25:01.5846815Z 2025-05-07T20:25:01.5846820Z 2025-05-07T20:25:01.5847160Z 2025-05-07T20:25:01.5851203Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:25:01.5851586Z 2025-05-07T20:25:01.5851590Z 2025-05-07T20:25:01.5851594Z 2025-05-07T20:25:01.5851598Z 2025-05-07T20:25:01.5851601Z 2025-05-07T20:25:01.5851605Z 2025-05-07T20:25:01.6849610Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:25:01.6882261Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:25:01.6882585Z 2025-05-07T20:25:01.6883388Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:25:01.6883974Z 2025-05-07T20:25:01.6890790Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:25:01.6891279Z 2025-05-07T20:25:01.6891552Z 2025-05-07T20:25:01.6891783Z  2025-05-07T20:25:01.6892071Z 2025-05-07T20:25:01.6892077Z 2025-05-07T20:25:01.6892299Z  2025-05-07T20:25:01.6892575Z 2025-05-07T20:25:01.6892580Z 2025-05-07T20:25:01.6892586Z 2025-05-07T20:25:01.6892824Z  2025-05-07T20:25:01.6893099Z 2025-05-07T20:25:01.6893103Z 2025-05-07T20:25:01.6893107Z 2025-05-07T20:25:01.6893282Z 2025-05-07T20:25:01.6893477Z  2025-05-07T20:25:01.6893764Z 2025-05-07T20:25:01.6893770Z 2025-05-07T20:25:01.6893775Z 2025-05-07T20:25:01.6893781Z 2025-05-07T20:25:01.6893786Z 2025-05-07T20:25:01.6894089Z  2025-05-07T20:25:01.6894377Z 2025-05-07T20:25:01.6894383Z 2025-05-07T20:25:01.6894388Z 2025-05-07T20:25:01.6894393Z 2025-05-07T20:25:01.6894399Z 2025-05-07T20:25:01.6894404Z 2025-05-07T20:25:01.6894662Z  2025-05-07T20:25:01.6894960Z 2025-05-07T20:25:01.6894966Z 2025-05-07T20:25:01.6894971Z 2025-05-07T20:25:01.6894977Z 2025-05-07T20:25:01.6894982Z 2025-05-07T20:25:01.6894988Z 2025-05-07T20:25:01.6894993Z 2025-05-07T20:25:01.6895258Z  2025-05-07T20:25:01.6895560Z 2025-05-07T20:25:01.6895574Z 2025-05-07T20:25:01.6895579Z 2025-05-07T20:25:01.6895584Z 2025-05-07T20:25:01.6895589Z 2025-05-07T20:25:01.6895595Z 2025-05-07T20:25:01.6895600Z 2025-05-07T20:25:01.6895605Z 2025-05-07T20:25:01.6895797Z  2025-05-07T20:25:01.6896015Z 2025-05-07T20:25:01.6896019Z 2025-05-07T20:25:01.6896023Z 2025-05-07T20:25:01.6896026Z 2025-05-07T20:25:01.6896030Z 2025-05-07T20:25:01.6896034Z 2025-05-07T20:25:01.6896037Z 2025-05-07T20:25:01.6896050Z 2025-05-07T20:25:01.6896054Z 2025-05-07T20:25:01.6896245Z  done 2025-05-07T20:25:01.7896570Z Preparing transaction: \ done 2025-05-07T20:25:01.8899364Z Verifying transaction: / done 2025-05-07T20:25:03.3923534Z Executing transaction: \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:25:03.5665158Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:25:05.2651250Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:25:05.2664513Z [SETUP] Installing libxcrypt ... 2025-05-07T20:25:05.2687573Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:25:06.1279464Z Channels: 2025-05-07T20:25:06.1279755Z - conda-forge 2025-05-07T20:25:06.1279983Z Platform: linux-64 2025-05-07T20:25:09.3829552Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:09.7480806Z Solving environment: \ | done 2025-05-07T20:25:09.8085202Z 2025-05-07T20:25:09.8085780Z ## Package Plan ## 2025-05-07T20:25:09.8086024Z 2025-05-07T20:25:09.8086337Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:09.8086757Z 2025-05-07T20:25:09.8086875Z added / updated specs: 2025-05-07T20:25:09.8087188Z - libxcrypt 2025-05-07T20:25:09.8087347Z 2025-05-07T20:25:09.8087353Z 2025-05-07T20:25:09.8087510Z The following packages will be downloaded: 2025-05-07T20:25:09.8087770Z 2025-05-07T20:25:09.8087917Z package | build 2025-05-07T20:25:09.8088224Z ---------------------------|----------------- 2025-05-07T20:25:09.8088592Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:25:09.8089381Z ------------------------------------------------------------ 2025-05-07T20:25:09.8089711Z Total: 98 KB 2025-05-07T20:25:09.8089922Z 2025-05-07T20:25:09.8090044Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:09.8090265Z 2025-05-07T20:25:09.8090481Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:25:09.8090764Z 2025-05-07T20:25:09.8090768Z 2025-05-07T20:25:09.8090772Z 2025-05-07T20:25:09.8090923Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:09.9993441Z libxcrypt-4.4.36 | 98 KB | | 0% 2025-05-07T20:25:10.0025630Z libxcrypt-4.4.36 | 98 KB | #6 | 16% 2025-05-07T20:25:10.0126080Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:25:10.0128341Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:25:10.0128841Z 2025-05-07T20:25:10.0129147Z done 2025-05-07T20:25:10.1132001Z Preparing transaction: - done 2025-05-07T20:25:10.2136390Z Verifying transaction: | done 2025-05-07T20:25:10.3143064Z Executing transaction: - done 2025-05-07T20:25:13.7219009Z [SETUP] Copying over ... 2025-05-07T20:25:13.7219766Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.13/crypt.h 2025-05-07T20:25:13.7220384Z 2025-05-07T20:25:13.7252946Z 2025-05-07T20:25:15.3584414Z [SETUP] Installed Python version: Python 3.13.2 2025-05-07T20:25:15.3584879Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:25:15.3617145Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:15.3617620Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:15.3630110Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:15.3630453Z env: 2025-05-07T20:25:15.3630672Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:15.3630971Z BUILD_ENV: build_binary 2025-05-07T20:25:15.3631212Z BUILD_TARGET: genai 2025-05-07T20:25:15.3631438Z BUILD_VARIANT: cuda 2025-05-07T20:25:15.3631674Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:25:15.3641567Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:15.3641906Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:15.3642235Z ##[endgroup] 2025-05-07T20:25:15.7032457Z ################################################################################ 2025-05-07T20:25:15.7032833Z # Install C/C++ Compilers 2025-05-07T20:25:15.7033070Z # 2025-05-07T20:25:15.7049712Z # [2025-05-07T20:25:15.704Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:25:15.7050148Z ################################################################################ 2025-05-07T20:25:15.7050377Z 2025-05-07T20:25:15.7066532Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:15.7956116Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:15.7966737Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:25:15.7990097Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:25:16.6639306Z Channels: 2025-05-07T20:25:16.6639552Z - conda-forge 2025-05-07T20:25:16.6639780Z Platform: linux-64 2025-05-07T20:25:19.9451069Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:20.3116376Z Solving environment: \ done 2025-05-07T20:25:20.3728408Z 2025-05-07T20:25:20.3728608Z ## Package Plan ## 2025-05-07T20:25:20.3728765Z 2025-05-07T20:25:20.3728997Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:20.3729315Z 2025-05-07T20:25:20.3729414Z added / updated specs: 2025-05-07T20:25:20.3729683Z - sysroot_linux-64=2.17 2025-05-07T20:25:20.3729845Z 2025-05-07T20:25:20.3729850Z 2025-05-07T20:25:20.3729967Z The following packages will be downloaded: 2025-05-07T20:25:20.3730187Z 2025-05-07T20:25:20.3730298Z package | build 2025-05-07T20:25:20.3730614Z ---------------------------|----------------- 2025-05-07T20:25:20.3731035Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:25:20.3731503Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:25:20.3731907Z ------------------------------------------------------------ 2025-05-07T20:25:20.3732237Z Total: 15.4 MB 2025-05-07T20:25:20.3732443Z 2025-05-07T20:25:20.3732567Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:20.3732790Z 2025-05-07T20:25:20.3733066Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:25:20.3733926Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:25:20.3734224Z 2025-05-07T20:25:20.3734228Z 2025-05-07T20:25:20.3734232Z 2025-05-07T20:25:20.3734382Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:20.3734740Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:20.3734981Z 2025-05-07T20:25:20.4799453Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:25:20.4799752Z 2025-05-07T20:25:20.4834373Z kernel-headers_linux | 921 KB | ####8 | 49%  2025-05-07T20:25:20.4835261Z 2025-05-07T20:25:20.5544943Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:20.6613613Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:20.6871468Z sysroot_linux-64-2.1 | 14.5 MB | ##6 | 26% 2025-05-07T20:25:20.6872032Z 2025-05-07T20:25:20.6872714Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:20.6872980Z 2025-05-07T20:25:20.7641256Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:20.8478952Z sysroot_linux-64-2.1 | 14.5 MB | ######6 | 66% 2025-05-07T20:25:21.3693201Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:21.3693944Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:21.3698923Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:21.3699824Z 2025-05-07T20:25:21.3700369Z 2025-05-07T20:25:21.3700955Z  done 2025-05-07T20:25:21.4703827Z Preparing transaction: / done 2025-05-07T20:25:21.6711668Z Verifying transaction: \ | done 2025-05-07T20:25:21.8819831Z Executing transaction: - \ done 2025-05-07T20:25:22.0337280Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:25:22.0337633Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:25:23.7083386Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:25:23.7101185Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:25:23.7124302Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:25:24.6057974Z Channels: 2025-05-07T20:25:24.6058304Z - conda-forge 2025-05-07T20:25:24.6058598Z Platform: linux-64 2025-05-07T20:25:27.8836719Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:28.8284428Z Solving environment: \ | / done 2025-05-07T20:25:28.8910550Z 2025-05-07T20:25:28.8911151Z ## Package Plan ## 2025-05-07T20:25:28.8911394Z 2025-05-07T20:25:28.8911698Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:28.8912005Z 2025-05-07T20:25:28.8912096Z added / updated specs: 2025-05-07T20:25:28.8912367Z - gxx_linux-64=11.4.0 2025-05-07T20:25:28.8912520Z 2025-05-07T20:25:28.8912525Z 2025-05-07T20:25:28.8912655Z The following packages will be downloaded: 2025-05-07T20:25:28.8912872Z 2025-05-07T20:25:28.8912982Z package | build 2025-05-07T20:25:28.8913287Z ---------------------------|----------------- 2025-05-07T20:25:28.8913674Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:25:28.8914146Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:25:28.8914590Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:25:28.8915019Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:25:28.8915433Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:25:28.8915857Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:25:28.8916274Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:25:28.8916723Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:25:28.8917507Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:25:28.8917929Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:25:28.8918395Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:25:28.8918863Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:25:28.8919251Z ------------------------------------------------------------ 2025-05-07T20:25:28.8919594Z Total: 91.6 MB 2025-05-07T20:25:28.8919795Z 2025-05-07T20:25:28.8919921Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:28.8920130Z 2025-05-07T20:25:28.8920563Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:25:28.8921125Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:25:28.8921653Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:25:28.8922143Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:25:28.8922628Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:25:28.8923115Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:25:28.8923623Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:28.8924179Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:25:28.8924659Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:25:28.8925203Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:28.8925561Z 2025-05-07T20:25:28.8925669Z The following packages will be UPDATED: 2025-05-07T20:25:28.8925864Z 2025-05-07T20:25:28.8926181Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:25:28.8926898Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:25:28.8927299Z 2025-05-07T20:25:28.8927304Z 2025-05-07T20:25:28.8927308Z 2025-05-07T20:25:28.8927446Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:28.8927804Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:28.8928028Z 2025-05-07T20:25:28.8928426Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:28.8928651Z 2025-05-07T20:25:28.8928655Z 2025-05-07T20:25:28.8931704Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:28.8931963Z 2025-05-07T20:25:28.8931967Z 2025-05-07T20:25:28.8942533Z 2025-05-07T20:25:28.8985905Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:25:28.8986238Z 2025-05-07T20:25:28.8986247Z 2025-05-07T20:25:28.8986251Z 2025-05-07T20:25:28.8994188Z 2025-05-07T20:25:28.9007707Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:25:28.9008035Z 2025-05-07T20:25:28.9008039Z 2025-05-07T20:25:28.9008042Z 2025-05-07T20:25:28.9008046Z 2025-05-07T20:25:28.9008050Z 2025-05-07T20:25:28.9009585Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:28.9009997Z 2025-05-07T20:25:28.9010004Z 2025-05-07T20:25:28.9010009Z 2025-05-07T20:25:28.9010014Z 2025-05-07T20:25:28.9010020Z 2025-05-07T20:25:28.9010026Z 2025-05-07T20:25:28.9013818Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:25:28.9014106Z 2025-05-07T20:25:28.9014110Z 2025-05-07T20:25:28.9014113Z 2025-05-07T20:25:28.9014117Z 2025-05-07T20:25:28.9014121Z 2025-05-07T20:25:28.9014124Z 2025-05-07T20:25:28.9014969Z 2025-05-07T20:25:28.9016622Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:25:28.9017323Z 2025-05-07T20:25:28.9017327Z 2025-05-07T20:25:28.9017331Z 2025-05-07T20:25:28.9017335Z 2025-05-07T20:25:28.9017339Z 2025-05-07T20:25:28.9017342Z 2025-05-07T20:25:28.9017346Z 2025-05-07T20:25:28.9017360Z 2025-05-07T20:25:28.9022205Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:25:28.9022587Z 2025-05-07T20:25:28.9022591Z 2025-05-07T20:25:28.9022595Z 2025-05-07T20:25:28.9022598Z 2025-05-07T20:25:28.9022609Z 2025-05-07T20:25:28.9022613Z 2025-05-07T20:25:28.9022616Z 2025-05-07T20:25:28.9022620Z 2025-05-07T20:25:28.9025480Z 2025-05-07T20:25:28.9027140Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:25:28.9027469Z 2025-05-07T20:25:28.9027473Z 2025-05-07T20:25:28.9027477Z 2025-05-07T20:25:28.9027480Z 2025-05-07T20:25:28.9027484Z 2025-05-07T20:25:28.9027488Z 2025-05-07T20:25:28.9027759Z 2025-05-07T20:25:28.9027765Z 2025-05-07T20:25:28.9027769Z 2025-05-07T20:25:28.9028583Z 2025-05-07T20:25:28.9030501Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:25:28.9030875Z 2025-05-07T20:25:28.9030880Z 2025-05-07T20:25:28.9030884Z 2025-05-07T20:25:28.9030887Z 2025-05-07T20:25:28.9030891Z 2025-05-07T20:25:28.9030895Z 2025-05-07T20:25:28.9030898Z 2025-05-07T20:25:28.9030902Z 2025-05-07T20:25:28.9030906Z 2025-05-07T20:25:28.9030909Z 2025-05-07T20:25:28.9030913Z 2025-05-07T20:25:28.9992376Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:25:28.9992783Z 2025-05-07T20:25:28.9992787Z 2025-05-07T20:25:29.0103134Z 2025-05-07T20:25:29.0337792Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:25:29.0338067Z 2025-05-07T20:25:29.0338296Z 2025-05-07T20:25:29.0338315Z 2025-05-07T20:25:29.0338456Z 2025-05-07T20:25:29.0393410Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:25:29.0393699Z 2025-05-07T20:25:29.0394274Z 2025-05-07T20:25:29.1061857Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:29.1062141Z 2025-05-07T20:25:29.1062146Z 2025-05-07T20:25:29.1062149Z 2025-05-07T20:25:29.1167804Z binutils_impl_linux- | 6.0 MB | ### | 31%  2025-05-07T20:25:29.1168070Z 2025-05-07T20:25:29.1232466Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:29.1340386Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:29.1340625Z 2025-05-07T20:25:29.1340971Z 2025-05-07T20:25:29.1340978Z 2025-05-07T20:25:29.1344906Z 2025-05-07T20:25:29.1395039Z libstdcxx-15.1.0 | 3.7 MB | ########9 | 89%  2025-05-07T20:25:29.1395409Z 2025-05-07T20:25:29.1397046Z 2025-05-07T20:25:29.2169703Z libstdcxx-devel_linu | 11.1 MB | ####6 | 46%  2025-05-07T20:25:29.2170072Z 2025-05-07T20:25:29.2233198Z gxx_impl_linux-64-11 | 11.2 MB | ###8 | 38%  2025-05-07T20:25:29.2343007Z gcc_impl_linux-64-11 | 53.0 MB | 7 | 7% 2025-05-07T20:25:29.2343405Z 2025-05-07T20:25:29.2343412Z 2025-05-07T20:25:29.2343417Z 2025-05-07T20:25:29.2343422Z 2025-05-07T20:25:29.2396723Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:29.2397095Z 2025-05-07T20:25:29.2397740Z 2025-05-07T20:25:29.2779922Z libstdcxx-devel_linu | 11.1 MB | #######9 | 80%  2025-05-07T20:25:29.2780284Z 2025-05-07T20:25:29.2780290Z 2025-05-07T20:25:29.2780295Z 2025-05-07T20:25:29.2780300Z 2025-05-07T20:25:29.2780306Z 2025-05-07T20:25:29.3170664Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:29.3171052Z 2025-05-07T20:25:29.3233596Z gxx_impl_linux-64-11 | 11.2 MB | #######2 | 73%  2025-05-07T20:25:29.3399655Z gcc_impl_linux-64-11 | 53.0 MB | #4 | 15% 2025-05-07T20:25:29.3399993Z 2025-05-07T20:25:29.3400000Z 2025-05-07T20:25:29.3405706Z 2025-05-07T20:25:29.3414873Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:29.3415246Z 2025-05-07T20:25:29.3415457Z 2025-05-07T20:25:29.3417171Z 2025-05-07T20:25:29.3782965Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:29.3783346Z 2025-05-07T20:25:29.3783352Z 2025-05-07T20:25:29.3783357Z 2025-05-07T20:25:29.3783362Z 2025-05-07T20:25:29.3793066Z 2025-05-07T20:25:29.3869914Z libsanitizer-11.4.0 | 3.5 MB | ######## | 81%  2025-05-07T20:25:29.3870328Z 2025-05-07T20:25:29.3870334Z 2025-05-07T20:25:29.3870339Z 2025-05-07T20:25:29.3870345Z 2025-05-07T20:25:29.3870350Z 2025-05-07T20:25:29.3870355Z 2025-05-07T20:25:29.4235734Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:25:29.5080754Z gcc_impl_linux-64-11 | 53.0 MB | ## | 21% 2025-05-07T20:25:29.5081086Z 2025-05-07T20:25:29.5081092Z 2025-05-07T20:25:29.5081098Z 2025-05-07T20:25:29.5081103Z 2025-05-07T20:25:29.5083644Z 2025-05-07T20:25:29.5238317Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:29.5303035Z gcc_impl_linux-64-11 | 53.0 MB | ##8 | 28% 2025-05-07T20:25:29.5303293Z 2025-05-07T20:25:29.5303640Z 2025-05-07T20:25:29.5303647Z 2025-05-07T20:25:29.5303653Z 2025-05-07T20:25:29.5303659Z 2025-05-07T20:25:29.5305829Z 2025-05-07T20:25:29.5306437Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:29.5306827Z 2025-05-07T20:25:29.5306832Z 2025-05-07T20:25:29.5306838Z 2025-05-07T20:25:29.5306843Z 2025-05-07T20:25:29.5306848Z 2025-05-07T20:25:29.5306862Z 2025-05-07T20:25:29.5481532Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:29.5481927Z 2025-05-07T20:25:29.5481932Z 2025-05-07T20:25:29.5481938Z 2025-05-07T20:25:29.5481943Z 2025-05-07T20:25:29.5481949Z 2025-05-07T20:25:29.5481962Z 2025-05-07T20:25:29.5483631Z 2025-05-07T20:25:29.5714776Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:25:29.5715185Z 2025-05-07T20:25:29.5715189Z 2025-05-07T20:25:29.5715204Z 2025-05-07T20:25:29.5715215Z 2025-05-07T20:25:29.5715219Z 2025-05-07T20:25:29.5715222Z 2025-05-07T20:25:29.5715226Z 2025-05-07T20:25:29.5717011Z 2025-05-07T20:25:29.5746217Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:25:29.5746680Z 2025-05-07T20:25:29.5746686Z 2025-05-07T20:25:29.5746691Z 2025-05-07T20:25:29.5746696Z 2025-05-07T20:25:29.5746701Z 2025-05-07T20:25:29.5746706Z 2025-05-07T20:25:29.5746712Z 2025-05-07T20:25:29.5748790Z 2025-05-07T20:25:29.5927858Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:29.5928230Z 2025-05-07T20:25:29.5928235Z 2025-05-07T20:25:29.5928241Z 2025-05-07T20:25:29.5928246Z 2025-05-07T20:25:29.5928252Z 2025-05-07T20:25:29.5928257Z 2025-05-07T20:25:29.5928262Z 2025-05-07T20:25:29.6092260Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:29.6092611Z 2025-05-07T20:25:29.6092615Z 2025-05-07T20:25:29.6092619Z 2025-05-07T20:25:29.6092629Z 2025-05-07T20:25:29.6092632Z 2025-05-07T20:25:29.6092636Z 2025-05-07T20:25:29.6092640Z 2025-05-07T20:25:29.6092643Z 2025-05-07T20:25:29.6095515Z 2025-05-07T20:25:29.6144569Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:25:29.6144885Z 2025-05-07T20:25:29.6144889Z 2025-05-07T20:25:29.6144893Z 2025-05-07T20:25:29.6144896Z 2025-05-07T20:25:29.6144900Z 2025-05-07T20:25:29.6144904Z 2025-05-07T20:25:29.6144907Z 2025-05-07T20:25:29.6144911Z 2025-05-07T20:25:29.6148943Z 2025-05-07T20:25:29.6283416Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:29.6320290Z gcc_impl_linux-64-11 | 53.0 MB | ###4 | 35% 2025-05-07T20:25:29.6320638Z 2025-05-07T20:25:29.6320644Z 2025-05-07T20:25:29.6320649Z 2025-05-07T20:25:29.6320655Z 2025-05-07T20:25:29.6320660Z 2025-05-07T20:25:29.6320665Z 2025-05-07T20:25:29.6320680Z 2025-05-07T20:25:29.6320684Z 2025-05-07T20:25:29.6320688Z 2025-05-07T20:25:29.6321949Z 2025-05-07T20:25:29.6363189Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:25:29.6363474Z 2025-05-07T20:25:29.6363478Z 2025-05-07T20:25:29.6363482Z 2025-05-07T20:25:29.6363486Z 2025-05-07T20:25:29.6363489Z 2025-05-07T20:25:29.6363500Z 2025-05-07T20:25:29.6363504Z 2025-05-07T20:25:29.6363508Z 2025-05-07T20:25:29.6363511Z 2025-05-07T20:25:29.6363515Z 2025-05-07T20:25:29.6623708Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:29.6624005Z 2025-05-07T20:25:29.6624009Z 2025-05-07T20:25:29.6624013Z 2025-05-07T20:25:29.6624016Z 2025-05-07T20:25:29.6624020Z 2025-05-07T20:25:29.6624024Z 2025-05-07T20:25:29.6624028Z 2025-05-07T20:25:29.6624031Z 2025-05-07T20:25:29.6624035Z 2025-05-07T20:25:29.6624039Z 2025-05-07T20:25:29.6626567Z 2025-05-07T20:25:29.6667882Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:25:29.6668491Z 2025-05-07T20:25:29.6668502Z 2025-05-07T20:25:29.6668506Z 2025-05-07T20:25:29.6668510Z 2025-05-07T20:25:29.6668514Z 2025-05-07T20:25:29.6668517Z 2025-05-07T20:25:29.6668521Z 2025-05-07T20:25:29.6668525Z 2025-05-07T20:25:29.6668528Z 2025-05-07T20:25:29.6668532Z 2025-05-07T20:25:29.6668536Z 2025-05-07T20:25:29.6690363Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:29.6690781Z 2025-05-07T20:25:29.6690787Z 2025-05-07T20:25:29.7193114Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:29.7193831Z 2025-05-07T20:25:29.7286234Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:29.7337816Z gcc_impl_linux-64-11 | 53.0 MB | ####2 | 42% 2025-05-07T20:25:29.7338108Z 2025-05-07T20:25:29.7338114Z 2025-05-07T20:25:29.7338120Z 2025-05-07T20:25:29.7338630Z 2025-05-07T20:25:29.8285352Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:29.8961664Z gcc_impl_linux-64-11 | 53.0 MB | #####4 | 55% 2025-05-07T20:25:29.8962006Z 2025-05-07T20:25:29.8962012Z 2025-05-07T20:25:29.8962017Z 2025-05-07T20:25:29.8962022Z 2025-05-07T20:25:29.8963079Z 2025-05-07T20:25:29.9288471Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:29.9378709Z gcc_impl_linux-64-11 | 53.0 MB | ######8 | 68% 2025-05-07T20:25:29.9379048Z 2025-05-07T20:25:29.9379054Z 2025-05-07T20:25:29.9379060Z 2025-05-07T20:25:29.9379065Z 2025-05-07T20:25:29.9379070Z 2025-05-07T20:25:29.9379461Z 2025-05-07T20:25:29.9888388Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:29.9888761Z 2025-05-07T20:25:29.9888765Z 2025-05-07T20:25:29.9888769Z 2025-05-07T20:25:29.9888773Z 2025-05-07T20:25:29.9888777Z 2025-05-07T20:25:29.9888780Z 2025-05-07T20:25:29.9888784Z 2025-05-07T20:25:29.9889941Z 2025-05-07T20:25:29.9899172Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:29.9899487Z 2025-05-07T20:25:29.9899497Z 2025-05-07T20:25:29.9899508Z 2025-05-07T20:25:29.9899512Z 2025-05-07T20:25:29.9899516Z 2025-05-07T20:25:29.9899520Z 2025-05-07T20:25:29.9899523Z 2025-05-07T20:25:29.9901389Z 2025-05-07T20:25:30.0290600Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:30.0340853Z gcc_impl_linux-64-11 | 53.0 MB | #######9 | 79% 2025-05-07T20:25:30.0341151Z 2025-05-07T20:25:30.0341157Z 2025-05-07T20:25:30.0341162Z 2025-05-07T20:25:30.0341167Z 2025-05-07T20:25:30.0341186Z 2025-05-07T20:25:30.0341191Z 2025-05-07T20:25:30.0341677Z 2025-05-07T20:25:30.0347480Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:30.0347950Z 2025-05-07T20:25:30.0347955Z 2025-05-07T20:25:30.0347958Z 2025-05-07T20:25:30.0347962Z 2025-05-07T20:25:30.0347966Z 2025-05-07T20:25:30.0347969Z 2025-05-07T20:25:30.0349258Z 2025-05-07T20:25:30.0508429Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:30.0508941Z 2025-05-07T20:25:30.0508945Z 2025-05-07T20:25:30.0508949Z 2025-05-07T20:25:30.0508961Z 2025-05-07T20:25:30.0508965Z 2025-05-07T20:25:30.0508968Z 2025-05-07T20:25:30.0508972Z 2025-05-07T20:25:30.0508976Z 2025-05-07T20:25:30.0515539Z 2025-05-07T20:25:30.0522015Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:30.0522367Z 2025-05-07T20:25:30.0522373Z 2025-05-07T20:25:30.0522378Z 2025-05-07T20:25:30.0522384Z 2025-05-07T20:25:30.0522389Z 2025-05-07T20:25:30.0522394Z 2025-05-07T20:25:30.0522399Z 2025-05-07T20:25:30.0522404Z 2025-05-07T20:25:30.0522733Z 2025-05-07T20:25:30.1140039Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:30.1140539Z 2025-05-07T20:25:30.1140555Z 2025-05-07T20:25:30.1140560Z 2025-05-07T20:25:30.1140563Z 2025-05-07T20:25:30.1140567Z 2025-05-07T20:25:30.1140777Z 2025-05-07T20:25:30.1140782Z 2025-05-07T20:25:30.1140788Z 2025-05-07T20:25:30.1140793Z 2025-05-07T20:25:30.1140857Z 2025-05-07T20:25:30.1144472Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:30.1144782Z 2025-05-07T20:25:30.1144786Z 2025-05-07T20:25:30.1144790Z 2025-05-07T20:25:30.1144794Z 2025-05-07T20:25:30.1144798Z 2025-05-07T20:25:30.1144801Z 2025-05-07T20:25:30.1144805Z 2025-05-07T20:25:30.1144808Z 2025-05-07T20:25:30.1144812Z 2025-05-07T20:25:30.1144816Z 2025-05-07T20:25:30.1173568Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:30.1173954Z 2025-05-07T20:25:30.1173958Z 2025-05-07T20:25:30.1173962Z 2025-05-07T20:25:30.1173966Z 2025-05-07T20:25:30.1173969Z 2025-05-07T20:25:30.1173973Z 2025-05-07T20:25:30.1173977Z 2025-05-07T20:25:30.1173980Z 2025-05-07T20:25:30.1173984Z 2025-05-07T20:25:30.1173988Z 2025-05-07T20:25:30.1173991Z 2025-05-07T20:25:30.1179090Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:30.1179509Z 2025-05-07T20:25:30.1179515Z 2025-05-07T20:25:30.1179521Z 2025-05-07T20:25:30.1179526Z 2025-05-07T20:25:30.1179532Z 2025-05-07T20:25:30.1179537Z 2025-05-07T20:25:30.1179543Z 2025-05-07T20:25:30.1179561Z 2025-05-07T20:25:30.1179566Z 2025-05-07T20:25:30.1179571Z 2025-05-07T20:25:30.1179577Z 2025-05-07T20:25:30.1329251Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:30.2922417Z gcc_impl_linux-64-11 | 53.0 MB | ########9 | 90% 2025-05-07T20:25:30.2922774Z 2025-05-07T20:25:30.2922780Z 2025-05-07T20:25:30.2924030Z 2025-05-07T20:25:30.4077518Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:30.4595490Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:30.4595758Z 2025-05-07T20:25:30.7035317Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:30.7035594Z 2025-05-07T20:25:30.7035599Z 2025-05-07T20:25:31.1097162Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:31.1104047Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:31.1104508Z 2025-05-07T20:25:31.1104765Z 2025-05-07T20:25:31.1104975Z  2025-05-07T20:25:31.1105197Z 2025-05-07T20:25:31.1105202Z 2025-05-07T20:25:31.1105365Z  2025-05-07T20:25:31.1105578Z 2025-05-07T20:25:31.1105583Z 2025-05-07T20:25:31.1105587Z 2025-05-07T20:25:31.1105751Z  2025-05-07T20:25:31.1105956Z 2025-05-07T20:25:31.1105960Z 2025-05-07T20:25:31.1105965Z 2025-05-07T20:25:31.1105982Z 2025-05-07T20:25:31.1106221Z  2025-05-07T20:25:31.1106528Z 2025-05-07T20:25:31.1106534Z 2025-05-07T20:25:31.1106551Z 2025-05-07T20:25:31.1106557Z 2025-05-07T20:25:31.1106562Z 2025-05-07T20:25:31.1106833Z  2025-05-07T20:25:31.1107247Z 2025-05-07T20:25:31.1107251Z 2025-05-07T20:25:31.1107254Z 2025-05-07T20:25:31.1107258Z 2025-05-07T20:25:31.1107269Z 2025-05-07T20:25:31.1107273Z 2025-05-07T20:25:31.1107455Z  2025-05-07T20:25:31.1107748Z 2025-05-07T20:25:31.1107752Z 2025-05-07T20:25:31.1107755Z 2025-05-07T20:25:31.1107759Z 2025-05-07T20:25:31.1107763Z 2025-05-07T20:25:31.1107774Z 2025-05-07T20:25:31.1107777Z 2025-05-07T20:25:31.1107952Z  2025-05-07T20:25:31.1108159Z 2025-05-07T20:25:31.1108163Z 2025-05-07T20:25:31.1108167Z 2025-05-07T20:25:31.1108170Z 2025-05-07T20:25:31.1108174Z 2025-05-07T20:25:31.1108184Z 2025-05-07T20:25:31.1108188Z 2025-05-07T20:25:31.1108192Z 2025-05-07T20:25:31.1108498Z  2025-05-07T20:25:31.1108718Z 2025-05-07T20:25:31.1108722Z 2025-05-07T20:25:31.1108726Z 2025-05-07T20:25:31.1108735Z 2025-05-07T20:25:31.1108739Z 2025-05-07T20:25:31.1108743Z 2025-05-07T20:25:31.1108746Z 2025-05-07T20:25:31.1108750Z 2025-05-07T20:25:31.1108754Z 2025-05-07T20:25:31.1108934Z  2025-05-07T20:25:31.1109142Z 2025-05-07T20:25:31.1109151Z 2025-05-07T20:25:31.1109155Z 2025-05-07T20:25:31.1109159Z 2025-05-07T20:25:31.1109162Z 2025-05-07T20:25:31.1109166Z 2025-05-07T20:25:31.1109170Z 2025-05-07T20:25:31.1109173Z 2025-05-07T20:25:31.1109177Z 2025-05-07T20:25:31.1109181Z 2025-05-07T20:25:31.1109364Z  2025-05-07T20:25:31.1109589Z 2025-05-07T20:25:31.1109593Z 2025-05-07T20:25:31.1109596Z 2025-05-07T20:25:31.1109600Z 2025-05-07T20:25:31.1109610Z 2025-05-07T20:25:31.1109614Z 2025-05-07T20:25:31.1109617Z 2025-05-07T20:25:31.1109621Z 2025-05-07T20:25:31.1109629Z 2025-05-07T20:25:31.1109633Z 2025-05-07T20:25:31.1109637Z 2025-05-07T20:25:31.1109843Z  done 2025-05-07T20:25:31.2114737Z Preparing transaction: \ done 2025-05-07T20:25:31.5121403Z Verifying transaction: / - \ done 2025-05-07T20:25:31.6131283Z Executing transaction: / done 2025-05-07T20:25:31.7772925Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:25:35.6459597Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:35.6460154Z 2025-05-07T20:25:35.6475198Z 2025-05-07T20:25:35.6493357Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:35.6493893Z 2025-05-07T20:25:35.6507001Z 2025-05-07T20:25:35.6525614Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:35.6526147Z 2025-05-07T20:25:35.6538853Z 2025-05-07T20:25:35.6556804Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:35.6557318Z 2025-05-07T20:25:35.6571284Z 2025-05-07T20:25:37.5374379Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:37.5374784Z 2025-05-07T20:25:37.5998249Z [CHECK] Binary cc found in PATH 2025-05-07T20:25:39.4792419Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:39.4792748Z 2025-05-07T20:25:39.5414587Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:41.4122328Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:41.4122631Z 2025-05-07T20:25:41.4733749Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:43.3453152Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:43.3453425Z 2025-05-07T20:25:43.4086091Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:43.4089952Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:43.4090435Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:43.4090640Z 2025-05-07T20:25:45.2937354Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:45.2937871Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:45.2938271Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:45.2938563Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:45.2938901Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:45.2939385Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:45.2939775Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:45.2940395Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:45.2940769Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:45.2941159Z #define __CHAR_BIT__ 8 2025-05-07T20:25:45.2941471Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:45.2942150Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:45.2942521Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:45.2942907Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:45.2943272Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:45.2943679Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:45.2943975Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:45.2944344Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:45.2944775Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:45.2945195Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:45.2945593Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:45.2945997Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:45.2946296Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:45.2946559Z #define __GCC_IEC_559 2 2025-05-07T20:25:45.2946794Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:45.2947061Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:45.2947318Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:45.2947594Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:45.2948001Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:45.2948308Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:45.2948570Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:45.2948836Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:45.2949088Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:45.2949348Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:45.2949599Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:45.2949846Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:45.2950097Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:45.2950337Z #define __INT8_C(c) c 2025-05-07T20:25:45.2950569Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:45.2950850Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:45.2951156Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:45.2951460Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:45.2951802Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:45.2952070Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:45.2952334Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:45.2952597Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:45.2952866Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:45.2953250Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:45.2953655Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:45.2953928Z #define __linux 1 2025-05-07T20:25:45.2954150Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:45.2954422Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:45.2954687Z #define __unix 1 2025-05-07T20:25:45.2954906Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:45.2955175Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:45.2955435Z #define __WINT_MIN__ 0U 2025-05-07T20:25:45.2955672Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:45.2955949Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:45.2956207Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:45.2956466Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:45.2956893Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:45.2957162Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:45.2957445Z #define __INT64_C(c) c ## L 2025-05-07T20:25:45.2957701Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:45.2957982Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:45.2958239Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:45.2958581Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:45.2958954Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:45.2959191Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:45.2959443Z #define __DBL_DIG__ 15 2025-05-07T20:25:45.2959663Z #define __FLT32_DIG__ 6 2025-05-07T20:25:45.2959947Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:45.2960284Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:45.2960613Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:45.2960924Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:45.2961259Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:45.2961496Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:45.2961741Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:45.2962109Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:45.2962489Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:45.2962766Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:45.2963013Z #define __unix__ 1 2025-05-07T20:25:45.2963220Z #define __INT_WIDTH__ 32 2025-05-07T20:25:45.2963455Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:45.2963690Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:45.2963925Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:45.2964179Z #define __UINT16_C(c) c 2025-05-07T20:25:45.2964407Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:45.2964646Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:45.2964994Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:45.2965340Z #define __gnu_linux__ 1 2025-05-07T20:25:45.2965574Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:45.2965843Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:45.2966122Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:45.2966374Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:45.2966624Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:45.2976049Z #define __GNUC__ 11 2025-05-07T20:25:45.2976275Z #define __pie__ 2 2025-05-07T20:25:45.2976482Z #define __MMX__ 1 2025-05-07T20:25:45.2976702Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:45.2976963Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:45.2977236Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:45.2977505Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:45.2977843Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:45.2978235Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:45.2978553Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:45.2978805Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:45.2979064Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:45.2979350Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:45.2979609Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:45.2979860Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:45.2980126Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:45.2980410Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:45.2980671Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:45.2980933Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:45.2981177Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:45.2981436Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:45.2981689Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:45.2981948Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:45.2982197Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:45.2982504Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:45.2982852Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:45.2983116Z #define __SSE2_MATH__ 1 2025-05-07T20:25:45.2983359Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:45.2983763Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:45.2984048Z #define __amd64 1 2025-05-07T20:25:45.2984267Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:45.2984519Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:45.2984813Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:45.2985115Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:45.2985358Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:45.2985624Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:45.2985870Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:45.2986117Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:45.2986367Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:45.2986608Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:45.2986860Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:45.2987118Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:45.2987453Z #define __x86_64 1 2025-05-07T20:25:45.2987768Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:45.2988142Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:45.2988592Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:45.2989039Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:45.2989498Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:45.2989869Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:45.2990112Z #define __LP64__ 1 2025-05-07T20:25:45.2990326Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:45.2990663Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:45.2991027Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:45.2991295Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:45.2991552Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:45.2991822Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:45.2992089Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:45.2992339Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:45.2992591Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:45.2992839Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:45.2993081Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:45.2993399Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:45.2993748Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:45.2994009Z #define __FLT_DIG__ 6 2025-05-07T20:25:45.2994231Z #define __NO_INLINE__ 1 2025-05-07T20:25:45.2994464Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:45.2994779Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:45.2995111Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:45.2995435Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:45.2995768Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:45.2996009Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:45.2996258Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:45.2996515Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:45.2996798Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:45.2997076Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:45.2997332Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:45.2997616Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:45.2997936Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:45.2998187Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:45.2998435Z #define __FLT128_DIG__ 33 2025-05-07T20:25:45.2998665Z #define __INT32_C(c) c 2025-05-07T20:25:45.2998897Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:45.2999161Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:45.2999420Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:45.2999690Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:45.2999991Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:45.3000277Z #define unix 1 2025-05-07T20:25:45.3000495Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:45.3000799Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:45.3001087Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:45.3001486Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:45.3001797Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:45.3002029Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:45.3002277Z #define __ELF__ 1 2025-05-07T20:25:45.3002497Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:45.3002760Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:45.3003022Z #define __FLT_RADIX__ 2 2025-05-07T20:25:45.3003258Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:45.3003604Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:45.3003947Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:45.3004190Z #define __SSE_MATH__ 1 2025-05-07T20:25:45.3004403Z #define __k8 1 2025-05-07T20:25:45.3004683Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:45.3005040Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:45.3005417Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:45.3005705Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:45.3005958Z #define __LDBL_DIG__ 18 2025-05-07T20:25:45.3006249Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:45.3006581Z #define __x86_64__ 1 2025-05-07T20:25:45.3006852Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:45.3007173Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:45.3007507Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:45.3007800Z #define __FLT64_DIG__ 15 2025-05-07T20:25:45.3008070Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:45.3008403Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:45.3008699Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:45.3008956Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:45.3009222Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:45.3009504Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:45.3009864Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:45.3010250Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:45.3010527Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:45.3010849Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:45.3011154Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:45.3011432Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:45.3011702Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:45.3011991Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:45.3012263Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:45.3012487Z #define __SEG_FS 1 2025-05-07T20:25:45.3012710Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:45.3012975Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:45.3013231Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:45.3013506Z #define __SEG_GS 1 2025-05-07T20:25:45.3013812Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:45.3014173Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:45.3014435Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:45.3014709Z #define __INT16_TYPE__ short int 2025-05-07T20:25:45.3014972Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:45.3015252Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:45.3015508Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:45.3015738Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:45.3015984Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:45.3016311Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:45.3016695Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:45.3016960Z #define linux 1 2025-05-07T20:25:45.3017172Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:45.3017435Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:45.3017689Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:45.3017929Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:45.3018172Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:45.3018413Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:45.3018754Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:45.3019290Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:45.3019602Z #define __code_model_small__ 1 2025-05-07T20:25:45.3019865Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:45.3020137Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:45.3020372Z #define __k8__ 1 2025-05-07T20:25:45.3020586Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:45.3020859Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:45.3021140Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:45.3021365Z #define __pic__ 2 2025-05-07T20:25:45.3021604Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:45.3021900Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:45.3022175Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:45.3022488Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:45.3022841Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:45.3023266Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:45.3023531Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:45.3023815Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:45.3024106Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:45.3024335Z #define __linux__ 1 2025-05-07T20:25:45.3024544Z #define __INT64_TYPE__ long int 2025-05-07T20:25:45.3024791Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:45.3025030Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:45.3025282Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:45.3025521Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:45.3025792Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:45.3026101Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:45.3026376Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:45.3026621Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:45.3026896Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:45.3027172Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:45.3027487Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:45.3027941Z #define __SSE__ 1 2025-05-07T20:25:45.3028166Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:45.3028495Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:45.3028819Z #define __amd64__ 1 2025-05-07T20:25:45.3029033Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:45.3029276Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:45.3029527Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:45.3029786Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:45.3030037Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:45.3030296Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:45.3030544Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:45.3030805Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:45.3031053Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:45.3031384Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:45.3031839Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:45.3032191Z #define _LP64 1 2025-05-07T20:25:45.3032395Z #define __UINT8_C(c) c 2025-05-07T20:25:45.3032623Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:45.3032887Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:45.3033138Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:45.3033394Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:45.3033679Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:45.3034015Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:45.3034461Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:45.3034819Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:45.3035094Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:45.3035393Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:45.3035751Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:45.3036107Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:45.3036353Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:45.3036678Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:45.3037156Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:45.3037424Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:45.3037665Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:45.3037912Z #define __FXSR__ 1 2025-05-07T20:25:45.3038200Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:45.3038644Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:45.3039040Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:45.3039335Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:45.3039574Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:45.3039895Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:45.3040506Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:45.3040737Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:45.3041133Z #define __PIC__ 2 2025-05-07T20:25:45.3041411Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:45.3041860Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:45.3042296Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:45.3042663Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:45.3043026Z #define __SSE2__ 1 2025-05-07T20:25:45.3043260Z #define __INT32_TYPE__ int 2025-05-07T20:25:45.3043529Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:45.3043796Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:45.3044121Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:45.3044484Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:45.3044745Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:45.3045002Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:45.3045261Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:45.3045527Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:45.3045763Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:45.3046001Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:45.3046276Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:45.3046560Z #define __PIE__ 2 2025-05-07T20:25:45.3046875Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:45.3047266Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:45.3047595Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:45.3047938Z #define __INT16_C(c) c 2025-05-07T20:25:45.3048177Z #define __STDC__ 1 2025-05-07T20:25:45.3048405Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:45.3048665Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:45.3048904Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:45.3049186Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:45.3049523Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:45.3049833Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:45.3050092Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:45.3050360Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:45.3050609Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:45.3050890Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:45.3051172Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:45.3051431Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:45.3051708Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:45.3052087Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:45.3052445Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:45.3052729Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:45.3053007Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:45.3053249Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:45.3053398Z 2025-05-07T20:25:45.3554809Z 2025-05-07T20:25:45.3555114Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:45.3555591Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:45.3555885Z 2025-05-07T20:25:47.2426374Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:47.2426837Z #define __cpp_attributes 200809L 2025-05-07T20:25:47.2427527Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:47.2428027Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:47.2428309Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:47.2428564Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:47.2428908Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:47.2429261Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:47.2429534Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:47.2429838Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:47.2430131Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:47.2430395Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:47.2430644Z #define __CHAR_BIT__ 8 2025-05-07T20:25:47.2430868Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:47.2431116Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:47.2431365Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:47.2431803Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:47.2432074Z #define __cpp_static_assert 201411L 2025-05-07T20:25:47.2432362Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:47.2432656Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:47.2432942Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:47.2433222Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:47.2433538Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:47.2433842Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:47.2434234Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:47.2434642Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:47.2434941Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:47.2435215Z #define __GCC_IEC_559 2 2025-05-07T20:25:47.2435452Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:47.2435719Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:47.2435993Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:47.2436273Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:47.2436557Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:47.2436862Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:47.2437164Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:47.2437485Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:47.2437791Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:47.2438057Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:47.2438329Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:47.2438594Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:47.2438895Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:47.2439154Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:47.2439403Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:47.2439673Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:47.2439997Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:47.2440650Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:47.2440905Z #define __INT8_C(c) c 2025-05-07T20:25:47.2441136Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:47.2441411Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:47.2441719Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:47.2442035Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:47.2442304Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:47.2442581Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:47.2442899Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:47.2443242Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:47.2443513Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:47.2443781Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:47.2444042Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:47.2444305Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:47.2444584Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:47.2444967Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:47.2445377Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:47.2445652Z #define __linux 1 2025-05-07T20:25:47.2446021Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:47.2446295Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:47.2446557Z #define __unix 1 2025-05-07T20:25:47.2446777Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:47.2447051Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:47.2447327Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:47.2447618Z #define __WINT_MIN__ 0U 2025-05-07T20:25:47.2447876Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:47.2448161Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:47.2448429Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:47.2448687Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:47.2448928Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:47.2449210Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:47.2449496Z #define __INT64_C(c) c ## L 2025-05-07T20:25:47.2449874Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:47.2450166Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:47.2450430Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:47.2450724Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:47.2451000Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:47.2451256Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:47.2451594Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:47.2451961Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:47.2452209Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:47.2452478Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:47.2452741Z #define __DBL_DIG__ 15 2025-05-07T20:25:47.2452965Z #define __FLT32_DIG__ 6 2025-05-07T20:25:47.2453259Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:47.2453589Z #define __GXX_WEAK__ 1 2025-05-07T20:25:47.2453818Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:47.2454060Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:47.2463190Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:47.2463579Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:47.2463859Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:47.2464168Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:47.2464497Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:47.2464901Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:47.2465300Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:47.2465580Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:47.2465833Z #define __unix__ 1 2025-05-07T20:25:47.2466059Z #define __INT_WIDTH__ 32 2025-05-07T20:25:47.2466299Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:47.2466544Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:47.2466786Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:47.2467047Z #define __UINT16_C(c) c 2025-05-07T20:25:47.2467285Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:47.2467531Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:47.2468004Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:47.2468370Z #define __gnu_linux__ 1 2025-05-07T20:25:47.2468609Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:47.2468863Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:47.2469142Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:47.2469423Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:47.2469681Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:47.2469937Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:47.2470182Z #define __GNUC__ 11 2025-05-07T20:25:47.2470391Z #define __GXX_RTTI 1 2025-05-07T20:25:47.2470611Z #define __pie__ 2 2025-05-07T20:25:47.2470820Z #define __MMX__ 1 2025-05-07T20:25:47.2471031Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:47.2471292Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:47.2471564Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:47.2471818Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:47.2472068Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:47.2472365Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:47.2472672Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:47.2473131Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:47.2473493Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:47.2473790Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:47.2474088Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:47.2474338Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:47.2474591Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:47.2474886Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:47.2475170Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:47.2475425Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:47.2475673Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:47.2475946Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:47.2476228Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:47.2476482Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:47.2476844Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:47.2477093Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:47.2477355Z #define __cplusplus 201703L 2025-05-07T20:25:47.2477645Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:47.2477939Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:47.2478190Z #define __DEPRECATED 1 2025-05-07T20:25:47.2478430Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:47.2478717Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:47.2478966Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:47.2479269Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:47.2479615Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:47.2479875Z #define __SSE2_MATH__ 1 2025-05-07T20:25:47.2480112Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:47.2480404Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:47.2480688Z #define __amd64 1 2025-05-07T20:25:47.2480905Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:47.2481167Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:47.2481435Z #define __GNUG__ 11 2025-05-07T20:25:47.2481685Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:47.2481987Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:47.2482232Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:47.2482482Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:47.2482743Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:47.2482990Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:47.2483258Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:47.2483537Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:47.2483799Z #define __cpp_hex_float 201603L 2025-05-07T20:25:47.2484058Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:47.2484309Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:47.2484582Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:47.2484845Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:47.2485098Z #define __x86_64 1 2025-05-07T20:25:47.2485328Z #define __cpp_lambdas 200907L 2025-05-07T20:25:47.2485587Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:47.2485951Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:47.2486336Z #define __cpp_template_auto 201606L 2025-05-07T20:25:47.2486688Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:47.2487136Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:47.2487598Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:47.2487970Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:47.2488214Z #define __LP64__ 1 2025-05-07T20:25:47.2488430Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:47.2488768Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:47.2489134Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:47.2489398Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:47.2489672Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:47.2489937Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:47.2490200Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:47.2490446Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:47.2490792Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:47.2491110Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:47.2491450Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:47.2491719Z #define __FLT_DIG__ 6 2025-05-07T20:25:47.2491945Z #define __NO_INLINE__ 1 2025-05-07T20:25:47.2492172Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:47.2492533Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:47.2492865Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:47.2493117Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:47.2493374Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:47.2493618Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:47.2493890Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:47.2494179Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:47.2494425Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:47.2494855Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:47.2495136Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:47.2495398Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:47.2495693Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:47.2496023Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:47.2496310Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:47.2496561Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:47.2496814Z #define __FLT128_DIG__ 33 2025-05-07T20:25:47.2497051Z #define __INT32_C(c) c 2025-05-07T20:25:47.2497281Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:47.2497560Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:47.2497838Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:47.2498103Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:47.2498410Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:47.2498709Z #define unix 1 2025-05-07T20:25:47.2498918Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:47.2499181Z #define __cpp_rtti 199711L 2025-05-07T20:25:47.2499440Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:47.2499739Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:47.2500045Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:47.2500348Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:47.2500669Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:47.2500912Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:47.2501196Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:47.2501470Z #define __ELF__ 1 2025-05-07T20:25:47.2501687Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:47.2501964Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:47.2502232Z #define __FLT_RADIX__ 2 2025-05-07T20:25:47.2502468Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:47.2502814Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:47.2503166Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:47.2503427Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:47.2503702Z #define __k8 1 2025-05-07T20:25:47.2503994Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:47.2504359Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:47.2504644Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:47.2504936Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:47.2505192Z #define __LDBL_DIG__ 18 2025-05-07T20:25:47.2505424Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:47.2505675Z #define __x86_64__ 1 2025-05-07T20:25:47.2505908Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:47.2506195Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:47.2506522Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:47.2506821Z #define __FLT64_DIG__ 15 2025-05-07T20:25:47.2507092Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:47.2507429Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:47.2507837Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:47.2508103Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:47.2508379Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:47.2508671Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:47.2509123Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:47.2509507Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:47.2509791Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:47.2510106Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:47.2510410Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:47.2510723Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:47.2511010Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:47.2511272Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:47.2511561Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:47.2511830Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:47.2512056Z #define __SEG_FS 1 2025-05-07T20:25:47.2512274Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:47.2512534Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:47.2512878Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:47.2513149Z #define __SEG_GS 1 2025-05-07T20:25:47.2513456Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:47.2513822Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:47.2514077Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:47.2514351Z #define __INT16_TYPE__ short int 2025-05-07T20:25:47.2514618Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:47.2514912Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:47.2515190Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:47.2515423Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:47.2515669Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:47.2515992Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:47.2516362Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:47.2516659Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:47.2516970Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:47.2517257Z #define linux 1 2025-05-07T20:25:47.2517477Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:47.2517780Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:47.2518058Z #define __EXCEPTIONS 1 2025-05-07T20:25:47.2518298Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:47.2518544Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:47.2518803Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:47.2519085Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:47.2519422Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:47.2519797Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:47.2520137Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:47.2520456Z #define __code_model_small__ 1 2025-05-07T20:25:47.2520719Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:47.2521015Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:47.2521308Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:47.2521577Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:47.2521858Z #define __k8__ 1 2025-05-07T20:25:47.2522085Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:47.2522358Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:47.2522647Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:47.2522884Z #define __pic__ 2 2025-05-07T20:25:47.2523126Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:47.2523428Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:47.2523692Z #define __cpp_decltype 200707L 2025-05-07T20:25:47.2523972Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:47.2524292Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:47.2524650Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:47.2525002Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:47.2525286Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:47.2525601Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:47.2525893Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:47.2526138Z #define __linux__ 1 2025-05-07T20:25:47.2526362Z #define __INT64_TYPE__ long int 2025-05-07T20:25:47.2526710Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:47.2526960Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:47.2527226Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:47.2527506Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:47.2527811Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:47.2528102Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:47.2528409Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:47.2528674Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:47.2528953Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:47.2529242Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:47.2529568Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:47.2529908Z #define __SSE__ 1 2025-05-07T20:25:47.2530133Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:47.2530546Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:47.2530882Z #define __amd64__ 1 2025-05-07T20:25:47.2531110Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:47.2531358Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:47.2531617Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:47.2531874Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:47.2532140Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:47.2532387Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:47.2532656Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:47.2532921Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:47.2533261Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:47.2533704Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:47.2534053Z #define _LP64 1 2025-05-07T20:25:47.2534265Z #define __UINT8_C(c) c 2025-05-07T20:25:47.2534493Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:47.2534753Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:47.2535022Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:47.2535271Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:47.2535615Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:47.2536069Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:47.2536427Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:47.2536715Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:47.2537016Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:47.2537317Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:47.2537681Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:47.2538037Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:47.2538295Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:47.2538546Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:47.2538881Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:47.2539240Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:47.2539493Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:47.2539741Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:47.2539993Z #define __FXSR__ 1 2025-05-07T20:25:47.2540529Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:47.2540972Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:47.2541376Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:47.2541684Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:47.2541936Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:47.2542226Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:47.2542511Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:47.2542766Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:47.2543119Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:47.2543471Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:47.2543729Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:47.2543969Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:47.2544206Z #define __PIC__ 2 2025-05-07T20:25:47.2544448Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:47.2544984Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:47.2545358Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:47.2545683Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:47.2546015Z #define __cpp_constexpr 201603L 2025-05-07T20:25:47.2546268Z #define __SSE2__ 1 2025-05-07T20:25:47.2546502Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:47.2546774Z #define __INT32_TYPE__ int 2025-05-07T20:25:47.2547018Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:47.2547277Z #define __cpp_exceptions 199711L 2025-05-07T20:25:47.2547544Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:47.2547941Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:47.2548287Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:47.2548545Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:47.2548923Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:47.2549190Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:47.2549467Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:47.2549701Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:47.2549953Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:47.2550239Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:47.2550513Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:47.2550802Z #define __PIE__ 2 2025-05-07T20:25:47.2551128Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:47.2551540Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:47.2551841Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:47.2552177Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:47.2552527Z #define __INT16_C(c) c 2025-05-07T20:25:47.2552751Z #define __STDC__ 1 2025-05-07T20:25:47.2552971Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:47.2553230Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:47.2553491Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:47.2553743Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:47.2554049Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:47.2554379Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:47.2554706Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:47.2554970Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:47.2555248Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:47.2555527Z #define __SSE_MATH__ 1 2025-05-07T20:25:47.2555762Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:47.2556032Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:47.2556334Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:47.2556612Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:47.2556891Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:47.2557163Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:47.2557454Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:47.2557845Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:47.2558212Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:47.2558512Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:47.2558790Z #define _GNU_SOURCE 1 2025-05-07T20:25:47.2559024Z #define __cpp_init_captures 201304L 2025-05-07T20:25:47.2559299Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:47.2559545Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:47.2559699Z 2025-05-07T20:25:47.3043399Z 2025-05-07T20:25:47.3043750Z + conda run -n build_binary c++ --version 2025-05-07T20:25:47.3043971Z 2025-05-07T20:25:49.1760063Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:49.1760465Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:49.1760909Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:49.1761436Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:49.1761750Z 2025-05-07T20:25:49.1761755Z 2025-05-07T20:25:49.2385679Z 2025-05-07T20:25:49.2386421Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:49.2388001Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:49.2388339Z 2025-05-07T20:25:51.1879556Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:51.1882035Z 2025-05-07T20:25:51.1882561Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:51.1883134Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:51.1883436Z 2025-05-07T20:25:53.1323649Z #define __cplusplus 201703L 2025-05-07T20:25:53.1325937Z 2025-05-07T20:25:53.1326524Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:53.1374930Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.8.0 2025-05-07T20:25:53.1375336Z . $PRELUDE; install_cuda $BUILD_ENV 12.8.0 2025-05-07T20:25:53.1388998Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:53.1389350Z env: 2025-05-07T20:25:53.1389570Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:53.1389876Z BUILD_ENV: build_binary 2025-05-07T20:25:53.1390123Z BUILD_TARGET: genai 2025-05-07T20:25:53.1390356Z BUILD_VARIANT: cuda 2025-05-07T20:25:53.1390581Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:25:53.1390835Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:53.1391131Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:53.1391464Z ##[endgroup] 2025-05-07T20:25:53.4747276Z ################################################################################ 2025-05-07T20:25:53.4747850Z # Install CUDA 2025-05-07T20:25:53.4748121Z # 2025-05-07T20:25:53.4763647Z # [2025-05-07T20:25:53.476Z] + install_cuda build_binary 12.8.0 2025-05-07T20:25:53.4764163Z ################################################################################ 2025-05-07T20:25:53.4764447Z 2025-05-07T20:25:53.4779216Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:53.5685555Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:53.5686037Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:53.5690174Z + conda clean --packages --tarball -y 2025-05-07T20:25:53.5690460Z 2025-05-07T20:25:54.2702101Z Will remove 29 (113.6 MB) tarball(s). 2025-05-07T20:25:54.2702627Z Will remove 6 (619 KB) package(s). 2025-05-07T20:25:54.3323425Z 2025-05-07T20:25:54.3332948Z + conda clean --all -y 2025-05-07T20:25:54.3333208Z 2025-05-07T20:25:54.9934532Z There are no unused tarball(s) to remove. 2025-05-07T20:25:54.9935162Z Will remove 1 index cache(s). 2025-05-07T20:25:54.9935721Z There are no unused package(s) to remove. 2025-05-07T20:25:54.9936322Z There are no tempfile(s) to remove. 2025-05-07T20:25:54.9936891Z There are no logfile(s) to remove. 2025-05-07T20:25:55.0557819Z 2025-05-07T20:25:55.0572122Z [INSTALL] Installing CUDA 12.8.0 ... 2025-05-07T20:25:55.0596566Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.8.0 2025-05-07T20:25:55.9645870Z Channels: 2025-05-07T20:26:06.3348880Z - conda-forge 2025-05-07T20:26:06.3349166Z Platform: linux-64 2025-05-07T20:26:06.3349908Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:26:07.4486670Z Solving environment: - \ | / done 2025-05-07T20:26:07.5216775Z 2025-05-07T20:26:07.5217217Z ## Package Plan ## 2025-05-07T20:26:07.5217420Z 2025-05-07T20:26:07.5217623Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:07.5217917Z 2025-05-07T20:26:07.5218015Z added / updated specs: 2025-05-07T20:26:07.5218245Z - cuda=12.8.0 2025-05-07T20:26:07.5218378Z 2025-05-07T20:26:07.5218408Z 2025-05-07T20:26:07.5218525Z The following packages will be downloaded: 2025-05-07T20:26:07.5218738Z 2025-05-07T20:26:07.5218847Z package | build 2025-05-07T20:26:07.5219160Z ---------------------------|----------------- 2025-05-07T20:26:07.5219511Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:26:07.5220415Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:26:07.5220935Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:26:07.5221515Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:26:07.5222064Z cuda-12.8.0 | ha804496_0 26 KB conda-forge 2025-05-07T20:26:07.5222552Z cuda-cccl_linux-64-12.8.55 | ha770c72_1 1.0 MB conda-forge 2025-05-07T20:26:07.5223297Z cuda-command-line-tools-12.8.0| ha770c72_0 20 KB conda-forge 2025-05-07T20:26:07.5223793Z cuda-compiler-12.8.0 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:26:07.5224257Z cuda-crt-dev_linux-64-12.8.61| ha770c72_1 90 KB conda-forge 2025-05-07T20:26:07.5224724Z cuda-crt-tools-12.8.61 | ha770c72_1 27 KB conda-forge 2025-05-07T20:26:07.5225160Z cuda-cudart-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:26:07.5225599Z cuda-cudart-dev-12.8.57 | h5888daf_1 23 KB conda-forge 2025-05-07T20:26:07.5226087Z cuda-cudart-dev_linux-64-12.8.57| h3f2d84a_1 377 KB conda-forge 2025-05-07T20:26:07.5226581Z cuda-cudart-static-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:26:07.5227091Z cuda-cudart-static_linux-64-12.8.57| h3f2d84a_1 950 KB conda-forge 2025-05-07T20:26:07.5227595Z cuda-cudart_linux-64-12.8.57| h3f2d84a_1 188 KB conda-forge 2025-05-07T20:26:07.5228167Z cuda-cuobjdump-12.8.55 | hbd13f7d_0 227 KB conda-forge 2025-05-07T20:26:07.5228608Z cuda-cupti-12.8.57 | hbd13f7d_0 1.8 MB conda-forge 2025-05-07T20:26:07.5229032Z cuda-cupti-dev-12.8.57 | h5888daf_0 4.0 MB conda-forge 2025-05-07T20:26:07.5229479Z cuda-cuxxfilt-12.8.55 | hbd13f7d_0 211 KB conda-forge 2025-05-07T20:26:07.5229931Z cuda-driver-dev-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:26:07.5230406Z cuda-driver-dev_linux-64-12.8.90| h3f2d84a_1 36 KB conda-forge 2025-05-07T20:26:07.5230857Z cuda-gdb-12.8.55 | h50b4baa_0 353 KB conda-forge 2025-05-07T20:26:07.5231286Z cuda-libraries-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:07.5231758Z cuda-libraries-dev-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:07.5232221Z cuda-nsight-12.8.55 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:26:07.5232640Z cuda-nvcc-12.8.61 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:26:07.5233082Z cuda-nvcc-dev_linux-64-12.8.61| he91c749_1 12.7 MB conda-forge 2025-05-07T20:26:07.5233546Z cuda-nvcc-impl-12.8.61 | h85509e4_1 25 KB conda-forge 2025-05-07T20:26:07.5233984Z cuda-nvcc-tools-12.8.61 | he02047a_1 24.5 MB conda-forge 2025-05-07T20:26:07.5234436Z cuda-nvcc_linux-64-12.8.61 | h04802cd_0 25 KB conda-forge 2025-05-07T20:26:07.5234879Z cuda-nvdisasm-12.8.55 | hbd13f7d_0 4.9 MB conda-forge 2025-05-07T20:26:07.5235315Z cuda-nvml-dev-12.8.55 | hbd13f7d_0 134 KB conda-forge 2025-05-07T20:26:07.5235738Z cuda-nvprof-12.8.57 | hbd13f7d_0 2.5 MB conda-forge 2025-05-07T20:26:07.5236170Z cuda-nvprune-12.8.55 | hbd13f7d_0 68 KB conda-forge 2025-05-07T20:26:07.5236602Z cuda-nvrtc-12.8.61 | hbd13f7d_0 63.1 MB conda-forge 2025-05-07T20:26:07.5237025Z cuda-nvrtc-dev-12.8.61 | h5888daf_0 34 KB conda-forge 2025-05-07T20:26:07.5237458Z cuda-nvtx-12.8.55 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:26:07.5237897Z cuda-nvvm-dev_linux-64-12.8.61| ha770c72_1 25 KB conda-forge 2025-05-07T20:26:07.5238447Z cuda-nvvm-impl-12.8.61 | he02047a_1 20.8 MB conda-forge 2025-05-07T20:26:07.5238883Z cuda-nvvm-tools-12.8.61 | he02047a_1 23.5 MB conda-forge 2025-05-07T20:26:07.5239307Z cuda-nvvp-12.8.57 | hbd13f7d_0 112.4 MB conda-forge 2025-05-07T20:26:07.5239724Z cuda-opencl-12.8.55 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:26:07.5240551Z cuda-opencl-dev-12.8.55 | h5888daf_0 95 KB conda-forge 2025-05-07T20:26:07.5241199Z cuda-profiler-api-12.8.55 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:26:07.5241732Z cuda-runtime-12.8.0 | ha804496_0 20 KB conda-forge 2025-05-07T20:26:07.5242268Z cuda-sanitizer-api-12.8.55 | hbd13f7d_0 8.8 MB conda-forge 2025-05-07T20:26:07.5242799Z cuda-toolkit-12.8.0 | ha804496_0 20 KB conda-forge 2025-05-07T20:26:07.5243283Z cuda-tools-12.8.0 | ha770c72_0 19 KB conda-forge 2025-05-07T20:26:07.5243777Z cuda-version-12.8 | h5d125a7_3 21 KB conda-forge 2025-05-07T20:26:07.5244298Z cuda-visual-tools-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:07.5244820Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:26:07.5245276Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:26:07.5245708Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:26:07.5246242Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:26:07.5246836Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:26:07.5247423Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:26:07.5247988Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:26:07.5248499Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:26:07.5249022Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:26:07.5249559Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:26:07.5250056Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:26:07.5250502Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:26:07.5250945Z gds-tools-1.13.0.11 | h5888daf_0 37.9 MB conda-forge 2025-05-07T20:26:07.5251399Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:26:07.5251814Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:26:07.5252249Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:26:07.5252693Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:26:07.5253133Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:26:07.5253602Z libcublas-12.8.3.14 | h9ab20c4_0 460.2 MB conda-forge 2025-05-07T20:26:07.5254105Z libcublas-dev-12.8.3.14 | h9ab20c4_0 89 KB conda-forge 2025-05-07T20:26:07.5254608Z libcufft-11.3.3.41 | hbd13f7d_0 147.4 MB conda-forge 2025-05-07T20:26:07.5255106Z libcufft-dev-11.3.3.41 | h5888daf_0 33 KB conda-forge 2025-05-07T20:26:07.5255604Z libcufile-1.13.0.11 | h12f29b5_0 939 KB conda-forge 2025-05-07T20:26:07.5256114Z libcufile-dev-1.13.0.11 | h5888daf_0 35 KB conda-forge 2025-05-07T20:26:07.5256621Z libcurand-10.3.9.55 | hbd13f7d_0 43.6 MB conda-forge 2025-05-07T20:26:07.5257127Z libcurand-dev-10.3.9.55 | h5888daf_0 265 KB conda-forge 2025-05-07T20:26:07.5257769Z libcusolver-11.7.2.55 | h9ab20c4_0 156.9 MB conda-forge 2025-05-07T20:26:07.5258303Z libcusolver-dev-11.7.2.55 | h9ab20c4_0 59 KB conda-forge 2025-05-07T20:26:07.5258832Z libcusparse-12.5.7.53 | hbd13f7d_0 164.9 MB conda-forge 2025-05-07T20:26:07.5259360Z libcusparse-dev-12.5.7.53 | h5888daf_0 51 KB conda-forge 2025-05-07T20:26:07.5259896Z libedit-3.1.20250104 | pl5321h7949ede_0 132 KB conda-forge 2025-05-07T20:26:07.5260397Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:26:07.5260969Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:26:07.5261473Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:26:07.5261984Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:26:07.5262480Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:26:07.5262949Z libglvnd-1.7.0 | ha4b6fd6_2 129 KB conda-forge 2025-05-07T20:26:07.5263427Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:26:07.5263908Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:26:07.5264364Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:26:07.5264813Z libnpp-12.3.3.65 | hbd13f7d_0 130.6 MB conda-forge 2025-05-07T20:26:07.5265300Z libnpp-dev-12.3.3.65 | h5888daf_0 443 KB conda-forge 2025-05-07T20:26:07.5265780Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:26:07.5266266Z libnvfatbin-12.8.55 | hbd13f7d_0 793 KB conda-forge 2025-05-07T20:26:07.5266786Z libnvfatbin-dev-12.8.55 | h5888daf_0 26 KB conda-forge 2025-05-07T20:26:07.5267323Z libnvjitlink-12.8.61 | hbd13f7d_0 28.7 MB conda-forge 2025-05-07T20:26:07.5267926Z libnvjitlink-dev-12.8.61 | h5888daf_0 25 KB conda-forge 2025-05-07T20:26:07.5268361Z libnvjpeg-12.3.5.57 | h97fd463_0 3.0 MB conda-forge 2025-05-07T20:26:07.5268793Z libnvjpeg-dev-12.3.5.57 | ha770c72_0 31 KB conda-forge 2025-05-07T20:26:07.5269219Z libopengl-1.7.0 | ha4b6fd6_2 50 KB conda-forge 2025-05-07T20:26:07.5269622Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:26:07.5270021Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:26:07.5270440Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:26:07.5270854Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:26:07.5271255Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:26:07.5271641Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:26:07.5272049Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:26:07.5272468Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:26:07.5272874Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:26:07.5273263Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:26:07.5273660Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:26:07.5274031Z ncurses-6.5 | h2d0b736_3 871 KB conda-forge 2025-05-07T20:26:07.5274462Z nsight-compute-2025.1.0.14 | hb5ebaad_0 320.6 MB conda-forge 2025-05-07T20:26:07.5274883Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:26:07.5275252Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:26:07.5275709Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:26:07.5276144Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:26:07.5276567Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:26:07.5276976Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:26:07.5277407Z python-3.13.0 |h9ebbce0_101_cp313 31.5 MB conda-forge 2025-05-07T20:26:07.5277923Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:26:07.5278322Z sqlite-3.49.2 | h9eae976_0 840 KB conda-forge 2025-05-07T20:26:07.5278695Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:26:07.5279082Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:26:07.5279479Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:26:07.5279895Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:26:07.5280337Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:26:07.5280793Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:26:07.5281253Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:26:07.5281698Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:26:07.5282136Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:26:07.5282575Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:26:07.5282997Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:26:07.5283416Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:26:07.5283836Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:26:07.5284286Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:26:07.5284749Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:26:07.5285197Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:26:07.5285641Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:26:07.5286095Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:26:07.5286514Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:26:07.5286950Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:26:07.5287399Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:26:07.5287845Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:26:07.5288245Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:26:07.5288620Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:26:07.5288983Z ------------------------------------------------------------ 2025-05-07T20:26:07.5289311Z Total: 1.91 GB 2025-05-07T20:26:07.5289520Z 2025-05-07T20:26:07.5289644Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:07.5289863Z 2025-05-07T20:26:07.5290069Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:26:07.5290477Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:26:07.5290881Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:26:07.5291330Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:26:07.5291833Z cuda conda-forge/noarch::cuda-12.8.0-ha804496_0 2025-05-07T20:26:07.5292292Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.8.55-ha770c72_1 2025-05-07T20:26:07.5292864Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.8.0-ha770c72_0 2025-05-07T20:26:07.5293423Z cuda-compiler conda-forge/noarch::cuda-compiler-12.8.0-hbad6d8a_0 2025-05-07T20:26:07.5293950Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.8.61-ha770c72_1 2025-05-07T20:26:07.5294498Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.8.61-ha770c72_1 2025-05-07T20:26:07.5295094Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.8.57-h5888daf_1 2025-05-07T20:26:07.5295612Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.8.57-h5888daf_1 2025-05-07T20:26:07.5296166Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:26:07.5296765Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.8.57-h5888daf_1 2025-05-07T20:26:07.5299880Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:26:07.5300481Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:26:07.5301026Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.8.55-hbd13f7d_0 2025-05-07T20:26:07.5301538Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.8.57-hbd13f7d_0 2025-05-07T20:26:07.5302018Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.8.57-h5888daf_0 2025-05-07T20:26:07.5302543Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.8.55-hbd13f7d_0 2025-05-07T20:26:07.5303060Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.8.57-h5888daf_1 2025-05-07T20:26:07.5303612Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.8.90-h3f2d84a_1 2025-05-07T20:26:07.5304132Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.8.55-h50b4baa_0 2025-05-07T20:26:07.5304621Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.8.0-ha770c72_0 2025-05-07T20:26:07.5305166Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.8.0-ha770c72_0 2025-05-07T20:26:07.5305694Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.8.55-h7938cbb_0 2025-05-07T20:26:07.5306150Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.8.61-hcdd1206_0 2025-05-07T20:26:07.5306661Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.8.61-he91c749_1 2025-05-07T20:26:07.5307220Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.8.61-h85509e4_1 2025-05-07T20:26:07.5307850Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.8.61-he02047a_1 2025-05-07T20:26:07.5308384Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.8.61-h04802cd_0 2025-05-07T20:26:07.5308908Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.8.55-hbd13f7d_0 2025-05-07T20:26:07.5309411Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.8.55-hbd13f7d_0 2025-05-07T20:26:07.5309901Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.8.57-hbd13f7d_0 2025-05-07T20:26:07.5310385Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.8.55-hbd13f7d_0 2025-05-07T20:26:07.5310859Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.8.61-hbd13f7d_0 2025-05-07T20:26:07.5311343Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.8.61-h5888daf_0 2025-05-07T20:26:07.5311829Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.8.55-hbd13f7d_0 2025-05-07T20:26:07.5312349Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.8.61-ha770c72_1 2025-05-07T20:26:07.5312901Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.8.61-he02047a_1 2025-05-07T20:26:07.5313436Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.8.61-he02047a_1 2025-05-07T20:26:07.5313926Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.8.57-hbd13f7d_0 2025-05-07T20:26:07.5314500Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.8.55-hbd13f7d_0 2025-05-07T20:26:07.5315004Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.8.55-h5888daf_0 2025-05-07T20:26:07.5315550Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.8.55-h7938cbb_0 2025-05-07T20:26:07.5316065Z cuda-runtime conda-forge/noarch::cuda-runtime-12.8.0-ha804496_0 2025-05-07T20:26:07.5316609Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.8.55-hbd13f7d_0 2025-05-07T20:26:07.5317142Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.8.0-ha804496_0 2025-05-07T20:26:07.5317703Z cuda-tools conda-forge/linux-64::cuda-tools-12.8.0-ha770c72_0 2025-05-07T20:26:07.5318170Z cuda-version conda-forge/noarch::cuda-version-12.8-h5d125a7_3 2025-05-07T20:26:07.5318692Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.8.0-ha770c72_0 2025-05-07T20:26:07.5319220Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:26:07.5319655Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:26:07.5320144Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:26:07.5320728Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:26:07.5321312Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:26:07.5321869Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:26:07.5322358Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:26:07.5322842Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:26:07.5323315Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:26:07.5323759Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:26:07.5324161Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:26:07.5324575Z gds-tools conda-forge/linux-64::gds-tools-1.13.0.11-h5888daf_0 2025-05-07T20:26:07.5324989Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:26:07.5325347Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:26:07.5325743Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:26:07.5326155Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:26:07.5326541Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:26:07.5326978Z libcublas conda-forge/linux-64::libcublas-12.8.3.14-h9ab20c4_0 2025-05-07T20:26:07.5327472Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.8.3.14-h9ab20c4_0 2025-05-07T20:26:07.5327954Z libcufft conda-forge/linux-64::libcufft-11.3.3.41-hbd13f7d_0 2025-05-07T20:26:07.5328421Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.3.41-h5888daf_0 2025-05-07T20:26:07.5328920Z libcufile conda-forge/linux-64::libcufile-1.13.0.11-h12f29b5_0 2025-05-07T20:26:07.5329410Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.13.0.11-h5888daf_0 2025-05-07T20:26:07.5329895Z libcurand conda-forge/linux-64::libcurand-10.3.9.55-hbd13f7d_0 2025-05-07T20:26:07.5330372Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.9.55-h5888daf_0 2025-05-07T20:26:07.5330876Z libcusolver conda-forge/linux-64::libcusolver-11.7.2.55-h9ab20c4_0 2025-05-07T20:26:07.5331397Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.2.55-h9ab20c4_0 2025-05-07T20:26:07.5331920Z libcusparse conda-forge/linux-64::libcusparse-12.5.7.53-hbd13f7d_0 2025-05-07T20:26:07.5332442Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.7.53-h5888daf_0 2025-05-07T20:26:07.5332975Z libedit conda-forge/linux-64::libedit-3.1.20250104-pl5321h7949ede_0 2025-05-07T20:26:07.5333561Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:26:07.5334174Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:26:07.5334660Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:26:07.5335173Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:26:07.5335638Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:26:07.5336065Z libglvnd conda-forge/linux-64::libglvnd-1.7.0-ha4b6fd6_2 2025-05-07T20:26:07.5336528Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:26:07.5337077Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:26:07.5337493Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:26:07.5337899Z libnpp conda-forge/linux-64::libnpp-12.3.3.65-hbd13f7d_0 2025-05-07T20:26:07.5338353Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.3.65-h5888daf_0 2025-05-07T20:26:07.5338807Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:26:07.5339270Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.8.55-hbd13f7d_0 2025-05-07T20:26:07.5339775Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.8.55-h5888daf_0 2025-05-07T20:26:07.5340776Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.8.61-hbd13f7d_0 2025-05-07T20:26:07.5341485Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.8.61-h5888daf_0 2025-05-07T20:26:07.5342009Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.5.57-h97fd463_0 2025-05-07T20:26:07.5342495Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.5.57-ha770c72_0 2025-05-07T20:26:07.5342983Z libopengl conda-forge/linux-64::libopengl-1.7.0-ha4b6fd6_2 2025-05-07T20:26:07.5343409Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:26:07.5343837Z libsqlite conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 2025-05-07T20:26:07.5354242Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:26:07.5354825Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:26:07.5355311Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:26:07.5355835Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:26:07.5356390Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:26:07.5356829Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:26:07.5357242Z libzlib conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 2025-05-07T20:26:07.5357637Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:26:07.5358103Z nsight-compute conda-forge/linux-64::nsight-compute-2025.1.0.14-hb5ebaad_0 2025-05-07T20:26:07.5358567Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:26:07.5358932Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:26:07.5359309Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:26:07.5359777Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:26:07.5360242Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:26:07.5360694Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:26:07.5361157Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:26:07.5362166Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:26:07.5362589Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:26:07.5363115Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:26:07.5363630Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:26:07.5364392Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:26:07.5364958Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:26:07.5365481Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:26:07.5365974Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:26:07.5366482Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:26:07.5366948Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:26:07.5367548Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:26:07.5368018Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:26:07.5368549Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:26:07.5369105Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:26:07.5369632Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:26:07.5370116Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:26:07.5370717Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:26:07.5371397Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:26:07.5372000Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:26:07.5372524Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:26:07.5373045Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:26:07.5373481Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:26:07.5373724Z 2025-05-07T20:26:07.5373842Z The following packages will be UPDATED: 2025-05-07T20:26:07.5374045Z 2025-05-07T20:26:07.5374313Z libuuid pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:26:07.5374924Z ncurses pkgs/main::ncurses-6.4-h6a678d5_0 --> conda-forge::ncurses-6.5-h2d0b736_3 2025-05-07T20:26:07.5375509Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.49.2-h9eae976_0 2025-05-07T20:26:07.5376072Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 2025-05-07T20:26:07.5376386Z 2025-05-07T20:26:07.5376599Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:26:07.5376909Z 2025-05-07T20:26:07.5377145Z expat pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 2025-05-07T20:26:07.5377745Z python pkgs/main::python-3.13.2-hf623796_100~ --> conda-forge::python-3.13.0-h9ebbce0_101_cp313 2025-05-07T20:26:07.5378340Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:26:07.5378652Z 2025-05-07T20:26:07.5378685Z 2025-05-07T20:26:07.5378689Z 2025-05-07T20:26:07.5378841Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:07.5379211Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:26:07.5379447Z 2025-05-07T20:26:07.5379843Z nsight-compute-2025. | 320.6 MB | | 0%  2025-05-07T20:26:07.5380090Z 2025-05-07T20:26:07.5380095Z 2025-05-07T20:26:07.5380318Z libcusparse-12.5.7.5 | 164.9 MB | | 0%  2025-05-07T20:26:07.5380567Z 2025-05-07T20:26:07.5380571Z 2025-05-07T20:26:07.5380575Z 2025-05-07T20:26:07.5380814Z libcusolver-11.7.2.5 | 156.9 MB | | 0%  2025-05-07T20:26:07.5381064Z 2025-05-07T20:26:07.5381068Z 2025-05-07T20:26:07.5381072Z 2025-05-07T20:26:07.5381075Z 2025-05-07T20:26:07.5381306Z libcufft-11.3.3.41 | 147.4 MB | | 0%  2025-05-07T20:26:07.5381555Z 2025-05-07T20:26:07.5381559Z 2025-05-07T20:26:07.5381662Z 2025-05-07T20:26:07.5381666Z 2025-05-07T20:26:07.5381670Z 2025-05-07T20:26:07.5386267Z libnpp-12.3.3.65 | 130.6 MB | | 0%  2025-05-07T20:26:07.5386524Z 2025-05-07T20:26:07.5386528Z 2025-05-07T20:26:07.5386532Z 2025-05-07T20:26:07.5386536Z 2025-05-07T20:26:07.5386539Z 2025-05-07T20:26:07.5390397Z 2025-05-07T20:26:07.5391851Z cuda-nsight-12.8.55 | 113.2 MB | | 0%  2025-05-07T20:26:07.5392139Z 2025-05-07T20:26:07.5392143Z 2025-05-07T20:26:07.5392147Z 2025-05-07T20:26:07.5392151Z 2025-05-07T20:26:07.5392154Z 2025-05-07T20:26:07.5392158Z 2025-05-07T20:26:07.5392162Z 2025-05-07T20:26:07.5393526Z cuda-nvvp-12.8.57 | 112.4 MB | | 0%  2025-05-07T20:26:07.5393809Z 2025-05-07T20:26:07.5393813Z 2025-05-07T20:26:07.5393817Z 2025-05-07T20:26:07.5393820Z 2025-05-07T20:26:07.5393824Z 2025-05-07T20:26:07.5393828Z 2025-05-07T20:26:07.5393831Z 2025-05-07T20:26:07.5393835Z 2025-05-07T20:26:07.5394664Z cuda-nvrtc-12.8.61 | 63.1 MB | | 0%  2025-05-07T20:26:07.5394935Z 2025-05-07T20:26:07.5394939Z 2025-05-07T20:26:07.5394943Z 2025-05-07T20:26:07.5394947Z 2025-05-07T20:26:07.5394960Z 2025-05-07T20:26:07.5394964Z 2025-05-07T20:26:07.5394968Z 2025-05-07T20:26:07.5394971Z 2025-05-07T20:26:07.5394975Z 2025-05-07T20:26:07.5396799Z libcurand-10.3.9.55 | 43.6 MB | | 0%  2025-05-07T20:26:07.5397172Z 2025-05-07T20:26:07.5397178Z 2025-05-07T20:26:07.5397184Z 2025-05-07T20:26:07.5397189Z 2025-05-07T20:26:07.5397194Z 2025-05-07T20:26:07.5397200Z 2025-05-07T20:26:07.5397216Z 2025-05-07T20:26:07.5397222Z 2025-05-07T20:26:07.5397225Z 2025-05-07T20:26:07.5397229Z 2025-05-07T20:26:07.5400223Z gds-tools-1.13.0.11 | 37.9 MB | | 0%  2025-05-07T20:26:07.5400609Z 2025-05-07T20:26:07.5400613Z 2025-05-07T20:26:07.5400617Z 2025-05-07T20:26:07.5400621Z 2025-05-07T20:26:07.5400631Z 2025-05-07T20:26:07.5400635Z 2025-05-07T20:26:07.5400639Z 2025-05-07T20:26:07.5400642Z 2025-05-07T20:26:07.5400646Z 2025-05-07T20:26:07.5400650Z 2025-05-07T20:26:07.5400654Z 2025-05-07T20:26:07.5402945Z python-3.13.0 | 31.5 MB | | 0%  2025-05-07T20:26:07.5403270Z 2025-05-07T20:26:07.5403276Z 2025-05-07T20:26:07.5403281Z 2025-05-07T20:26:07.5403286Z 2025-05-07T20:26:07.5403291Z 2025-05-07T20:26:07.5403297Z 2025-05-07T20:26:07.5403302Z 2025-05-07T20:26:07.5403307Z 2025-05-07T20:26:07.5403312Z 2025-05-07T20:26:07.5403318Z 2025-05-07T20:26:07.5403323Z 2025-05-07T20:26:07.5404436Z 2025-05-07T20:26:07.5405932Z libnvjitlink-12.8.61 | 28.7 MB | | 0%  2025-05-07T20:26:07.5406289Z 2025-05-07T20:26:07.5406292Z 2025-05-07T20:26:07.5406304Z 2025-05-07T20:26:07.5406307Z 2025-05-07T20:26:07.5406311Z 2025-05-07T20:26:07.5406315Z 2025-05-07T20:26:07.5406465Z 2025-05-07T20:26:07.5406471Z 2025-05-07T20:26:07.5406486Z 2025-05-07T20:26:07.5406490Z 2025-05-07T20:26:07.5406494Z 2025-05-07T20:26:07.5406498Z 2025-05-07T20:26:07.5406502Z 2025-05-07T20:26:07.5407383Z cuda-nvcc-tools-12.8 | 24.5 MB | | 0%  2025-05-07T20:26:07.5407829Z 2025-05-07T20:26:07.5407835Z 2025-05-07T20:26:07.5407860Z 2025-05-07T20:26:07.5407865Z 2025-05-07T20:26:07.5407871Z 2025-05-07T20:26:07.5407876Z 2025-05-07T20:26:07.5407881Z 2025-05-07T20:26:07.5407887Z 2025-05-07T20:26:07.5407892Z 2025-05-07T20:26:07.5407898Z 2025-05-07T20:26:07.5407903Z 2025-05-07T20:26:07.5407908Z 2025-05-07T20:26:07.5407914Z 2025-05-07T20:26:07.5407927Z 2025-05-07T20:26:07.5409619Z cuda-nvvm-tools-12.8 | 23.5 MB | | 0%  2025-05-07T20:26:07.5409948Z 2025-05-07T20:26:07.5409951Z 2025-05-07T20:26:07.5409955Z 2025-05-07T20:26:07.5409959Z 2025-05-07T20:26:07.5409963Z 2025-05-07T20:26:07.5409966Z 2025-05-07T20:26:07.5409970Z 2025-05-07T20:26:07.5410088Z 2025-05-07T20:26:07.5410092Z 2025-05-07T20:26:07.5410096Z 2025-05-07T20:26:07.5410099Z 2025-05-07T20:26:07.5410103Z 2025-05-07T20:26:07.5410107Z 2025-05-07T20:26:07.5410111Z 2025-05-07T20:26:07.5410122Z 2025-05-07T20:26:07.5410816Z cuda-nvvm-impl-12.8. | 20.8 MB | | 0%  2025-05-07T20:26:07.5411150Z 2025-05-07T20:26:07.5411154Z 2025-05-07T20:26:07.5411157Z 2025-05-07T20:26:07.5411170Z 2025-05-07T20:26:07.5411173Z 2025-05-07T20:26:07.5411177Z 2025-05-07T20:26:07.5411181Z 2025-05-07T20:26:07.5411184Z 2025-05-07T20:26:07.5411188Z 2025-05-07T20:26:07.5411295Z 2025-05-07T20:26:07.5411299Z 2025-05-07T20:26:07.5411303Z 2025-05-07T20:26:07.5411307Z 2025-05-07T20:26:07.5411310Z 2025-05-07T20:26:07.5411314Z 2025-05-07T20:26:07.5411317Z 2025-05-07T20:26:07.5412237Z cuda-nvcc-dev_linux- | 12.7 MB | | 0%  2025-05-07T20:26:07.5412605Z 2025-05-07T20:26:07.5412617Z 2025-05-07T20:26:07.5412622Z 2025-05-07T20:26:07.5412626Z 2025-05-07T20:26:07.5412629Z 2025-05-07T20:26:07.5412633Z 2025-05-07T20:26:07.5412642Z 2025-05-07T20:26:07.5412645Z 2025-05-07T20:26:07.5412649Z 2025-05-07T20:26:07.5412653Z 2025-05-07T20:26:07.5412657Z 2025-05-07T20:26:07.5412660Z 2025-05-07T20:26:07.5412664Z 2025-05-07T20:26:07.5412668Z 2025-05-07T20:26:07.5412682Z 2025-05-07T20:26:07.5412688Z 2025-05-07T20:26:07.5412693Z 2025-05-07T20:26:07.5414018Z cuda-sanitizer-api-1 | 8.8 MB | | 0%  2025-05-07T20:26:07.5414482Z 2025-05-07T20:26:07.5414486Z 2025-05-07T20:26:07.5414498Z 2025-05-07T20:26:07.5414502Z 2025-05-07T20:26:07.5414505Z 2025-05-07T20:26:07.5414509Z 2025-05-07T20:26:07.5414513Z 2025-05-07T20:26:07.5414516Z 2025-05-07T20:26:07.5414520Z 2025-05-07T20:26:07.5414523Z 2025-05-07T20:26:07.5414527Z 2025-05-07T20:26:07.5414531Z 2025-05-07T20:26:07.5414534Z 2025-05-07T20:26:07.5414538Z 2025-05-07T20:26:07.5414546Z 2025-05-07T20:26:07.5414550Z 2025-05-07T20:26:07.5414553Z 2025-05-07T20:26:07.5414557Z 2025-05-07T20:26:07.5415525Z cuda-nvdisasm-12.8.5 | 4.9 MB | | 0%  2025-05-07T20:26:07.5415906Z 2025-05-07T20:26:07.5415910Z 2025-05-07T20:26:07.5415914Z 2025-05-07T20:26:07.5415917Z 2025-05-07T20:26:07.5415921Z 2025-05-07T20:26:07.5415925Z 2025-05-07T20:26:07.5415934Z 2025-05-07T20:26:07.5415937Z 2025-05-07T20:26:07.5415941Z 2025-05-07T20:26:07.5415945Z 2025-05-07T20:26:07.5415954Z 2025-05-07T20:26:07.5415958Z 2025-05-07T20:26:07.5415962Z 2025-05-07T20:26:07.5415971Z 2025-05-07T20:26:07.5415975Z 2025-05-07T20:26:07.5415979Z 2025-05-07T20:26:07.5415982Z 2025-05-07T20:26:07.5415986Z 2025-05-07T20:26:07.5415990Z 2025-05-07T20:26:07.6312440Z ... (more hidden) ... 2025-05-07T20:26:07.6313955Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:26:07.6315634Z 2025-05-07T20:26:07.6323563Z nsight-compute-2025. | 320.6 MB | | 0%  2025-05-07T20:26:07.6323949Z 2025-05-07T20:26:07.6326089Z 2025-05-07T20:26:07.6360049Z libcusparse-12.5.7.5 | 164.9 MB | | 0%  2025-05-07T20:26:07.6360396Z 2025-05-07T20:26:07.6360402Z 2025-05-07T20:26:07.6360407Z 2025-05-07T20:26:07.6363219Z 2025-05-07T20:26:07.6446688Z libcufft-11.3.3.41 | 147.4 MB | | 0%  2025-05-07T20:26:07.6447032Z 2025-05-07T20:26:07.6447036Z 2025-05-07T20:26:07.6447479Z 2025-05-07T20:26:07.7313940Z libcusolver-11.7.2.5 | 156.9 MB | | 0%  2025-05-07T20:26:07.7316258Z libcublas-12.8.3.14 | 460.2 MB | 1 | 1% 2025-05-07T20:26:07.7316720Z 2025-05-07T20:26:07.7324721Z nsight-compute-2025. | 320.6 MB | 1 | 1%  2025-05-07T20:26:07.7325085Z 2025-05-07T20:26:07.7327377Z 2025-05-07T20:26:07.7360272Z libcusparse-12.5.7.5 | 164.9 MB | 2 | 2%  2025-05-07T20:26:07.7360912Z 2025-05-07T20:26:07.7360928Z 2025-05-07T20:26:07.7360934Z 2025-05-07T20:26:07.7361246Z 2025-05-07T20:26:07.7516901Z libcufft-11.3.3.41 | 147.4 MB | 2 | 2%  2025-05-07T20:26:07.7517558Z 2025-05-07T20:26:07.7517573Z 2025-05-07T20:26:07.7518096Z 2025-05-07T20:26:07.8314538Z libcusolver-11.7.2.5 | 156.9 MB | 1 | 2%  2025-05-07T20:26:07.8322032Z libcublas-12.8.3.14 | 460.2 MB | 1 | 2% 2025-05-07T20:26:07.8325305Z 2025-05-07T20:26:07.8330798Z nsight-compute-2025. | 320.6 MB | 2 | 2%  2025-05-07T20:26:07.8331146Z 2025-05-07T20:26:07.8331698Z 2025-05-07T20:26:07.8363360Z libcusparse-12.5.7.5 | 164.9 MB | 4 | 5%  2025-05-07T20:26:07.8363736Z 2025-05-07T20:26:07.8363743Z 2025-05-07T20:26:07.8363748Z 2025-05-07T20:26:07.8363869Z 2025-05-07T20:26:07.8522550Z libcufft-11.3.3.41 | 147.4 MB | 4 | 5%  2025-05-07T20:26:07.8522907Z 2025-05-07T20:26:07.8522913Z 2025-05-07T20:26:07.8523760Z 2025-05-07T20:26:07.9316647Z libcusolver-11.7.2.5 | 156.9 MB | 3 | 3%  2025-05-07T20:26:07.9323132Z libcublas-12.8.3.14 | 460.2 MB | 2 | 3% 2025-05-07T20:26:07.9328329Z 2025-05-07T20:26:07.9333499Z nsight-compute-2025. | 320.6 MB | 3 | 4%  2025-05-07T20:26:07.9333871Z 2025-05-07T20:26:07.9334289Z 2025-05-07T20:26:07.9368655Z libcusparse-12.5.7.5 | 164.9 MB | 7 | 7%  2025-05-07T20:26:07.9369026Z 2025-05-07T20:26:07.9369030Z 2025-05-07T20:26:07.9369034Z 2025-05-07T20:26:07.9369038Z 2025-05-07T20:26:07.9528127Z libcufft-11.3.3.41 | 147.4 MB | 6 | 7%  2025-05-07T20:26:07.9528500Z 2025-05-07T20:26:07.9528506Z 2025-05-07T20:26:07.9528969Z 2025-05-07T20:26:08.0319542Z libcusolver-11.7.2.5 | 156.9 MB | 5 | 5%  2025-05-07T20:26:08.0328236Z libcublas-12.8.3.14 | 460.2 MB | 3 | 3% 2025-05-07T20:26:08.0332190Z 2025-05-07T20:26:08.0341584Z nsight-compute-2025. | 320.6 MB | 4 | 5%  2025-05-07T20:26:08.0341866Z 2025-05-07T20:26:08.0343755Z 2025-05-07T20:26:08.0404241Z libcusparse-12.5.7.5 | 164.9 MB | 9 | 9%  2025-05-07T20:26:08.0404617Z 2025-05-07T20:26:08.0404623Z 2025-05-07T20:26:08.0404628Z 2025-05-07T20:26:08.0407452Z 2025-05-07T20:26:08.0528251Z libcufft-11.3.3.41 | 147.4 MB | 9 | 9%  2025-05-07T20:26:08.0528592Z 2025-05-07T20:26:08.0528599Z 2025-05-07T20:26:08.0529361Z 2025-05-07T20:26:08.1328006Z libcusolver-11.7.2.5 | 156.9 MB | 7 | 7%  2025-05-07T20:26:08.1334144Z libcublas-12.8.3.14 | 460.2 MB | 4 | 4% 2025-05-07T20:26:08.1336069Z 2025-05-07T20:26:08.1340332Z nsight-compute-2025. | 320.6 MB | 5 | 6%  2025-05-07T20:26:08.1340833Z 2025-05-07T20:26:08.1340837Z 2025-05-07T20:26:08.1628311Z libcusparse-12.5.7.5 | 164.9 MB | #1 | 12%  2025-05-07T20:26:08.1628746Z 2025-05-07T20:26:08.1628753Z 2025-05-07T20:26:08.1631051Z 2025-05-07T20:26:08.1681126Z libcusolver-11.7.2.5 | 156.9 MB | 9 | 9%  2025-05-07T20:26:08.1681442Z 2025-05-07T20:26:08.1681447Z 2025-05-07T20:26:08.1681453Z 2025-05-07T20:26:08.1681461Z 2025-05-07T20:26:08.2381644Z libcufft-11.3.3.41 | 147.4 MB | #1 | 11%  2025-05-07T20:26:08.2382071Z 2025-05-07T20:26:08.2382080Z 2025-05-07T20:26:08.2435456Z libcusparse-12.5.7.5 | 164.9 MB | #4 | 14%  2025-05-07T20:26:08.2484717Z libcublas-12.8.3.14 | 460.2 MB | 5 | 5% 2025-05-07T20:26:08.2490772Z 2025-05-07T20:26:08.2629388Z nsight-compute-2025. | 320.6 MB | 6 | 7%  2025-05-07T20:26:08.2629672Z 2025-05-07T20:26:08.2629702Z 2025-05-07T20:26:08.2632662Z 2025-05-07T20:26:08.2687339Z libcusolver-11.7.2.5 | 156.9 MB | #1 | 11%  2025-05-07T20:26:08.2687638Z 2025-05-07T20:26:08.2687645Z 2025-05-07T20:26:08.2687651Z 2025-05-07T20:26:08.2689183Z 2025-05-07T20:26:08.3479467Z libcufft-11.3.3.41 | 147.4 MB | #3 | 13%  2025-05-07T20:26:08.3534263Z libcublas-12.8.3.14 | 460.2 MB | 5 | 6% 2025-05-07T20:26:08.3534578Z 2025-05-07T20:26:08.3536750Z 2025-05-07T20:26:08.3632718Z libcusparse-12.5.7.5 | 164.9 MB | #6 | 16%  2025-05-07T20:26:08.3633019Z 2025-05-07T20:26:08.3633025Z 2025-05-07T20:26:08.3633170Z 2025-05-07T20:26:08.3639376Z libcusolver-11.7.2.5 | 156.9 MB | #3 | 13%  2025-05-07T20:26:08.3639848Z 2025-05-07T20:26:08.3689749Z nsight-compute-2025. | 320.6 MB | 7 | 8%  2025-05-07T20:26:08.3690011Z 2025-05-07T20:26:08.3690017Z 2025-05-07T20:26:08.3690023Z 2025-05-07T20:26:08.3694406Z 2025-05-07T20:26:08.4510275Z libcufft-11.3.3.41 | 147.4 MB | #5 | 15%  2025-05-07T20:26:08.4535556Z libcublas-12.8.3.14 | 460.2 MB | 6 | 7% 2025-05-07T20:26:08.4535804Z 2025-05-07T20:26:08.4536179Z 2025-05-07T20:26:08.4632759Z libcusparse-12.5.7.5 | 164.9 MB | #8 | 18%  2025-05-07T20:26:08.4633059Z 2025-05-07T20:26:08.4633096Z 2025-05-07T20:26:08.4633739Z 2025-05-07T20:26:08.4641471Z libcusolver-11.7.2.5 | 156.9 MB | #5 | 16%  2025-05-07T20:26:08.4641785Z 2025-05-07T20:26:08.4691569Z nsight-compute-2025. | 320.6 MB | 8 | 9%  2025-05-07T20:26:08.4691833Z 2025-05-07T20:26:08.4691839Z 2025-05-07T20:26:08.4691844Z 2025-05-07T20:26:08.4691852Z 2025-05-07T20:26:08.5511927Z libcufft-11.3.3.41 | 147.4 MB | #7 | 17%  2025-05-07T20:26:08.5539590Z libcublas-12.8.3.14 | 460.2 MB | 7 | 7% 2025-05-07T20:26:08.5539840Z 2025-05-07T20:26:08.5544632Z 2025-05-07T20:26:08.5635689Z libcusparse-12.5.7.5 | 164.9 MB | ## | 21%  2025-05-07T20:26:08.5635968Z 2025-05-07T20:26:08.5635974Z 2025-05-07T20:26:08.5635979Z 2025-05-07T20:26:08.5697147Z libcusolver-11.7.2.5 | 156.9 MB | #7 | 18%  2025-05-07T20:26:08.5697427Z 2025-05-07T20:26:08.5697431Z 2025-05-07T20:26:08.5697442Z 2025-05-07T20:26:08.5698571Z 2025-05-07T20:26:08.5715618Z libcufft-11.3.3.41 | 147.4 MB | #9 | 19%  2025-05-07T20:26:08.5717297Z 2025-05-07T20:26:08.6516524Z nsight-compute-2025. | 320.6 MB | 9 | 10%  2025-05-07T20:26:08.6541985Z libcublas-12.8.3.14 | 460.2 MB | 7 | 8% 2025-05-07T20:26:08.6542222Z 2025-05-07T20:26:08.6542227Z 2025-05-07T20:26:08.6637916Z libcusparse-12.5.7.5 | 164.9 MB | ##2 | 23%  2025-05-07T20:26:08.6638239Z 2025-05-07T20:26:08.6638245Z 2025-05-07T20:26:08.6639895Z 2025-05-07T20:26:08.6700270Z libcusolver-11.7.2.5 | 156.9 MB | ## | 20%  2025-05-07T20:26:08.6700543Z 2025-05-07T20:26:08.6700574Z 2025-05-07T20:26:08.6700579Z 2025-05-07T20:26:08.6701279Z 2025-05-07T20:26:08.6729999Z libcufft-11.3.3.41 | 147.4 MB | ##1 | 21%  2025-05-07T20:26:08.6731465Z 2025-05-07T20:26:08.7518306Z nsight-compute-2025. | 320.6 MB | # | 11%  2025-05-07T20:26:08.7568622Z libcublas-12.8.3.14 | 460.2 MB | 8 | 9% 2025-05-07T20:26:08.7568897Z 2025-05-07T20:26:08.7568901Z 2025-05-07T20:26:08.7639231Z libcusparse-12.5.7.5 | 164.9 MB | ##5 | 25%  2025-05-07T20:26:08.7639517Z 2025-05-07T20:26:08.7639521Z 2025-05-07T20:26:08.7639525Z 2025-05-07T20:26:08.7701885Z libcusolver-11.7.2.5 | 156.9 MB | ##2 | 22%  2025-05-07T20:26:08.7702223Z 2025-05-07T20:26:08.7702227Z 2025-05-07T20:26:08.7702231Z 2025-05-07T20:26:08.7704560Z 2025-05-07T20:26:08.7731189Z libcufft-11.3.3.41 | 147.4 MB | ##3 | 24%  2025-05-07T20:26:08.7731474Z 2025-05-07T20:26:08.8520583Z nsight-compute-2025. | 320.6 MB | #1 | 12%  2025-05-07T20:26:08.8625166Z libcublas-12.8.3.14 | 460.2 MB | 9 | 9% 2025-05-07T20:26:08.8625418Z 2025-05-07T20:26:08.8628169Z 2025-05-07T20:26:08.8650263Z libcusparse-12.5.7.5 | 164.9 MB | ##7 | 27%  2025-05-07T20:26:08.8650561Z 2025-05-07T20:26:08.8650567Z 2025-05-07T20:26:08.8650572Z 2025-05-07T20:26:08.8718964Z libcusolver-11.7.2.5 | 156.9 MB | ##4 | 24%  2025-05-07T20:26:08.8719505Z 2025-05-07T20:26:08.8719509Z 2025-05-07T20:26:08.8719513Z 2025-05-07T20:26:08.8719517Z 2025-05-07T20:26:08.8782880Z libcufft-11.3.3.41 | 147.4 MB | ##5 | 26%  2025-05-07T20:26:08.8783486Z 2025-05-07T20:26:08.9521979Z nsight-compute-2025. | 320.6 MB | #2 | 13%  2025-05-07T20:26:08.9627618Z libcublas-12.8.3.14 | 460.2 MB | # | 10% 2025-05-07T20:26:08.9627984Z 2025-05-07T20:26:08.9630211Z 2025-05-07T20:26:08.9720271Z libcusparse-12.5.7.5 | 164.9 MB | ##9 | 30%  2025-05-07T20:26:08.9720540Z 2025-05-07T20:26:08.9720782Z 2025-05-07T20:26:08.9720787Z 2025-05-07T20:26:08.9724479Z 2025-05-07T20:26:08.9785970Z libcufft-11.3.3.41 | 147.4 MB | ##8 | 28%  2025-05-07T20:26:08.9786262Z 2025-05-07T20:26:08.9839415Z nsight-compute-2025. | 320.6 MB | #4 | 14%  2025-05-07T20:26:08.9839686Z 2025-05-07T20:26:08.9839810Z 2025-05-07T20:26:08.9840023Z 2025-05-07T20:26:09.0522288Z libcusolver-11.7.2.5 | 156.9 MB | ##6 | 27%  2025-05-07T20:26:09.0636440Z libcublas-12.8.3.14 | 460.2 MB | #1 | 11% 2025-05-07T20:26:09.0636720Z 2025-05-07T20:26:09.0636726Z 2025-05-07T20:26:09.0723996Z libcusparse-12.5.7.5 | 164.9 MB | ###2 | 32%  2025-05-07T20:26:09.0724377Z 2025-05-07T20:26:09.0724384Z 2025-05-07T20:26:09.0724389Z 2025-05-07T20:26:09.0726681Z 2025-05-07T20:26:09.0788899Z libcufft-11.3.3.41 | 147.4 MB | ### | 31%  2025-05-07T20:26:09.0791009Z 2025-05-07T20:26:09.1377878Z nsight-compute-2025. | 320.6 MB | #5 | 15%  2025-05-07T20:26:09.1378297Z 2025-05-07T20:26:09.1378304Z 2025-05-07T20:26:09.1378310Z 2025-05-07T20:26:09.1523634Z libcusolver-11.7.2.5 | 156.9 MB | ##8 | 29%  2025-05-07T20:26:09.1637037Z libcublas-12.8.3.14 | 460.2 MB | #1 | 12% 2025-05-07T20:26:09.1637308Z 2025-05-07T20:26:09.1637349Z 2025-05-07T20:26:09.1728316Z libcusparse-12.5.7.5 | 164.9 MB | ###4 | 35%  2025-05-07T20:26:09.1728603Z 2025-05-07T20:26:09.1728607Z 2025-05-07T20:26:09.1728611Z 2025-05-07T20:26:09.1728615Z 2025-05-07T20:26:09.1798817Z libcufft-11.3.3.41 | 147.4 MB | ###3 | 33%  2025-05-07T20:26:09.1799476Z 2025-05-07T20:26:09.2525157Z nsight-compute-2025. | 320.6 MB | #6 | 17%  2025-05-07T20:26:09.2638685Z libcublas-12.8.3.14 | 460.2 MB | #2 | 13% 2025-05-07T20:26:09.2639036Z 2025-05-07T20:26:09.2639468Z 2025-05-07T20:26:09.2731406Z libcusparse-12.5.7.5 | 164.9 MB | ###6 | 37%  2025-05-07T20:26:09.2731675Z 2025-05-07T20:26:09.2731704Z 2025-05-07T20:26:09.2731708Z 2025-05-07T20:26:09.2735659Z 2025-05-07T20:26:09.2798745Z libcufft-11.3.3.41 | 147.4 MB | ###5 | 36%  2025-05-07T20:26:09.2800346Z 2025-05-07T20:26:09.3279178Z nsight-compute-2025. | 320.6 MB | #7 | 18%  2025-05-07T20:26:09.3279468Z 2025-05-07T20:26:09.3279472Z 2025-05-07T20:26:09.3280171Z 2025-05-07T20:26:09.3573370Z libcusolver-11.7.2.5 | 156.9 MB | ### | 30%  2025-05-07T20:26:09.3679300Z libcublas-12.8.3.14 | 460.2 MB | #3 | 14% 2025-05-07T20:26:09.3679562Z 2025-05-07T20:26:09.3679566Z 2025-05-07T20:26:09.3806248Z libcusparse-12.5.7.5 | 164.9 MB | ###9 | 39%  2025-05-07T20:26:09.3807337Z 2025-05-07T20:26:09.4013075Z nsight-compute-2025. | 320.6 MB | #8 | 19%  2025-05-07T20:26:09.4013441Z 2025-05-07T20:26:09.4013446Z 2025-05-07T20:26:09.4013449Z 2025-05-07T20:26:09.4014245Z 2025-05-07T20:26:09.4283617Z libcufft-11.3.3.41 | 147.4 MB | ###8 | 38%  2025-05-07T20:26:09.4284037Z 2025-05-07T20:26:09.4284044Z 2025-05-07T20:26:09.4284050Z 2025-05-07T20:26:09.4621827Z libcusolver-11.7.2.5 | 156.9 MB | ###2 | 33%  2025-05-07T20:26:09.4721294Z libcublas-12.8.3.14 | 460.2 MB | #4 | 15% 2025-05-07T20:26:09.4721607Z 2025-05-07T20:26:09.4721614Z 2025-05-07T20:26:09.4965637Z libcusparse-12.5.7.5 | 164.9 MB | ####1 | 42%  2025-05-07T20:26:09.4966669Z 2025-05-07T20:26:09.5082774Z nsight-compute-2025. | 320.6 MB | ## | 20%  2025-05-07T20:26:09.5083175Z 2025-05-07T20:26:09.5083249Z 2025-05-07T20:26:09.5083254Z 2025-05-07T20:26:09.5083886Z 2025-05-07T20:26:09.5286763Z libcufft-11.3.3.41 | 147.4 MB | #### | 40%  2025-05-07T20:26:09.5287047Z 2025-05-07T20:26:09.5287051Z 2025-05-07T20:26:09.5287055Z 2025-05-07T20:26:09.5722564Z libcusolver-11.7.2.5 | 156.9 MB | ###5 | 35%  2025-05-07T20:26:09.5722871Z 2025-05-07T20:26:09.5724497Z 2025-05-07T20:26:09.5826637Z libcusparse-12.5.7.5 | 164.9 MB | ####3 | 44%  2025-05-07T20:26:09.6094048Z libcublas-12.8.3.14 | 460.2 MB | #5 | 15% 2025-05-07T20:26:09.6095776Z 2025-05-07T20:26:09.6260071Z nsight-compute-2025. | 320.6 MB | ##1 | 21%  2025-05-07T20:26:09.6260336Z 2025-05-07T20:26:09.6260340Z 2025-05-07T20:26:09.6260344Z 2025-05-07T20:26:09.6261430Z 2025-05-07T20:26:09.6291281Z libcufft-11.3.3.41 | 147.4 MB | ####2 | 43%  2025-05-07T20:26:09.6291549Z 2025-05-07T20:26:09.6291553Z 2025-05-07T20:26:09.6291557Z 2025-05-07T20:26:09.6733557Z libcusolver-11.7.2.5 | 156.9 MB | ###7 | 37%  2025-05-07T20:26:09.6733841Z 2025-05-07T20:26:09.6734011Z 2025-05-07T20:26:09.6886593Z libcusparse-12.5.7.5 | 164.9 MB | ####6 | 46%  2025-05-07T20:26:09.7254116Z libcublas-12.8.3.14 | 460.2 MB | #6 | 16% 2025-05-07T20:26:09.7254377Z 2025-05-07T20:26:09.7290393Z nsight-compute-2025. | 320.6 MB | ##2 | 22%  2025-05-07T20:26:09.7290680Z 2025-05-07T20:26:09.7290685Z 2025-05-07T20:26:09.7290689Z 2025-05-07T20:26:09.7291872Z 2025-05-07T20:26:09.7316297Z libcufft-11.3.3.41 | 147.4 MB | ####4 | 45%  2025-05-07T20:26:09.7316569Z 2025-05-07T20:26:09.7316574Z 2025-05-07T20:26:09.7319392Z 2025-05-07T20:26:09.7901781Z libcusolver-11.7.2.5 | 156.9 MB | ###9 | 39%  2025-05-07T20:26:09.7902094Z 2025-05-07T20:26:09.7902098Z 2025-05-07T20:26:09.7951147Z libcusparse-12.5.7.5 | 164.9 MB | ####8 | 49%  2025-05-07T20:26:09.8293964Z libcublas-12.8.3.14 | 460.2 MB | #6 | 17% 2025-05-07T20:26:09.8294222Z 2025-05-07T20:26:09.8294227Z 2025-05-07T20:26:09.8294230Z 2025-05-07T20:26:09.8296742Z 2025-05-07T20:26:09.8316555Z libcufft-11.3.3.41 | 147.4 MB | ####6 | 47%  2025-05-07T20:26:09.8316839Z 2025-05-07T20:26:09.8316843Z 2025-05-07T20:26:09.8317871Z 2025-05-07T20:26:09.8383550Z libcusolver-11.7.2.5 | 156.9 MB | ####1 | 42%  2025-05-07T20:26:09.8383857Z 2025-05-07T20:26:09.8985091Z nsight-compute-2025. | 320.6 MB | ##3 | 23%  2025-05-07T20:26:09.8985365Z 2025-05-07T20:26:09.8985369Z 2025-05-07T20:26:09.9069430Z libcusparse-12.5.7.5 | 164.9 MB | ##### | 51%  2025-05-07T20:26:09.9317702Z libcublas-12.8.3.14 | 460.2 MB | #7 | 18% 2025-05-07T20:26:09.9317965Z 2025-05-07T20:26:09.9317995Z 2025-05-07T20:26:09.9320658Z 2025-05-07T20:26:09.9326518Z libcusolver-11.7.2.5 | 156.9 MB | ####3 | 44%  2025-05-07T20:26:09.9326792Z 2025-05-07T20:26:09.9326797Z 2025-05-07T20:26:09.9326801Z 2025-05-07T20:26:09.9329105Z 2025-05-07T20:26:09.9480151Z libcufft-11.3.3.41 | 147.4 MB | ####8 | 49%  2025-05-07T20:26:09.9483891Z 2025-05-07T20:26:09.9988880Z nsight-compute-2025. | 320.6 MB | ##4 | 24%  2025-05-07T20:26:09.9989194Z 2025-05-07T20:26:09.9989200Z 2025-05-07T20:26:10.0070947Z libcusparse-12.5.7.5 | 164.9 MB | #####3 | 53%  2025-05-07T20:26:10.0318875Z libcublas-12.8.3.14 | 460.2 MB | #8 | 18% 2025-05-07T20:26:10.0319197Z 2025-05-07T20:26:10.0319204Z 2025-05-07T20:26:10.0321591Z 2025-05-07T20:26:10.0376561Z libcusolver-11.7.2.5 | 156.9 MB | ####5 | 46%  2025-05-07T20:26:10.0376888Z 2025-05-07T20:26:10.0376894Z 2025-05-07T20:26:10.0376900Z 2025-05-07T20:26:10.0377688Z 2025-05-07T20:26:10.0482856Z libcufft-11.3.3.41 | 147.4 MB | #####1 | 51%  2025-05-07T20:26:10.0483144Z 2025-05-07T20:26:10.0992237Z nsight-compute-2025. | 320.6 MB | ##5 | 25%  2025-05-07T20:26:10.0992517Z 2025-05-07T20:26:10.0992521Z 2025-05-07T20:26:10.1074538Z libcusparse-12.5.7.5 | 164.9 MB | #####5 | 55%  2025-05-07T20:26:10.1322122Z libcublas-12.8.3.14 | 460.2 MB | #9 | 19% 2025-05-07T20:26:10.1322373Z 2025-05-07T20:26:10.1322378Z 2025-05-07T20:26:10.1324852Z 2025-05-07T20:26:10.1483580Z libcusolver-11.7.2.5 | 156.9 MB | ####8 | 49%  2025-05-07T20:26:10.1483852Z 2025-05-07T20:26:10.1996909Z nsight-compute-2025. | 320.6 MB | ##6 | 27%  2025-05-07T20:26:10.1997216Z 2025-05-07T20:26:10.1997906Z 2025-05-07T20:26:10.2008313Z libcusparse-12.5.7.5 | 164.9 MB | #####7 | 58%  2025-05-07T20:26:10.2008580Z 2025-05-07T20:26:10.2008585Z 2025-05-07T20:26:10.2008589Z 2025-05-07T20:26:10.2010919Z 2025-05-07T20:26:10.2092276Z libcufft-11.3.3.41 | 147.4 MB | #####3 | 53%  2025-05-07T20:26:10.2346067Z libcublas-12.8.3.14 | 460.2 MB | ## | 20% 2025-05-07T20:26:10.2346332Z 2025-05-07T20:26:10.2346337Z 2025-05-07T20:26:10.2348351Z 2025-05-07T20:26:10.2487379Z libcusolver-11.7.2.5 | 156.9 MB | ##### | 51%  2025-05-07T20:26:10.2487763Z 2025-05-07T20:26:10.2999863Z nsight-compute-2025. | 320.6 MB | ##7 | 28%  2025-05-07T20:26:10.3000222Z 2025-05-07T20:26:10.3003031Z 2025-05-07T20:26:10.3008741Z libcusparse-12.5.7.5 | 164.9 MB | #####9 | 60%  2025-05-07T20:26:10.3009019Z 2025-05-07T20:26:10.3009023Z 2025-05-07T20:26:10.3009053Z 2025-05-07T20:26:10.3011191Z 2025-05-07T20:26:10.3188381Z libcufft-11.3.3.41 | 147.4 MB | #####5 | 55%  2025-05-07T20:26:10.3350261Z libcublas-12.8.3.14 | 460.2 MB | ## | 21% 2025-05-07T20:26:10.3350722Z 2025-05-07T20:26:10.3350729Z 2025-05-07T20:26:10.3352193Z 2025-05-07T20:26:10.3489774Z libcusolver-11.7.2.5 | 156.9 MB | #####3 | 53%  2025-05-07T20:26:10.3490075Z 2025-05-07T20:26:10.4013315Z nsight-compute-2025. | 320.6 MB | ##8 | 29%  2025-05-07T20:26:10.4013700Z 2025-05-07T20:26:10.4013704Z 2025-05-07T20:26:10.4013708Z 2025-05-07T20:26:10.4015503Z 2025-05-07T20:26:10.4034278Z libcufft-11.3.3.41 | 147.4 MB | #####7 | 58%  2025-05-07T20:26:10.4034557Z 2025-05-07T20:26:10.4035795Z 2025-05-07T20:26:10.4249012Z libcusparse-12.5.7.5 | 164.9 MB | ######2 | 62%  2025-05-07T20:26:10.4355921Z libcublas-12.8.3.14 | 460.2 MB | ##1 | 22% 2025-05-07T20:26:10.4356168Z 2025-05-07T20:26:10.4356330Z 2025-05-07T20:26:10.4358761Z 2025-05-07T20:26:10.4494655Z libcusolver-11.7.2.5 | 156.9 MB | #####5 | 56%  2025-05-07T20:26:10.4495580Z 2025-05-07T20:26:10.5014389Z nsight-compute-2025. | 320.6 MB | ### | 30%  2025-05-07T20:26:10.5014809Z 2025-05-07T20:26:10.5014815Z 2025-05-07T20:26:10.5014820Z 2025-05-07T20:26:10.5015165Z 2025-05-07T20:26:10.5068118Z libcufft-11.3.3.41 | 147.4 MB | ###### | 60%  2025-05-07T20:26:10.5068410Z 2025-05-07T20:26:10.5068414Z 2025-05-07T20:26:10.5287232Z libcusparse-12.5.7.5 | 164.9 MB | ######4 | 64%  2025-05-07T20:26:10.5373140Z libcublas-12.8.3.14 | 460.2 MB | ##2 | 22% 2025-05-07T20:26:10.5373458Z 2025-05-07T20:26:10.5373463Z 2025-05-07T20:26:10.5374130Z 2025-05-07T20:26:10.5497735Z libcusolver-11.7.2.5 | 156.9 MB | #####8 | 58%  2025-05-07T20:26:10.5499367Z 2025-05-07T20:26:10.6018842Z nsight-compute-2025. | 320.6 MB | ###1 | 31%  2025-05-07T20:26:10.6019291Z 2025-05-07T20:26:10.6019298Z 2025-05-07T20:26:10.6019303Z 2025-05-07T20:26:10.6020031Z 2025-05-07T20:26:10.6082142Z libcufft-11.3.3.41 | 147.4 MB | ######2 | 63%  2025-05-07T20:26:10.6082424Z 2025-05-07T20:26:10.6082429Z 2025-05-07T20:26:10.6315703Z libcusparse-12.5.7.5 | 164.9 MB | ######6 | 67%  2025-05-07T20:26:10.6458985Z libcublas-12.8.3.14 | 460.2 MB | ##3 | 23% 2025-05-07T20:26:10.6459251Z 2025-05-07T20:26:10.6459256Z 2025-05-07T20:26:10.6460838Z 2025-05-07T20:26:10.6498306Z libcusolver-11.7.2.5 | 156.9 MB | ###### | 60%  2025-05-07T20:26:10.6498597Z 2025-05-07T20:26:10.7082502Z nsight-compute-2025. | 320.6 MB | ###2 | 33%  2025-05-07T20:26:10.7082779Z 2025-05-07T20:26:10.7082784Z 2025-05-07T20:26:10.7317096Z libcusparse-12.5.7.5 | 164.9 MB | ######9 | 69%  2025-05-07T20:26:10.7460794Z libcublas-12.8.3.14 | 460.2 MB | ##3 | 24% 2025-05-07T20:26:10.7461119Z 2025-05-07T20:26:10.7461123Z 2025-05-07T20:26:10.7462817Z 2025-05-07T20:26:10.7499940Z libcusolver-11.7.2.5 | 156.9 MB | ######2 | 63%  2025-05-07T20:26:10.7500220Z 2025-05-07T20:26:10.8082641Z nsight-compute-2025. | 320.6 MB | ###3 | 34%  2025-05-07T20:26:10.8082925Z 2025-05-07T20:26:10.8082929Z 2025-05-07T20:26:10.8189540Z libcusparse-12.5.7.5 | 164.9 MB | #######1 | 72%  2025-05-07T20:26:10.8190408Z 2025-05-07T20:26:10.8190416Z 2025-05-07T20:26:10.8190421Z 2025-05-07T20:26:10.8190426Z 2025-05-07T20:26:10.8322031Z libcufft-11.3.3.41 | 147.4 MB | ######5 | 65%  2025-05-07T20:26:10.8464825Z libcublas-12.8.3.14 | 460.2 MB | ##4 | 25% 2025-05-07T20:26:10.8465189Z 2025-05-07T20:26:10.8465196Z 2025-05-07T20:26:10.8467208Z 2025-05-07T20:26:10.8500710Z libcusolver-11.7.2.5 | 156.9 MB | ######5 | 65%  2025-05-07T20:26:10.8501745Z 2025-05-07T20:26:10.9113408Z nsight-compute-2025. | 320.6 MB | ###5 | 35%  2025-05-07T20:26:10.9113709Z 2025-05-07T20:26:10.9114122Z 2025-05-07T20:26:10.9258633Z libcusparse-12.5.7.5 | 164.9 MB | #######4 | 74%  2025-05-07T20:26:10.9258985Z 2025-05-07T20:26:10.9258990Z 2025-05-07T20:26:10.9258994Z 2025-05-07T20:26:10.9258997Z 2025-05-07T20:26:10.9324118Z libcufft-11.3.3.41 | 147.4 MB | ######6 | 67%  2025-05-07T20:26:10.9496033Z libcublas-12.8.3.14 | 460.2 MB | ##5 | 26% 2025-05-07T20:26:10.9496430Z 2025-05-07T20:26:10.9496437Z 2025-05-07T20:26:10.9496442Z 2025-05-07T20:26:10.9508411Z libcusolver-11.7.2.5 | 156.9 MB | ######7 | 68%  2025-05-07T20:26:10.9509327Z 2025-05-07T20:26:11.0260335Z nsight-compute-2025. | 320.6 MB | ###6 | 36%  2025-05-07T20:26:11.0260608Z 2025-05-07T20:26:11.0260612Z 2025-05-07T20:26:11.0260616Z 2025-05-07T20:26:11.0260620Z 2025-05-07T20:26:11.0326764Z libcufft-11.3.3.41 | 147.4 MB | ######9 | 69%  2025-05-07T20:26:11.0632093Z libcublas-12.8.3.14 | 460.2 MB | ##6 | 26% 2025-05-07T20:26:11.0632538Z 2025-05-07T20:26:11.0632572Z 2025-05-07T20:26:11.0640638Z libcusparse-12.5.7.5 | 164.9 MB | #######6 | 77%  2025-05-07T20:26:11.0640908Z 2025-05-07T20:26:11.0640913Z 2025-05-07T20:26:11.0643585Z 2025-05-07T20:26:11.0872470Z libcusolver-11.7.2.5 | 156.9 MB | ####### | 70%  2025-05-07T20:26:11.0872745Z 2025-05-07T20:26:11.1330661Z nsight-compute-2025. | 320.6 MB | ###7 | 38%  2025-05-07T20:26:11.1340859Z libcublas-12.8.3.14 | 460.2 MB | ##7 | 27% 2025-05-07T20:26:11.1341116Z 2025-05-07T20:26:11.1341122Z 2025-05-07T20:26:11.1341150Z 2025-05-07T20:26:11.1342871Z 2025-05-07T20:26:11.1633715Z libcufft-11.3.3.41 | 147.4 MB | #######1 | 71%  2025-05-07T20:26:11.1634004Z 2025-05-07T20:26:11.1634008Z 2025-05-07T20:26:11.1640799Z libcusparse-12.5.7.5 | 164.9 MB | #######9 | 79%  2025-05-07T20:26:11.1641143Z 2025-05-07T20:26:11.1641149Z 2025-05-07T20:26:11.1641917Z 2025-05-07T20:26:11.2371660Z libcusolver-11.7.2.5 | 156.9 MB | #######2 | 73%  2025-05-07T20:26:11.2372077Z 2025-05-07T20:26:11.2372085Z 2025-05-07T20:26:11.2372090Z 2025-05-07T20:26:11.2372096Z 2025-05-07T20:26:11.2374418Z libcufft-11.3.3.41 | 147.4 MB | #######3 | 74%  2025-05-07T20:26:11.2655090Z libcublas-12.8.3.14 | 460.2 MB | ##8 | 28% 2025-05-07T20:26:11.2655353Z 2025-05-07T20:26:11.2699746Z nsight-compute-2025. | 320.6 MB | ###8 | 39%  2025-05-07T20:26:11.2700082Z 2025-05-07T20:26:11.2700086Z 2025-05-07T20:26:11.2820400Z libcusparse-12.5.7.5 | 164.9 MB | ########1 | 81%  2025-05-07T20:26:11.2820783Z 2025-05-07T20:26:11.2820787Z 2025-05-07T20:26:11.2821408Z 2025-05-07T20:26:11.3374335Z libcusolver-11.7.2.5 | 156.9 MB | #######5 | 75%  2025-05-07T20:26:11.3374779Z 2025-05-07T20:26:11.3374786Z 2025-05-07T20:26:11.3374792Z 2025-05-07T20:26:11.3375307Z 2025-05-07T20:26:11.3469392Z libcufft-11.3.3.41 | 147.4 MB | #######5 | 76%  2025-05-07T20:26:11.3661197Z libcublas-12.8.3.14 | 460.2 MB | ##8 | 29% 2025-05-07T20:26:11.3661497Z 2025-05-07T20:26:11.3702326Z nsight-compute-2025. | 320.6 MB | ###9 | 40%  2025-05-07T20:26:11.3702604Z 2025-05-07T20:26:11.3702814Z 2025-05-07T20:26:11.4376226Z libcusparse-12.5.7.5 | 164.9 MB | ########3 | 84%  2025-05-07T20:26:11.4376593Z 2025-05-07T20:26:11.4376598Z 2025-05-07T20:26:11.4376625Z 2025-05-07T20:26:11.4376629Z 2025-05-07T20:26:11.4471900Z libcufft-11.3.3.41 | 147.4 MB | #######8 | 78%  2025-05-07T20:26:11.4496117Z libcublas-12.8.3.14 | 460.2 MB | ##9 | 30% 2025-05-07T20:26:11.4496461Z 2025-05-07T20:26:11.4496468Z 2025-05-07T20:26:11.4498449Z 2025-05-07T20:26:11.4663360Z libcusolver-11.7.2.5 | 156.9 MB | #######7 | 77%  2025-05-07T20:26:11.4667675Z 2025-05-07T20:26:11.4704437Z nsight-compute-2025. | 320.6 MB | ####1 | 41%  2025-05-07T20:26:11.4704704Z 2025-05-07T20:26:11.4706580Z 2025-05-07T20:26:11.5377233Z libcusparse-12.5.7.5 | 164.9 MB | ########6 | 87%  2025-05-07T20:26:11.5377660Z 2025-05-07T20:26:11.5377678Z 2025-05-07T20:26:11.5377684Z 2025-05-07T20:26:11.5377961Z 2025-05-07T20:26:11.5498305Z libcufft-11.3.3.41 | 147.4 MB | ######## | 81%  2025-05-07T20:26:11.5498629Z 2025-05-07T20:26:11.5498644Z 2025-05-07T20:26:11.5499239Z 2025-05-07T20:26:11.5557096Z libcusolver-11.7.2.5 | 156.9 MB | #######9 | 79%  2025-05-07T20:26:11.5663674Z libcublas-12.8.3.14 | 460.2 MB | ### | 31% 2025-05-07T20:26:11.5665272Z 2025-05-07T20:26:11.5758945Z nsight-compute-2025. | 320.6 MB | ####2 | 42%  2025-05-07T20:26:11.5759203Z 2025-05-07T20:26:11.5759207Z 2025-05-07T20:26:11.6379891Z libcusparse-12.5.7.5 | 164.9 MB | ########8 | 89%  2025-05-07T20:26:11.6380204Z 2025-05-07T20:26:11.6380209Z 2025-05-07T20:26:11.6380214Z 2025-05-07T20:26:11.6380576Z 2025-05-07T20:26:11.6501067Z libcufft-11.3.3.41 | 147.4 MB | ########3 | 83%  2025-05-07T20:26:11.6501483Z 2025-05-07T20:26:11.6501517Z 2025-05-07T20:26:11.6502453Z 2025-05-07T20:26:11.6559601Z libcusolver-11.7.2.5 | 156.9 MB | ########1 | 82%  2025-05-07T20:26:11.6686678Z libcublas-12.8.3.14 | 460.2 MB | ###1 | 31% 2025-05-07T20:26:11.6687991Z 2025-05-07T20:26:11.6833065Z nsight-compute-2025. | 320.6 MB | ####3 | 44%  2025-05-07T20:26:11.6833464Z 2025-05-07T20:26:11.6834388Z 2025-05-07T20:26:11.7380366Z libcusparse-12.5.7.5 | 164.9 MB | #########1 | 91%  2025-05-07T20:26:11.7380653Z 2025-05-07T20:26:11.7380661Z 2025-05-07T20:26:11.7380666Z 2025-05-07T20:26:11.7380671Z 2025-05-07T20:26:11.7503325Z libcufft-11.3.3.41 | 147.4 MB | ########5 | 86%  2025-05-07T20:26:11.7503611Z 2025-05-07T20:26:11.7503615Z 2025-05-07T20:26:11.7504284Z 2025-05-07T20:26:11.7563478Z libcusolver-11.7.2.5 | 156.9 MB | ########3 | 84%  2025-05-07T20:26:11.7717161Z libcublas-12.8.3.14 | 460.2 MB | ###2 | 32% 2025-05-07T20:26:11.7719674Z 2025-05-07T20:26:11.7877908Z nsight-compute-2025. | 320.6 MB | ####4 | 45%  2025-05-07T20:26:11.7878210Z 2025-05-07T20:26:11.7879840Z 2025-05-07T20:26:11.8381078Z libcusparse-12.5.7.5 | 164.9 MB | #########3 | 94%  2025-05-07T20:26:11.8381369Z 2025-05-07T20:26:11.8381384Z 2025-05-07T20:26:11.8381388Z 2025-05-07T20:26:11.8382665Z 2025-05-07T20:26:11.8509501Z libcufft-11.3.3.41 | 147.4 MB | ########8 | 88%  2025-05-07T20:26:11.8510045Z 2025-05-07T20:26:11.8510056Z 2025-05-07T20:26:11.8510767Z 2025-05-07T20:26:11.8718147Z libcusolver-11.7.2.5 | 156.9 MB | ########6 | 86%  2025-05-07T20:26:11.8722236Z libcublas-12.8.3.14 | 460.2 MB | ###3 | 33% 2025-05-07T20:26:11.8723924Z 2025-05-07T20:26:11.8922715Z nsight-compute-2025. | 320.6 MB | ####5 | 46%  2025-05-07T20:26:11.8923122Z 2025-05-07T20:26:11.8923931Z 2025-05-07T20:26:11.9390351Z libcusparse-12.5.7.5 | 164.9 MB | #########5 | 96%  2025-05-07T20:26:11.9390748Z 2025-05-07T20:26:11.9390754Z 2025-05-07T20:26:11.9391022Z 2025-05-07T20:26:11.9392931Z 2025-05-07T20:26:11.9510889Z libcufft-11.3.3.41 | 147.4 MB | ######### | 91%  2025-05-07T20:26:11.9511255Z 2025-05-07T20:26:11.9511259Z 2025-05-07T20:26:11.9511953Z 2025-05-07T20:26:11.9747575Z libcusolver-11.7.2.5 | 156.9 MB | ########8 | 88%  2025-05-07T20:26:11.9802558Z libcublas-12.8.3.14 | 460.2 MB | ###3 | 34% 2025-05-07T20:26:11.9803081Z 2025-05-07T20:26:11.9924560Z nsight-compute-2025. | 320.6 MB | ####6 | 47%  2025-05-07T20:26:11.9924841Z 2025-05-07T20:26:11.9925516Z 2025-05-07T20:26:12.0394447Z libcusparse-12.5.7.5 | 164.9 MB | #########8 | 98%  2025-05-07T20:26:12.0394729Z 2025-05-07T20:26:12.0394736Z 2025-05-07T20:26:12.0394742Z 2025-05-07T20:26:12.0395311Z 2025-05-07T20:26:12.0512103Z libcufft-11.3.3.41 | 147.4 MB | #########3 | 93%  2025-05-07T20:26:12.0512372Z 2025-05-07T20:26:12.0512378Z 2025-05-07T20:26:12.0512803Z 2025-05-07T20:26:12.0754052Z libcusolver-11.7.2.5 | 156.9 MB | ######### | 91%  2025-05-07T20:26:12.0803647Z libcublas-12.8.3.14 | 460.2 MB | ###4 | 35% 2025-05-07T20:26:12.0805250Z 2025-05-07T20:26:12.1512752Z nsight-compute-2025. | 320.6 MB | ####8 | 48%  2025-05-07T20:26:12.1513084Z 2025-05-07T20:26:12.1513089Z 2025-05-07T20:26:12.1514581Z 2025-05-07T20:26:12.1754377Z libcusolver-11.7.2.5 | 156.9 MB | #########3 | 94%  2025-05-07T20:26:12.1847370Z libcublas-12.8.3.14 | 460.2 MB | ###5 | 36% 2025-05-07T20:26:12.1849375Z 2025-05-07T20:26:12.2069853Z nsight-compute-2025. | 320.6 MB | ####9 | 49%  2025-05-07T20:26:12.2070226Z 2025-05-07T20:26:12.2070232Z 2025-05-07T20:26:12.2070238Z 2025-05-07T20:26:12.2070244Z 2025-05-07T20:26:12.2513069Z libcufft-11.3.3.41 | 147.4 MB | #########5 | 96%  2025-05-07T20:26:12.2513348Z 2025-05-07T20:26:12.2513354Z 2025-05-07T20:26:12.2514018Z 2025-05-07T20:26:12.2848369Z libcusolver-11.7.2.5 | 156.9 MB | #########6 | 97%  2025-05-07T20:26:12.2848688Z 2025-05-07T20:26:12.3128913Z nsight-compute-2025. | 320.6 MB | ##### | 51%  2025-05-07T20:26:12.3129183Z 2025-05-07T20:26:12.3129190Z 2025-05-07T20:26:12.3129195Z 2025-05-07T20:26:12.3129199Z 2025-05-07T20:26:12.3513183Z libcufft-11.3.3.41 | 147.4 MB | #########8 | 98%  2025-05-07T20:26:12.3513539Z 2025-05-07T20:26:12.3513576Z 2025-05-07T20:26:12.3513585Z 2025-05-07T20:26:12.3662204Z libcusolver-11.7.2.5 | 156.9 MB | #########9 | 99%  2025-05-07T20:26:12.3912671Z libcublas-12.8.3.14 | 460.2 MB | ###6 | 37% 2025-05-07T20:26:12.3914539Z 2025-05-07T20:26:12.4818546Z nsight-compute-2025. | 320.6 MB | #####2 | 52%  2025-05-07T20:26:12.4913559Z libcublas-12.8.3.14 | 460.2 MB | ###7 | 37% 2025-05-07T20:26:12.4915381Z 2025-05-07T20:26:12.5820350Z nsight-compute-2025. | 320.6 MB | #####3 | 54%  2025-05-07T20:26:12.5913672Z libcublas-12.8.3.14 | 460.2 MB | ###8 | 39% 2025-05-07T20:26:12.5914567Z 2025-05-07T20:26:12.6914403Z nsight-compute-2025. | 320.6 MB | #####5 | 55%  2025-05-07T20:26:12.6915247Z 2025-05-07T20:26:12.7555649Z nsight-compute-2025. | 320.6 MB | #####7 | 58%  2025-05-07T20:26:12.7914401Z libcublas-12.8.3.14 | 460.2 MB | ###9 | 39% 2025-05-07T20:26:12.7916258Z 2025-05-07T20:26:12.8558650Z nsight-compute-2025. | 320.6 MB | #####9 | 60%  2025-05-07T20:26:12.9013916Z libcublas-12.8.3.14 | 460.2 MB | #### | 41% 2025-05-07T20:26:12.9014601Z 2025-05-07T20:26:12.9614990Z nsight-compute-2025. | 320.6 MB | ######1 | 61%  2025-05-07T20:26:13.0017151Z libcublas-12.8.3.14 | 460.2 MB | ####1 | 42% 2025-05-07T20:26:13.0017650Z 2025-05-07T20:26:13.0615277Z nsight-compute-2025. | 320.6 MB | ######3 | 63%  2025-05-07T20:26:13.1161517Z libcublas-12.8.3.14 | 460.2 MB | ####2 | 43% 2025-05-07T20:26:13.1163327Z 2025-05-07T20:26:13.1617552Z nsight-compute-2025. | 320.6 MB | ######5 | 65%  2025-05-07T20:26:13.2215488Z libcublas-12.8.3.14 | 460.2 MB | ####3 | 44% 2025-05-07T20:26:13.2216824Z 2025-05-07T20:26:13.2620308Z nsight-compute-2025. | 320.6 MB | ######6 | 67%  2025-05-07T20:26:13.3278701Z libcublas-12.8.3.14 | 460.2 MB | ####4 | 45% 2025-05-07T20:26:13.3278995Z 2025-05-07T20:26:13.3621476Z nsight-compute-2025. | 320.6 MB | ######8 | 68%  2025-05-07T20:26:13.4279743Z libcublas-12.8.3.14 | 460.2 MB | ####5 | 46% 2025-05-07T20:26:13.4281109Z 2025-05-07T20:26:13.4981802Z nsight-compute-2025. | 320.6 MB | ####### | 70%  2025-05-07T20:26:13.5282293Z libcublas-12.8.3.14 | 460.2 MB | ####7 | 47% 2025-05-07T20:26:13.5284457Z 2025-05-07T20:26:13.5984422Z nsight-compute-2025. | 320.6 MB | #######1 | 72%  2025-05-07T20:26:13.6391944Z libcublas-12.8.3.14 | 460.2 MB | ####8 | 48% 2025-05-07T20:26:13.6394956Z 2025-05-07T20:26:13.7393033Z nsight-compute-2025. | 320.6 MB | #######3 | 74%  2025-05-07T20:26:13.7393807Z 2025-05-07T20:26:13.7606411Z nsight-compute-2025. | 320.6 MB | #######6 | 76%  2025-05-07T20:26:13.8608895Z libcublas-12.8.3.14 | 460.2 MB | ####9 | 49% 2025-05-07T20:26:13.8617799Z libcublas-12.8.3.14 | 460.2 MB | ##### | 50% 2025-05-07T20:26:13.8619492Z 2025-05-07T20:26:13.9609375Z nsight-compute-2025. | 320.6 MB | #######7 | 78%  2025-05-07T20:26:13.9785371Z libcublas-12.8.3.14 | 460.2 MB | #####1 | 51% 2025-05-07T20:26:13.9785663Z 2025-05-07T20:26:13.9964221Z nsight-compute-2025. | 320.6 MB | #######9 | 80%  2025-05-07T20:26:13.9964538Z 2025-05-07T20:26:13.9969224Z 2025-05-07T20:26:14.0421756Z libcusparse-12.5.7.5 | 164.9 MB | ########## | 100%  2025-05-07T20:26:14.0422047Z 2025-05-07T20:26:14.0422052Z 2025-05-07T20:26:14.0422056Z 2025-05-07T20:26:14.0422060Z 2025-05-07T20:26:14.0425525Z 2025-05-07T20:26:14.0612158Z libnpp-12.3.3.65 | 130.6 MB | | 0%  2025-05-07T20:26:14.1193351Z libcublas-12.8.3.14 | 460.2 MB | #####2 | 52% 2025-05-07T20:26:14.1196354Z 2025-05-07T20:26:14.1427094Z nsight-compute-2025. | 320.6 MB | ########1 | 81%  2025-05-07T20:26:14.1427363Z 2025-05-07T20:26:14.1427368Z 2025-05-07T20:26:14.1427372Z 2025-05-07T20:26:14.1427383Z 2025-05-07T20:26:14.1427387Z 2025-05-07T20:26:14.1877222Z libnpp-12.3.3.65 | 130.6 MB | 2 | 3%  2025-05-07T20:26:14.2428203Z libcublas-12.8.3.14 | 460.2 MB | #####3 | 53% 2025-05-07T20:26:14.2428463Z 2025-05-07T20:26:14.2428467Z 2025-05-07T20:26:14.2428471Z 2025-05-07T20:26:14.2428475Z 2025-05-07T20:26:14.2431754Z 2025-05-07T20:26:14.2655753Z libnpp-12.3.3.65 | 130.6 MB | 5 | 6%  2025-05-07T20:26:14.2656095Z 2025-05-07T20:26:14.3010377Z nsight-compute-2025. | 320.6 MB | ########2 | 83%  2025-05-07T20:26:14.3429314Z libcublas-12.8.3.14 | 460.2 MB | #####4 | 54% 2025-05-07T20:26:14.3429664Z 2025-05-07T20:26:14.3429672Z 2025-05-07T20:26:14.3429678Z 2025-05-07T20:26:14.3429683Z 2025-05-07T20:26:14.3429718Z 2025-05-07T20:26:14.4000194Z libnpp-12.3.3.65 | 130.6 MB | 8 | 9%  2025-05-07T20:26:14.4001075Z 2025-05-07T20:26:14.4237407Z nsight-compute-2025. | 320.6 MB | ########4 | 84%  2025-05-07T20:26:14.4430177Z libcublas-12.8.3.14 | 460.2 MB | #####5 | 55% 2025-05-07T20:26:14.4430498Z 2025-05-07T20:26:14.4430787Z 2025-05-07T20:26:14.4430792Z 2025-05-07T20:26:14.4430797Z 2025-05-07T20:26:14.4432697Z 2025-05-07T20:26:14.4879446Z libnpp-12.3.3.65 | 130.6 MB | #1 | 12%  2025-05-07T20:26:14.4879735Z 2025-05-07T20:26:14.4879740Z 2025-05-07T20:26:14.4879744Z 2025-05-07T20:26:14.4883466Z 2025-05-07T20:26:14.5282151Z libcufft-11.3.3.41 | 147.4 MB | ########## | 100%  2025-05-07T20:26:14.5282437Z 2025-05-07T20:26:14.5416028Z nsight-compute-2025. | 320.6 MB | ########5 | 86%  2025-05-07T20:26:14.5434289Z libcublas-12.8.3.14 | 460.2 MB | #####5 | 56% 2025-05-07T20:26:14.5434546Z 2025-05-07T20:26:14.5434790Z 2025-05-07T20:26:14.5434796Z 2025-05-07T20:26:14.5434800Z 2025-05-07T20:26:14.5438448Z 2025-05-07T20:26:14.5579040Z libnpp-12.3.3.65 | 130.6 MB | #4 | 14%  2025-05-07T20:26:14.5579335Z 2025-05-07T20:26:14.5579340Z 2025-05-07T20:26:14.5579343Z 2025-05-07T20:26:14.5579347Z 2025-05-07T20:26:14.5579358Z 2025-05-07T20:26:14.5579375Z 2025-05-07T20:26:14.6581289Z cuda-nsight-12.8.55 | 113.2 MB | | 0%  2025-05-07T20:26:14.6581620Z 2025-05-07T20:26:14.6581624Z 2025-05-07T20:26:14.6581629Z 2025-05-07T20:26:14.6581643Z 2025-05-07T20:26:14.6581647Z 2025-05-07T20:26:14.6581651Z 2025-05-07T20:26:14.6594954Z cuda-nsight-12.8.55 | 113.2 MB | 2 | 3%  2025-05-07T20:26:14.6668803Z libcublas-12.8.3.14 | 460.2 MB | #####6 | 57% 2025-05-07T20:26:14.6669046Z 2025-05-07T20:26:14.6855423Z nsight-compute-2025. | 320.6 MB | ########6 | 87%  2025-05-07T20:26:14.6855686Z 2025-05-07T20:26:14.6855898Z 2025-05-07T20:26:14.6855929Z 2025-05-07T20:26:14.6855936Z 2025-05-07T20:26:14.6855963Z 2025-05-07T20:26:14.7584526Z libnpp-12.3.3.65 | 130.6 MB | #6 | 17%  2025-05-07T20:26:14.7584852Z 2025-05-07T20:26:14.7584856Z 2025-05-07T20:26:14.7584861Z 2025-05-07T20:26:14.7584866Z 2025-05-07T20:26:14.7584871Z 2025-05-07T20:26:14.7584885Z 2025-05-07T20:26:14.7883940Z cuda-nsight-12.8.55 | 113.2 MB | 5 | 5%  2025-05-07T20:26:14.8043223Z libcublas-12.8.3.14 | 460.2 MB | #####7 | 57% 2025-05-07T20:26:14.8045285Z 2025-05-07T20:26:14.8168849Z nsight-compute-2025. | 320.6 MB | ########7 | 88%  2025-05-07T20:26:14.8169123Z 2025-05-07T20:26:14.8169129Z 2025-05-07T20:26:14.8169133Z 2025-05-07T20:26:14.8169138Z 2025-05-07T20:26:14.8169146Z 2025-05-07T20:26:14.8584257Z libnpp-12.3.3.65 | 130.6 MB | #9 | 19%  2025-05-07T20:26:14.8584539Z 2025-05-07T20:26:14.8584544Z 2025-05-07T20:26:14.8584548Z 2025-05-07T20:26:14.8584569Z 2025-05-07T20:26:14.8584573Z 2025-05-07T20:26:14.8584863Z 2025-05-07T20:26:14.8975485Z cuda-nsight-12.8.55 | 113.2 MB | 7 | 8%  2025-05-07T20:26:14.9297078Z libcublas-12.8.3.14 | 460.2 MB | #####8 | 58% 2025-05-07T20:26:14.9297354Z 2025-05-07T20:26:14.9297360Z 2025-05-07T20:26:14.9297365Z 2025-05-07T20:26:14.9297400Z 2025-05-07T20:26:14.9299135Z 2025-05-07T20:26:14.9387146Z libnpp-12.3.3.65 | 130.6 MB | ##1 | 21%  2025-05-07T20:26:14.9392044Z 2025-05-07T20:26:14.9588595Z nsight-compute-2025. | 320.6 MB | ########9 | 89%  2025-05-07T20:26:14.9588867Z 2025-05-07T20:26:14.9588873Z 2025-05-07T20:26:14.9588887Z 2025-05-07T20:26:14.9588894Z 2025-05-07T20:26:14.9588898Z 2025-05-07T20:26:14.9588902Z 2025-05-07T20:26:15.0154432Z cuda-nsight-12.8.55 | 113.2 MB | # | 10%  2025-05-07T20:26:15.0415485Z libcublas-12.8.3.14 | 460.2 MB | #####8 | 59% 2025-05-07T20:26:15.0415749Z 2025-05-07T20:26:15.0416109Z 2025-05-07T20:26:15.0416122Z 2025-05-07T20:26:15.0416129Z 2025-05-07T20:26:15.0416135Z 2025-05-07T20:26:15.0592891Z libnpp-12.3.3.65 | 130.6 MB | ##3 | 24%  2025-05-07T20:26:15.0593209Z 2025-05-07T20:26:15.0593215Z 2025-05-07T20:26:15.0593219Z 2025-05-07T20:26:15.0593223Z 2025-05-07T20:26:15.0593229Z 2025-05-07T20:26:15.0593970Z 2025-05-07T20:26:15.0638792Z cuda-nsight-12.8.55 | 113.2 MB | #2 | 13%  2025-05-07T20:26:15.0640879Z 2025-05-07T20:26:15.1172272Z nsight-compute-2025. | 320.6 MB | ########9 | 90%  2025-05-07T20:26:15.1489499Z libcublas-12.8.3.14 | 460.2 MB | #####9 | 60% 2025-05-07T20:26:15.1489856Z 2025-05-07T20:26:15.1490036Z 2025-05-07T20:26:15.1490042Z 2025-05-07T20:26:15.1490151Z 2025-05-07T20:26:15.1491675Z 2025-05-07T20:26:15.1600465Z libnpp-12.3.3.65 | 130.6 MB | ##5 | 26%  2025-05-07T20:26:15.1600840Z 2025-05-07T20:26:15.1600844Z 2025-05-07T20:26:15.1601098Z 2025-05-07T20:26:15.1601120Z 2025-05-07T20:26:15.1601124Z 2025-05-07T20:26:15.1603397Z 2025-05-07T20:26:15.1656268Z cuda-nsight-12.8.55 | 113.2 MB | #5 | 16%  2025-05-07T20:26:15.1657899Z 2025-05-07T20:26:15.2229206Z nsight-compute-2025. | 320.6 MB | ######### | 91%  2025-05-07T20:26:15.2494825Z libcublas-12.8.3.14 | 460.2 MB | ###### | 60% 2025-05-07T20:26:15.2495610Z 2025-05-07T20:26:15.2495617Z 2025-05-07T20:26:15.2495623Z 2025-05-07T20:26:15.2495628Z 2025-05-07T20:26:15.2497716Z 2025-05-07T20:26:15.2600257Z libnpp-12.3.3.65 | 130.6 MB | ##8 | 28%  2025-05-07T20:26:15.2600640Z 2025-05-07T20:26:15.2600646Z 2025-05-07T20:26:15.2600651Z 2025-05-07T20:26:15.2600656Z 2025-05-07T20:26:15.2600661Z 2025-05-07T20:26:15.2604536Z 2025-05-07T20:26:15.2735070Z cuda-nsight-12.8.55 | 113.2 MB | #8 | 18%  2025-05-07T20:26:15.2737938Z 2025-05-07T20:26:15.3321843Z nsight-compute-2025. | 320.6 MB | #########1 | 92%  2025-05-07T20:26:15.3516307Z libcublas-12.8.3.14 | 460.2 MB | ###### | 61% 2025-05-07T20:26:15.3516639Z 2025-05-07T20:26:15.3516645Z 2025-05-07T20:26:15.3516651Z 2025-05-07T20:26:15.3516656Z 2025-05-07T20:26:15.3522838Z 2025-05-07T20:26:15.3603154Z libnpp-12.3.3.65 | 130.6 MB | ### | 30%  2025-05-07T20:26:15.3603543Z 2025-05-07T20:26:15.3603549Z 2025-05-07T20:26:15.3603555Z 2025-05-07T20:26:15.3605296Z 2025-05-07T20:26:15.3605302Z 2025-05-07T20:26:15.3605307Z 2025-05-07T20:26:15.3737992Z cuda-nsight-12.8.55 | 113.2 MB | ##1 | 21%  2025-05-07T20:26:15.3739499Z 2025-05-07T20:26:15.4326830Z nsight-compute-2025. | 320.6 MB | #########2 | 93%  2025-05-07T20:26:15.4650816Z libcublas-12.8.3.14 | 460.2 MB | ######1 | 62% 2025-05-07T20:26:15.4651197Z 2025-05-07T20:26:15.4651204Z 2025-05-07T20:26:15.4651209Z 2025-05-07T20:26:15.4651215Z 2025-05-07T20:26:15.4651220Z 2025-05-07T20:26:15.4708781Z libnpp-12.3.3.65 | 130.6 MB | ###2 | 32%  2025-05-07T20:26:15.4709167Z 2025-05-07T20:26:15.4709174Z 2025-05-07T20:26:15.4709179Z 2025-05-07T20:26:15.4709185Z 2025-05-07T20:26:15.4709192Z 2025-05-07T20:26:15.4711048Z 2025-05-07T20:26:15.4742124Z cuda-nsight-12.8.55 | 113.2 MB | ##3 | 24%  2025-05-07T20:26:15.4742525Z 2025-05-07T20:26:15.5327998Z nsight-compute-2025. | 320.6 MB | #########3 | 94%  2025-05-07T20:26:15.5655373Z libcublas-12.8.3.14 | 460.2 MB | ######2 | 62% 2025-05-07T20:26:15.5655756Z 2025-05-07T20:26:15.5655762Z 2025-05-07T20:26:15.5655768Z 2025-05-07T20:26:15.5655773Z 2025-05-07T20:26:15.5655779Z 2025-05-07T20:26:15.5751411Z libnpp-12.3.3.65 | 130.6 MB | ###4 | 34%  2025-05-07T20:26:15.5751813Z 2025-05-07T20:26:15.5751819Z 2025-05-07T20:26:15.5751825Z 2025-05-07T20:26:15.5751830Z 2025-05-07T20:26:15.5751835Z 2025-05-07T20:26:15.5754163Z 2025-05-07T20:26:15.5851250Z cuda-nsight-12.8.55 | 113.2 MB | ##6 | 26%  2025-05-07T20:26:15.5851657Z 2025-05-07T20:26:15.6330368Z nsight-compute-2025. | 320.6 MB | #########4 | 95%  2025-05-07T20:26:15.6711487Z libcublas-12.8.3.14 | 460.2 MB | ######2 | 63% 2025-05-07T20:26:15.6711842Z 2025-05-07T20:26:15.6711848Z 2025-05-07T20:26:15.6711854Z 2025-05-07T20:26:15.6711859Z 2025-05-07T20:26:15.6714923Z 2025-05-07T20:26:15.6751678Z libnpp-12.3.3.65 | 130.6 MB | ###6 | 36%  2025-05-07T20:26:15.6752060Z 2025-05-07T20:26:15.6752066Z 2025-05-07T20:26:15.6752071Z 2025-05-07T20:26:15.6752077Z 2025-05-07T20:26:15.6752082Z 2025-05-07T20:26:15.6752087Z 2025-05-07T20:26:15.6949284Z cuda-nsight-12.8.55 | 113.2 MB | ##8 | 29%  2025-05-07T20:26:15.6952737Z 2025-05-07T20:26:15.7357201Z nsight-compute-2025. | 320.6 MB | #########5 | 95%  2025-05-07T20:26:15.7714750Z libcublas-12.8.3.14 | 460.2 MB | ######3 | 64% 2025-05-07T20:26:15.7715125Z 2025-05-07T20:26:15.7715131Z 2025-05-07T20:26:15.7715787Z 2025-05-07T20:26:15.7715795Z 2025-05-07T20:26:15.7717001Z 2025-05-07T20:26:15.7751567Z libnpp-12.3.3.65 | 130.6 MB | ###8 | 38%  2025-05-07T20:26:15.7751976Z 2025-05-07T20:26:15.7751982Z 2025-05-07T20:26:15.7751987Z 2025-05-07T20:26:15.7751992Z 2025-05-07T20:26:15.7751997Z 2025-05-07T20:26:15.7752175Z 2025-05-07T20:26:15.8037784Z cuda-nsight-12.8.55 | 113.2 MB | ###1 | 32%  2025-05-07T20:26:15.8043632Z 2025-05-07T20:26:15.8466662Z nsight-compute-2025. | 320.6 MB | #########6 | 96%  2025-05-07T20:26:15.8723307Z libcublas-12.8.3.14 | 460.2 MB | ######4 | 64% 2025-05-07T20:26:15.8723635Z 2025-05-07T20:26:15.8723641Z 2025-05-07T20:26:15.8723647Z 2025-05-07T20:26:15.8723652Z 2025-05-07T20:26:15.8727132Z 2025-05-07T20:26:15.8752671Z libnpp-12.3.3.65 | 130.6 MB | #### | 41%  2025-05-07T20:26:15.8753062Z 2025-05-07T20:26:15.8753068Z 2025-05-07T20:26:15.8753074Z 2025-05-07T20:26:15.8753103Z 2025-05-07T20:26:15.8753109Z 2025-05-07T20:26:15.8753114Z 2025-05-07T20:26:15.8847016Z cuda-nsight-12.8.55 | 113.2 MB | ###4 | 34%  2025-05-07T20:26:15.8847409Z 2025-05-07T20:26:15.8847415Z 2025-05-07T20:26:15.8847449Z 2025-05-07T20:26:15.9044338Z libcusolver-11.7.2.5 | 156.9 MB | ########## | 100%  2025-05-07T20:26:15.9045847Z 2025-05-07T20:26:15.9427853Z nsight-compute-2025. | 320.6 MB | #########7 | 97%  2025-05-07T20:26:15.9428243Z 2025-05-07T20:26:15.9428249Z 2025-05-07T20:26:15.9428254Z 2025-05-07T20:26:15.9428260Z 2025-05-07T20:26:15.9428266Z 2025-05-07T20:26:15.9428271Z 2025-05-07T20:26:15.9428278Z 2025-05-07T20:26:15.9574820Z cuda-nvvp-12.8.57 | 112.4 MB | | 0%  2025-05-07T20:26:15.9867521Z libcublas-12.8.3.14 | 460.2 MB | ######4 | 65% 2025-05-07T20:26:15.9867965Z 2025-05-07T20:26:15.9867971Z 2025-05-07T20:26:15.9867976Z 2025-05-07T20:26:15.9867981Z 2025-05-07T20:26:15.9875207Z 2025-05-07T20:26:16.0052065Z libnpp-12.3.3.65 | 130.6 MB | ####2 | 43%  2025-05-07T20:26:16.0052453Z 2025-05-07T20:26:16.0052460Z 2025-05-07T20:26:16.0052465Z 2025-05-07T20:26:16.0052470Z 2025-05-07T20:26:16.0052475Z 2025-05-07T20:26:16.0052481Z 2025-05-07T20:26:16.0138692Z cuda-nsight-12.8.55 | 113.2 MB | ###7 | 37%  2025-05-07T20:26:16.0141532Z 2025-05-07T20:26:16.0431681Z nsight-compute-2025. | 320.6 MB | #########8 | 98%  2025-05-07T20:26:16.0432069Z 2025-05-07T20:26:16.0432075Z 2025-05-07T20:26:16.0432081Z 2025-05-07T20:26:16.0432086Z 2025-05-07T20:26:16.0432091Z 2025-05-07T20:26:16.0432096Z 2025-05-07T20:26:16.0432854Z 2025-05-07T20:26:16.0743460Z cuda-nvvp-12.8.57 | 112.4 MB | 2 | 2%  2025-05-07T20:26:16.1030174Z libcublas-12.8.3.14 | 460.2 MB | ######5 | 65% 2025-05-07T20:26:16.1030540Z 2025-05-07T20:26:16.1030548Z 2025-05-07T20:26:16.1030554Z 2025-05-07T20:26:16.1030559Z 2025-05-07T20:26:16.1034407Z 2025-05-07T20:26:16.1105491Z libnpp-12.3.3.65 | 130.6 MB | ####4 | 45%  2025-05-07T20:26:16.1105885Z 2025-05-07T20:26:16.1105891Z 2025-05-07T20:26:16.1105896Z 2025-05-07T20:26:16.1105901Z 2025-05-07T20:26:16.1105906Z 2025-05-07T20:26:16.1108939Z 2025-05-07T20:26:16.1299068Z cuda-nsight-12.8.55 | 113.2 MB | ###9 | 40%  2025-05-07T20:26:16.1308883Z 2025-05-07T20:26:16.1437494Z nsight-compute-2025. | 320.6 MB | #########8 | 99%  2025-05-07T20:26:16.1437851Z 2025-05-07T20:26:16.1437856Z 2025-05-07T20:26:16.1437859Z 2025-05-07T20:26:16.1437863Z 2025-05-07T20:26:16.1437867Z 2025-05-07T20:26:16.1437870Z 2025-05-07T20:26:16.1439159Z 2025-05-07T20:26:16.1759816Z cuda-nvvp-12.8.57 | 112.4 MB | 4 | 4%  2025-05-07T20:26:16.2159467Z libcublas-12.8.3.14 | 460.2 MB | ######6 | 66% 2025-05-07T20:26:16.2159828Z 2025-05-07T20:26:16.2159832Z 2025-05-07T20:26:16.2159836Z 2025-05-07T20:26:16.2159840Z 2025-05-07T20:26:16.2163238Z 2025-05-07T20:26:16.2267424Z libnpp-12.3.3.65 | 130.6 MB | ####6 | 47%  2025-05-07T20:26:16.2268271Z 2025-05-07T20:26:16.2268277Z 2025-05-07T20:26:16.2268282Z 2025-05-07T20:26:16.2268288Z 2025-05-07T20:26:16.2268293Z 2025-05-07T20:26:16.2268299Z 2025-05-07T20:26:16.2389772Z cuda-nsight-12.8.55 | 113.2 MB | ####2 | 42%  2025-05-07T20:26:16.2390204Z 2025-05-07T20:26:16.2442135Z nsight-compute-2025. | 320.6 MB | #########9 | 100%  2025-05-07T20:26:16.2442504Z 2025-05-07T20:26:16.2442510Z 2025-05-07T20:26:16.2442525Z 2025-05-07T20:26:16.2442530Z 2025-05-07T20:26:16.2442535Z 2025-05-07T20:26:16.2442541Z 2025-05-07T20:26:16.2442546Z 2025-05-07T20:26:16.2763052Z cuda-nvvp-12.8.57 | 112.4 MB | 6 | 6%  2025-05-07T20:26:16.3223359Z libcublas-12.8.3.14 | 460.2 MB | ######6 | 67% 2025-05-07T20:26:16.3223709Z 2025-05-07T20:26:16.3223715Z 2025-05-07T20:26:16.3223721Z 2025-05-07T20:26:16.3223762Z 2025-05-07T20:26:16.3226326Z 2025-05-07T20:26:16.3392059Z libnpp-12.3.3.65 | 130.6 MB | ####8 | 49%  2025-05-07T20:26:16.3392449Z 2025-05-07T20:26:16.3392455Z 2025-05-07T20:26:16.3392657Z 2025-05-07T20:26:16.3392662Z 2025-05-07T20:26:16.3392667Z 2025-05-07T20:26:16.3392673Z 2025-05-07T20:26:16.3449416Z cuda-nsight-12.8.55 | 113.2 MB | ####4 | 44%  2025-05-07T20:26:16.3449837Z 2025-05-07T20:26:16.3449842Z 2025-05-07T20:26:16.3449848Z 2025-05-07T20:26:16.3449853Z 2025-05-07T20:26:16.3449867Z 2025-05-07T20:26:16.3449872Z 2025-05-07T20:26:16.3449877Z 2025-05-07T20:26:16.3763635Z cuda-nvvp-12.8.57 | 112.4 MB | 8 | 9%  2025-05-07T20:26:16.4224872Z libcublas-12.8.3.14 | 460.2 MB | ######7 | 67% 2025-05-07T20:26:16.4225228Z 2025-05-07T20:26:16.4225241Z 2025-05-07T20:26:16.4225246Z 2025-05-07T20:26:16.4225252Z 2025-05-07T20:26:16.4227695Z 2025-05-07T20:26:16.4394367Z libnpp-12.3.3.65 | 130.6 MB | ##### | 50%  2025-05-07T20:26:16.4394773Z 2025-05-07T20:26:16.4394779Z 2025-05-07T20:26:16.4394784Z 2025-05-07T20:26:16.4394789Z 2025-05-07T20:26:16.4394794Z 2025-05-07T20:26:16.4394800Z 2025-05-07T20:26:16.4451170Z cuda-nsight-12.8.55 | 113.2 MB | ####6 | 47%  2025-05-07T20:26:16.4451588Z 2025-05-07T20:26:16.4451612Z 2025-05-07T20:26:16.4451618Z 2025-05-07T20:26:16.4451623Z 2025-05-07T20:26:16.4451629Z 2025-05-07T20:26:16.4451634Z 2025-05-07T20:26:16.4451639Z 2025-05-07T20:26:16.4864718Z cuda-nvvp-12.8.57 | 112.4 MB | #1 | 11%  2025-05-07T20:26:16.5225416Z libcublas-12.8.3.14 | 460.2 MB | ######7 | 68% 2025-05-07T20:26:16.5225778Z 2025-05-07T20:26:16.5225782Z 2025-05-07T20:26:16.5225786Z 2025-05-07T20:26:16.5225790Z 2025-05-07T20:26:16.5227272Z 2025-05-07T20:26:16.5451916Z libnpp-12.3.3.65 | 130.6 MB | #####2 | 53%  2025-05-07T20:26:16.5452286Z 2025-05-07T20:26:16.5452319Z 2025-05-07T20:26:16.5452323Z 2025-05-07T20:26:16.5452327Z 2025-05-07T20:26:16.5452332Z 2025-05-07T20:26:16.5452336Z 2025-05-07T20:26:16.5457006Z 2025-05-07T20:26:16.5875671Z cuda-nvvp-12.8.57 | 112.4 MB | #3 | 14%  2025-05-07T20:26:16.5913006Z libcublas-12.8.3.14 | 460.2 MB | ######8 | 69% 2025-05-07T20:26:16.5913698Z 2025-05-07T20:26:16.5913704Z 2025-05-07T20:26:16.5913709Z 2025-05-07T20:26:16.5913715Z 2025-05-07T20:26:16.5913720Z 2025-05-07T20:26:16.5913726Z 2025-05-07T20:26:16.6228956Z cuda-nsight-12.8.55 | 113.2 MB | ####8 | 49%  2025-05-07T20:26:16.6229358Z 2025-05-07T20:26:16.6229363Z 2025-05-07T20:26:16.6229366Z 2025-05-07T20:26:16.6229370Z 2025-05-07T20:26:16.6231964Z 2025-05-07T20:26:16.6532062Z libnpp-12.3.3.65 | 130.6 MB | #####4 | 55%  2025-05-07T20:26:16.6532421Z 2025-05-07T20:26:16.6532425Z 2025-05-07T20:26:16.6532428Z 2025-05-07T20:26:16.6532432Z 2025-05-07T20:26:16.6532436Z 2025-05-07T20:26:16.6532725Z 2025-05-07T20:26:16.6535589Z 2025-05-07T20:26:16.6919132Z cuda-nvvp-12.8.57 | 112.4 MB | #6 | 16%  2025-05-07T20:26:16.6919439Z 2025-05-07T20:26:16.6919444Z 2025-05-07T20:26:16.6919448Z 2025-05-07T20:26:16.6919451Z 2025-05-07T20:26:16.6919455Z 2025-05-07T20:26:16.6919466Z 2025-05-07T20:26:16.6971184Z cuda-nsight-12.8.55 | 113.2 MB | #####1 | 51%  2025-05-07T20:26:16.7246170Z libcublas-12.8.3.14 | 460.2 MB | ######9 | 69% 2025-05-07T20:26:16.7246537Z 2025-05-07T20:26:16.7246543Z 2025-05-07T20:26:16.7246549Z 2025-05-07T20:26:16.7246554Z 2025-05-07T20:26:16.7250361Z 2025-05-07T20:26:16.7532998Z libnpp-12.3.3.65 | 130.6 MB | #####6 | 57%  2025-05-07T20:26:16.7533379Z 2025-05-07T20:26:16.7533383Z 2025-05-07T20:26:16.7533387Z 2025-05-07T20:26:16.7533390Z 2025-05-07T20:26:16.7533394Z 2025-05-07T20:26:16.7533398Z 2025-05-07T20:26:16.7533401Z 2025-05-07T20:26:16.7924541Z cuda-nvvp-12.8.57 | 112.4 MB | #8 | 18%  2025-05-07T20:26:16.7924955Z 2025-05-07T20:26:16.7924961Z 2025-05-07T20:26:16.7924964Z 2025-05-07T20:26:16.7924968Z 2025-05-07T20:26:16.7924973Z 2025-05-07T20:26:16.7924977Z 2025-05-07T20:26:16.8001294Z cuda-nsight-12.8.55 | 113.2 MB | #####3 | 53%  2025-05-07T20:26:16.8288739Z libcublas-12.8.3.14 | 460.2 MB | ######9 | 70% 2025-05-07T20:26:16.8288999Z 2025-05-07T20:26:16.8289097Z 2025-05-07T20:26:16.8289105Z 2025-05-07T20:26:16.8289127Z 2025-05-07T20:26:16.8293722Z 2025-05-07T20:26:16.8540540Z libnpp-12.3.3.65 | 130.6 MB | #####8 | 59%  2025-05-07T20:26:16.8540882Z 2025-05-07T20:26:16.8540888Z 2025-05-07T20:26:16.8540894Z 2025-05-07T20:26:16.8540910Z 2025-05-07T20:26:16.8540915Z 2025-05-07T20:26:16.8540920Z 2025-05-07T20:26:16.8540925Z 2025-05-07T20:26:16.8929481Z cuda-nvvp-12.8.57 | 112.4 MB | ## | 21%  2025-05-07T20:26:16.8929797Z 2025-05-07T20:26:16.8929830Z 2025-05-07T20:26:16.8929834Z 2025-05-07T20:26:16.8929838Z 2025-05-07T20:26:16.8929842Z 2025-05-07T20:26:16.8934965Z 2025-05-07T20:26:16.9008215Z cuda-nsight-12.8.55 | 113.2 MB | #####5 | 56%  2025-05-07T20:26:16.9290592Z libcublas-12.8.3.14 | 460.2 MB | ####### | 70% 2025-05-07T20:26:16.9290932Z 2025-05-07T20:26:16.9290963Z 2025-05-07T20:26:16.9290967Z 2025-05-07T20:26:16.9290971Z 2025-05-07T20:26:16.9295150Z 2025-05-07T20:26:16.9545108Z libnpp-12.3.3.65 | 130.6 MB | ###### | 61%  2025-05-07T20:26:16.9545514Z 2025-05-07T20:26:16.9545519Z 2025-05-07T20:26:16.9545523Z 2025-05-07T20:26:16.9545526Z 2025-05-07T20:26:16.9545530Z 2025-05-07T20:26:16.9545535Z 2025-05-07T20:26:16.9546257Z 2025-05-07T20:26:16.9929816Z cuda-nvvp-12.8.57 | 112.4 MB | ##3 | 23%  2025-05-07T20:26:16.9930275Z 2025-05-07T20:26:16.9930283Z 2025-05-07T20:26:16.9930289Z 2025-05-07T20:26:16.9930294Z 2025-05-07T20:26:16.9930332Z 2025-05-07T20:26:16.9934625Z 2025-05-07T20:26:17.0009897Z cuda-nsight-12.8.55 | 113.2 MB | #####8 | 58%  2025-05-07T20:26:17.0292278Z libcublas-12.8.3.14 | 460.2 MB | ####### | 71% 2025-05-07T20:26:17.0292609Z 2025-05-07T20:26:17.0292613Z 2025-05-07T20:26:17.0292617Z 2025-05-07T20:26:17.0292621Z 2025-05-07T20:26:17.0294335Z 2025-05-07T20:26:17.0546574Z libnpp-12.3.3.65 | 130.6 MB | ######2 | 63%  2025-05-07T20:26:17.0546883Z 2025-05-07T20:26:17.0546888Z 2025-05-07T20:26:17.0546892Z 2025-05-07T20:26:17.0546895Z 2025-05-07T20:26:17.0546899Z 2025-05-07T20:26:17.0546903Z 2025-05-07T20:26:17.0548564Z 2025-05-07T20:26:17.1012431Z cuda-nvvp-12.8.57 | 112.4 MB | ##5 | 26%  2025-05-07T20:26:17.1012800Z 2025-05-07T20:26:17.1012805Z 2025-05-07T20:26:17.1012808Z 2025-05-07T20:26:17.1012812Z 2025-05-07T20:26:17.1012816Z 2025-05-07T20:26:17.1012827Z 2025-05-07T20:26:17.1211839Z cuda-nsight-12.8.55 | 113.2 MB | ###### | 61%  2025-05-07T20:26:17.1296610Z libcublas-12.8.3.14 | 460.2 MB | #######1 | 72% 2025-05-07T20:26:17.1296866Z 2025-05-07T20:26:17.1296878Z 2025-05-07T20:26:17.1296882Z 2025-05-07T20:26:17.1296886Z 2025-05-07T20:26:17.1300122Z 2025-05-07T20:26:17.1670332Z libnpp-12.3.3.65 | 130.6 MB | ######5 | 65%  2025-05-07T20:26:17.1670677Z 2025-05-07T20:26:17.1670682Z 2025-05-07T20:26:17.1670686Z 2025-05-07T20:26:17.1670690Z 2025-05-07T20:26:17.1670694Z 2025-05-07T20:26:17.1670697Z 2025-05-07T20:26:17.1670701Z 2025-05-07T20:26:17.2020302Z cuda-nvvp-12.8.57 | 112.4 MB | ##7 | 28%  2025-05-07T20:26:17.2020627Z 2025-05-07T20:26:17.2020631Z 2025-05-07T20:26:17.2020635Z 2025-05-07T20:26:17.2020639Z 2025-05-07T20:26:17.2020643Z 2025-05-07T20:26:17.2020647Z 2025-05-07T20:26:17.2297607Z cuda-nsight-12.8.55 | 113.2 MB | ######2 | 63%  2025-05-07T20:26:17.2297933Z 2025-05-07T20:26:17.2297967Z 2025-05-07T20:26:17.2297972Z 2025-05-07T20:26:17.2297976Z 2025-05-07T20:26:17.2300148Z 2025-05-07T20:26:17.2397399Z libnpp-12.3.3.65 | 130.6 MB | ######7 | 67%  2025-05-07T20:26:17.2819958Z libcublas-12.8.3.14 | 460.2 MB | #######2 | 72% 2025-05-07T20:26:17.2820242Z 2025-05-07T20:26:17.2820247Z 2025-05-07T20:26:17.2820284Z 2025-05-07T20:26:17.2820288Z 2025-05-07T20:26:17.2820292Z 2025-05-07T20:26:17.2820296Z 2025-05-07T20:26:17.2821829Z 2025-05-07T20:26:17.3053446Z cuda-nvvp-12.8.57 | 112.4 MB | ### | 30%  2025-05-07T20:26:17.3053792Z 2025-05-07T20:26:17.3053798Z 2025-05-07T20:26:17.3053803Z 2025-05-07T20:26:17.3053809Z 2025-05-07T20:26:17.3053814Z 2025-05-07T20:26:17.3053819Z 2025-05-07T20:26:17.3303114Z cuda-nsight-12.8.55 | 113.2 MB | ######5 | 65%  2025-05-07T20:26:17.3303458Z 2025-05-07T20:26:17.3303462Z 2025-05-07T20:26:17.3303466Z 2025-05-07T20:26:17.3303470Z 2025-05-07T20:26:17.3304283Z 2025-05-07T20:26:17.3442924Z libnpp-12.3.3.65 | 130.6 MB | ######9 | 69%  2025-05-07T20:26:17.3983190Z libcublas-12.8.3.14 | 460.2 MB | #######2 | 73% 2025-05-07T20:26:17.3983486Z 2025-05-07T20:26:17.3983490Z 2025-05-07T20:26:17.3983494Z 2025-05-07T20:26:17.3983498Z 2025-05-07T20:26:17.3983502Z 2025-05-07T20:26:17.3983506Z 2025-05-07T20:26:17.3983542Z 2025-05-07T20:26:17.4095846Z cuda-nvvp-12.8.57 | 112.4 MB | ###2 | 32%  2025-05-07T20:26:17.4096278Z 2025-05-07T20:26:17.4096284Z 2025-05-07T20:26:17.4096290Z 2025-05-07T20:26:17.4096296Z 2025-05-07T20:26:17.4096301Z 2025-05-07T20:26:17.4098087Z 2025-05-07T20:26:17.4331324Z cuda-nsight-12.8.55 | 113.2 MB | ######7 | 67%  2025-05-07T20:26:17.4331636Z 2025-05-07T20:26:17.4331640Z 2025-05-07T20:26:17.4331644Z 2025-05-07T20:26:17.4331647Z 2025-05-07T20:26:17.4332483Z 2025-05-07T20:26:17.4479292Z libnpp-12.3.3.65 | 130.6 MB | #######1 | 71%  2025-05-07T20:26:17.4987639Z libcublas-12.8.3.14 | 460.2 MB | #######3 | 73% 2025-05-07T20:26:17.4987989Z 2025-05-07T20:26:17.4987994Z 2025-05-07T20:26:17.4987997Z 2025-05-07T20:26:17.4988001Z 2025-05-07T20:26:17.4988005Z 2025-05-07T20:26:17.4988009Z 2025-05-07T20:26:17.5001568Z 2025-05-07T20:26:17.5222185Z cuda-nvvp-12.8.57 | 112.4 MB | ###4 | 35%  2025-05-07T20:26:17.5222782Z 2025-05-07T20:26:17.5222787Z 2025-05-07T20:26:17.5222791Z 2025-05-07T20:26:17.5222795Z 2025-05-07T20:26:17.5222798Z 2025-05-07T20:26:17.5223830Z 2025-05-07T20:26:17.5447274Z cuda-nsight-12.8.55 | 113.2 MB | ######9 | 69%  2025-05-07T20:26:17.5447562Z 2025-05-07T20:26:17.5447896Z 2025-05-07T20:26:17.5447908Z 2025-05-07T20:26:17.5447914Z 2025-05-07T20:26:17.5451643Z 2025-05-07T20:26:17.5529607Z libnpp-12.3.3.65 | 130.6 MB | #######3 | 73%  2025-05-07T20:26:17.5989641Z libcublas-12.8.3.14 | 460.2 MB | #######3 | 74% 2025-05-07T20:26:17.5990208Z 2025-05-07T20:26:17.5990230Z 2025-05-07T20:26:17.5990235Z 2025-05-07T20:26:17.5990241Z 2025-05-07T20:26:17.5990246Z 2025-05-07T20:26:17.5990252Z 2025-05-07T20:26:17.5990258Z 2025-05-07T20:26:17.6296248Z cuda-nvvp-12.8.57 | 112.4 MB | ###6 | 37%  2025-05-07T20:26:17.6296611Z 2025-05-07T20:26:17.6296616Z 2025-05-07T20:26:17.6296656Z 2025-05-07T20:26:17.6296662Z 2025-05-07T20:26:17.6296667Z 2025-05-07T20:26:17.6300126Z 2025-05-07T20:26:17.6449186Z cuda-nsight-12.8.55 | 113.2 MB | #######1 | 72%  2025-05-07T20:26:17.6449609Z 2025-05-07T20:26:17.6449613Z 2025-05-07T20:26:17.6449617Z 2025-05-07T20:26:17.6449621Z 2025-05-07T20:26:17.6450951Z 2025-05-07T20:26:17.6534041Z libnpp-12.3.3.65 | 130.6 MB | #######5 | 76%  2025-05-07T20:26:17.7076312Z libcublas-12.8.3.14 | 460.2 MB | #######4 | 74% 2025-05-07T20:26:17.7076654Z 2025-05-07T20:26:17.7076659Z 2025-05-07T20:26:17.7076672Z 2025-05-07T20:26:17.7076707Z 2025-05-07T20:26:17.7076711Z 2025-05-07T20:26:17.7076714Z 2025-05-07T20:26:17.7076718Z 2025-05-07T20:26:17.7297543Z cuda-nvvp-12.8.57 | 112.4 MB | ###9 | 39%  2025-05-07T20:26:17.7297862Z 2025-05-07T20:26:17.7297869Z 2025-05-07T20:26:17.7297873Z 2025-05-07T20:26:17.7297877Z 2025-05-07T20:26:17.7297880Z 2025-05-07T20:26:17.7298372Z 2025-05-07T20:26:17.7453707Z cuda-nsight-12.8.55 | 113.2 MB | #######3 | 74%  2025-05-07T20:26:17.7454081Z 2025-05-07T20:26:17.7454087Z 2025-05-07T20:26:17.7454092Z 2025-05-07T20:26:17.7454098Z 2025-05-07T20:26:17.7458067Z 2025-05-07T20:26:17.7540609Z libnpp-12.3.3.65 | 130.6 MB | #######7 | 78%  2025-05-07T20:26:17.8126989Z libcublas-12.8.3.14 | 460.2 MB | #######4 | 75% 2025-05-07T20:26:17.8127281Z 2025-05-07T20:26:17.8127286Z 2025-05-07T20:26:17.8127290Z 2025-05-07T20:26:17.8127302Z 2025-05-07T20:26:17.8127306Z 2025-05-07T20:26:17.8127309Z 2025-05-07T20:26:17.8127727Z 2025-05-07T20:26:17.8349562Z cuda-nvvp-12.8.57 | 112.4 MB | ####1 | 41%  2025-05-07T20:26:17.8349921Z 2025-05-07T20:26:17.8349925Z 2025-05-07T20:26:17.8349929Z 2025-05-07T20:26:17.8349933Z 2025-05-07T20:26:17.8349937Z 2025-05-07T20:26:17.8356291Z 2025-05-07T20:26:17.8454913Z cuda-nsight-12.8.55 | 113.2 MB | #######5 | 76%  2025-05-07T20:26:17.8455403Z 2025-05-07T20:26:17.8455410Z 2025-05-07T20:26:17.8455416Z 2025-05-07T20:26:17.8455421Z 2025-05-07T20:26:17.8455427Z 2025-05-07T20:26:17.8586131Z libnpp-12.3.3.65 | 130.6 MB | ######## | 80%  2025-05-07T20:26:17.9128577Z libcublas-12.8.3.14 | 460.2 MB | #######5 | 76% 2025-05-07T20:26:17.9128866Z 2025-05-07T20:26:17.9128878Z 2025-05-07T20:26:17.9128882Z 2025-05-07T20:26:17.9128886Z 2025-05-07T20:26:17.9128890Z 2025-05-07T20:26:17.9128894Z 2025-05-07T20:26:17.9131199Z 2025-05-07T20:26:17.9389007Z cuda-nvvp-12.8.57 | 112.4 MB | ####3 | 44%  2025-05-07T20:26:17.9389320Z 2025-05-07T20:26:17.9389325Z 2025-05-07T20:26:17.9389329Z 2025-05-07T20:26:17.9389333Z 2025-05-07T20:26:17.9389336Z 2025-05-07T20:26:17.9409879Z 2025-05-07T20:26:17.9456712Z cuda-nsight-12.8.55 | 113.2 MB | #######7 | 78%  2025-05-07T20:26:17.9457071Z 2025-05-07T20:26:17.9457075Z 2025-05-07T20:26:17.9457408Z 2025-05-07T20:26:17.9457412Z 2025-05-07T20:26:17.9457416Z 2025-05-07T20:26:17.9637767Z libnpp-12.3.3.65 | 130.6 MB | ########2 | 82%  2025-05-07T20:26:18.0131827Z libcublas-12.8.3.14 | 460.2 MB | #######6 | 76% 2025-05-07T20:26:18.0132119Z 2025-05-07T20:26:18.0132123Z 2025-05-07T20:26:18.0132135Z 2025-05-07T20:26:18.0132139Z 2025-05-07T20:26:18.0132143Z 2025-05-07T20:26:18.0132146Z 2025-05-07T20:26:18.0132150Z 2025-05-07T20:26:18.0394441Z cuda-nvvp-12.8.57 | 112.4 MB | ####5 | 46%  2025-05-07T20:26:18.0394760Z 2025-05-07T20:26:18.0394764Z 2025-05-07T20:26:18.0395038Z 2025-05-07T20:26:18.0395043Z 2025-05-07T20:26:18.0395047Z 2025-05-07T20:26:18.0395579Z 2025-05-07T20:26:18.0463205Z cuda-nsight-12.8.55 | 113.2 MB | ######## | 80%  2025-05-07T20:26:18.0463507Z 2025-05-07T20:26:18.0463511Z 2025-05-07T20:26:18.0463515Z 2025-05-07T20:26:18.0463519Z 2025-05-07T20:26:18.0465357Z 2025-05-07T20:26:18.0741935Z libnpp-12.3.3.65 | 130.6 MB | ########4 | 85%  2025-05-07T20:26:18.1265181Z libcublas-12.8.3.14 | 460.2 MB | #######6 | 77% 2025-05-07T20:26:18.1265590Z 2025-05-07T20:26:18.1265597Z 2025-05-07T20:26:18.1265602Z 2025-05-07T20:26:18.1265608Z 2025-05-07T20:26:18.1265624Z 2025-05-07T20:26:18.1265629Z 2025-05-07T20:26:18.1265635Z 2025-05-07T20:26:18.1394789Z cuda-nvvp-12.8.57 | 112.4 MB | ####7 | 48%  2025-05-07T20:26:18.1395108Z 2025-05-07T20:26:18.1395112Z 2025-05-07T20:26:18.1395123Z 2025-05-07T20:26:18.1395127Z 2025-05-07T20:26:18.1395131Z 2025-05-07T20:26:18.1397955Z 2025-05-07T20:26:18.1595134Z cuda-nsight-12.8.55 | 113.2 MB | ########2 | 82%  2025-05-07T20:26:18.1595483Z 2025-05-07T20:26:18.1595489Z 2025-05-07T20:26:18.1595494Z 2025-05-07T20:26:18.1595499Z 2025-05-07T20:26:18.1597499Z 2025-05-07T20:26:18.1745213Z libnpp-12.3.3.65 | 130.6 MB | ########6 | 87%  2025-05-07T20:26:18.2380118Z libcublas-12.8.3.14 | 460.2 MB | #######7 | 77% 2025-05-07T20:26:18.2380436Z 2025-05-07T20:26:18.2380444Z 2025-05-07T20:26:18.2380464Z 2025-05-07T20:26:18.2380470Z 2025-05-07T20:26:18.2380478Z 2025-05-07T20:26:18.2380485Z 2025-05-07T20:26:18.2380518Z 2025-05-07T20:26:18.2448107Z cuda-nvvp-12.8.57 | 112.4 MB | ##### | 50%  2025-05-07T20:26:18.2448415Z 2025-05-07T20:26:18.2448419Z 2025-05-07T20:26:18.2448951Z 2025-05-07T20:26:18.2448966Z 2025-05-07T20:26:18.2448976Z 2025-05-07T20:26:18.2453905Z 2025-05-07T20:26:18.2595205Z cuda-nsight-12.8.55 | 113.2 MB | ########4 | 84%  2025-05-07T20:26:18.2595677Z 2025-05-07T20:26:18.2595683Z 2025-05-07T20:26:18.2595688Z 2025-05-07T20:26:18.2595693Z 2025-05-07T20:26:18.2597744Z 2025-05-07T20:26:18.2746678Z libnpp-12.3.3.65 | 130.6 MB | ########9 | 89%  2025-05-07T20:26:18.3381128Z libcublas-12.8.3.14 | 460.2 MB | #######7 | 78% 2025-05-07T20:26:18.3381399Z 2025-05-07T20:26:18.3381441Z 2025-05-07T20:26:18.3381447Z 2025-05-07T20:26:18.3381451Z 2025-05-07T20:26:18.3381457Z 2025-05-07T20:26:18.3381460Z 2025-05-07T20:26:18.3381466Z 2025-05-07T20:26:18.3451363Z cuda-nvvp-12.8.57 | 112.4 MB | #####2 | 52%  2025-05-07T20:26:18.3451671Z 2025-05-07T20:26:18.3451675Z 2025-05-07T20:26:18.3451679Z 2025-05-07T20:26:18.3451682Z 2025-05-07T20:26:18.3451687Z 2025-05-07T20:26:18.3451691Z 2025-05-07T20:26:18.3669746Z cuda-nsight-12.8.55 | 113.2 MB | ########6 | 87%  2025-05-07T20:26:18.3670167Z 2025-05-07T20:26:18.3670173Z 2025-05-07T20:26:18.3670178Z 2025-05-07T20:26:18.3670204Z 2025-05-07T20:26:18.3670210Z 2025-05-07T20:26:18.3756020Z libnpp-12.3.3.65 | 130.6 MB | #########1 | 91%  2025-05-07T20:26:18.4388906Z libcublas-12.8.3.14 | 460.2 MB | #######8 | 78% 2025-05-07T20:26:18.4389268Z 2025-05-07T20:26:18.4389274Z 2025-05-07T20:26:18.4389279Z 2025-05-07T20:26:18.4389285Z 2025-05-07T20:26:18.4389567Z 2025-05-07T20:26:18.4389572Z 2025-05-07T20:26:18.4392821Z 2025-05-07T20:26:18.4451793Z cuda-nvvp-12.8.57 | 112.4 MB | #####4 | 55%  2025-05-07T20:26:18.4452175Z 2025-05-07T20:26:18.4452180Z 2025-05-07T20:26:18.4452186Z 2025-05-07T20:26:18.4452191Z 2025-05-07T20:26:18.4452197Z 2025-05-07T20:26:18.4452202Z 2025-05-07T20:26:18.4720764Z cuda-nsight-12.8.55 | 113.2 MB | ########9 | 89%  2025-05-07T20:26:18.4721147Z 2025-05-07T20:26:18.4721153Z 2025-05-07T20:26:18.4721158Z 2025-05-07T20:26:18.4721163Z 2025-05-07T20:26:18.4723086Z 2025-05-07T20:26:18.4788576Z libnpp-12.3.3.65 | 130.6 MB | #########3 | 93%  2025-05-07T20:26:18.5389221Z libcublas-12.8.3.14 | 460.2 MB | #######8 | 79% 2025-05-07T20:26:18.5389566Z 2025-05-07T20:26:18.5389573Z 2025-05-07T20:26:18.5389579Z 2025-05-07T20:26:18.5389585Z 2025-05-07T20:26:18.5389591Z 2025-05-07T20:26:18.5389596Z 2025-05-07T20:26:18.5392376Z 2025-05-07T20:26:18.5548483Z cuda-nvvp-12.8.57 | 112.4 MB | #####6 | 57%  2025-05-07T20:26:18.5548874Z 2025-05-07T20:26:18.5548881Z 2025-05-07T20:26:18.5548886Z 2025-05-07T20:26:18.5548891Z 2025-05-07T20:26:18.5548897Z 2025-05-07T20:26:18.5548902Z 2025-05-07T20:26:18.5789476Z cuda-nsight-12.8.55 | 113.2 MB | #########1 | 91%  2025-05-07T20:26:18.5920957Z libcublas-12.8.3.14 | 460.2 MB | #######9 | 79% 2025-05-07T20:26:18.5921258Z 2025-05-07T20:26:18.5921262Z 2025-05-07T20:26:18.5921266Z 2025-05-07T20:26:18.5921270Z 2025-05-07T20:26:18.5926080Z 2025-05-07T20:26:18.6414895Z libnpp-12.3.3.65 | 130.6 MB | #########5 | 96%  2025-05-07T20:26:18.6415221Z 2025-05-07T20:26:18.6415227Z 2025-05-07T20:26:18.6415232Z 2025-05-07T20:26:18.6415237Z 2025-05-07T20:26:18.6415243Z 2025-05-07T20:26:18.6415248Z 2025-05-07T20:26:18.6421455Z 2025-05-07T20:26:18.6550867Z cuda-nvvp-12.8.57 | 112.4 MB | #####9 | 59%  2025-05-07T20:26:18.6551162Z 2025-05-07T20:26:18.6551166Z 2025-05-07T20:26:18.6551170Z 2025-05-07T20:26:18.6551173Z 2025-05-07T20:26:18.6551177Z 2025-05-07T20:26:18.6551180Z 2025-05-07T20:26:18.6796718Z cuda-nsight-12.8.55 | 113.2 MB | #########3 | 94%  2025-05-07T20:26:18.7012442Z libcublas-12.8.3.14 | 460.2 MB | #######9 | 80% 2025-05-07T20:26:18.7012694Z 2025-05-07T20:26:18.7012698Z 2025-05-07T20:26:18.7012702Z 2025-05-07T20:26:18.7012706Z 2025-05-07T20:26:18.7014441Z 2025-05-07T20:26:18.7483352Z libnpp-12.3.3.65 | 130.6 MB | #########7 | 98%  2025-05-07T20:26:18.7483720Z 2025-05-07T20:26:18.7483725Z 2025-05-07T20:26:18.7483740Z 2025-05-07T20:26:18.7483744Z 2025-05-07T20:26:18.7483747Z 2025-05-07T20:26:18.7483751Z 2025-05-07T20:26:18.7488221Z 2025-05-07T20:26:18.7553585Z cuda-nvvp-12.8.57 | 112.4 MB | ######1 | 61%  2025-05-07T20:26:18.7553915Z 2025-05-07T20:26:18.7553920Z 2025-05-07T20:26:18.7553924Z 2025-05-07T20:26:18.7553937Z 2025-05-07T20:26:18.7553941Z 2025-05-07T20:26:18.7553945Z 2025-05-07T20:26:18.7902438Z cuda-nsight-12.8.55 | 113.2 MB | #########5 | 96%  2025-05-07T20:26:18.8050507Z libcublas-12.8.3.14 | 460.2 MB | ######## | 81% 2025-05-07T20:26:18.8050836Z 2025-05-07T20:26:18.8050842Z 2025-05-07T20:26:18.8050847Z 2025-05-07T20:26:18.8050853Z 2025-05-07T20:26:18.8054601Z 2025-05-07T20:26:18.8488948Z libnpp-12.3.3.65 | 130.6 MB | #########9 | 99%  2025-05-07T20:26:18.8489329Z 2025-05-07T20:26:18.8489334Z 2025-05-07T20:26:18.8489338Z 2025-05-07T20:26:18.8489341Z 2025-05-07T20:26:18.8489345Z 2025-05-07T20:26:18.8489363Z 2025-05-07T20:26:18.8492602Z 2025-05-07T20:26:18.8560511Z cuda-nvvp-12.8.57 | 112.4 MB | ######3 | 64%  2025-05-07T20:26:18.8560886Z 2025-05-07T20:26:18.8560892Z 2025-05-07T20:26:18.8560898Z 2025-05-07T20:26:18.8560903Z 2025-05-07T20:26:18.8560909Z 2025-05-07T20:26:18.8560914Z 2025-05-07T20:26:18.8907454Z cuda-nsight-12.8.55 | 113.2 MB | #########8 | 98%  2025-05-07T20:26:18.9489735Z libcublas-12.8.3.14 | 460.2 MB | ########1 | 81% 2025-05-07T20:26:18.9490062Z 2025-05-07T20:26:18.9490068Z 2025-05-07T20:26:18.9490074Z 2025-05-07T20:26:18.9490079Z 2025-05-07T20:26:18.9490084Z 2025-05-07T20:26:18.9490089Z 2025-05-07T20:26:18.9491617Z 2025-05-07T20:26:18.9910877Z cuda-nvvp-12.8.57 | 112.4 MB | ######6 | 66%  2025-05-07T20:26:19.0493638Z libcublas-12.8.3.14 | 460.2 MB | ########1 | 82% 2025-05-07T20:26:19.0493885Z 2025-05-07T20:26:19.0493890Z 2025-05-07T20:26:19.0493894Z 2025-05-07T20:26:19.0494150Z 2025-05-07T20:26:19.0494155Z 2025-05-07T20:26:19.0494159Z 2025-05-07T20:26:19.0494286Z 2025-05-07T20:26:19.0915976Z cuda-nvvp-12.8.57 | 112.4 MB | ######8 | 69%  2025-05-07T20:26:19.1495285Z libcublas-12.8.3.14 | 460.2 MB | ########2 | 82% 2025-05-07T20:26:19.1495532Z 2025-05-07T20:26:19.1495536Z 2025-05-07T20:26:19.1495553Z 2025-05-07T20:26:19.1495557Z 2025-05-07T20:26:19.1495561Z 2025-05-07T20:26:19.1495564Z 2025-05-07T20:26:19.1496137Z 2025-05-07T20:26:19.1917283Z cuda-nvvp-12.8.57 | 112.4 MB | #######1 | 72%  2025-05-07T20:26:19.2496291Z libcublas-12.8.3.14 | 460.2 MB | ########3 | 83% 2025-05-07T20:26:19.2496551Z 2025-05-07T20:26:19.2496825Z 2025-05-07T20:26:19.2496829Z 2025-05-07T20:26:19.2496833Z 2025-05-07T20:26:19.2496919Z 2025-05-07T20:26:19.2496934Z 2025-05-07T20:26:19.2497858Z 2025-05-07T20:26:19.2917962Z cuda-nvvp-12.8.57 | 112.4 MB | #######4 | 74%  2025-05-07T20:26:19.3496887Z libcublas-12.8.3.14 | 460.2 MB | ########3 | 84% 2025-05-07T20:26:19.3497221Z 2025-05-07T20:26:19.3497227Z 2025-05-07T20:26:19.3497232Z 2025-05-07T20:26:19.3497238Z 2025-05-07T20:26:19.3497243Z 2025-05-07T20:26:19.3497248Z 2025-05-07T20:26:19.3497254Z 2025-05-07T20:26:19.3921825Z cuda-nvvp-12.8.57 | 112.4 MB | #######7 | 78%  2025-05-07T20:26:19.4547207Z libcublas-12.8.3.14 | 460.2 MB | ########4 | 85% 2025-05-07T20:26:19.4547549Z 2025-05-07T20:26:19.4547555Z 2025-05-07T20:26:19.4547560Z 2025-05-07T20:26:19.4547565Z 2025-05-07T20:26:19.4547571Z 2025-05-07T20:26:19.4547576Z 2025-05-07T20:26:19.4547581Z 2025-05-07T20:26:19.4922184Z cuda-nvvp-12.8.57 | 112.4 MB | ######## | 80%  2025-05-07T20:26:19.5557618Z libcublas-12.8.3.14 | 460.2 MB | ########5 | 85% 2025-05-07T20:26:19.5557945Z 2025-05-07T20:26:19.5557951Z 2025-05-07T20:26:19.5557957Z 2025-05-07T20:26:19.5557962Z 2025-05-07T20:26:19.5557967Z 2025-05-07T20:26:19.5557988Z 2025-05-07T20:26:19.5557993Z 2025-05-07T20:26:19.5923842Z cuda-nvvp-12.8.57 | 112.4 MB | ########3 | 83%  2025-05-07T20:26:19.6561202Z libcublas-12.8.3.14 | 460.2 MB | ########6 | 86% 2025-05-07T20:26:19.6561455Z 2025-05-07T20:26:19.6561459Z 2025-05-07T20:26:19.6561463Z 2025-05-07T20:26:19.6561467Z 2025-05-07T20:26:19.6561488Z 2025-05-07T20:26:19.6561492Z 2025-05-07T20:26:19.6561497Z 2025-05-07T20:26:19.6923837Z cuda-nvvp-12.8.57 | 112.4 MB | ########5 | 86%  2025-05-07T20:26:19.7563208Z libcublas-12.8.3.14 | 460.2 MB | ########6 | 87% 2025-05-07T20:26:19.7563565Z 2025-05-07T20:26:19.7563571Z 2025-05-07T20:26:19.7563576Z 2025-05-07T20:26:19.7563581Z 2025-05-07T20:26:19.7563587Z 2025-05-07T20:26:19.7563592Z 2025-05-07T20:26:19.7563597Z 2025-05-07T20:26:19.7929643Z cuda-nvvp-12.8.57 | 112.4 MB | ########8 | 89%  2025-05-07T20:26:19.8579778Z libcublas-12.8.3.14 | 460.2 MB | ########7 | 88% 2025-05-07T20:26:19.8580156Z 2025-05-07T20:26:19.8580161Z 2025-05-07T20:26:19.8580173Z 2025-05-07T20:26:19.8580177Z 2025-05-07T20:26:19.8580181Z 2025-05-07T20:26:19.8580184Z 2025-05-07T20:26:19.8580188Z 2025-05-07T20:26:19.9011080Z cuda-nvvp-12.8.57 | 112.4 MB | #########1 | 92%  2025-05-07T20:26:19.9584321Z libcublas-12.8.3.14 | 460.2 MB | ########8 | 88% 2025-05-07T20:26:19.9584953Z 2025-05-07T20:26:19.9584959Z 2025-05-07T20:26:19.9584965Z 2025-05-07T20:26:19.9584970Z 2025-05-07T20:26:19.9584976Z 2025-05-07T20:26:19.9584981Z 2025-05-07T20:26:19.9584987Z 2025-05-07T20:26:20.0012034Z cuda-nvvp-12.8.57 | 112.4 MB | #########4 | 95%  2025-05-07T20:26:20.0623730Z libcublas-12.8.3.14 | 460.2 MB | ########9 | 89% 2025-05-07T20:26:20.0623990Z 2025-05-07T20:26:20.0623995Z 2025-05-07T20:26:20.0623999Z 2025-05-07T20:26:20.0624002Z 2025-05-07T20:26:20.0624006Z 2025-05-07T20:26:20.0624010Z 2025-05-07T20:26:20.0624013Z 2025-05-07T20:26:20.1034318Z cuda-nvvp-12.8.57 | 112.4 MB | #########7 | 97%  2025-05-07T20:26:20.2068703Z libcublas-12.8.3.14 | 460.2 MB | ########9 | 90% 2025-05-07T20:26:20.3070013Z libcublas-12.8.3.14 | 460.2 MB | ######### | 91% 2025-05-07T20:26:20.4180243Z libcublas-12.8.3.14 | 460.2 MB | #########1 | 91% 2025-05-07T20:26:20.5188641Z libcublas-12.8.3.14 | 460.2 MB | #########2 | 92% 2025-05-07T20:26:20.6199143Z libcublas-12.8.3.14 | 460.2 MB | #########2 | 93% 2025-05-07T20:26:20.7202913Z libcublas-12.8.3.14 | 460.2 MB | #########3 | 94% 2025-05-07T20:26:20.8204279Z libcublas-12.8.3.14 | 460.2 MB | #########4 | 95% 2025-05-07T20:26:20.9240924Z libcublas-12.8.3.14 | 460.2 MB | #########5 | 95% 2025-05-07T20:26:21.0241455Z libcublas-12.8.3.14 | 460.2 MB | #########6 | 96% 2025-05-07T20:26:21.1306836Z libcublas-12.8.3.14 | 460.2 MB | #########7 | 97% 2025-05-07T20:26:21.2419086Z libcublas-12.8.3.14 | 460.2 MB | #########7 | 98% 2025-05-07T20:26:21.3431888Z libcublas-12.8.3.14 | 460.2 MB | #########8 | 99% 2025-05-07T20:26:22.7130982Z libcublas-12.8.3.14 | 460.2 MB | #########9 | 99% 2025-05-07T20:26:22.7131271Z 2025-05-07T20:26:22.7131276Z 2025-05-07T20:26:22.7131279Z 2025-05-07T20:26:22.7131283Z 2025-05-07T20:26:22.7131300Z 2025-05-07T20:26:22.7135001Z 2025-05-07T20:26:22.7622664Z cuda-nsight-12.8.55 | 113.2 MB | ########## | 100%  2025-05-07T20:26:22.7623071Z 2025-05-07T20:26:22.7623075Z 2025-05-07T20:26:22.7623086Z 2025-05-07T20:26:22.7623093Z 2025-05-07T20:26:22.7623099Z 2025-05-07T20:26:22.7623104Z 2025-05-07T20:26:22.7623109Z 2025-05-07T20:26:22.7625424Z 2025-05-07T20:26:22.8628518Z cuda-nvrtc-12.8.61 | 63.1 MB | | 0%  2025-05-07T20:26:22.8628848Z 2025-05-07T20:26:22.8628852Z 2025-05-07T20:26:22.8628856Z 2025-05-07T20:26:22.8628860Z 2025-05-07T20:26:22.8628864Z 2025-05-07T20:26:22.8628869Z 2025-05-07T20:26:22.8628873Z 2025-05-07T20:26:22.8628876Z 2025-05-07T20:26:22.9454379Z cuda-nvrtc-12.8.61 | 63.1 MB | 5 | 6%  2025-05-07T20:26:22.9454757Z 2025-05-07T20:26:22.9454762Z 2025-05-07T20:26:22.9454766Z 2025-05-07T20:26:22.9456445Z 2025-05-07T20:26:22.9629692Z libcufft-11.3.3.41 | 147.4 MB | ########## | 100%  2025-05-07T20:26:22.9630032Z 2025-05-07T20:26:22.9630036Z 2025-05-07T20:26:22.9630063Z 2025-05-07T20:26:22.9630067Z 2025-05-07T20:26:22.9630071Z 2025-05-07T20:26:22.9630075Z 2025-05-07T20:26:22.9630078Z 2025-05-07T20:26:22.9634156Z 2025-05-07T20:26:23.0632036Z cuda-nvrtc-12.8.61 | 63.1 MB | #1 | 11%  2025-05-07T20:26:23.0632368Z 2025-05-07T20:26:23.0632372Z 2025-05-07T20:26:23.0632376Z 2025-05-07T20:26:23.0632380Z 2025-05-07T20:26:23.0632384Z 2025-05-07T20:26:23.0632387Z 2025-05-07T20:26:23.0632391Z 2025-05-07T20:26:23.0633494Z 2025-05-07T20:26:23.1693223Z cuda-nvrtc-12.8.61 | 63.1 MB | #7 | 17%  2025-05-07T20:26:23.1693626Z 2025-05-07T20:26:23.1693661Z 2025-05-07T20:26:23.1693668Z 2025-05-07T20:26:23.1693673Z 2025-05-07T20:26:23.1693678Z 2025-05-07T20:26:23.1693683Z 2025-05-07T20:26:23.1693689Z 2025-05-07T20:26:23.1695511Z 2025-05-07T20:26:23.2017507Z cuda-nvrtc-12.8.61 | 63.1 MB | ##3 | 23%  2025-05-07T20:26:23.2017933Z 2025-05-07T20:26:23.2018249Z 2025-05-07T20:26:23.2018255Z 2025-05-07T20:26:23.2018260Z 2025-05-07T20:26:23.2018265Z 2025-05-07T20:26:23.2448031Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:26:23.2448504Z 2025-05-07T20:26:23.2448511Z 2025-05-07T20:26:23.2448516Z 2025-05-07T20:26:23.2448521Z 2025-05-07T20:26:23.2448527Z 2025-05-07T20:26:23.2448532Z 2025-05-07T20:26:23.2448537Z 2025-05-07T20:26:23.2448543Z 2025-05-07T20:26:23.2448548Z 2025-05-07T20:26:23.2733494Z libcurand-10.3.9.55 | 43.6 MB | | 0%  2025-05-07T20:26:23.2733981Z 2025-05-07T20:26:23.2733987Z 2025-05-07T20:26:23.2734244Z 2025-05-07T20:26:23.2734251Z 2025-05-07T20:26:23.2734257Z 2025-05-07T20:26:23.2734262Z 2025-05-07T20:26:23.2734267Z 2025-05-07T20:26:23.2734272Z 2025-05-07T20:26:23.3455771Z cuda-nvrtc-12.8.61 | 63.1 MB | ##8 | 29%  2025-05-07T20:26:23.3456173Z 2025-05-07T20:26:23.3456179Z 2025-05-07T20:26:23.3456184Z 2025-05-07T20:26:23.3456213Z 2025-05-07T20:26:23.3456219Z 2025-05-07T20:26:23.3456225Z 2025-05-07T20:26:23.3456230Z 2025-05-07T20:26:23.3456245Z 2025-05-07T20:26:23.3456251Z 2025-05-07T20:26:23.3810087Z libcurand-10.3.9.55 | 43.6 MB | 7 | 7%  2025-05-07T20:26:23.3810394Z 2025-05-07T20:26:23.3810399Z 2025-05-07T20:26:23.3810402Z 2025-05-07T20:26:23.3810406Z 2025-05-07T20:26:23.3810410Z 2025-05-07T20:26:23.3810414Z 2025-05-07T20:26:23.3810417Z 2025-05-07T20:26:23.3811584Z 2025-05-07T20:26:23.4457634Z cuda-nvrtc-12.8.61 | 63.1 MB | ###4 | 34%  2025-05-07T20:26:23.4457930Z 2025-05-07T20:26:23.4457954Z 2025-05-07T20:26:23.4457958Z 2025-05-07T20:26:23.4457962Z 2025-05-07T20:26:23.4457974Z 2025-05-07T20:26:23.4457978Z 2025-05-07T20:26:23.4457982Z 2025-05-07T20:26:23.4457986Z 2025-05-07T20:26:23.4457989Z 2025-05-07T20:26:23.4810632Z libcurand-10.3.9.55 | 43.6 MB | #5 | 15%  2025-05-07T20:26:23.4810950Z 2025-05-07T20:26:23.4810954Z 2025-05-07T20:26:23.4810958Z 2025-05-07T20:26:23.4810961Z 2025-05-07T20:26:23.4810965Z 2025-05-07T20:26:23.4810969Z 2025-05-07T20:26:23.4810973Z 2025-05-07T20:26:23.4811070Z 2025-05-07T20:26:23.5461467Z cuda-nvrtc-12.8.61 | 63.1 MB | #### | 40%  2025-05-07T20:26:23.5461814Z 2025-05-07T20:26:23.5461818Z 2025-05-07T20:26:23.5461822Z 2025-05-07T20:26:23.5461826Z 2025-05-07T20:26:23.5461829Z 2025-05-07T20:26:23.5461833Z 2025-05-07T20:26:23.5461837Z 2025-05-07T20:26:23.5461840Z 2025-05-07T20:26:23.5462575Z 2025-05-07T20:26:23.5828811Z libcurand-10.3.9.55 | 43.6 MB | ##3 | 23%  2025-05-07T20:26:23.5829117Z 2025-05-07T20:26:23.5829122Z 2025-05-07T20:26:23.5829125Z 2025-05-07T20:26:23.5829129Z 2025-05-07T20:26:23.5829133Z 2025-05-07T20:26:23.5829136Z 2025-05-07T20:26:23.5829140Z 2025-05-07T20:26:23.5829144Z 2025-05-07T20:26:23.6492027Z cuda-nvrtc-12.8.61 | 63.1 MB | ####5 | 46%  2025-05-07T20:26:23.6492356Z 2025-05-07T20:26:23.6492360Z 2025-05-07T20:26:23.6492364Z 2025-05-07T20:26:23.6492368Z 2025-05-07T20:26:23.6492372Z 2025-05-07T20:26:23.6492375Z 2025-05-07T20:26:23.6492379Z 2025-05-07T20:26:23.6492383Z 2025-05-07T20:26:23.6492387Z 2025-05-07T20:26:23.6862925Z libcurand-10.3.9.55 | 43.6 MB | ### | 31%  2025-05-07T20:26:23.6863219Z 2025-05-07T20:26:23.6863223Z 2025-05-07T20:26:23.6863227Z 2025-05-07T20:26:23.6863231Z 2025-05-07T20:26:23.6863235Z 2025-05-07T20:26:23.6863238Z 2025-05-07T20:26:23.6863250Z 2025-05-07T20:26:23.6864901Z 2025-05-07T20:26:23.7493824Z cuda-nvrtc-12.8.61 | 63.1 MB | #####1 | 51%  2025-05-07T20:26:23.7494119Z 2025-05-07T20:26:23.7494131Z 2025-05-07T20:26:23.7494135Z 2025-05-07T20:26:23.7494139Z 2025-05-07T20:26:23.7494143Z 2025-05-07T20:26:23.7494147Z 2025-05-07T20:26:23.7494151Z 2025-05-07T20:26:23.7494155Z 2025-05-07T20:26:23.7496344Z 2025-05-07T20:26:23.7864443Z libcurand-10.3.9.55 | 43.6 MB | ###8 | 39%  2025-05-07T20:26:23.7864751Z 2025-05-07T20:26:23.7864755Z 2025-05-07T20:26:23.7864759Z 2025-05-07T20:26:23.7864763Z 2025-05-07T20:26:23.7864767Z 2025-05-07T20:26:23.7864770Z 2025-05-07T20:26:23.7864774Z 2025-05-07T20:26:23.7864778Z 2025-05-07T20:26:23.8594391Z cuda-nvrtc-12.8.61 | 63.1 MB | #####6 | 57%  2025-05-07T20:26:23.8594795Z 2025-05-07T20:26:23.8594799Z 2025-05-07T20:26:23.8594803Z 2025-05-07T20:26:23.8594807Z 2025-05-07T20:26:23.8594811Z 2025-05-07T20:26:23.8594815Z 2025-05-07T20:26:23.8595077Z 2025-05-07T20:26:23.8595083Z 2025-05-07T20:26:23.8596373Z 2025-05-07T20:26:23.8916267Z libcurand-10.3.9.55 | 43.6 MB | ####6 | 46%  2025-05-07T20:26:23.8916691Z 2025-05-07T20:26:23.8916697Z 2025-05-07T20:26:23.8916703Z 2025-05-07T20:26:23.8916708Z 2025-05-07T20:26:23.8916715Z 2025-05-07T20:26:23.8916720Z 2025-05-07T20:26:23.8916747Z 2025-05-07T20:26:23.8918340Z 2025-05-07T20:26:23.9597081Z cuda-nvrtc-12.8.61 | 63.1 MB | ######2 | 63%  2025-05-07T20:26:23.9597493Z 2025-05-07T20:26:23.9597499Z 2025-05-07T20:26:23.9597504Z 2025-05-07T20:26:23.9597510Z 2025-05-07T20:26:23.9597515Z 2025-05-07T20:26:23.9597520Z 2025-05-07T20:26:23.9597526Z 2025-05-07T20:26:23.9597531Z 2025-05-07T20:26:23.9601999Z 2025-05-07T20:26:23.9916885Z libcurand-10.3.9.55 | 43.6 MB | #####4 | 55%  2025-05-07T20:26:23.9917184Z 2025-05-07T20:26:23.9917188Z 2025-05-07T20:26:23.9917192Z 2025-05-07T20:26:23.9917218Z 2025-05-07T20:26:23.9917222Z 2025-05-07T20:26:23.9917226Z 2025-05-07T20:26:23.9917230Z 2025-05-07T20:26:23.9918500Z 2025-05-07T20:26:24.0597794Z cuda-nvrtc-12.8.61 | 63.1 MB | ######8 | 68%  2025-05-07T20:26:24.0598203Z 2025-05-07T20:26:24.0598210Z 2025-05-07T20:26:24.0598215Z 2025-05-07T20:26:24.0598221Z 2025-05-07T20:26:24.0598256Z 2025-05-07T20:26:24.0598262Z 2025-05-07T20:26:24.0598268Z 2025-05-07T20:26:24.0598273Z 2025-05-07T20:26:24.0599914Z 2025-05-07T20:26:24.0951684Z libcurand-10.3.9.55 | 43.6 MB | ######2 | 62%  2025-05-07T20:26:24.0951987Z 2025-05-07T20:26:24.0951991Z 2025-05-07T20:26:24.0951995Z 2025-05-07T20:26:24.0951999Z 2025-05-07T20:26:24.0952003Z 2025-05-07T20:26:24.0952006Z 2025-05-07T20:26:24.0952010Z 2025-05-07T20:26:24.0952014Z 2025-05-07T20:26:24.1635463Z cuda-nvrtc-12.8.61 | 63.1 MB | #######3 | 74%  2025-05-07T20:26:24.1635762Z 2025-05-07T20:26:24.1635766Z 2025-05-07T20:26:24.1635792Z 2025-05-07T20:26:24.1635796Z 2025-05-07T20:26:24.1635800Z 2025-05-07T20:26:24.1635804Z 2025-05-07T20:26:24.1635808Z 2025-05-07T20:26:24.1635811Z 2025-05-07T20:26:24.1640251Z 2025-05-07T20:26:24.1970349Z libcurand-10.3.9.55 | 43.6 MB | ####### | 70%  2025-05-07T20:26:24.1970668Z 2025-05-07T20:26:24.1970693Z 2025-05-07T20:26:24.1970697Z 2025-05-07T20:26:24.1970701Z 2025-05-07T20:26:24.1970704Z 2025-05-07T20:26:24.1970708Z 2025-05-07T20:26:24.1970712Z 2025-05-07T20:26:24.1970716Z 2025-05-07T20:26:24.2680564Z cuda-nvrtc-12.8.61 | 63.1 MB | #######9 | 79%  2025-05-07T20:26:24.2680876Z 2025-05-07T20:26:24.2680880Z 2025-05-07T20:26:24.2680884Z 2025-05-07T20:26:24.2680888Z 2025-05-07T20:26:24.2680892Z 2025-05-07T20:26:24.2680896Z 2025-05-07T20:26:24.2680899Z 2025-05-07T20:26:24.2680903Z 2025-05-07T20:26:24.2682679Z 2025-05-07T20:26:24.2715815Z libcurand-10.3.9.55 | 43.6 MB | #######7 | 78%  2025-05-07T20:26:24.2716109Z 2025-05-07T20:26:24.2716113Z 2025-05-07T20:26:24.2716117Z 2025-05-07T20:26:24.2716120Z 2025-05-07T20:26:24.2716124Z 2025-05-07T20:26:24.2716128Z 2025-05-07T20:26:24.2718073Z 2025-05-07T20:26:24.3079869Z cuda-nvvp-12.8.57 | 112.4 MB | ########## | 100%  2025-05-07T20:26:24.3080278Z 2025-05-07T20:26:24.3080590Z 2025-05-07T20:26:24.3080594Z 2025-05-07T20:26:24.3080598Z 2025-05-07T20:26:24.3080602Z 2025-05-07T20:26:24.3080605Z 2025-05-07T20:26:24.3080609Z 2025-05-07T20:26:24.3081721Z 2025-05-07T20:26:24.3355082Z cuda-nvrtc-12.8.61 | 63.1 MB | ########4 | 85%  2025-05-07T20:26:24.3355484Z 2025-05-07T20:26:24.3355488Z 2025-05-07T20:26:24.3355491Z 2025-05-07T20:26:24.3355495Z 2025-05-07T20:26:24.3355499Z 2025-05-07T20:26:24.3355502Z 2025-05-07T20:26:24.3355506Z 2025-05-07T20:26:24.3355510Z 2025-05-07T20:26:24.3355513Z 2025-05-07T20:26:24.3355526Z 2025-05-07T20:26:24.4084295Z gds-tools-1.13.0.11 | 37.9 MB | | 0%  2025-05-07T20:26:24.4084606Z 2025-05-07T20:26:24.4084610Z 2025-05-07T20:26:24.4084614Z 2025-05-07T20:26:24.4084618Z 2025-05-07T20:26:24.4084621Z 2025-05-07T20:26:24.4084642Z 2025-05-07T20:26:24.4084646Z 2025-05-07T20:26:24.4086797Z 2025-05-07T20:26:24.4374028Z cuda-nvrtc-12.8.61 | 63.1 MB | ######### | 91%  2025-05-07T20:26:24.4374349Z 2025-05-07T20:26:24.4374356Z 2025-05-07T20:26:24.4374362Z 2025-05-07T20:26:24.4374367Z 2025-05-07T20:26:24.4374372Z 2025-05-07T20:26:24.4374378Z 2025-05-07T20:26:24.4374383Z 2025-05-07T20:26:24.4374388Z 2025-05-07T20:26:24.4374393Z 2025-05-07T20:26:24.4382135Z 2025-05-07T20:26:24.5031444Z gds-tools-1.13.0.11 | 37.9 MB | 1 | 2%  2025-05-07T20:26:24.5031799Z 2025-05-07T20:26:24.5031804Z 2025-05-07T20:26:24.5031808Z 2025-05-07T20:26:24.5031813Z 2025-05-07T20:26:24.5031818Z 2025-05-07T20:26:24.5031822Z 2025-05-07T20:26:24.5031845Z 2025-05-07T20:26:24.5031849Z 2025-05-07T20:26:24.5033137Z 2025-05-07T20:26:24.5085205Z libcurand-10.3.9.55 | 43.6 MB | ########5 | 85%  2025-05-07T20:26:24.5085506Z 2025-05-07T20:26:24.5085510Z 2025-05-07T20:26:24.5085514Z 2025-05-07T20:26:24.5085518Z 2025-05-07T20:26:24.5085522Z 2025-05-07T20:26:24.5085525Z 2025-05-07T20:26:24.5085543Z 2025-05-07T20:26:24.5085546Z 2025-05-07T20:26:24.5484403Z cuda-nvrtc-12.8.61 | 63.1 MB | #########6 | 97%  2025-05-07T20:26:24.5484690Z 2025-05-07T20:26:24.5484694Z 2025-05-07T20:26:24.5484698Z 2025-05-07T20:26:24.5484702Z 2025-05-07T20:26:24.5484705Z 2025-05-07T20:26:24.5484709Z 2025-05-07T20:26:24.5484715Z 2025-05-07T20:26:24.5484721Z 2025-05-07T20:26:24.5484726Z 2025-05-07T20:26:24.5484894Z 2025-05-07T20:26:24.6031947Z gds-tools-1.13.0.11 | 37.9 MB | 3 | 4%  2025-05-07T20:26:24.6032255Z 2025-05-07T20:26:24.6032266Z 2025-05-07T20:26:24.6032292Z 2025-05-07T20:26:24.6032296Z 2025-05-07T20:26:24.6032301Z 2025-05-07T20:26:24.6032304Z 2025-05-07T20:26:24.6032308Z 2025-05-07T20:26:24.6032312Z 2025-05-07T20:26:24.6032316Z 2025-05-07T20:26:24.6485507Z libcurand-10.3.9.55 | 43.6 MB | #########2 | 93%  2025-05-07T20:26:24.6485804Z 2025-05-07T20:26:24.6485812Z 2025-05-07T20:26:24.6485841Z 2025-05-07T20:26:24.6485846Z 2025-05-07T20:26:24.6485852Z 2025-05-07T20:26:24.6485857Z 2025-05-07T20:26:24.6485862Z 2025-05-07T20:26:24.6485868Z 2025-05-07T20:26:24.6485873Z 2025-05-07T20:26:24.6489645Z 2025-05-07T20:26:24.7040993Z gds-tools-1.13.0.11 | 37.9 MB | 8 | 8%  2025-05-07T20:26:24.7041306Z 2025-05-07T20:26:24.7041310Z 2025-05-07T20:26:24.7041314Z 2025-05-07T20:26:24.7041318Z 2025-05-07T20:26:24.7041322Z 2025-05-07T20:26:24.7041325Z 2025-05-07T20:26:24.7041329Z 2025-05-07T20:26:24.7041333Z 2025-05-07T20:26:24.7042104Z 2025-05-07T20:26:24.7489071Z libcurand-10.3.9.55 | 43.6 MB | #########9 | 100%  2025-05-07T20:26:24.7489455Z 2025-05-07T20:26:24.7489459Z 2025-05-07T20:26:24.7489463Z 2025-05-07T20:26:24.7489477Z 2025-05-07T20:26:24.7489481Z 2025-05-07T20:26:24.7489485Z 2025-05-07T20:26:24.7489489Z 2025-05-07T20:26:24.7489493Z 2025-05-07T20:26:24.7489497Z 2025-05-07T20:26:24.7491274Z 2025-05-07T20:26:24.8493737Z gds-tools-1.13.0.11 | 37.9 MB | #6 | 16%  2025-05-07T20:26:24.8494062Z 2025-05-07T20:26:24.8494066Z 2025-05-07T20:26:24.8494070Z 2025-05-07T20:26:24.8494074Z 2025-05-07T20:26:24.8494077Z 2025-05-07T20:26:24.8494081Z 2025-05-07T20:26:24.8494085Z 2025-05-07T20:26:24.8494088Z 2025-05-07T20:26:24.8494092Z 2025-05-07T20:26:24.8499399Z 2025-05-07T20:26:24.9494173Z gds-tools-1.13.0.11 | 37.9 MB | ##5 | 26%  2025-05-07T20:26:24.9494486Z 2025-05-07T20:26:24.9494491Z 2025-05-07T20:26:24.9494494Z 2025-05-07T20:26:24.9494498Z 2025-05-07T20:26:24.9494737Z 2025-05-07T20:26:24.9494742Z 2025-05-07T20:26:24.9494746Z 2025-05-07T20:26:24.9494749Z 2025-05-07T20:26:24.9494753Z 2025-05-07T20:26:24.9494757Z 2025-05-07T20:26:25.0504593Z gds-tools-1.13.0.11 | 37.9 MB | ###4 | 34%  2025-05-07T20:26:25.0504887Z 2025-05-07T20:26:25.0504891Z 2025-05-07T20:26:25.0504918Z 2025-05-07T20:26:25.0504922Z 2025-05-07T20:26:25.0504925Z 2025-05-07T20:26:25.0504929Z 2025-05-07T20:26:25.0504933Z 2025-05-07T20:26:25.0504936Z 2025-05-07T20:26:25.0504940Z 2025-05-07T20:26:25.0507875Z 2025-05-07T20:26:25.1505808Z gds-tools-1.13.0.11 | 37.9 MB | ####2 | 43%  2025-05-07T20:26:25.1506120Z 2025-05-07T20:26:25.1506124Z 2025-05-07T20:26:25.1506128Z 2025-05-07T20:26:25.1506138Z 2025-05-07T20:26:25.1506142Z 2025-05-07T20:26:25.1506146Z 2025-05-07T20:26:25.1506149Z 2025-05-07T20:26:25.1506153Z 2025-05-07T20:26:25.1506157Z 2025-05-07T20:26:25.1507530Z 2025-05-07T20:26:25.2506537Z gds-tools-1.13.0.11 | 37.9 MB | #####1 | 52%  2025-05-07T20:26:25.2506896Z 2025-05-07T20:26:25.2506901Z 2025-05-07T20:26:25.2506904Z 2025-05-07T20:26:25.2506908Z 2025-05-07T20:26:25.2506912Z 2025-05-07T20:26:25.2506915Z 2025-05-07T20:26:25.2506919Z 2025-05-07T20:26:25.2506923Z 2025-05-07T20:26:25.2506926Z 2025-05-07T20:26:25.2507639Z 2025-05-07T20:26:25.3510584Z gds-tools-1.13.0.11 | 37.9 MB | ######1 | 61%  2025-05-07T20:26:25.3510894Z 2025-05-07T20:26:25.3510898Z 2025-05-07T20:26:25.3510902Z 2025-05-07T20:26:25.3510906Z 2025-05-07T20:26:25.3510910Z 2025-05-07T20:26:25.3510913Z 2025-05-07T20:26:25.3510917Z 2025-05-07T20:26:25.3510921Z 2025-05-07T20:26:25.3510924Z 2025-05-07T20:26:25.3515703Z 2025-05-07T20:26:25.4518450Z gds-tools-1.13.0.11 | 37.9 MB | ####### | 71%  2025-05-07T20:26:25.4518770Z 2025-05-07T20:26:25.4518774Z 2025-05-07T20:26:25.4518778Z 2025-05-07T20:26:25.4518810Z 2025-05-07T20:26:25.4518814Z 2025-05-07T20:26:25.4518817Z 2025-05-07T20:26:25.4518821Z 2025-05-07T20:26:25.4518825Z 2025-05-07T20:26:25.4518829Z 2025-05-07T20:26:25.4521754Z 2025-05-07T20:26:25.5519823Z gds-tools-1.13.0.11 | 37.9 MB | #######9 | 80%  2025-05-07T20:26:25.5520130Z 2025-05-07T20:26:25.5520134Z 2025-05-07T20:26:25.5520164Z 2025-05-07T20:26:25.5520168Z 2025-05-07T20:26:25.5520171Z 2025-05-07T20:26:25.5520175Z 2025-05-07T20:26:25.5520179Z 2025-05-07T20:26:25.5520182Z 2025-05-07T20:26:25.5520186Z 2025-05-07T20:26:25.5521561Z 2025-05-07T20:26:25.6531503Z gds-tools-1.13.0.11 | 37.9 MB | ########8 | 88%  2025-05-07T20:26:25.6531810Z 2025-05-07T20:26:25.6531814Z 2025-05-07T20:26:25.6531818Z 2025-05-07T20:26:25.6531822Z 2025-05-07T20:26:25.6531826Z 2025-05-07T20:26:25.6531830Z 2025-05-07T20:26:25.6531833Z 2025-05-07T20:26:25.6531845Z 2025-05-07T20:26:25.6531849Z 2025-05-07T20:26:25.6533831Z 2025-05-07T20:26:25.7385520Z gds-tools-1.13.0.11 | 37.9 MB | #########7 | 97%  2025-05-07T20:26:25.7385820Z 2025-05-07T20:26:25.7385831Z 2025-05-07T20:26:26.1600670Z libcusparse-12.5.7.5 | 164.9 MB | ########## | 100%  2025-05-07T20:26:26.1600979Z 2025-05-07T20:26:26.1600983Z 2025-05-07T20:26:26.1600987Z 2025-05-07T20:26:26.1601278Z 2025-05-07T20:26:26.1601284Z 2025-05-07T20:26:26.1601289Z 2025-05-07T20:26:26.1601294Z 2025-05-07T20:26:26.1601299Z 2025-05-07T20:26:26.1601304Z 2025-05-07T20:26:26.2108743Z libcurand-10.3.9.55 | 43.6 MB | ########## | 100%  2025-05-07T20:26:26.2109063Z 2025-05-07T20:26:26.2109067Z 2025-05-07T20:26:26.2109071Z 2025-05-07T20:26:26.2109074Z 2025-05-07T20:26:26.2109078Z 2025-05-07T20:26:26.2109082Z 2025-05-07T20:26:26.2109086Z 2025-05-07T20:26:26.2109089Z 2025-05-07T20:26:26.2109101Z 2025-05-07T20:26:26.2109105Z 2025-05-07T20:26:26.2109431Z 2025-05-07T20:26:26.3110848Z python-3.13.0 | 31.5 MB | | 0%  2025-05-07T20:26:26.3111185Z 2025-05-07T20:26:26.3111190Z 2025-05-07T20:26:26.3111194Z 2025-05-07T20:26:26.3111198Z 2025-05-07T20:26:26.3111202Z 2025-05-07T20:26:26.3111205Z 2025-05-07T20:26:26.3111209Z 2025-05-07T20:26:26.3111213Z 2025-05-07T20:26:26.3111217Z 2025-05-07T20:26:26.3111220Z 2025-05-07T20:26:26.3115380Z 2025-05-07T20:26:26.4114247Z python-3.13.0 | 31.5 MB | # | 11%  2025-05-07T20:26:26.4114573Z 2025-05-07T20:26:26.4114578Z 2025-05-07T20:26:26.4114582Z 2025-05-07T20:26:26.4114586Z 2025-05-07T20:26:26.4114590Z 2025-05-07T20:26:26.4114594Z 2025-05-07T20:26:26.4114597Z 2025-05-07T20:26:26.4114601Z 2025-05-07T20:26:26.4114605Z 2025-05-07T20:26:26.4114609Z 2025-05-07T20:26:26.4117264Z 2025-05-07T20:26:26.5141638Z python-3.13.0 | 31.5 MB | ##2 | 23%  2025-05-07T20:26:26.5141956Z 2025-05-07T20:26:26.5141961Z 2025-05-07T20:26:26.5141988Z 2025-05-07T20:26:26.5141992Z 2025-05-07T20:26:26.5141996Z 2025-05-07T20:26:26.5141999Z 2025-05-07T20:26:26.5142003Z 2025-05-07T20:26:26.5142007Z 2025-05-07T20:26:26.5142011Z 2025-05-07T20:26:26.5142014Z 2025-05-07T20:26:26.5144369Z 2025-05-07T20:26:26.5371678Z python-3.13.0 | 31.5 MB | ###3 | 34%  2025-05-07T20:26:26.5371976Z 2025-05-07T20:26:26.5371980Z 2025-05-07T20:26:26.5371983Z 2025-05-07T20:26:26.5371997Z 2025-05-07T20:26:26.5372001Z 2025-05-07T20:26:26.5372005Z 2025-05-07T20:26:26.5372008Z 2025-05-07T20:26:26.5373185Z 2025-05-07T20:26:26.5728386Z cuda-nvrtc-12.8.61 | 63.1 MB | ########## | 100%  2025-05-07T20:26:26.5728695Z 2025-05-07T20:26:26.5728699Z 2025-05-07T20:26:26.5728703Z 2025-05-07T20:26:26.5728707Z 2025-05-07T20:26:26.5728711Z 2025-05-07T20:26:26.5733406Z 2025-05-07T20:26:26.5979985Z cuda-nsight-12.8.55 | 113.2 MB | ########## | 100%  2025-05-07T20:26:26.5980348Z 2025-05-07T20:26:26.5980355Z 2025-05-07T20:26:26.5980360Z 2025-05-07T20:26:26.5980365Z 2025-05-07T20:26:26.5980371Z 2025-05-07T20:26:26.5980376Z 2025-05-07T20:26:26.5980381Z 2025-05-07T20:26:26.5980386Z 2025-05-07T20:26:26.5980391Z 2025-05-07T20:26:26.5980396Z 2025-05-07T20:26:26.5980402Z 2025-05-07T20:26:26.5980409Z 2025-05-07T20:26:26.6176814Z libnvjitlink-12.8.61 | 28.7 MB | | 0%  2025-05-07T20:26:26.6177240Z 2025-05-07T20:26:26.6177244Z 2025-05-07T20:26:26.6177248Z 2025-05-07T20:26:26.6177252Z 2025-05-07T20:26:26.6177256Z 2025-05-07T20:26:26.6177266Z 2025-05-07T20:26:26.6177270Z 2025-05-07T20:26:26.6177274Z 2025-05-07T20:26:26.6177278Z 2025-05-07T20:26:26.6177281Z 2025-05-07T20:26:26.6177861Z 2025-05-07T20:26:26.6984324Z python-3.13.0 | 31.5 MB | ####5 | 45%  2025-05-07T20:26:26.6984730Z 2025-05-07T20:26:26.6984736Z 2025-05-07T20:26:26.6984741Z 2025-05-07T20:26:26.6984747Z 2025-05-07T20:26:26.6984770Z 2025-05-07T20:26:26.6984777Z 2025-05-07T20:26:26.6984782Z 2025-05-07T20:26:26.6984787Z 2025-05-07T20:26:26.6984793Z 2025-05-07T20:26:26.6984798Z 2025-05-07T20:26:26.6984803Z 2025-05-07T20:26:26.6984809Z 2025-05-07T20:26:26.7307228Z libnvjitlink-12.8.61 | 28.7 MB | 6 | 6%  2025-05-07T20:26:26.7307930Z 2025-05-07T20:26:26.7307935Z 2025-05-07T20:26:26.7307938Z 2025-05-07T20:26:26.7307942Z 2025-05-07T20:26:26.7307946Z 2025-05-07T20:26:26.7307950Z 2025-05-07T20:26:26.7307954Z 2025-05-07T20:26:26.7307957Z 2025-05-07T20:26:26.7307961Z 2025-05-07T20:26:26.7307965Z 2025-05-07T20:26:26.7308344Z 2025-05-07T20:26:26.7992388Z python-3.13.0 | 31.5 MB | #####6 | 56%  2025-05-07T20:26:26.7992690Z 2025-05-07T20:26:26.7992694Z 2025-05-07T20:26:26.7992698Z 2025-05-07T20:26:26.7992702Z 2025-05-07T20:26:26.7992706Z 2025-05-07T20:26:26.7992716Z 2025-05-07T20:26:26.7992720Z 2025-05-07T20:26:26.7992952Z 2025-05-07T20:26:26.7992958Z 2025-05-07T20:26:26.7992961Z 2025-05-07T20:26:26.7992965Z 2025-05-07T20:26:26.7994919Z 2025-05-07T20:26:26.8407816Z libnvjitlink-12.8.61 | 28.7 MB | #7 | 17%  2025-05-07T20:26:26.8408313Z 2025-05-07T20:26:26.8408320Z 2025-05-07T20:26:26.8408326Z 2025-05-07T20:26:26.8408351Z 2025-05-07T20:26:26.8408357Z 2025-05-07T20:26:26.8408362Z 2025-05-07T20:26:26.8408367Z 2025-05-07T20:26:26.8408373Z 2025-05-07T20:26:26.8408378Z 2025-05-07T20:26:26.8408383Z 2025-05-07T20:26:26.8411173Z 2025-05-07T20:26:26.8992969Z python-3.13.0 | 31.5 MB | ######6 | 67%  2025-05-07T20:26:26.8993276Z 2025-05-07T20:26:26.8993280Z 2025-05-07T20:26:26.8993284Z 2025-05-07T20:26:26.8993288Z 2025-05-07T20:26:26.8993291Z 2025-05-07T20:26:26.8993295Z 2025-05-07T20:26:26.8993299Z 2025-05-07T20:26:26.8993303Z 2025-05-07T20:26:26.8993306Z 2025-05-07T20:26:26.8993310Z 2025-05-07T20:26:26.8993326Z 2025-05-07T20:26:26.8994836Z 2025-05-07T20:26:26.9447845Z libnvjitlink-12.8.61 | 28.7 MB | ##7 | 28%  2025-05-07T20:26:26.9448320Z 2025-05-07T20:26:26.9448327Z 2025-05-07T20:26:26.9448332Z 2025-05-07T20:26:26.9448337Z 2025-05-07T20:26:26.9448343Z 2025-05-07T20:26:26.9448359Z 2025-05-07T20:26:26.9448364Z 2025-05-07T20:26:26.9448386Z 2025-05-07T20:26:26.9448392Z 2025-05-07T20:26:26.9448397Z 2025-05-07T20:26:26.9453721Z 2025-05-07T20:26:26.9995836Z python-3.13.0 | 31.5 MB | #######7 | 77%  2025-05-07T20:26:26.9996278Z 2025-05-07T20:26:26.9996285Z 2025-05-07T20:26:26.9996290Z 2025-05-07T20:26:26.9996296Z 2025-05-07T20:26:26.9996301Z 2025-05-07T20:26:26.9996306Z 2025-05-07T20:26:26.9996311Z 2025-05-07T20:26:26.9996316Z 2025-05-07T20:26:26.9996322Z 2025-05-07T20:26:26.9996327Z 2025-05-07T20:26:26.9996332Z 2025-05-07T20:26:26.9996337Z 2025-05-07T20:26:27.0476256Z libnvjitlink-12.8.61 | 28.7 MB | ###9 | 40%  2025-05-07T20:26:27.0476589Z 2025-05-07T20:26:27.0476594Z 2025-05-07T20:26:27.0476598Z 2025-05-07T20:26:27.0476601Z 2025-05-07T20:26:27.0476605Z 2025-05-07T20:26:27.0476609Z 2025-05-07T20:26:27.0476613Z 2025-05-07T20:26:27.0476616Z 2025-05-07T20:26:27.0476620Z 2025-05-07T20:26:27.0476624Z 2025-05-07T20:26:27.0481128Z 2025-05-07T20:26:27.0563179Z python-3.13.0 | 31.5 MB | ########7 | 87%  2025-05-07T20:26:27.0563472Z 2025-05-07T20:26:27.0563476Z 2025-05-07T20:26:27.0563480Z 2025-05-07T20:26:27.0563484Z 2025-05-07T20:26:27.0563495Z 2025-05-07T20:26:27.0563499Z 2025-05-07T20:26:27.0563503Z 2025-05-07T20:26:27.0563507Z 2025-05-07T20:26:27.0563510Z 2025-05-07T20:26:27.0563514Z 2025-05-07T20:26:27.1027332Z gds-tools-1.13.0.11 | 37.9 MB | ########## | 100%  2025-05-07T20:26:27.1027635Z 2025-05-07T20:26:27.1027639Z 2025-05-07T20:26:27.1027643Z 2025-05-07T20:26:27.1027663Z 2025-05-07T20:26:27.1027674Z 2025-05-07T20:26:27.1027678Z 2025-05-07T20:26:27.1027681Z 2025-05-07T20:26:27.1027685Z 2025-05-07T20:26:27.1027689Z 2025-05-07T20:26:27.1027693Z 2025-05-07T20:26:27.1027697Z 2025-05-07T20:26:27.1029902Z 2025-05-07T20:26:27.1257895Z libnvjitlink-12.8.61 | 28.7 MB | ####9 | 50%  2025-05-07T20:26:27.1258497Z 2025-05-07T20:26:27.1258501Z 2025-05-07T20:26:27.1258505Z 2025-05-07T20:26:27.1258509Z 2025-05-07T20:26:27.1258512Z 2025-05-07T20:26:27.1258516Z 2025-05-07T20:26:27.1258520Z 2025-05-07T20:26:27.1258524Z 2025-05-07T20:26:27.1258528Z 2025-05-07T20:26:27.1258531Z 2025-05-07T20:26:27.1258535Z 2025-05-07T20:26:27.1258539Z 2025-05-07T20:26:27.1258543Z 2025-05-07T20:26:27.1476898Z cuda-nvcc-tools-12.8 | 24.5 MB | | 0%  2025-05-07T20:26:27.1477218Z 2025-05-07T20:26:27.1477222Z 2025-05-07T20:26:27.1477226Z 2025-05-07T20:26:27.1477230Z 2025-05-07T20:26:27.1477459Z 2025-05-07T20:26:27.1477466Z 2025-05-07T20:26:27.1477471Z 2025-05-07T20:26:27.1477476Z 2025-05-07T20:26:27.1477481Z 2025-05-07T20:26:27.1477487Z 2025-05-07T20:26:27.1477502Z 2025-05-07T20:26:27.2033003Z python-3.13.0 | 31.5 MB | #########7 | 97%  2025-05-07T20:26:27.2033318Z 2025-05-07T20:26:27.2033322Z 2025-05-07T20:26:27.2033348Z 2025-05-07T20:26:27.2033359Z 2025-05-07T20:26:27.2033362Z 2025-05-07T20:26:27.2033366Z 2025-05-07T20:26:27.2033370Z 2025-05-07T20:26:27.2033373Z 2025-05-07T20:26:27.2033377Z 2025-05-07T20:26:27.2033381Z 2025-05-07T20:26:27.2033384Z 2025-05-07T20:26:27.2033388Z 2025-05-07T20:26:27.2260519Z libnvjitlink-12.8.61 | 28.7 MB | ###### | 61%  2025-05-07T20:26:27.2260968Z 2025-05-07T20:26:27.2260974Z 2025-05-07T20:26:27.2260980Z 2025-05-07T20:26:27.2260985Z 2025-05-07T20:26:27.2260991Z 2025-05-07T20:26:27.2260996Z 2025-05-07T20:26:27.2261001Z 2025-05-07T20:26:27.2261024Z 2025-05-07T20:26:27.2261030Z 2025-05-07T20:26:27.2261036Z 2025-05-07T20:26:27.2261041Z 2025-05-07T20:26:27.2261046Z 2025-05-07T20:26:27.2261051Z 2025-05-07T20:26:27.3034757Z cuda-nvcc-tools-12.8 | 24.5 MB | #1 | 11%  2025-05-07T20:26:27.3035178Z 2025-05-07T20:26:27.3035182Z 2025-05-07T20:26:27.3035186Z 2025-05-07T20:26:27.3035212Z 2025-05-07T20:26:27.3035216Z 2025-05-07T20:26:27.3035220Z 2025-05-07T20:26:27.3035223Z 2025-05-07T20:26:27.3035227Z 2025-05-07T20:26:27.3035231Z 2025-05-07T20:26:27.3035234Z 2025-05-07T20:26:27.3035247Z 2025-05-07T20:26:27.3036851Z 2025-05-07T20:26:27.3919584Z libnvjitlink-12.8.61 | 28.7 MB | #######3 | 73%  2025-05-07T20:26:27.3919907Z 2025-05-07T20:26:27.3919912Z 2025-05-07T20:26:27.3919925Z 2025-05-07T20:26:27.3919929Z 2025-05-07T20:26:27.3919933Z 2025-05-07T20:26:27.3919937Z 2025-05-07T20:26:27.3919941Z 2025-05-07T20:26:27.3919944Z 2025-05-07T20:26:27.3919948Z 2025-05-07T20:26:27.3919966Z 2025-05-07T20:26:27.3919970Z 2025-05-07T20:26:27.3919974Z 2025-05-07T20:26:27.3921589Z 2025-05-07T20:26:27.4034282Z cuda-nvcc-tools-12.8 | 24.5 MB | ##2 | 23%  2025-05-07T20:26:27.4034692Z 2025-05-07T20:26:27.4034697Z 2025-05-07T20:26:27.4034700Z 2025-05-07T20:26:27.4034704Z 2025-05-07T20:26:27.4034722Z 2025-05-07T20:26:27.4034726Z 2025-05-07T20:26:27.4034730Z 2025-05-07T20:26:27.4034733Z 2025-05-07T20:26:27.4034737Z 2025-05-07T20:26:27.4034741Z 2025-05-07T20:26:27.4034744Z 2025-05-07T20:26:27.4036301Z 2025-05-07T20:26:27.4920831Z libnvjitlink-12.8.61 | 28.7 MB | ########4 | 84%  2025-05-07T20:26:27.4921220Z 2025-05-07T20:26:27.4921241Z 2025-05-07T20:26:27.4921245Z 2025-05-07T20:26:27.4921252Z 2025-05-07T20:26:27.4921257Z 2025-05-07T20:26:27.4921262Z 2025-05-07T20:26:27.4921268Z 2025-05-07T20:26:27.4921283Z 2025-05-07T20:26:27.4921289Z 2025-05-07T20:26:27.4921294Z 2025-05-07T20:26:27.4921321Z 2025-05-07T20:26:27.4921326Z 2025-05-07T20:26:27.4923689Z 2025-05-07T20:26:27.5072684Z cuda-nvcc-tools-12.8 | 24.5 MB | ###5 | 36%  2025-05-07T20:26:27.5073077Z 2025-05-07T20:26:27.5073081Z 2025-05-07T20:26:27.5073085Z 2025-05-07T20:26:27.5073089Z 2025-05-07T20:26:27.5073093Z 2025-05-07T20:26:27.5073393Z 2025-05-07T20:26:27.5073398Z 2025-05-07T20:26:27.5073404Z 2025-05-07T20:26:27.5073409Z 2025-05-07T20:26:27.5073414Z 2025-05-07T20:26:27.5073420Z 2025-05-07T20:26:27.5080784Z 2025-05-07T20:26:27.5925095Z libnvjitlink-12.8.61 | 28.7 MB | #########5 | 95%  2025-05-07T20:26:27.5925413Z 2025-05-07T20:26:27.5925420Z 2025-05-07T20:26:27.5925424Z 2025-05-07T20:26:27.5925427Z 2025-05-07T20:26:27.5925431Z 2025-05-07T20:26:27.5925435Z 2025-05-07T20:26:27.5925439Z 2025-05-07T20:26:27.5925442Z 2025-05-07T20:26:27.5925446Z 2025-05-07T20:26:27.5925450Z 2025-05-07T20:26:27.5925453Z 2025-05-07T20:26:27.5925691Z 2025-05-07T20:26:27.5926091Z 2025-05-07T20:26:27.6000688Z cuda-nvcc-tools-12.8 | 24.5 MB | ####7 | 48%  2025-05-07T20:26:27.6000997Z 2025-05-07T20:26:27.6025542Z nsight-compute-2025. | 320.6 MB | ########## | 100%  2025-05-07T20:26:27.6025810Z 2025-05-07T20:26:27.6025814Z 2025-05-07T20:26:27.6029213Z 2025-05-07T20:26:27.6484385Z libcusolver-11.7.2.5 | 156.9 MB | ########## | 100%  2025-05-07T20:26:27.6484686Z 2025-05-07T20:26:27.6484698Z 2025-05-07T20:26:27.6484719Z 2025-05-07T20:26:27.6484723Z 2025-05-07T20:26:27.6484726Z 2025-05-07T20:26:27.6484730Z 2025-05-07T20:26:27.6484734Z 2025-05-07T20:26:27.6484738Z 2025-05-07T20:26:27.6484741Z 2025-05-07T20:26:27.6484745Z 2025-05-07T20:26:27.6484749Z 2025-05-07T20:26:27.6484753Z 2025-05-07T20:26:27.6484756Z 2025-05-07T20:26:27.6484760Z 2025-05-07T20:26:27.6928757Z cuda-nvvm-tools-12.8 | 23.5 MB | | 0%  2025-05-07T20:26:27.6929217Z 2025-05-07T20:26:27.6929222Z 2025-05-07T20:26:27.6929226Z 2025-05-07T20:26:27.6929230Z 2025-05-07T20:26:27.6929234Z 2025-05-07T20:26:27.6929238Z 2025-05-07T20:26:27.6929243Z 2025-05-07T20:26:27.6929247Z 2025-05-07T20:26:27.6929250Z 2025-05-07T20:26:27.6929254Z 2025-05-07T20:26:27.6929258Z 2025-05-07T20:26:27.6929262Z 2025-05-07T20:26:27.6929277Z 2025-05-07T20:26:27.7479313Z cuda-nvcc-tools-12.8 | 24.5 MB | #####9 | 60%  2025-05-07T20:26:27.7479643Z 2025-05-07T20:26:27.7479647Z 2025-05-07T20:26:27.7479651Z 2025-05-07T20:26:27.7479655Z 2025-05-07T20:26:27.7479667Z 2025-05-07T20:26:27.7479671Z 2025-05-07T20:26:27.7479675Z 2025-05-07T20:26:27.7479678Z 2025-05-07T20:26:27.7479682Z 2025-05-07T20:26:27.7479686Z 2025-05-07T20:26:27.7479689Z 2025-05-07T20:26:27.7479693Z 2025-05-07T20:26:27.7479697Z 2025-05-07T20:26:27.7481964Z 2025-05-07T20:26:27.7929436Z cuda-nvvm-tools-12.8 | 23.5 MB | #4 | 15%  2025-05-07T20:26:27.7929778Z 2025-05-07T20:26:27.7929782Z 2025-05-07T20:26:27.7929786Z 2025-05-07T20:26:27.7929789Z 2025-05-07T20:26:27.7929793Z 2025-05-07T20:26:27.7929797Z 2025-05-07T20:26:27.7929801Z 2025-05-07T20:26:27.7929805Z 2025-05-07T20:26:27.7929808Z 2025-05-07T20:26:27.7929812Z 2025-05-07T20:26:27.7929816Z 2025-05-07T20:26:27.7929828Z 2025-05-07T20:26:27.7931202Z 2025-05-07T20:26:27.8551148Z cuda-nvcc-tools-12.8 | 24.5 MB | #######2 | 72%  2025-05-07T20:26:27.8551567Z 2025-05-07T20:26:27.8551572Z 2025-05-07T20:26:27.8551575Z 2025-05-07T20:26:27.8551579Z 2025-05-07T20:26:27.8551583Z 2025-05-07T20:26:27.8551587Z 2025-05-07T20:26:27.8551590Z 2025-05-07T20:26:27.8551594Z 2025-05-07T20:26:27.8551608Z 2025-05-07T20:26:27.8551612Z 2025-05-07T20:26:27.8551616Z 2025-05-07T20:26:27.8551619Z 2025-05-07T20:26:27.8551623Z 2025-05-07T20:26:27.8551633Z 2025-05-07T20:26:27.8932814Z cuda-nvvm-tools-12.8 | 23.5 MB | ##9 | 30%  2025-05-07T20:26:27.8933151Z 2025-05-07T20:26:27.8933155Z 2025-05-07T20:26:27.8933159Z 2025-05-07T20:26:27.8933163Z 2025-05-07T20:26:27.8933167Z 2025-05-07T20:26:27.8933170Z 2025-05-07T20:26:27.8933174Z 2025-05-07T20:26:27.8933178Z 2025-05-07T20:26:27.8933182Z 2025-05-07T20:26:27.8933185Z 2025-05-07T20:26:27.8933434Z 2025-05-07T20:26:27.8933438Z 2025-05-07T20:26:27.8933447Z 2025-05-07T20:26:27.9611732Z cuda-nvcc-tools-12.8 | 24.5 MB | ########4 | 85%  2025-05-07T20:26:27.9612085Z 2025-05-07T20:26:27.9612089Z 2025-05-07T20:26:27.9612093Z 2025-05-07T20:26:27.9612096Z 2025-05-07T20:26:27.9612100Z 2025-05-07T20:26:27.9612104Z 2025-05-07T20:26:27.9612107Z 2025-05-07T20:26:27.9612111Z 2025-05-07T20:26:27.9612115Z 2025-05-07T20:26:27.9612118Z 2025-05-07T20:26:27.9612122Z 2025-05-07T20:26:27.9612126Z 2025-05-07T20:26:27.9612130Z 2025-05-07T20:26:27.9612133Z 2025-05-07T20:26:27.9974355Z cuda-nvvm-tools-12.8 | 23.5 MB | ####3 | 44%  2025-05-07T20:26:27.9974682Z 2025-05-07T20:26:27.9974686Z 2025-05-07T20:26:27.9974690Z 2025-05-07T20:26:27.9974694Z 2025-05-07T20:26:27.9974698Z 2025-05-07T20:26:27.9974710Z 2025-05-07T20:26:27.9974723Z 2025-05-07T20:26:27.9974727Z 2025-05-07T20:26:27.9974731Z 2025-05-07T20:26:27.9974744Z 2025-05-07T20:26:27.9974747Z 2025-05-07T20:26:27.9974751Z 2025-05-07T20:26:27.9974755Z 2025-05-07T20:26:28.0703667Z cuda-nvcc-tools-12.8 | 24.5 MB | #########6 | 97%  2025-05-07T20:26:28.0704135Z 2025-05-07T20:26:28.0704139Z 2025-05-07T20:26:28.0704143Z 2025-05-07T20:26:28.0704147Z 2025-05-07T20:26:28.0704151Z 2025-05-07T20:26:28.0704154Z 2025-05-07T20:26:28.0704158Z 2025-05-07T20:26:28.0704162Z 2025-05-07T20:26:28.0704166Z 2025-05-07T20:26:28.0704169Z 2025-05-07T20:26:28.0704173Z 2025-05-07T20:26:28.0704177Z 2025-05-07T20:26:28.0704181Z 2025-05-07T20:26:28.0704184Z 2025-05-07T20:26:28.1705585Z cuda-nvvm-tools-12.8 | 23.5 MB | #####7 | 58%  2025-05-07T20:26:28.1705928Z 2025-05-07T20:26:28.1705932Z 2025-05-07T20:26:28.1705936Z 2025-05-07T20:26:28.1705940Z 2025-05-07T20:26:28.1705943Z 2025-05-07T20:26:28.1705947Z 2025-05-07T20:26:28.1705951Z 2025-05-07T20:26:28.1705954Z 2025-05-07T20:26:28.1705972Z 2025-05-07T20:26:28.1705983Z 2025-05-07T20:26:28.1705987Z 2025-05-07T20:26:28.1705991Z 2025-05-07T20:26:28.1705995Z 2025-05-07T20:26:28.1707596Z 2025-05-07T20:26:28.2707494Z cuda-nvvm-tools-12.8 | 23.5 MB | #######3 | 73%  2025-05-07T20:26:28.2708090Z 2025-05-07T20:26:28.2708094Z 2025-05-07T20:26:28.2708099Z 2025-05-07T20:26:28.2708103Z 2025-05-07T20:26:28.2708107Z 2025-05-07T20:26:28.2708110Z 2025-05-07T20:26:28.2708114Z 2025-05-07T20:26:28.2708118Z 2025-05-07T20:26:28.2708122Z 2025-05-07T20:26:28.2708125Z 2025-05-07T20:26:28.2708129Z 2025-05-07T20:26:28.2708153Z 2025-05-07T20:26:28.2708157Z 2025-05-07T20:26:28.2708188Z 2025-05-07T20:26:28.3388970Z cuda-nvvm-tools-12.8 | 23.5 MB | ########8 | 89%  2025-05-07T20:26:28.3389354Z 2025-05-07T20:26:28.3389359Z 2025-05-07T20:26:28.3389363Z 2025-05-07T20:26:28.3389366Z 2025-05-07T20:26:28.3389370Z 2025-05-07T20:26:28.3389392Z 2025-05-07T20:26:28.3389396Z 2025-05-07T20:26:28.3389400Z 2025-05-07T20:26:28.3389403Z 2025-05-07T20:26:28.3389407Z 2025-05-07T20:26:28.3390158Z 2025-05-07T20:26:28.3799738Z python-3.13.0 | 31.5 MB | ########## | 100%  2025-05-07T20:26:28.3800137Z 2025-05-07T20:26:28.3800143Z 2025-05-07T20:26:28.3800159Z 2025-05-07T20:26:28.3800165Z 2025-05-07T20:26:28.3800170Z 2025-05-07T20:26:28.3800175Z 2025-05-07T20:26:28.3800180Z 2025-05-07T20:26:28.3800186Z 2025-05-07T20:26:28.3800191Z 2025-05-07T20:26:28.3800196Z 2025-05-07T20:26:28.3800201Z 2025-05-07T20:26:28.3800207Z 2025-05-07T20:26:28.3800227Z 2025-05-07T20:26:28.3800233Z 2025-05-07T20:26:28.3801522Z 2025-05-07T20:26:28.4804073Z cuda-nvvm-impl-12.8. | 20.8 MB | | 0%  2025-05-07T20:26:28.4804523Z 2025-05-07T20:26:28.4804529Z 2025-05-07T20:26:28.4804534Z 2025-05-07T20:26:28.4804539Z 2025-05-07T20:26:28.4804546Z 2025-05-07T20:26:28.4804853Z 2025-05-07T20:26:28.4804859Z 2025-05-07T20:26:28.4804863Z 2025-05-07T20:26:28.4804867Z 2025-05-07T20:26:28.4804870Z 2025-05-07T20:26:28.4804874Z 2025-05-07T20:26:28.4804878Z 2025-05-07T20:26:28.4804882Z 2025-05-07T20:26:28.4804885Z 2025-05-07T20:26:28.4806722Z 2025-05-07T20:26:28.4923637Z cuda-nvvm-impl-12.8. | 20.8 MB | #6 | 16%  2025-05-07T20:26:28.4924071Z 2025-05-07T20:26:28.4924077Z 2025-05-07T20:26:28.4924083Z 2025-05-07T20:26:28.4924088Z 2025-05-07T20:26:28.4924093Z 2025-05-07T20:26:28.4924108Z 2025-05-07T20:26:28.4924113Z 2025-05-07T20:26:28.4924119Z 2025-05-07T20:26:28.4924369Z 2025-05-07T20:26:28.4924376Z 2025-05-07T20:26:28.4924381Z 2025-05-07T20:26:28.4925077Z 2025-05-07T20:26:28.5621983Z libnvjitlink-12.8.61 | 28.7 MB | ########## | 100%  2025-05-07T20:26:28.5622436Z 2025-05-07T20:26:28.5622441Z 2025-05-07T20:26:28.5622447Z 2025-05-07T20:26:28.5622452Z 2025-05-07T20:26:28.5622477Z 2025-05-07T20:26:28.5622483Z 2025-05-07T20:26:28.5622488Z 2025-05-07T20:26:28.5622493Z 2025-05-07T20:26:28.5622498Z 2025-05-07T20:26:28.5622504Z 2025-05-07T20:26:28.5622509Z 2025-05-07T20:26:28.5622514Z 2025-05-07T20:26:28.5622519Z 2025-05-07T20:26:28.5622525Z 2025-05-07T20:26:28.5622530Z 2025-05-07T20:26:28.5622535Z 2025-05-07T20:26:28.5868515Z cuda-nvcc-dev_linux- | 12.7 MB | | 0%  2025-05-07T20:26:28.5868999Z 2025-05-07T20:26:28.5869003Z 2025-05-07T20:26:28.5869007Z 2025-05-07T20:26:28.5869011Z 2025-05-07T20:26:28.5869014Z 2025-05-07T20:26:28.5869018Z 2025-05-07T20:26:28.5869036Z 2025-05-07T20:26:28.5869056Z 2025-05-07T20:26:28.5869060Z 2025-05-07T20:26:28.5869064Z 2025-05-07T20:26:28.5869068Z 2025-05-07T20:26:28.5869071Z 2025-05-07T20:26:28.5869075Z 2025-05-07T20:26:28.5869079Z 2025-05-07T20:26:28.5869083Z 2025-05-07T20:26:28.6623935Z cuda-nvvm-impl-12.8. | 20.8 MB | ###2 | 32%  2025-05-07T20:26:28.6624386Z 2025-05-07T20:26:28.6624390Z 2025-05-07T20:26:28.6624394Z 2025-05-07T20:26:28.6624398Z 2025-05-07T20:26:28.6624402Z 2025-05-07T20:26:28.6624406Z 2025-05-07T20:26:28.6624409Z 2025-05-07T20:26:28.6624413Z 2025-05-07T20:26:28.6624417Z 2025-05-07T20:26:28.6624421Z 2025-05-07T20:26:28.6624425Z 2025-05-07T20:26:28.6624428Z 2025-05-07T20:26:28.6624432Z 2025-05-07T20:26:28.6624436Z 2025-05-07T20:26:28.6624440Z 2025-05-07T20:26:28.6627875Z 2025-05-07T20:26:28.7013972Z cuda-nvcc-dev_linux- | 12.7 MB | #6 | 16%  2025-05-07T20:26:28.7014340Z 2025-05-07T20:26:28.7014346Z 2025-05-07T20:26:28.7014351Z 2025-05-07T20:26:28.7014356Z 2025-05-07T20:26:28.7014362Z 2025-05-07T20:26:28.7014368Z 2025-05-07T20:26:28.7014374Z 2025-05-07T20:26:28.7014379Z 2025-05-07T20:26:28.7014393Z 2025-05-07T20:26:28.7014397Z 2025-05-07T20:26:28.7014402Z 2025-05-07T20:26:28.7014406Z 2025-05-07T20:26:28.7014411Z 2025-05-07T20:26:28.7014429Z 2025-05-07T20:26:28.7014437Z 2025-05-07T20:26:28.7628223Z cuda-nvvm-impl-12.8. | 20.8 MB | ####7 | 48%  2025-05-07T20:26:28.7628639Z 2025-05-07T20:26:28.7628646Z 2025-05-07T20:26:28.7628651Z 2025-05-07T20:26:28.7628657Z 2025-05-07T20:26:28.7628662Z 2025-05-07T20:26:28.7628670Z 2025-05-07T20:26:28.7628675Z 2025-05-07T20:26:28.7628681Z 2025-05-07T20:26:28.7628686Z 2025-05-07T20:26:28.7628692Z 2025-05-07T20:26:28.7628697Z 2025-05-07T20:26:28.7628703Z 2025-05-07T20:26:28.7628708Z 2025-05-07T20:26:28.7628713Z 2025-05-07T20:26:28.7628720Z 2025-05-07T20:26:28.7630726Z 2025-05-07T20:26:28.8158787Z cuda-nvcc-dev_linux- | 12.7 MB | ###6 | 37%  2025-05-07T20:26:28.8159242Z 2025-05-07T20:26:28.8159247Z 2025-05-07T20:26:28.8159251Z 2025-05-07T20:26:28.8159255Z 2025-05-07T20:26:28.8159259Z 2025-05-07T20:26:28.8159263Z 2025-05-07T20:26:28.8159276Z 2025-05-07T20:26:28.8159537Z 2025-05-07T20:26:28.8159541Z 2025-05-07T20:26:28.8159545Z 2025-05-07T20:26:28.8159549Z 2025-05-07T20:26:28.8159553Z 2025-05-07T20:26:28.8159556Z 2025-05-07T20:26:28.8159560Z 2025-05-07T20:26:28.8159564Z 2025-05-07T20:26:28.8640969Z cuda-nvvm-impl-12.8. | 20.8 MB | ######2 | 62%  2025-05-07T20:26:28.8641316Z 2025-05-07T20:26:28.8641321Z 2025-05-07T20:26:28.8641325Z 2025-05-07T20:26:28.8641328Z 2025-05-07T20:26:28.8641336Z 2025-05-07T20:26:28.8641341Z 2025-05-07T20:26:28.8641344Z 2025-05-07T20:26:28.8641348Z 2025-05-07T20:26:28.8641352Z 2025-05-07T20:26:28.8641356Z 2025-05-07T20:26:28.8641608Z 2025-05-07T20:26:28.8641613Z 2025-05-07T20:26:28.8641617Z 2025-05-07T20:26:28.8641621Z 2025-05-07T20:26:28.8641625Z 2025-05-07T20:26:28.8642433Z 2025-05-07T20:26:28.9160379Z cuda-nvcc-dev_linux- | 12.7 MB | #####8 | 58%  2025-05-07T20:26:28.9160717Z 2025-05-07T20:26:28.9160722Z 2025-05-07T20:26:28.9160758Z 2025-05-07T20:26:28.9160762Z 2025-05-07T20:26:28.9160766Z 2025-05-07T20:26:28.9160770Z 2025-05-07T20:26:28.9160774Z 2025-05-07T20:26:28.9160792Z 2025-05-07T20:26:28.9160796Z 2025-05-07T20:26:28.9160799Z 2025-05-07T20:26:28.9160803Z 2025-05-07T20:26:28.9160808Z 2025-05-07T20:26:28.9160812Z 2025-05-07T20:26:28.9160815Z 2025-05-07T20:26:28.9167566Z 2025-05-07T20:26:28.9270196Z cuda-nvvm-impl-12.8. | 20.8 MB | #######6 | 77%  2025-05-07T20:26:28.9270530Z 2025-05-07T20:26:28.9270534Z 2025-05-07T20:26:28.9270547Z 2025-05-07T20:26:28.9270551Z 2025-05-07T20:26:28.9270577Z 2025-05-07T20:26:28.9270582Z 2025-05-07T20:26:28.9270586Z 2025-05-07T20:26:28.9270590Z 2025-05-07T20:26:28.9270593Z 2025-05-07T20:26:28.9270597Z 2025-05-07T20:26:28.9270601Z 2025-05-07T20:26:28.9270604Z 2025-05-07T20:26:28.9270609Z 2025-05-07T20:26:28.9645039Z cuda-nvcc-tools-12.8 | 24.5 MB | ########## | 100%  2025-05-07T20:26:28.9645406Z 2025-05-07T20:26:28.9645412Z 2025-05-07T20:26:28.9645417Z 2025-05-07T20:26:28.9645423Z 2025-05-07T20:26:28.9645428Z 2025-05-07T20:26:28.9645445Z 2025-05-07T20:26:28.9645450Z 2025-05-07T20:26:28.9645456Z 2025-05-07T20:26:28.9645461Z 2025-05-07T20:26:28.9645479Z 2025-05-07T20:26:28.9645484Z 2025-05-07T20:26:28.9645490Z 2025-05-07T20:26:28.9645495Z 2025-05-07T20:26:28.9645500Z 2025-05-07T20:26:28.9645505Z 2025-05-07T20:26:28.9648218Z 2025-05-07T20:26:28.9939223Z cuda-nvcc-dev_linux- | 12.7 MB | ########3 | 84%  2025-05-07T20:26:28.9939573Z 2025-05-07T20:26:28.9939607Z 2025-05-07T20:26:28.9939612Z 2025-05-07T20:26:28.9939616Z 2025-05-07T20:26:28.9939620Z 2025-05-07T20:26:28.9939624Z 2025-05-07T20:26:28.9939627Z 2025-05-07T20:26:28.9939631Z 2025-05-07T20:26:28.9939635Z 2025-05-07T20:26:28.9939639Z 2025-05-07T20:26:28.9939644Z 2025-05-07T20:26:28.9939648Z 2025-05-07T20:26:28.9939651Z 2025-05-07T20:26:28.9939669Z 2025-05-07T20:26:28.9939677Z 2025-05-07T20:26:28.9939681Z 2025-05-07T20:26:28.9939684Z 2025-05-07T20:26:29.0162484Z cuda-sanitizer-api-1 | 8.8 MB | | 0%  2025-05-07T20:26:29.0162838Z 2025-05-07T20:26:29.0162842Z 2025-05-07T20:26:29.0162846Z 2025-05-07T20:26:29.0162850Z 2025-05-07T20:26:29.0162853Z 2025-05-07T20:26:29.0162857Z 2025-05-07T20:26:29.0162861Z 2025-05-07T20:26:29.0162865Z 2025-05-07T20:26:29.0162878Z 2025-05-07T20:26:29.0162882Z 2025-05-07T20:26:29.0162886Z 2025-05-07T20:26:29.0162890Z 2025-05-07T20:26:29.0162894Z 2025-05-07T20:26:29.0162898Z 2025-05-07T20:26:29.0163212Z 2025-05-07T20:26:29.0942703Z cuda-nvvm-impl-12.8. | 20.8 MB | #########1 | 92%  2025-05-07T20:26:29.0943118Z 2025-05-07T20:26:29.0943123Z 2025-05-07T20:26:29.0943127Z 2025-05-07T20:26:29.0943132Z 2025-05-07T20:26:29.0943137Z 2025-05-07T20:26:29.0943141Z 2025-05-07T20:26:29.0943144Z 2025-05-07T20:26:29.0943421Z 2025-05-07T20:26:29.0943440Z 2025-05-07T20:26:29.0943444Z 2025-05-07T20:26:29.0943448Z 2025-05-07T20:26:29.0943452Z 2025-05-07T20:26:29.0943456Z 2025-05-07T20:26:29.0943459Z 2025-05-07T20:26:29.0943463Z 2025-05-07T20:26:29.0943467Z 2025-05-07T20:26:29.0943471Z 2025-05-07T20:26:29.1942713Z cuda-sanitizer-api-1 | 8.8 MB | ###1 | 31%  2025-05-07T20:26:29.1943064Z 2025-05-07T20:26:29.1943068Z 2025-05-07T20:26:29.1943072Z 2025-05-07T20:26:29.1943078Z 2025-05-07T20:26:29.1943082Z 2025-05-07T20:26:29.1943100Z 2025-05-07T20:26:29.1943104Z 2025-05-07T20:26:29.1943363Z 2025-05-07T20:26:29.1943369Z 2025-05-07T20:26:29.1943373Z 2025-05-07T20:26:29.1943377Z 2025-05-07T20:26:29.1943381Z 2025-05-07T20:26:29.1943384Z 2025-05-07T20:26:29.1947998Z 2025-05-07T20:26:29.1956986Z cuda-nvvm-tools-12.8 | 23.5 MB | ########## | 100%  2025-05-07T20:26:29.1957330Z 2025-05-07T20:26:29.1957367Z 2025-05-07T20:26:29.1957373Z 2025-05-07T20:26:29.1957401Z 2025-05-07T20:26:29.1957416Z 2025-05-07T20:26:29.1957422Z 2025-05-07T20:26:29.1957427Z 2025-05-07T20:26:29.1957433Z 2025-05-07T20:26:29.1957439Z 2025-05-07T20:26:29.1957445Z 2025-05-07T20:26:29.1957451Z 2025-05-07T20:26:29.1957455Z 2025-05-07T20:26:29.1957460Z 2025-05-07T20:26:29.1957463Z 2025-05-07T20:26:29.1957467Z 2025-05-07T20:26:29.1957471Z 2025-05-07T20:26:29.1960114Z 2025-05-07T20:26:29.2476680Z cuda-sanitizer-api-1 | 8.8 MB | ####### | 70%  2025-05-07T20:26:29.2477063Z 2025-05-07T20:26:29.2477105Z 2025-05-07T20:26:29.2477111Z 2025-05-07T20:26:29.2477118Z 2025-05-07T20:26:29.2477126Z 2025-05-07T20:26:29.2477135Z 2025-05-07T20:26:29.2477144Z 2025-05-07T20:26:29.2477152Z 2025-05-07T20:26:29.2477159Z 2025-05-07T20:26:29.2477165Z 2025-05-07T20:26:29.2477170Z 2025-05-07T20:26:29.2477175Z 2025-05-07T20:26:29.2477180Z 2025-05-07T20:26:29.2477186Z 2025-05-07T20:26:29.2477218Z 2025-05-07T20:26:29.2477225Z 2025-05-07T20:26:29.2477230Z 2025-05-07T20:26:29.2477242Z 2025-05-07T20:26:29.3485814Z cuda-nvdisasm-12.8.5 | 4.9 MB | | 0%  2025-05-07T20:26:29.3486242Z 2025-05-07T20:26:29.3486263Z 2025-05-07T20:26:29.3486270Z 2025-05-07T20:26:29.3486276Z 2025-05-07T20:26:29.3486283Z 2025-05-07T20:26:29.3486288Z 2025-05-07T20:26:29.3486293Z 2025-05-07T20:26:29.3486298Z 2025-05-07T20:26:29.3486304Z 2025-05-07T20:26:29.3486309Z 2025-05-07T20:26:29.3486315Z 2025-05-07T20:26:29.3486320Z 2025-05-07T20:26:29.3486325Z 2025-05-07T20:26:29.3486359Z 2025-05-07T20:26:29.3486365Z 2025-05-07T20:26:29.3486371Z 2025-05-07T20:26:29.3486376Z 2025-05-07T20:26:29.3486381Z 2025-05-07T20:26:29.5029734Z cuda-nvdisasm-12.8.5 | 4.9 MB | #####7 | 58%  2025-05-07T20:26:29.5030072Z 2025-05-07T20:26:29.5030096Z 2025-05-07T20:26:29.5030100Z 2025-05-07T20:26:29.5030119Z 2025-05-07T20:26:29.5030124Z 2025-05-07T20:26:29.5030136Z 2025-05-07T20:26:29.5030140Z 2025-05-07T20:26:29.5030144Z 2025-05-07T20:26:29.5030149Z 2025-05-07T20:26:29.5030153Z 2025-05-07T20:26:29.5030157Z 2025-05-07T20:26:29.5030160Z 2025-05-07T20:26:29.5030164Z 2025-05-07T20:26:29.5030168Z 2025-05-07T20:26:29.5030171Z 2025-05-07T20:26:29.5030175Z 2025-05-07T20:26:29.5313486Z cuda-nvcc-dev_linux- | 12.7 MB | ########## | 100%  2025-05-07T20:26:29.5313838Z 2025-05-07T20:26:29.5313842Z 2025-05-07T20:26:29.5313846Z 2025-05-07T20:26:29.5313849Z 2025-05-07T20:26:29.5313864Z 2025-05-07T20:26:29.5313868Z 2025-05-07T20:26:29.5313871Z 2025-05-07T20:26:29.5313875Z 2025-05-07T20:26:29.5313879Z 2025-05-07T20:26:29.5313882Z 2025-05-07T20:26:29.5313886Z 2025-05-07T20:26:29.5313890Z 2025-05-07T20:26:29.5313893Z 2025-05-07T20:26:29.5313897Z 2025-05-07T20:26:29.5313901Z 2025-05-07T20:26:29.5313912Z 2025-05-07T20:26:29.5314129Z 2025-05-07T20:26:29.5609664Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:26:29.5610007Z 2025-05-07T20:26:29.5610020Z 2025-05-07T20:26:29.5610024Z 2025-05-07T20:26:29.5610028Z 2025-05-07T20:26:29.5610032Z 2025-05-07T20:26:29.5610035Z 2025-05-07T20:26:29.5610039Z 2025-05-07T20:26:29.5610043Z 2025-05-07T20:26:29.5610046Z 2025-05-07T20:26:29.5610050Z 2025-05-07T20:26:29.5610054Z 2025-05-07T20:26:29.5610058Z 2025-05-07T20:26:29.5610061Z 2025-05-07T20:26:29.5610065Z 2025-05-07T20:26:29.5610069Z 2025-05-07T20:26:29.5610072Z 2025-05-07T20:26:29.5610076Z 2025-05-07T20:26:29.5611461Z 2025-05-07T20:26:29.5653329Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########## | 100%  2025-05-07T20:26:29.5653678Z 2025-05-07T20:26:29.5653683Z 2025-05-07T20:26:29.5653686Z 2025-05-07T20:26:29.5653690Z 2025-05-07T20:26:29.5653694Z 2025-05-07T20:26:29.5653698Z 2025-05-07T20:26:29.5653701Z 2025-05-07T20:26:29.5653715Z 2025-05-07T20:26:29.5653719Z 2025-05-07T20:26:29.5653730Z 2025-05-07T20:26:29.5653734Z 2025-05-07T20:26:29.5653737Z 2025-05-07T20:26:29.5653741Z 2025-05-07T20:26:29.5653745Z 2025-05-07T20:26:29.5653748Z 2025-05-07T20:26:29.5653752Z 2025-05-07T20:26:29.5653756Z 2025-05-07T20:26:29.5653759Z 2025-05-07T20:26:29.5653763Z 2025-05-07T20:26:29.6654732Z ... (more hidden) ... 2025-05-07T20:26:29.6655045Z 2025-05-07T20:26:29.6655050Z 2025-05-07T20:26:29.6655053Z 2025-05-07T20:26:29.6655066Z 2025-05-07T20:26:29.6655070Z 2025-05-07T20:26:29.6655074Z 2025-05-07T20:26:29.6655106Z 2025-05-07T20:26:29.6655110Z 2025-05-07T20:26:29.6655114Z 2025-05-07T20:26:29.6655118Z 2025-05-07T20:26:29.6655121Z 2025-05-07T20:26:29.6655126Z 2025-05-07T20:26:29.6655129Z 2025-05-07T20:26:29.6655133Z 2025-05-07T20:26:29.6655137Z 2025-05-07T20:26:29.6655140Z 2025-05-07T20:26:29.6655144Z 2025-05-07T20:26:29.6655148Z 2025-05-07T20:26:29.6655162Z 2025-05-07T20:26:29.7349216Z ... (more hidden) ... 2025-05-07T20:26:29.7349607Z 2025-05-07T20:26:29.7349611Z 2025-05-07T20:26:29.7349615Z 2025-05-07T20:26:29.7349619Z 2025-05-07T20:26:29.7349623Z 2025-05-07T20:26:29.7349626Z 2025-05-07T20:26:29.7349630Z 2025-05-07T20:26:29.7349634Z 2025-05-07T20:26:29.7349638Z 2025-05-07T20:26:29.7349641Z 2025-05-07T20:26:29.7349645Z 2025-05-07T20:26:29.7349649Z 2025-05-07T20:26:29.7349653Z 2025-05-07T20:26:29.7349666Z 2025-05-07T20:26:29.7351781Z 2025-05-07T20:26:29.8284837Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:26:29.8285200Z 2025-05-07T20:26:29.8285205Z 2025-05-07T20:26:29.8285209Z 2025-05-07T20:26:29.8285213Z 2025-05-07T20:26:29.8285217Z 2025-05-07T20:26:29.8285221Z 2025-05-07T20:26:29.8285224Z 2025-05-07T20:26:29.8285228Z 2025-05-07T20:26:29.8285232Z 2025-05-07T20:26:29.8285236Z 2025-05-07T20:26:29.8285253Z 2025-05-07T20:26:29.8285257Z 2025-05-07T20:26:29.8285260Z 2025-05-07T20:26:29.8285264Z 2025-05-07T20:26:29.8285268Z 2025-05-07T20:26:29.8285271Z 2025-05-07T20:26:29.8285275Z 2025-05-07T20:26:29.8285279Z 2025-05-07T20:26:29.8295919Z 2025-05-07T20:26:30.9641309Z ... (more hidden) ... 2025-05-07T20:26:30.9641636Z 2025-05-07T20:26:30.9641640Z 2025-05-07T20:26:30.9641644Z 2025-05-07T20:26:30.9641648Z 2025-05-07T20:26:30.9641662Z 2025-05-07T20:26:30.9641667Z 2025-05-07T20:26:30.9641670Z 2025-05-07T20:26:30.9641674Z 2025-05-07T20:26:30.9641678Z 2025-05-07T20:26:31.8169205Z libcurand-10.3.9.55 | 43.6 MB | ########## | 100%  2025-05-07T20:26:31.8169635Z 2025-05-07T20:26:31.8169641Z 2025-05-07T20:26:31.8169646Z 2025-05-07T20:26:31.8169651Z 2025-05-07T20:26:31.8169657Z 2025-05-07T20:26:31.8169662Z 2025-05-07T20:26:31.8170888Z 2025-05-07T20:26:32.1525800Z cuda-nvvp-12.8.57 | 112.4 MB | ########## | 100%  2025-05-07T20:26:32.1996756Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:26:32.1997041Z 2025-05-07T20:26:32.1997045Z 2025-05-07T20:26:32.1997050Z 2025-05-07T20:26:32.1997053Z 2025-05-07T20:26:32.1997057Z 2025-05-07T20:26:32.1997061Z 2025-05-07T20:26:32.1997065Z 2025-05-07T20:26:32.1997077Z 2025-05-07T20:26:32.1997081Z 2025-05-07T20:26:32.1997085Z 2025-05-07T20:26:32.2429570Z gds-tools-1.13.0.11 | 37.9 MB | ########## | 100%  2025-05-07T20:26:32.2429882Z 2025-05-07T20:26:32.2429897Z 2025-05-07T20:26:32.2429901Z 2025-05-07T20:26:32.2429906Z 2025-05-07T20:26:32.2430155Z 2025-05-07T20:26:32.7892649Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:26:32.7893245Z 2025-05-07T20:26:32.7893253Z 2025-05-07T20:26:32.7893260Z 2025-05-07T20:26:32.7893268Z 2025-05-07T20:26:32.7893275Z 2025-05-07T20:26:32.7893283Z 2025-05-07T20:26:32.7893291Z 2025-05-07T20:26:32.7893298Z 2025-05-07T20:26:33.3471034Z cuda-nvrtc-12.8.61 | 63.1 MB | ########## | 100%  2025-05-07T20:26:33.3471382Z 2025-05-07T20:26:33.3471386Z 2025-05-07T20:26:33.3471390Z 2025-05-07T20:26:33.3471394Z 2025-05-07T20:26:33.3471398Z 2025-05-07T20:26:33.3471402Z 2025-05-07T20:26:33.3471405Z 2025-05-07T20:26:33.3471409Z 2025-05-07T20:26:33.3471413Z 2025-05-07T20:26:33.3471417Z 2025-05-07T20:26:33.3471420Z 2025-05-07T20:26:33.3471424Z 2025-05-07T20:26:33.6255183Z libnvjitlink-12.8.61 | 28.7 MB | ########## | 100%  2025-05-07T20:26:33.6255530Z 2025-05-07T20:26:33.6255535Z 2025-05-07T20:26:33.6255569Z 2025-05-07T20:26:33.6255573Z 2025-05-07T20:26:33.6255577Z 2025-05-07T20:26:33.6255580Z 2025-05-07T20:26:33.6255584Z 2025-05-07T20:26:33.6255588Z 2025-05-07T20:26:33.6255592Z 2025-05-07T20:26:33.6255596Z 2025-05-07T20:26:33.6255600Z 2025-05-07T20:26:33.7528184Z python-3.13.0 | 31.5 MB | ########## | 100%  2025-05-07T20:26:33.7528664Z 2025-05-07T20:26:33.7528670Z 2025-05-07T20:26:33.7528675Z 2025-05-07T20:26:33.7528681Z 2025-05-07T20:26:33.7528686Z 2025-05-07T20:26:33.7528702Z 2025-05-07T20:26:33.7528706Z 2025-05-07T20:26:33.7528709Z 2025-05-07T20:26:33.7528713Z 2025-05-07T20:26:33.7528717Z 2025-05-07T20:26:33.7528720Z 2025-05-07T20:26:33.7528724Z 2025-05-07T20:26:33.7528728Z 2025-05-07T20:26:33.9849266Z cuda-nvcc-tools-12.8 | 24.5 MB | ########## | 100%  2025-05-07T20:26:33.9849602Z 2025-05-07T20:26:33.9849607Z 2025-05-07T20:26:33.9849610Z 2025-05-07T20:26:33.9849614Z 2025-05-07T20:26:33.9849618Z 2025-05-07T20:26:33.9849649Z 2025-05-07T20:26:33.9849654Z 2025-05-07T20:26:33.9849658Z 2025-05-07T20:26:33.9849661Z 2025-05-07T20:26:33.9849665Z 2025-05-07T20:26:33.9849669Z 2025-05-07T20:26:33.9849672Z 2025-05-07T20:26:33.9849676Z 2025-05-07T20:26:33.9849683Z 2025-05-07T20:26:34.0260684Z cuda-nvvm-tools-12.8 | 23.5 MB | ########## | 100%  2025-05-07T20:26:34.0261049Z 2025-05-07T20:26:34.0261053Z 2025-05-07T20:26:34.0261057Z 2025-05-07T20:26:34.0261060Z 2025-05-07T20:26:34.0261064Z 2025-05-07T20:26:34.0261068Z 2025-05-07T20:26:34.0261072Z 2025-05-07T20:26:34.0261075Z 2025-05-07T20:26:34.0261079Z 2025-05-07T20:26:34.0261091Z 2025-05-07T20:26:34.0261095Z 2025-05-07T20:26:34.0261099Z 2025-05-07T20:26:34.0261102Z 2025-05-07T20:26:34.0261106Z 2025-05-07T20:26:34.0261110Z 2025-05-07T20:26:34.0261113Z 2025-05-07T20:26:34.0610110Z cuda-nvcc-dev_linux- | 12.7 MB | ########## | 100%  2025-05-07T20:26:34.0610476Z 2025-05-07T20:26:34.0610480Z 2025-05-07T20:26:34.0610484Z 2025-05-07T20:26:34.0610488Z 2025-05-07T20:26:34.0610492Z 2025-05-07T20:26:34.0610495Z 2025-05-07T20:26:34.0610499Z 2025-05-07T20:26:34.0610503Z 2025-05-07T20:26:34.0610506Z 2025-05-07T20:26:34.0610510Z 2025-05-07T20:26:34.0610514Z 2025-05-07T20:26:34.0610518Z 2025-05-07T20:26:34.0610781Z 2025-05-07T20:26:34.0610785Z 2025-05-07T20:26:34.0610789Z 2025-05-07T20:26:34.0610792Z 2025-05-07T20:26:34.0610796Z 2025-05-07T20:26:34.0610810Z 2025-05-07T20:26:34.1882682Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########## | 100%  2025-05-07T20:26:34.1883019Z 2025-05-07T20:26:34.1883023Z 2025-05-07T20:26:34.1883027Z 2025-05-07T20:26:34.1883030Z 2025-05-07T20:26:34.1883044Z 2025-05-07T20:26:34.1883048Z 2025-05-07T20:26:34.1883052Z 2025-05-07T20:26:34.1883056Z 2025-05-07T20:26:34.1883061Z 2025-05-07T20:26:34.1883065Z 2025-05-07T20:26:34.1883069Z 2025-05-07T20:26:34.1883073Z 2025-05-07T20:26:34.1883323Z 2025-05-07T20:26:34.1883329Z 2025-05-07T20:26:34.1883332Z 2025-05-07T20:26:34.1883336Z 2025-05-07T20:26:34.1883346Z 2025-05-07T20:26:34.4694042Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:26:34.4694398Z 2025-05-07T20:26:34.4694402Z 2025-05-07T20:26:34.4694412Z 2025-05-07T20:26:34.4694441Z 2025-05-07T20:26:34.4694445Z 2025-05-07T20:26:34.4694449Z 2025-05-07T20:26:34.4694453Z 2025-05-07T20:26:34.4694456Z 2025-05-07T20:26:34.4694460Z 2025-05-07T20:26:34.4694464Z 2025-05-07T20:26:34.4694468Z 2025-05-07T20:26:34.4694472Z 2025-05-07T20:26:34.4694476Z 2025-05-07T20:26:34.4694480Z 2025-05-07T20:26:34.4694484Z 2025-05-07T20:26:34.6718165Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:26:34.6718509Z 2025-05-07T20:26:34.6718513Z 2025-05-07T20:26:34.6718517Z 2025-05-07T20:26:34.6718521Z 2025-05-07T20:26:34.6718524Z 2025-05-07T20:26:34.6718551Z 2025-05-07T20:26:34.6718556Z 2025-05-07T20:26:34.6718560Z 2025-05-07T20:26:34.6718564Z 2025-05-07T20:26:34.6718567Z 2025-05-07T20:26:34.6718571Z 2025-05-07T20:26:34.6718575Z 2025-05-07T20:26:34.6718578Z 2025-05-07T20:26:34.6718593Z 2025-05-07T20:26:34.6718596Z 2025-05-07T20:26:34.6718600Z 2025-05-07T20:26:34.6718604Z 2025-05-07T20:26:34.6718620Z 2025-05-07T20:26:34.6718623Z 2025-05-07T20:26:39.9289674Z ... (more hidden) ... 2025-05-07T20:26:39.9289996Z 2025-05-07T20:26:41.1132750Z nsight-compute-2025. | 320.6 MB | ########## | 100%  2025-05-07T20:26:41.1141978Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:26:41.1142344Z 2025-05-07T20:26:41.1142351Z 2025-05-07T20:26:41.1142356Z 2025-05-07T20:26:41.1142361Z 2025-05-07T20:26:41.1142367Z 2025-05-07T20:26:41.1142382Z 2025-05-07T20:26:41.1142388Z 2025-05-07T20:26:41.1142393Z 2025-05-07T20:26:41.1142398Z 2025-05-07T20:26:41.1142403Z 2025-05-07T20:26:41.1142444Z 2025-05-07T20:26:41.1142450Z 2025-05-07T20:26:41.1142455Z 2025-05-07T20:26:41.1142460Z 2025-05-07T20:26:41.1142465Z 2025-05-07T20:26:41.1142470Z 2025-05-07T20:26:41.1142475Z 2025-05-07T20:26:41.1142481Z 2025-05-07T20:26:41.1142486Z 2025-05-07T20:26:41.1142596Z 2025-05-07T20:26:41.1143078Z  2025-05-07T20:26:41.1143515Z 2025-05-07T20:26:41.1143784Z 2025-05-07T20:26:41.1144018Z  2025-05-07T20:26:41.1144286Z 2025-05-07T20:26:41.1144292Z 2025-05-07T20:26:41.1144518Z  2025-05-07T20:26:41.1144800Z 2025-05-07T20:26:41.1144806Z 2025-05-07T20:26:41.1144811Z 2025-05-07T20:26:41.1145052Z  2025-05-07T20:26:41.1145335Z 2025-05-07T20:26:41.1145341Z 2025-05-07T20:26:41.1145346Z 2025-05-07T20:26:41.1145359Z 2025-05-07T20:26:41.1145592Z  2025-05-07T20:26:41.1145878Z 2025-05-07T20:26:41.1145883Z 2025-05-07T20:26:41.1145888Z 2025-05-07T20:26:41.1145893Z 2025-05-07T20:26:41.1145898Z 2025-05-07T20:26:41.1146139Z  2025-05-07T20:26:41.1146714Z 2025-05-07T20:26:41.1146720Z 2025-05-07T20:26:41.1146725Z 2025-05-07T20:26:41.1146730Z 2025-05-07T20:26:41.1146735Z 2025-05-07T20:26:41.1146741Z 2025-05-07T20:26:41.1146994Z  2025-05-07T20:26:41.1147284Z 2025-05-07T20:26:41.1147289Z 2025-05-07T20:26:41.1147295Z 2025-05-07T20:26:41.1147300Z 2025-05-07T20:26:41.1147305Z 2025-05-07T20:26:41.1147310Z 2025-05-07T20:26:41.1147315Z 2025-05-07T20:26:41.1147658Z  2025-05-07T20:26:41.1147952Z 2025-05-07T20:26:41.1147957Z 2025-05-07T20:26:41.1148156Z 2025-05-07T20:26:41.1148163Z 2025-05-07T20:26:41.1148168Z 2025-05-07T20:26:41.1148173Z 2025-05-07T20:26:41.1148179Z 2025-05-07T20:26:41.1148185Z 2025-05-07T20:26:41.1148461Z  2025-05-07T20:26:41.1148766Z 2025-05-07T20:26:41.1148772Z 2025-05-07T20:26:41.1148786Z 2025-05-07T20:26:41.1148792Z 2025-05-07T20:26:41.1148797Z 2025-05-07T20:26:41.1148802Z 2025-05-07T20:26:41.1148808Z 2025-05-07T20:26:41.1148813Z 2025-05-07T20:26:41.1148818Z 2025-05-07T20:26:41.1149109Z  2025-05-07T20:26:41.1149400Z 2025-05-07T20:26:41.1149405Z 2025-05-07T20:26:41.1149410Z 2025-05-07T20:26:41.1149420Z 2025-05-07T20:26:41.1149426Z 2025-05-07T20:26:41.1149431Z 2025-05-07T20:26:41.1149436Z 2025-05-07T20:26:41.1149441Z 2025-05-07T20:26:41.1149446Z 2025-05-07T20:26:41.1149452Z 2025-05-07T20:26:41.1149901Z  2025-05-07T20:26:41.1150194Z 2025-05-07T20:26:41.1150205Z 2025-05-07T20:26:41.1150210Z 2025-05-07T20:26:41.1150216Z 2025-05-07T20:26:41.1150221Z 2025-05-07T20:26:41.1150226Z 2025-05-07T20:26:41.1150231Z 2025-05-07T20:26:41.1150245Z 2025-05-07T20:26:41.1150250Z 2025-05-07T20:26:41.1150255Z 2025-05-07T20:26:41.1150268Z 2025-05-07T20:26:41.1150613Z  2025-05-07T20:26:41.1150907Z 2025-05-07T20:26:41.1150921Z 2025-05-07T20:26:41.1150933Z 2025-05-07T20:26:41.1150939Z 2025-05-07T20:26:41.1150944Z 2025-05-07T20:26:41.1150949Z 2025-05-07T20:26:41.1150954Z 2025-05-07T20:26:41.1150960Z 2025-05-07T20:26:41.1150965Z 2025-05-07T20:26:41.1150970Z 2025-05-07T20:26:41.1150975Z 2025-05-07T20:26:41.1150980Z 2025-05-07T20:26:41.1151460Z  2025-05-07T20:26:41.1151706Z 2025-05-07T20:26:41.1151722Z 2025-05-07T20:26:41.1151727Z 2025-05-07T20:26:41.1151738Z 2025-05-07T20:26:41.1151742Z 2025-05-07T20:26:41.1151746Z 2025-05-07T20:26:41.1151749Z 2025-05-07T20:26:41.1151753Z 2025-05-07T20:26:41.1151757Z 2025-05-07T20:26:41.1151760Z 2025-05-07T20:26:41.1151764Z 2025-05-07T20:26:41.1151768Z 2025-05-07T20:26:41.1151771Z 2025-05-07T20:26:41.1152009Z  2025-05-07T20:26:41.1152242Z 2025-05-07T20:26:41.1152252Z 2025-05-07T20:26:41.1152256Z 2025-05-07T20:26:41.1152260Z 2025-05-07T20:26:41.1152264Z 2025-05-07T20:26:41.1152268Z 2025-05-07T20:26:41.1152271Z 2025-05-07T20:26:41.1152275Z 2025-05-07T20:26:41.1152279Z 2025-05-07T20:26:41.1152282Z 2025-05-07T20:26:41.1152286Z 2025-05-07T20:26:41.1152289Z 2025-05-07T20:26:41.1152293Z 2025-05-07T20:26:41.1152297Z 2025-05-07T20:26:41.1152858Z  2025-05-07T20:26:41.1153178Z 2025-05-07T20:26:41.1153184Z 2025-05-07T20:26:41.1153196Z 2025-05-07T20:26:41.1153201Z 2025-05-07T20:26:41.1153207Z 2025-05-07T20:26:41.1153212Z 2025-05-07T20:26:41.1153218Z 2025-05-07T20:26:41.1153223Z 2025-05-07T20:26:41.1153228Z 2025-05-07T20:26:41.1153234Z 2025-05-07T20:26:41.1153239Z 2025-05-07T20:26:41.1153244Z 2025-05-07T20:26:41.1153368Z 2025-05-07T20:26:41.1153373Z 2025-05-07T20:26:41.1153379Z 2025-05-07T20:26:41.1153685Z  2025-05-07T20:26:41.1153993Z 2025-05-07T20:26:41.1153998Z 2025-05-07T20:26:41.1154004Z 2025-05-07T20:26:41.1154009Z 2025-05-07T20:26:41.1154035Z 2025-05-07T20:26:41.1154041Z 2025-05-07T20:26:41.1154046Z 2025-05-07T20:26:41.1154051Z 2025-05-07T20:26:41.1154056Z 2025-05-07T20:26:41.1154061Z 2025-05-07T20:26:41.1154065Z 2025-05-07T20:26:41.1154071Z 2025-05-07T20:26:41.1154076Z 2025-05-07T20:26:41.1154081Z 2025-05-07T20:26:41.1154087Z 2025-05-07T20:26:41.1154182Z 2025-05-07T20:26:41.1154481Z  2025-05-07T20:26:41.1154803Z 2025-05-07T20:26:41.1154809Z 2025-05-07T20:26:41.1154814Z 2025-05-07T20:26:41.1154828Z 2025-05-07T20:26:41.1154833Z 2025-05-07T20:26:41.1154838Z 2025-05-07T20:26:41.1154843Z 2025-05-07T20:26:41.1154856Z 2025-05-07T20:26:41.1154862Z 2025-05-07T20:26:41.1154867Z 2025-05-07T20:26:41.1154872Z 2025-05-07T20:26:41.1154877Z 2025-05-07T20:26:41.1154883Z 2025-05-07T20:26:41.1154888Z 2025-05-07T20:26:41.1154893Z 2025-05-07T20:26:41.1154898Z 2025-05-07T20:26:41.1154911Z 2025-05-07T20:26:41.1155202Z  2025-05-07T20:26:41.1155512Z 2025-05-07T20:26:41.1155517Z 2025-05-07T20:26:41.1155522Z 2025-05-07T20:26:41.1155528Z 2025-05-07T20:26:41.1155548Z 2025-05-07T20:26:41.1155554Z 2025-05-07T20:26:41.1155559Z 2025-05-07T20:26:41.1155571Z 2025-05-07T20:26:41.1155576Z 2025-05-07T20:26:41.1155582Z 2025-05-07T20:26:41.1155587Z 2025-05-07T20:26:41.1155592Z 2025-05-07T20:26:41.1155597Z 2025-05-07T20:26:41.1155603Z 2025-05-07T20:26:41.1155608Z 2025-05-07T20:26:41.1155613Z 2025-05-07T20:26:41.1155618Z 2025-05-07T20:26:41.1155624Z 2025-05-07T20:26:41.1156530Z  2025-05-07T20:26:41.1156849Z 2025-05-07T20:26:41.1156858Z 2025-05-07T20:26:41.1157286Z  2025-05-07T20:26:41.1157424Z 2025-05-07T20:26:41.1157428Z 2025-05-07T20:26:41.1157820Z  2025-05-07T20:26:41.1157937Z 2025-05-07T20:26:41.1157943Z 2025-05-07T20:26:41.1157947Z 2025-05-07T20:26:41.1158527Z  2025-05-07T20:26:41.1158649Z 2025-05-07T20:26:41.1158654Z 2025-05-07T20:26:41.1158660Z 2025-05-07T20:26:41.1158665Z 2025-05-07T20:26:41.1159369Z  2025-05-07T20:26:41.1159551Z 2025-05-07T20:26:41.1159557Z 2025-05-07T20:26:41.1159562Z 2025-05-07T20:26:41.1159580Z 2025-05-07T20:26:41.1159586Z 2025-05-07T20:26:41.1159950Z  2025-05-07T20:26:41.1160129Z 2025-05-07T20:26:41.1160135Z 2025-05-07T20:26:41.1160140Z 2025-05-07T20:26:41.1160145Z 2025-05-07T20:26:41.1160151Z 2025-05-07T20:26:41.1160159Z 2025-05-07T20:26:41.1160549Z  2025-05-07T20:26:41.1160749Z 2025-05-07T20:26:41.1160764Z 2025-05-07T20:26:41.1160770Z 2025-05-07T20:26:41.1160779Z 2025-05-07T20:26:41.1160784Z 2025-05-07T20:26:41.1160790Z 2025-05-07T20:26:41.1160795Z 2025-05-07T20:26:41.1161164Z  2025-05-07T20:26:41.1161366Z 2025-05-07T20:26:41.1161379Z 2025-05-07T20:26:41.1161385Z 2025-05-07T20:26:41.1161389Z 2025-05-07T20:26:41.1161393Z 2025-05-07T20:26:41.1161396Z 2025-05-07T20:26:41.1161400Z 2025-05-07T20:26:41.1161404Z 2025-05-07T20:26:41.1161872Z  2025-05-07T20:26:41.1162088Z 2025-05-07T20:26:41.1162094Z 2025-05-07T20:26:41.1162107Z 2025-05-07T20:26:41.1162113Z 2025-05-07T20:26:41.1162129Z 2025-05-07T20:26:41.1162135Z 2025-05-07T20:26:41.1162140Z 2025-05-07T20:26:41.1162146Z 2025-05-07T20:26:41.1162160Z 2025-05-07T20:26:41.1162537Z  2025-05-07T20:26:41.1162756Z 2025-05-07T20:26:41.1162761Z 2025-05-07T20:26:41.1162764Z 2025-05-07T20:26:41.1162774Z 2025-05-07T20:26:41.1162778Z 2025-05-07T20:26:41.1162782Z 2025-05-07T20:26:41.1162902Z 2025-05-07T20:26:41.1162906Z 2025-05-07T20:26:41.1162909Z 2025-05-07T20:26:41.1162913Z 2025-05-07T20:26:41.1163172Z  2025-05-07T20:26:41.1163397Z 2025-05-07T20:26:41.1163407Z 2025-05-07T20:26:41.1163410Z 2025-05-07T20:26:41.1163414Z 2025-05-07T20:26:41.1163418Z 2025-05-07T20:26:41.1163421Z 2025-05-07T20:26:41.1163425Z 2025-05-07T20:26:41.1163429Z 2025-05-07T20:26:41.1163432Z 2025-05-07T20:26:41.1163436Z 2025-05-07T20:26:41.1163440Z 2025-05-07T20:26:41.1163846Z  2025-05-07T20:26:41.1164088Z 2025-05-07T20:26:41.1164094Z 2025-05-07T20:26:41.1164237Z 2025-05-07T20:26:41.1164244Z 2025-05-07T20:26:41.1164266Z 2025-05-07T20:26:41.1164272Z 2025-05-07T20:26:41.1164277Z 2025-05-07T20:26:41.1164282Z 2025-05-07T20:26:41.1164288Z 2025-05-07T20:26:41.1164293Z 2025-05-07T20:26:41.1164298Z 2025-05-07T20:26:41.1164303Z 2025-05-07T20:26:41.1164521Z  2025-05-07T20:26:41.1164774Z 2025-05-07T20:26:41.1164785Z 2025-05-07T20:26:41.1164791Z 2025-05-07T20:26:41.1164796Z 2025-05-07T20:26:41.1164801Z 2025-05-07T20:26:41.1164806Z 2025-05-07T20:26:41.1164811Z 2025-05-07T20:26:41.1164817Z 2025-05-07T20:26:41.1164822Z 2025-05-07T20:26:41.1164827Z 2025-05-07T20:26:41.1164832Z 2025-05-07T20:26:41.1164837Z 2025-05-07T20:26:41.1164842Z 2025-05-07T20:26:41.1165217Z  2025-05-07T20:26:41.1165471Z 2025-05-07T20:26:41.1165484Z 2025-05-07T20:26:41.1165489Z 2025-05-07T20:26:41.1165494Z 2025-05-07T20:26:41.1165499Z 2025-05-07T20:26:41.1165504Z 2025-05-07T20:26:41.1165510Z 2025-05-07T20:26:41.1165523Z 2025-05-07T20:26:41.1165528Z 2025-05-07T20:26:41.1165533Z 2025-05-07T20:26:41.1165539Z 2025-05-07T20:26:41.1165544Z 2025-05-07T20:26:41.1165558Z 2025-05-07T20:26:41.1165563Z 2025-05-07T20:26:41.1165833Z  2025-05-07T20:26:41.1166028Z 2025-05-07T20:26:41.1166032Z 2025-05-07T20:26:41.1166036Z 2025-05-07T20:26:41.1166057Z 2025-05-07T20:26:41.1166061Z 2025-05-07T20:26:41.1166064Z 2025-05-07T20:26:41.1166068Z 2025-05-07T20:26:41.1166072Z 2025-05-07T20:26:41.1166086Z 2025-05-07T20:26:41.1166090Z 2025-05-07T20:26:41.1166094Z 2025-05-07T20:26:41.1166097Z 2025-05-07T20:26:41.1166101Z 2025-05-07T20:26:41.1166105Z 2025-05-07T20:26:41.1166108Z 2025-05-07T20:26:41.1166480Z  2025-05-07T20:26:41.1166747Z 2025-05-07T20:26:41.1166759Z 2025-05-07T20:26:41.1166764Z 2025-05-07T20:26:41.1166776Z 2025-05-07T20:26:41.1166782Z 2025-05-07T20:26:41.1166787Z 2025-05-07T20:26:41.1166793Z 2025-05-07T20:26:41.1166804Z 2025-05-07T20:26:41.1166810Z 2025-05-07T20:26:41.1166815Z 2025-05-07T20:26:41.1166821Z 2025-05-07T20:26:41.1166826Z 2025-05-07T20:26:41.1166831Z 2025-05-07T20:26:41.1166837Z 2025-05-07T20:26:41.1166842Z 2025-05-07T20:26:41.1166847Z 2025-05-07T20:26:41.1167095Z  2025-05-07T20:26:41.1167338Z 2025-05-07T20:26:41.1167354Z 2025-05-07T20:26:41.1167358Z 2025-05-07T20:26:41.1167362Z 2025-05-07T20:26:41.1167365Z 2025-05-07T20:26:41.1167369Z 2025-05-07T20:26:41.1167373Z 2025-05-07T20:26:41.1167376Z 2025-05-07T20:26:41.1167380Z 2025-05-07T20:26:41.1167384Z 2025-05-07T20:26:41.1167387Z 2025-05-07T20:26:41.1167391Z 2025-05-07T20:26:41.1167394Z 2025-05-07T20:26:41.1167398Z 2025-05-07T20:26:41.1167402Z 2025-05-07T20:26:41.1167405Z 2025-05-07T20:26:41.1167409Z 2025-05-07T20:26:41.1167829Z  2025-05-07T20:26:41.1168124Z 2025-05-07T20:26:41.1168129Z 2025-05-07T20:26:41.1168135Z 2025-05-07T20:26:41.1168157Z 2025-05-07T20:26:41.1168163Z 2025-05-07T20:26:41.1168169Z 2025-05-07T20:26:41.1168174Z 2025-05-07T20:26:41.1168180Z 2025-05-07T20:26:41.1168185Z 2025-05-07T20:26:41.1168191Z 2025-05-07T20:26:41.1168196Z 2025-05-07T20:26:41.1168202Z 2025-05-07T20:26:41.1168207Z 2025-05-07T20:26:41.1168212Z 2025-05-07T20:26:41.1168217Z 2025-05-07T20:26:41.1168756Z 2025-05-07T20:26:41.1168760Z 2025-05-07T20:26:41.1168764Z 2025-05-07T20:26:41.1169050Z  2025-05-07T20:26:41.1169348Z 2025-05-07T20:26:41.1169354Z 2025-05-07T20:26:41.1169630Z  2025-05-07T20:26:41.1169781Z 2025-05-07T20:26:41.1169790Z 2025-05-07T20:26:41.1170245Z  2025-05-07T20:26:41.1170412Z 2025-05-07T20:26:41.1170420Z 2025-05-07T20:26:41.1170431Z 2025-05-07T20:26:41.1170946Z  2025-05-07T20:26:41.1171102Z 2025-05-07T20:26:41.1171107Z 2025-05-07T20:26:41.1171116Z 2025-05-07T20:26:41.1171122Z 2025-05-07T20:26:41.1171590Z  2025-05-07T20:26:41.1171887Z 2025-05-07T20:26:41.1171902Z 2025-05-07T20:26:41.1171906Z 2025-05-07T20:26:41.1171910Z 2025-05-07T20:26:41.1171913Z 2025-05-07T20:26:41.1172291Z  2025-05-07T20:26:41.1172465Z 2025-05-07T20:26:41.1172471Z 2025-05-07T20:26:41.1172477Z 2025-05-07T20:26:41.1172486Z 2025-05-07T20:26:41.1172491Z 2025-05-07T20:26:41.1172496Z 2025-05-07T20:26:41.1172942Z  2025-05-07T20:26:41.1173126Z 2025-05-07T20:26:41.1173132Z 2025-05-07T20:26:41.1173144Z 2025-05-07T20:26:41.1173149Z 2025-05-07T20:26:41.1173155Z 2025-05-07T20:26:41.1173160Z 2025-05-07T20:26:41.1173166Z 2025-05-07T20:26:41.1173597Z  2025-05-07T20:26:41.1173790Z 2025-05-07T20:26:41.1173800Z 2025-05-07T20:26:41.1173806Z 2025-05-07T20:26:41.1173811Z 2025-05-07T20:26:41.1173816Z 2025-05-07T20:26:41.1173821Z 2025-05-07T20:26:41.1173826Z 2025-05-07T20:26:41.1173832Z 2025-05-07T20:26:41.1174249Z  2025-05-07T20:26:41.1174456Z 2025-05-07T20:26:41.1174466Z 2025-05-07T20:26:41.1174481Z 2025-05-07T20:26:41.1174487Z 2025-05-07T20:26:41.1174492Z 2025-05-07T20:26:41.1174497Z 2025-05-07T20:26:41.1174502Z 2025-05-07T20:26:41.1174508Z 2025-05-07T20:26:41.1174513Z 2025-05-07T20:26:41.1174952Z  2025-05-07T20:26:41.1175170Z 2025-05-07T20:26:41.1175174Z 2025-05-07T20:26:41.1175178Z 2025-05-07T20:26:41.1175194Z 2025-05-07T20:26:41.1175198Z 2025-05-07T20:26:41.1175201Z 2025-05-07T20:26:41.1175205Z 2025-05-07T20:26:41.1175209Z 2025-05-07T20:26:41.1175212Z 2025-05-07T20:26:41.1175216Z 2025-05-07T20:26:41.1175520Z  2025-05-07T20:26:41.1175725Z 2025-05-07T20:26:41.1175734Z 2025-05-07T20:26:41.1175738Z 2025-05-07T20:26:41.1175742Z 2025-05-07T20:26:41.1175746Z 2025-05-07T20:26:41.1175749Z 2025-05-07T20:26:41.1175753Z 2025-05-07T20:26:41.1175757Z 2025-05-07T20:26:41.1175760Z 2025-05-07T20:26:41.1175771Z 2025-05-07T20:26:41.1175774Z 2025-05-07T20:26:41.1176247Z  2025-05-07T20:26:41.1176499Z 2025-05-07T20:26:41.1176505Z 2025-05-07T20:26:41.1176510Z 2025-05-07T20:26:41.1176516Z 2025-05-07T20:26:41.1176521Z 2025-05-07T20:26:41.1176532Z 2025-05-07T20:26:41.1176537Z 2025-05-07T20:26:41.1176542Z 2025-05-07T20:26:41.1176547Z 2025-05-07T20:26:41.1176561Z 2025-05-07T20:26:41.1176567Z 2025-05-07T20:26:41.1176572Z 2025-05-07T20:26:41.1176792Z  2025-05-07T20:26:41.1177013Z 2025-05-07T20:26:41.1177023Z 2025-05-07T20:26:41.1177027Z 2025-05-07T20:26:41.1177037Z 2025-05-07T20:26:41.1177041Z 2025-05-07T20:26:41.1177045Z 2025-05-07T20:26:41.1177049Z 2025-05-07T20:26:41.1177052Z 2025-05-07T20:26:41.1177056Z 2025-05-07T20:26:41.1177060Z 2025-05-07T20:26:41.1177063Z 2025-05-07T20:26:41.1177067Z 2025-05-07T20:26:41.1177071Z 2025-05-07T20:26:41.1177492Z  2025-05-07T20:26:41.1177760Z 2025-05-07T20:26:41.1177766Z 2025-05-07T20:26:41.1177772Z 2025-05-07T20:26:41.1177785Z 2025-05-07T20:26:41.1177791Z 2025-05-07T20:26:41.1177803Z 2025-05-07T20:26:41.1177809Z 2025-05-07T20:26:41.1177815Z 2025-05-07T20:26:41.1177820Z 2025-05-07T20:26:41.1177826Z 2025-05-07T20:26:41.1177831Z 2025-05-07T20:26:41.1177837Z 2025-05-07T20:26:41.1177842Z 2025-05-07T20:26:41.1177847Z 2025-05-07T20:26:41.1178138Z  2025-05-07T20:26:41.1178402Z 2025-05-07T20:26:41.1178533Z 2025-05-07T20:26:41.1178538Z 2025-05-07T20:26:41.1178561Z 2025-05-07T20:26:41.1178567Z 2025-05-07T20:26:41.1178572Z 2025-05-07T20:26:41.1178577Z 2025-05-07T20:26:41.1178583Z 2025-05-07T20:26:41.1178588Z 2025-05-07T20:26:41.1178594Z 2025-05-07T20:26:41.1178599Z 2025-05-07T20:26:41.1178604Z 2025-05-07T20:26:41.1178609Z 2025-05-07T20:26:41.1178614Z 2025-05-07T20:26:41.1178619Z 2025-05-07T20:26:41.1178824Z  2025-05-07T20:26:41.1179106Z 2025-05-07T20:26:41.1179111Z 2025-05-07T20:26:41.1179116Z 2025-05-07T20:26:41.1179121Z 2025-05-07T20:26:41.1179126Z 2025-05-07T20:26:41.1179218Z 2025-05-07T20:26:41.1179224Z 2025-05-07T20:26:41.1179230Z 2025-05-07T20:26:41.1179234Z 2025-05-07T20:26:41.1179239Z 2025-05-07T20:26:41.1179244Z 2025-05-07T20:26:41.1179250Z 2025-05-07T20:26:41.1179255Z 2025-05-07T20:26:41.1179260Z 2025-05-07T20:26:41.1179266Z 2025-05-07T20:26:41.1179271Z 2025-05-07T20:26:41.1179505Z  2025-05-07T20:26:41.1179785Z 2025-05-07T20:26:41.1179790Z 2025-05-07T20:26:41.1179795Z 2025-05-07T20:26:41.1179801Z 2025-05-07T20:26:41.1179806Z 2025-05-07T20:26:41.1179811Z 2025-05-07T20:26:41.1179826Z 2025-05-07T20:26:41.1179832Z 2025-05-07T20:26:41.1179837Z 2025-05-07T20:26:41.1179842Z 2025-05-07T20:26:41.1179847Z 2025-05-07T20:26:41.1179853Z 2025-05-07T20:26:41.1179858Z 2025-05-07T20:26:41.1179864Z 2025-05-07T20:26:41.1179870Z 2025-05-07T20:26:41.1179875Z 2025-05-07T20:26:41.1179890Z 2025-05-07T20:26:41.1180102Z  2025-05-07T20:26:41.1180393Z 2025-05-07T20:26:41.1180405Z 2025-05-07T20:26:41.1180411Z 2025-05-07T20:26:41.1180416Z 2025-05-07T20:26:41.1180421Z 2025-05-07T20:26:41.1180426Z 2025-05-07T20:26:41.1180431Z 2025-05-07T20:26:41.1180436Z 2025-05-07T20:26:41.1180442Z 2025-05-07T20:26:41.1180447Z 2025-05-07T20:26:41.1180452Z 2025-05-07T20:26:41.1180457Z 2025-05-07T20:26:41.1180462Z 2025-05-07T20:26:41.1180473Z 2025-05-07T20:26:41.1180478Z 2025-05-07T20:26:41.1180483Z 2025-05-07T20:26:41.1180489Z 2025-05-07T20:26:41.1180506Z 2025-05-07T20:26:41.1181436Z  2025-05-07T20:26:41.1181738Z 2025-05-07T20:26:41.1181744Z 2025-05-07T20:26:41.1181889Z  2025-05-07T20:26:41.1182022Z 2025-05-07T20:26:41.1182027Z 2025-05-07T20:26:41.1182405Z  2025-05-07T20:26:41.1182545Z 2025-05-07T20:26:41.1182554Z 2025-05-07T20:26:41.1182559Z 2025-05-07T20:26:41.1183079Z  2025-05-07T20:26:41.1183233Z 2025-05-07T20:26:41.1183238Z 2025-05-07T20:26:41.1183244Z 2025-05-07T20:26:41.1183249Z 2025-05-07T20:26:41.1183598Z  2025-05-07T20:26:41.1183775Z 2025-05-07T20:26:41.1183780Z 2025-05-07T20:26:41.1183785Z 2025-05-07T20:26:41.1183794Z 2025-05-07T20:26:41.1183799Z 2025-05-07T20:26:41.1184151Z  2025-05-07T20:26:41.1184329Z 2025-05-07T20:26:41.1184342Z 2025-05-07T20:26:41.1184347Z 2025-05-07T20:26:41.1184352Z 2025-05-07T20:26:41.1184364Z 2025-05-07T20:26:41.1184369Z 2025-05-07T20:26:41.1184881Z  2025-05-07T20:26:41.1185029Z 2025-05-07T20:26:41.1185034Z 2025-05-07T20:26:41.1185038Z 2025-05-07T20:26:41.1185041Z 2025-05-07T20:26:41.1185045Z 2025-05-07T20:26:41.1185049Z 2025-05-07T20:26:41.1185053Z 2025-05-07T20:26:41.1185336Z  2025-05-07T20:26:41.1185530Z 2025-05-07T20:26:41.1185536Z 2025-05-07T20:26:41.1185542Z 2025-05-07T20:26:41.1185547Z 2025-05-07T20:26:41.1185556Z 2025-05-07T20:26:41.1185561Z 2025-05-07T20:26:41.1185566Z 2025-05-07T20:26:41.1185572Z 2025-05-07T20:26:41.1185929Z  2025-05-07T20:26:41.1186147Z 2025-05-07T20:26:41.1186153Z 2025-05-07T20:26:41.1186158Z 2025-05-07T20:26:41.1186164Z 2025-05-07T20:26:41.1186169Z 2025-05-07T20:26:41.1186178Z 2025-05-07T20:26:41.1186183Z 2025-05-07T20:26:41.1186188Z 2025-05-07T20:26:41.1186193Z 2025-05-07T20:26:41.1186498Z  2025-05-07T20:26:41.1186711Z 2025-05-07T20:26:41.1186846Z 2025-05-07T20:26:41.1186852Z 2025-05-07T20:26:41.1186857Z 2025-05-07T20:26:41.1186862Z 2025-05-07T20:26:41.1186868Z 2025-05-07T20:26:41.1186873Z 2025-05-07T20:26:41.1186878Z 2025-05-07T20:26:41.1186884Z 2025-05-07T20:26:41.1186889Z 2025-05-07T20:26:41.1187083Z  2025-05-07T20:26:41.1187298Z 2025-05-07T20:26:41.1187304Z 2025-05-07T20:26:41.1187309Z 2025-05-07T20:26:41.1187314Z 2025-05-07T20:26:41.1187319Z 2025-05-07T20:26:41.1187324Z 2025-05-07T20:26:41.1187330Z 2025-05-07T20:26:41.1187338Z 2025-05-07T20:26:41.1187343Z 2025-05-07T20:26:41.1187361Z 2025-05-07T20:26:41.1187366Z 2025-05-07T20:26:41.1187868Z  2025-05-07T20:26:41.1188087Z 2025-05-07T20:26:41.1188092Z 2025-05-07T20:26:41.1188095Z 2025-05-07T20:26:41.1188103Z 2025-05-07T20:26:41.1188203Z 2025-05-07T20:26:41.1188209Z 2025-05-07T20:26:41.1188212Z 2025-05-07T20:26:41.1188216Z 2025-05-07T20:26:41.1188220Z 2025-05-07T20:26:41.1188224Z 2025-05-07T20:26:41.1188239Z 2025-05-07T20:26:41.1188384Z 2025-05-07T20:26:41.1188645Z  2025-05-07T20:26:41.1188900Z 2025-05-07T20:26:41.1188906Z 2025-05-07T20:26:41.1188911Z 2025-05-07T20:26:41.1188917Z 2025-05-07T20:26:41.1188931Z 2025-05-07T20:26:41.1188937Z 2025-05-07T20:26:41.1188950Z 2025-05-07T20:26:41.1188955Z 2025-05-07T20:26:41.1188960Z 2025-05-07T20:26:41.1188966Z 2025-05-07T20:26:41.1188971Z 2025-05-07T20:26:41.1188976Z 2025-05-07T20:26:41.1188982Z 2025-05-07T20:26:41.1189176Z  2025-05-07T20:26:41.1189431Z 2025-05-07T20:26:41.1189448Z 2025-05-07T20:26:41.1189464Z 2025-05-07T20:26:41.1189469Z 2025-05-07T20:26:41.1189475Z 2025-05-07T20:26:41.1189480Z 2025-05-07T20:26:41.1189485Z 2025-05-07T20:26:41.1189490Z 2025-05-07T20:26:41.1189495Z 2025-05-07T20:26:41.1189501Z 2025-05-07T20:26:41.1189506Z 2025-05-07T20:26:41.1189511Z 2025-05-07T20:26:41.1189516Z 2025-05-07T20:26:41.1189522Z 2025-05-07T20:26:41.1189719Z  2025-05-07T20:26:41.1189997Z 2025-05-07T20:26:41.1190003Z 2025-05-07T20:26:41.1190008Z 2025-05-07T20:26:41.1190013Z 2025-05-07T20:26:41.1190019Z 2025-05-07T20:26:41.1190024Z 2025-05-07T20:26:41.1190029Z 2025-05-07T20:26:41.1190035Z 2025-05-07T20:26:41.1190040Z 2025-05-07T20:26:41.1190045Z 2025-05-07T20:26:41.1190050Z 2025-05-07T20:26:41.1190054Z 2025-05-07T20:26:41.1190059Z 2025-05-07T20:26:41.1190064Z 2025-05-07T20:26:41.1190070Z 2025-05-07T20:26:41.1190293Z  2025-05-07T20:26:41.1190556Z 2025-05-07T20:26:41.1190561Z 2025-05-07T20:26:41.1190565Z 2025-05-07T20:26:41.1190577Z 2025-05-07T20:26:41.1190582Z 2025-05-07T20:26:41.1190586Z 2025-05-07T20:26:41.1190598Z 2025-05-07T20:26:41.1190603Z 2025-05-07T20:26:41.1190607Z 2025-05-07T20:26:41.1190612Z 2025-05-07T20:26:41.1190617Z 2025-05-07T20:26:41.1190622Z 2025-05-07T20:26:41.1190626Z 2025-05-07T20:26:41.1190631Z 2025-05-07T20:26:41.1190636Z 2025-05-07T20:26:41.1190648Z 2025-05-07T20:26:41.1190880Z  2025-05-07T20:26:41.1191157Z 2025-05-07T20:26:41.1191163Z 2025-05-07T20:26:41.1191168Z 2025-05-07T20:26:41.1191173Z 2025-05-07T20:26:41.1191179Z 2025-05-07T20:26:41.1191193Z 2025-05-07T20:26:41.1191199Z 2025-05-07T20:26:41.1191204Z 2025-05-07T20:26:41.1191209Z 2025-05-07T20:26:41.1191214Z 2025-05-07T20:26:41.1191220Z 2025-05-07T20:26:41.1191225Z 2025-05-07T20:26:41.1191230Z 2025-05-07T20:26:41.1191235Z 2025-05-07T20:26:41.1191240Z 2025-05-07T20:26:41.1191246Z 2025-05-07T20:26:41.1191251Z 2025-05-07T20:26:41.1191475Z  2025-05-07T20:26:41.1191759Z 2025-05-07T20:26:41.1191765Z 2025-05-07T20:26:41.1191770Z 2025-05-07T20:26:41.1191775Z 2025-05-07T20:26:41.1191781Z 2025-05-07T20:26:41.1191786Z 2025-05-07T20:26:41.1191792Z 2025-05-07T20:26:41.1191797Z 2025-05-07T20:26:41.1191813Z 2025-05-07T20:26:41.1191818Z 2025-05-07T20:26:41.1191823Z 2025-05-07T20:26:41.1191992Z 2025-05-07T20:26:41.1191997Z 2025-05-07T20:26:41.1192002Z 2025-05-07T20:26:41.1192008Z 2025-05-07T20:26:41.1192013Z 2025-05-07T20:26:41.1192018Z 2025-05-07T20:26:41.1192023Z 2025-05-07T20:26:41.1192265Z  2025-05-07T20:26:41.1192558Z 2025-05-07T20:26:41.1192563Z 2025-05-07T20:26:41.1192699Z  2025-05-07T20:26:41.1192832Z 2025-05-07T20:26:41.1192838Z 2025-05-07T20:26:41.1192986Z  2025-05-07T20:26:41.1193123Z 2025-05-07T20:26:41.1193129Z 2025-05-07T20:26:41.1193134Z 2025-05-07T20:26:41.1193283Z  2025-05-07T20:26:41.1193442Z 2025-05-07T20:26:41.1193542Z 2025-05-07T20:26:41.1193548Z 2025-05-07T20:26:41.1193554Z 2025-05-07T20:26:41.1193706Z  2025-05-07T20:26:41.1193870Z 2025-05-07T20:26:41.1193875Z 2025-05-07T20:26:41.1193881Z 2025-05-07T20:26:41.1193886Z 2025-05-07T20:26:41.1193891Z 2025-05-07T20:26:41.1194041Z  2025-05-07T20:26:41.1194216Z 2025-05-07T20:26:41.1194222Z 2025-05-07T20:26:41.1194235Z 2025-05-07T20:26:41.1194240Z 2025-05-07T20:26:41.1194246Z 2025-05-07T20:26:41.1194251Z 2025-05-07T20:26:41.1194412Z  2025-05-07T20:26:41.1194595Z 2025-05-07T20:26:41.1194600Z 2025-05-07T20:26:41.1194606Z 2025-05-07T20:26:41.1194611Z 2025-05-07T20:26:41.1194616Z 2025-05-07T20:26:41.1194621Z 2025-05-07T20:26:41.1194627Z 2025-05-07T20:26:41.1194796Z  2025-05-07T20:26:41.1194991Z 2025-05-07T20:26:41.1194996Z 2025-05-07T20:26:41.1195001Z 2025-05-07T20:26:41.1195007Z 2025-05-07T20:26:41.1195012Z 2025-05-07T20:26:41.1195017Z 2025-05-07T20:26:41.1195022Z 2025-05-07T20:26:41.1195036Z 2025-05-07T20:26:41.1195207Z  2025-05-07T20:26:41.1195414Z 2025-05-07T20:26:41.1195419Z 2025-05-07T20:26:41.1195425Z 2025-05-07T20:26:41.1195430Z 2025-05-07T20:26:41.1195435Z 2025-05-07T20:26:41.1195440Z 2025-05-07T20:26:41.1195445Z 2025-05-07T20:26:41.1195450Z 2025-05-07T20:26:41.1195456Z 2025-05-07T20:26:41.1195628Z  2025-05-07T20:26:41.1195850Z 2025-05-07T20:26:41.1195856Z 2025-05-07T20:26:41.1195861Z 2025-05-07T20:26:41.1195866Z 2025-05-07T20:26:41.1195871Z 2025-05-07T20:26:41.1195876Z 2025-05-07T20:26:41.1195881Z 2025-05-07T20:26:41.1195887Z 2025-05-07T20:26:41.1195892Z 2025-05-07T20:26:41.1195897Z 2025-05-07T20:26:41.1196088Z  2025-05-07T20:26:41.1196305Z 2025-05-07T20:26:41.1196311Z 2025-05-07T20:26:41.1196316Z 2025-05-07T20:26:41.1196321Z 2025-05-07T20:26:41.1196326Z 2025-05-07T20:26:41.1196331Z 2025-05-07T20:26:41.1196337Z 2025-05-07T20:26:41.1196342Z 2025-05-07T20:26:41.1196347Z 2025-05-07T20:26:41.1196359Z 2025-05-07T20:26:41.1196364Z 2025-05-07T20:26:41.1196551Z  2025-05-07T20:26:41.1196783Z 2025-05-07T20:26:41.1196788Z 2025-05-07T20:26:41.1196793Z 2025-05-07T20:26:41.1196798Z 2025-05-07T20:26:41.1196803Z 2025-05-07T20:26:41.1196808Z 2025-05-07T20:26:41.1196814Z 2025-05-07T20:26:41.1196819Z 2025-05-07T20:26:41.1196837Z 2025-05-07T20:26:41.1196842Z 2025-05-07T20:26:41.1196847Z 2025-05-07T20:26:41.1196852Z 2025-05-07T20:26:41.1197034Z  2025-05-07T20:26:41.1197277Z 2025-05-07T20:26:41.1197282Z 2025-05-07T20:26:41.1197295Z 2025-05-07T20:26:41.1197300Z 2025-05-07T20:26:41.1197305Z 2025-05-07T20:26:41.1197310Z 2025-05-07T20:26:41.1197315Z 2025-05-07T20:26:41.1197319Z 2025-05-07T20:26:41.1197324Z 2025-05-07T20:26:41.1197329Z 2025-05-07T20:26:41.1197333Z 2025-05-07T20:26:41.1197338Z 2025-05-07T20:26:41.1197343Z 2025-05-07T20:26:41.1197525Z  2025-05-07T20:26:41.1197799Z 2025-05-07T20:26:41.1197806Z 2025-05-07T20:26:41.1197811Z 2025-05-07T20:26:41.1197816Z 2025-05-07T20:26:41.1197821Z 2025-05-07T20:26:41.1197826Z 2025-05-07T20:26:41.1197831Z 2025-05-07T20:26:41.1197836Z 2025-05-07T20:26:41.1197841Z 2025-05-07T20:26:41.1197846Z 2025-05-07T20:26:41.1197851Z 2025-05-07T20:26:41.1197856Z 2025-05-07T20:26:41.1197987Z 2025-05-07T20:26:41.1197992Z 2025-05-07T20:26:41.1198217Z  2025-05-07T20:26:41.1198411Z 2025-05-07T20:26:41.1198415Z 2025-05-07T20:26:41.1198418Z 2025-05-07T20:26:41.1198422Z 2025-05-07T20:26:41.1198426Z 2025-05-07T20:26:41.1198429Z 2025-05-07T20:26:41.1198433Z 2025-05-07T20:26:41.1198437Z 2025-05-07T20:26:41.1198440Z 2025-05-07T20:26:41.1198444Z 2025-05-07T20:26:41.1198456Z 2025-05-07T20:26:41.1198460Z 2025-05-07T20:26:41.1198464Z 2025-05-07T20:26:41.1198467Z 2025-05-07T20:26:41.1198471Z 2025-05-07T20:26:41.1198623Z  2025-05-07T20:26:41.1198907Z 2025-05-07T20:26:41.1198912Z 2025-05-07T20:26:41.1198915Z 2025-05-07T20:26:41.1198919Z 2025-05-07T20:26:41.1198923Z 2025-05-07T20:26:41.1198926Z 2025-05-07T20:26:41.1198930Z 2025-05-07T20:26:41.1198934Z 2025-05-07T20:26:41.1198937Z 2025-05-07T20:26:41.1198941Z 2025-05-07T20:26:41.1198944Z 2025-05-07T20:26:41.1198948Z 2025-05-07T20:26:41.1198952Z 2025-05-07T20:26:41.1198963Z 2025-05-07T20:26:41.1198967Z 2025-05-07T20:26:41.1198970Z 2025-05-07T20:26:41.1199157Z  2025-05-07T20:26:41.1199432Z 2025-05-07T20:26:41.1199455Z 2025-05-07T20:26:41.1199461Z 2025-05-07T20:26:41.1199467Z 2025-05-07T20:26:41.1199472Z 2025-05-07T20:26:41.1199477Z 2025-05-07T20:26:41.1199483Z 2025-05-07T20:26:41.1199488Z 2025-05-07T20:26:41.1199493Z 2025-05-07T20:26:41.1199498Z 2025-05-07T20:26:41.1199511Z 2025-05-07T20:26:41.1199517Z 2025-05-07T20:26:41.1199522Z 2025-05-07T20:26:41.1199527Z 2025-05-07T20:26:41.1199532Z 2025-05-07T20:26:41.1199537Z 2025-05-07T20:26:41.1199550Z 2025-05-07T20:26:41.1199802Z  2025-05-07T20:26:41.1200090Z 2025-05-07T20:26:41.1200100Z 2025-05-07T20:26:41.1200104Z 2025-05-07T20:26:41.1200108Z 2025-05-07T20:26:41.1200112Z 2025-05-07T20:26:41.1200115Z 2025-05-07T20:26:41.1200119Z 2025-05-07T20:26:41.1200123Z 2025-05-07T20:26:41.1200135Z 2025-05-07T20:26:41.1200138Z 2025-05-07T20:26:41.1200142Z 2025-05-07T20:26:41.1200145Z 2025-05-07T20:26:41.1200149Z 2025-05-07T20:26:41.1200153Z 2025-05-07T20:26:41.1200156Z 2025-05-07T20:26:41.1200160Z 2025-05-07T20:26:41.1200164Z 2025-05-07T20:26:41.1200170Z 2025-05-07T20:26:41.1201480Z  2025-05-07T20:26:41.1201804Z 2025-05-07T20:26:41.1202429Z 2025-05-07T20:26:41.1202637Z  2025-05-07T20:26:41.1202799Z 2025-05-07T20:26:41.1202819Z 2025-05-07T20:26:41.1202952Z  2025-05-07T20:26:41.1203064Z 2025-05-07T20:26:41.1203068Z 2025-05-07T20:26:41.1203072Z 2025-05-07T20:26:41.1203186Z  2025-05-07T20:26:41.1203335Z 2025-05-07T20:26:41.1203349Z 2025-05-07T20:26:41.1203354Z 2025-05-07T20:26:41.1203360Z 2025-05-07T20:26:41.1203494Z  2025-05-07T20:26:41.1203611Z 2025-05-07T20:26:41.1203615Z 2025-05-07T20:26:41.1203618Z 2025-05-07T20:26:41.1203622Z 2025-05-07T20:26:41.1203626Z 2025-05-07T20:26:41.1203740Z  2025-05-07T20:26:41.1203867Z 2025-05-07T20:26:41.1203871Z 2025-05-07T20:26:41.1203875Z 2025-05-07T20:26:41.1203878Z 2025-05-07T20:26:41.1203882Z 2025-05-07T20:26:41.1203886Z 2025-05-07T20:26:41.1204002Z  2025-05-07T20:26:41.1204130Z 2025-05-07T20:26:41.1204136Z 2025-05-07T20:26:41.1204141Z 2025-05-07T20:26:41.1204146Z 2025-05-07T20:26:41.1204152Z 2025-05-07T20:26:41.1204157Z 2025-05-07T20:26:41.1204163Z 2025-05-07T20:26:41.1204345Z  2025-05-07T20:26:41.1204483Z 2025-05-07T20:26:41.1204487Z 2025-05-07T20:26:41.1204491Z 2025-05-07T20:26:41.1204495Z 2025-05-07T20:26:41.1204498Z 2025-05-07T20:26:41.1204507Z 2025-05-07T20:26:41.1204511Z 2025-05-07T20:26:41.1204515Z 2025-05-07T20:26:41.1204668Z  2025-05-07T20:26:41.1204879Z 2025-05-07T20:26:41.1204884Z 2025-05-07T20:26:41.1204890Z 2025-05-07T20:26:41.1204895Z 2025-05-07T20:26:41.1204900Z 2025-05-07T20:26:41.1204905Z 2025-05-07T20:26:41.1204911Z 2025-05-07T20:26:41.1205082Z 2025-05-07T20:26:41.1205088Z 2025-05-07T20:26:41.1205269Z  2025-05-07T20:26:41.1205423Z 2025-05-07T20:26:41.1205427Z 2025-05-07T20:26:41.1205431Z 2025-05-07T20:26:41.1205434Z 2025-05-07T20:26:41.1205438Z 2025-05-07T20:26:41.1205442Z 2025-05-07T20:26:41.1205451Z 2025-05-07T20:26:41.1205455Z 2025-05-07T20:26:41.1205459Z 2025-05-07T20:26:41.1205473Z 2025-05-07T20:26:41.1205601Z  2025-05-07T20:26:41.1205756Z 2025-05-07T20:26:41.1205766Z 2025-05-07T20:26:41.1205770Z 2025-05-07T20:26:41.1205773Z 2025-05-07T20:26:41.1205777Z 2025-05-07T20:26:41.1205781Z 2025-05-07T20:26:41.1205870Z 2025-05-07T20:26:41.1205874Z 2025-05-07T20:26:41.1205878Z 2025-05-07T20:26:41.1205881Z 2025-05-07T20:26:41.1205885Z 2025-05-07T20:26:41.1206015Z  2025-05-07T20:26:41.1206247Z 2025-05-07T20:26:41.1206252Z 2025-05-07T20:26:41.1206258Z 2025-05-07T20:26:41.1206263Z 2025-05-07T20:26:41.1206268Z 2025-05-07T20:26:41.1206282Z 2025-05-07T20:26:41.1206288Z 2025-05-07T20:26:41.1206293Z 2025-05-07T20:26:41.1206298Z 2025-05-07T20:26:41.1206303Z 2025-05-07T20:26:41.1206309Z 2025-05-07T20:26:41.1206313Z 2025-05-07T20:26:41.1206462Z  2025-05-07T20:26:41.1206697Z 2025-05-07T20:26:41.1206702Z 2025-05-07T20:26:41.1206707Z 2025-05-07T20:26:41.1206713Z 2025-05-07T20:26:41.1206718Z 2025-05-07T20:26:41.1206723Z 2025-05-07T20:26:41.1206728Z 2025-05-07T20:26:41.1206733Z 2025-05-07T20:26:41.1206739Z 2025-05-07T20:26:41.1206744Z 2025-05-07T20:26:41.1206749Z 2025-05-07T20:26:41.1206754Z 2025-05-07T20:26:41.1206759Z 2025-05-07T20:26:41.1206977Z  2025-05-07T20:26:41.1207230Z 2025-05-07T20:26:41.1207235Z 2025-05-07T20:26:41.1207240Z 2025-05-07T20:26:41.1207246Z 2025-05-07T20:26:41.1207250Z 2025-05-07T20:26:41.1207256Z 2025-05-07T20:26:41.1207261Z 2025-05-07T20:26:41.1207266Z 2025-05-07T20:26:41.1207280Z 2025-05-07T20:26:41.1207286Z 2025-05-07T20:26:41.1207296Z 2025-05-07T20:26:41.1207302Z 2025-05-07T20:26:41.1207307Z 2025-05-07T20:26:41.1207312Z 2025-05-07T20:26:41.1207516Z  2025-05-07T20:26:41.1207847Z 2025-05-07T20:26:41.1207853Z 2025-05-07T20:26:41.1207857Z 2025-05-07T20:26:41.1207862Z 2025-05-07T20:26:41.1207868Z 2025-05-07T20:26:41.1207873Z 2025-05-07T20:26:41.1207878Z 2025-05-07T20:26:41.1207883Z 2025-05-07T20:26:41.1207888Z 2025-05-07T20:26:41.1207894Z 2025-05-07T20:26:41.1207899Z 2025-05-07T20:26:41.1207904Z 2025-05-07T20:26:41.1207909Z 2025-05-07T20:26:41.1207915Z 2025-05-07T20:26:41.1207921Z 2025-05-07T20:26:41.1208148Z  2025-05-07T20:26:41.1208413Z 2025-05-07T20:26:41.1208418Z 2025-05-07T20:26:41.1208423Z 2025-05-07T20:26:41.1208429Z 2025-05-07T20:26:41.1208434Z 2025-05-07T20:26:41.1208439Z 2025-05-07T20:26:41.1208444Z 2025-05-07T20:26:41.1208450Z 2025-05-07T20:26:41.1208455Z 2025-05-07T20:26:41.1208460Z 2025-05-07T20:26:41.1208473Z 2025-05-07T20:26:41.1208478Z 2025-05-07T20:26:41.1208492Z 2025-05-07T20:26:41.1208497Z 2025-05-07T20:26:41.1208502Z 2025-05-07T20:26:41.1208507Z 2025-05-07T20:26:41.1208722Z  2025-05-07T20:26:41.1208994Z 2025-05-07T20:26:41.1208999Z 2025-05-07T20:26:41.1209005Z 2025-05-07T20:26:41.1209018Z 2025-05-07T20:26:41.1209023Z 2025-05-07T20:26:41.1209028Z 2025-05-07T20:26:41.1209034Z 2025-05-07T20:26:41.1209039Z 2025-05-07T20:26:41.1209044Z 2025-05-07T20:26:41.1209050Z 2025-05-07T20:26:41.1209055Z 2025-05-07T20:26:41.1209060Z 2025-05-07T20:26:41.1209065Z 2025-05-07T20:26:41.1209070Z 2025-05-07T20:26:41.1209081Z 2025-05-07T20:26:41.1209087Z 2025-05-07T20:26:41.1209092Z 2025-05-07T20:26:41.1209316Z  2025-05-07T20:26:41.1209601Z 2025-05-07T20:26:41.1209606Z 2025-05-07T20:26:41.1209611Z 2025-05-07T20:26:41.1209617Z 2025-05-07T20:26:41.1209622Z 2025-05-07T20:26:41.1209627Z 2025-05-07T20:26:41.1209777Z 2025-05-07T20:26:41.1209780Z 2025-05-07T20:26:41.1209784Z 2025-05-07T20:26:41.1209787Z 2025-05-07T20:26:41.1209791Z 2025-05-07T20:26:41.1209794Z 2025-05-07T20:26:41.1209798Z 2025-05-07T20:26:41.1209801Z 2025-05-07T20:26:41.1209805Z 2025-05-07T20:26:41.1209816Z 2025-05-07T20:26:41.1209820Z 2025-05-07T20:26:41.1209823Z 2025-05-07T20:26:41.1210006Z  2025-05-07T20:26:41.1210233Z 2025-05-07T20:26:41.1210239Z 2025-05-07T20:26:41.1210388Z  2025-05-07T20:26:41.1210493Z 2025-05-07T20:26:41.1210497Z 2025-05-07T20:26:41.1210639Z  2025-05-07T20:26:41.1210773Z 2025-05-07T20:26:41.1210910Z 2025-05-07T20:26:41.1210915Z 2025-05-07T20:26:41.1211055Z  2025-05-07T20:26:41.1211199Z 2025-05-07T20:26:41.1211202Z 2025-05-07T20:26:41.1211206Z 2025-05-07T20:26:41.1211210Z 2025-05-07T20:26:41.1211359Z  2025-05-07T20:26:41.1211508Z 2025-05-07T20:26:41.1211512Z 2025-05-07T20:26:41.1211516Z 2025-05-07T20:26:41.1211519Z 2025-05-07T20:26:41.1211529Z 2025-05-07T20:26:41.1211639Z  2025-05-07T20:26:41.1211773Z 2025-05-07T20:26:41.1211777Z 2025-05-07T20:26:41.1211781Z 2025-05-07T20:26:41.1211784Z 2025-05-07T20:26:41.1211788Z 2025-05-07T20:26:41.1211792Z 2025-05-07T20:26:41.1211906Z  2025-05-07T20:26:41.1212043Z 2025-05-07T20:26:41.1212049Z 2025-05-07T20:26:41.1212054Z 2025-05-07T20:26:41.1212059Z 2025-05-07T20:26:41.1212064Z 2025-05-07T20:26:41.1212070Z 2025-05-07T20:26:41.1212075Z 2025-05-07T20:26:41.1212240Z  2025-05-07T20:26:41.1212387Z 2025-05-07T20:26:41.1212390Z 2025-05-07T20:26:41.1212394Z 2025-05-07T20:26:41.1212404Z 2025-05-07T20:26:41.1212408Z 2025-05-07T20:26:41.1212412Z 2025-05-07T20:26:41.1212418Z 2025-05-07T20:26:41.1212423Z 2025-05-07T20:26:41.1212599Z  2025-05-07T20:26:41.1212762Z 2025-05-07T20:26:41.1212766Z 2025-05-07T20:26:41.1212769Z 2025-05-07T20:26:41.1212773Z 2025-05-07T20:26:41.1212777Z 2025-05-07T20:26:41.1212786Z 2025-05-07T20:26:41.1212789Z 2025-05-07T20:26:41.1212793Z 2025-05-07T20:26:41.1212797Z 2025-05-07T20:26:41.1212976Z  2025-05-07T20:26:41.1213145Z 2025-05-07T20:26:41.1213148Z 2025-05-07T20:26:41.1213154Z 2025-05-07T20:26:41.1213163Z 2025-05-07T20:26:41.1213177Z 2025-05-07T20:26:41.1213182Z 2025-05-07T20:26:41.1213188Z 2025-05-07T20:26:41.1213193Z 2025-05-07T20:26:41.1213198Z 2025-05-07T20:26:41.1213203Z 2025-05-07T20:26:41.1213385Z  2025-05-07T20:26:41.1213558Z 2025-05-07T20:26:41.1213564Z 2025-05-07T20:26:41.1213569Z 2025-05-07T20:26:41.1213574Z 2025-05-07T20:26:41.1213587Z 2025-05-07T20:26:41.1213592Z 2025-05-07T20:26:41.1213597Z 2025-05-07T20:26:41.1213603Z 2025-05-07T20:26:41.1213608Z 2025-05-07T20:26:41.1213613Z 2025-05-07T20:26:41.1213618Z 2025-05-07T20:26:41.1213830Z  2025-05-07T20:26:41.1214068Z 2025-05-07T20:26:41.1214073Z 2025-05-07T20:26:41.1214078Z 2025-05-07T20:26:41.1214090Z 2025-05-07T20:26:41.1214095Z 2025-05-07T20:26:41.1214100Z 2025-05-07T20:26:41.1214105Z 2025-05-07T20:26:41.1214118Z 2025-05-07T20:26:41.1214124Z 2025-05-07T20:26:41.1214129Z 2025-05-07T20:26:41.1214134Z 2025-05-07T20:26:41.1214139Z 2025-05-07T20:26:41.1214335Z  2025-05-07T20:26:41.1214584Z 2025-05-07T20:26:41.1214590Z 2025-05-07T20:26:41.1214595Z 2025-05-07T20:26:41.1214609Z 2025-05-07T20:26:41.1214614Z 2025-05-07T20:26:41.1214619Z 2025-05-07T20:26:41.1214625Z 2025-05-07T20:26:41.1214630Z 2025-05-07T20:26:41.1214635Z 2025-05-07T20:26:41.1214640Z 2025-05-07T20:26:41.1214646Z 2025-05-07T20:26:41.1214670Z 2025-05-07T20:26:41.1214675Z 2025-05-07T20:26:41.1214864Z  2025-05-07T20:26:41.1215125Z 2025-05-07T20:26:41.1215130Z 2025-05-07T20:26:41.1215135Z 2025-05-07T20:26:41.1215141Z 2025-05-07T20:26:41.1215146Z 2025-05-07T20:26:41.1215152Z 2025-05-07T20:26:41.1215157Z 2025-05-07T20:26:41.1215163Z 2025-05-07T20:26:41.1215288Z 2025-05-07T20:26:41.1215293Z 2025-05-07T20:26:41.1215298Z 2025-05-07T20:26:41.1215304Z 2025-05-07T20:26:41.1215309Z 2025-05-07T20:26:41.1215314Z 2025-05-07T20:26:41.1215525Z  2025-05-07T20:26:41.1215783Z 2025-05-07T20:26:41.1215789Z 2025-05-07T20:26:41.1215794Z 2025-05-07T20:26:41.1215799Z 2025-05-07T20:26:41.1215804Z 2025-05-07T20:26:41.1215810Z 2025-05-07T20:26:41.1215815Z 2025-05-07T20:26:41.1215820Z 2025-05-07T20:26:41.1215825Z 2025-05-07T20:26:41.1215830Z 2025-05-07T20:26:41.1215846Z 2025-05-07T20:26:41.1215851Z 2025-05-07T20:26:41.1215856Z 2025-05-07T20:26:41.1215943Z 2025-05-07T20:26:41.1215949Z 2025-05-07T20:26:41.1216165Z  2025-05-07T20:26:41.1216407Z 2025-05-07T20:26:41.1216410Z 2025-05-07T20:26:41.1216414Z 2025-05-07T20:26:41.1216418Z 2025-05-07T20:26:41.1216422Z 2025-05-07T20:26:41.1216425Z 2025-05-07T20:26:41.1216429Z 2025-05-07T20:26:41.1216432Z 2025-05-07T20:26:41.1216443Z 2025-05-07T20:26:41.1216446Z 2025-05-07T20:26:41.1216450Z 2025-05-07T20:26:41.1216454Z 2025-05-07T20:26:41.1216457Z 2025-05-07T20:26:41.1216461Z 2025-05-07T20:26:41.1216465Z 2025-05-07T20:26:41.1216468Z 2025-05-07T20:26:41.1216625Z  2025-05-07T20:26:41.1216838Z 2025-05-07T20:26:41.1216843Z 2025-05-07T20:26:41.1216848Z 2025-05-07T20:26:41.1216853Z 2025-05-07T20:26:41.1216859Z 2025-05-07T20:26:41.1216864Z 2025-05-07T20:26:41.1216869Z 2025-05-07T20:26:41.1216874Z 2025-05-07T20:26:41.1216879Z 2025-05-07T20:26:41.1216885Z 2025-05-07T20:26:41.1216890Z 2025-05-07T20:26:41.1216902Z 2025-05-07T20:26:41.1216907Z 2025-05-07T20:26:41.1216921Z 2025-05-07T20:26:41.1216927Z 2025-05-07T20:26:41.1216932Z 2025-05-07T20:26:41.1216938Z 2025-05-07T20:26:41.1217148Z  2025-05-07T20:26:41.1217415Z 2025-05-07T20:26:41.1217420Z 2025-05-07T20:26:41.1217425Z 2025-05-07T20:26:41.1217431Z 2025-05-07T20:26:41.1217454Z 2025-05-07T20:26:41.1217459Z 2025-05-07T20:26:41.1217464Z 2025-05-07T20:26:41.1217469Z 2025-05-07T20:26:41.1217475Z 2025-05-07T20:26:41.1217480Z 2025-05-07T20:26:41.1217485Z 2025-05-07T20:26:41.1217490Z 2025-05-07T20:26:41.1217495Z 2025-05-07T20:26:41.1217501Z 2025-05-07T20:26:41.1217506Z 2025-05-07T20:26:41.1217511Z 2025-05-07T20:26:41.1217516Z 2025-05-07T20:26:41.1217522Z 2025-05-07T20:26:41.1217750Z  2025-05-07T20:26:41.1218042Z 2025-05-07T20:26:41.1218047Z 2025-05-07T20:26:41.1218194Z  2025-05-07T20:26:41.1218336Z 2025-05-07T20:26:41.1218341Z 2025-05-07T20:26:41.1218489Z  2025-05-07T20:26:41.1218630Z 2025-05-07T20:26:41.1218644Z 2025-05-07T20:26:41.1218649Z 2025-05-07T20:26:41.1218794Z  2025-05-07T20:26:41.1218937Z 2025-05-07T20:26:41.1218942Z 2025-05-07T20:26:41.1218948Z 2025-05-07T20:26:41.1218953Z 2025-05-07T20:26:41.1219104Z  2025-05-07T20:26:41.1219259Z 2025-05-07T20:26:41.1219270Z 2025-05-07T20:26:41.1219275Z 2025-05-07T20:26:41.1219280Z 2025-05-07T20:26:41.1219284Z 2025-05-07T20:26:41.1219437Z  2025-05-07T20:26:41.1219600Z 2025-05-07T20:26:41.1219605Z 2025-05-07T20:26:41.1219610Z 2025-05-07T20:26:41.1219616Z 2025-05-07T20:26:41.1219621Z 2025-05-07T20:26:41.1219626Z 2025-05-07T20:26:41.1219780Z  2025-05-07T20:26:41.1219948Z 2025-05-07T20:26:41.1219954Z 2025-05-07T20:26:41.1219959Z 2025-05-07T20:26:41.1219964Z 2025-05-07T20:26:41.1219969Z 2025-05-07T20:26:41.1219974Z 2025-05-07T20:26:41.1219980Z 2025-05-07T20:26:41.1220142Z  2025-05-07T20:26:41.1220323Z 2025-05-07T20:26:41.1220327Z 2025-05-07T20:26:41.1220330Z 2025-05-07T20:26:41.1220334Z 2025-05-07T20:26:41.1220338Z 2025-05-07T20:26:41.1220341Z 2025-05-07T20:26:41.1220345Z 2025-05-07T20:26:41.1220349Z 2025-05-07T20:26:41.1220506Z  2025-05-07T20:26:41.1220699Z 2025-05-07T20:26:41.1220703Z 2025-05-07T20:26:41.1220707Z 2025-05-07T20:26:41.1220814Z 2025-05-07T20:26:41.1220818Z 2025-05-07T20:26:41.1220822Z 2025-05-07T20:26:41.1220825Z 2025-05-07T20:26:41.1220829Z 2025-05-07T20:26:41.1220833Z 2025-05-07T20:26:41.1220990Z  2025-05-07T20:26:41.1221202Z 2025-05-07T20:26:41.1221208Z 2025-05-07T20:26:41.1221213Z 2025-05-07T20:26:41.1221218Z 2025-05-07T20:26:41.1221223Z 2025-05-07T20:26:41.1221229Z 2025-05-07T20:26:41.1221234Z 2025-05-07T20:26:41.1221240Z 2025-05-07T20:26:41.1221245Z 2025-05-07T20:26:41.1221259Z 2025-05-07T20:26:41.1221391Z  2025-05-07T20:26:41.1221552Z 2025-05-07T20:26:41.1221556Z 2025-05-07T20:26:41.1221643Z 2025-05-07T20:26:41.1221648Z 2025-05-07T20:26:41.1221651Z 2025-05-07T20:26:41.1221655Z 2025-05-07T20:26:41.1221665Z 2025-05-07T20:26:41.1221669Z 2025-05-07T20:26:41.1221672Z 2025-05-07T20:26:41.1221676Z 2025-05-07T20:26:41.1221680Z 2025-05-07T20:26:41.1221811Z  2025-05-07T20:26:41.1221981Z 2025-05-07T20:26:41.1221990Z 2025-05-07T20:26:41.1221994Z 2025-05-07T20:26:41.1222006Z 2025-05-07T20:26:41.1222010Z 2025-05-07T20:26:41.1222013Z 2025-05-07T20:26:41.1222017Z 2025-05-07T20:26:41.1222021Z 2025-05-07T20:26:41.1222024Z 2025-05-07T20:26:41.1222028Z 2025-05-07T20:26:41.1222032Z 2025-05-07T20:26:41.1222035Z 2025-05-07T20:26:41.1222165Z  2025-05-07T20:26:41.1222343Z 2025-05-07T20:26:41.1222347Z 2025-05-07T20:26:41.1222351Z 2025-05-07T20:26:41.1222354Z 2025-05-07T20:26:41.1222358Z 2025-05-07T20:26:41.1222362Z 2025-05-07T20:26:41.1222365Z 2025-05-07T20:26:41.1222369Z 2025-05-07T20:26:41.1222373Z 2025-05-07T20:26:41.1222380Z 2025-05-07T20:26:41.1222383Z 2025-05-07T20:26:41.1222387Z 2025-05-07T20:26:41.1222391Z 2025-05-07T20:26:41.1224047Z  done 2025-05-07T20:26:41.4336279Z Preparing transaction: \ | / done 2025-05-07T20:26:46.2787628Z Verifying transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:26:47.4876430Z Executing transaction: \ | / - \ | / - \ | / - done 2025-05-07T20:26:50.0920926Z [INSTALL] Fixing file placements for CUDA 12.8.0+ ... 2025-05-07T20:26:50.0921333Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:50.0922022Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:50.0922592Z 2025-05-07T20:26:50.0935553Z 2025-05-07T20:26:50.0936569Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:50.0937273Z 2025-05-07T20:26:50.0950084Z 2025-05-07T20:26:50.0950252Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:50.0955762Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:50.0959685Z 2025-05-07T20:26:50.2528709Z 2025-05-07T20:26:50.2534350Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:50.2538353Z 2025-05-07T20:26:50.2558447Z 2025-05-07T20:26:50.2558719Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:50.2925607Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:52.1700074Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:52.2324758Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:52.2325286Z 2025-05-07T20:26:52.6718127Z 2025-05-07T20:26:52.6729597Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:52.7084184Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:52.7084876Z 2025-05-07T20:26:53.1426315Z 2025-05-07T20:26:53.1426801Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:53.1428132Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:53.1429144Z 2025-05-07T20:26:53.5632764Z 2025-05-07T20:26:55.5824173Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:57.6025821Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:59.6412783Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:59.6413878Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:27:01.6804840Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:27:03.5731284Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:27:03.5731586Z 2025-05-07T20:27:03.6353850Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:27:07.4942675Z /tmp/tmpy1v2dtjh: line 3: clang: command not found 2025-05-07T20:27:07.4942964Z 2025-05-07T20:27:07.4943638Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:27:07.5585891Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:27:07.5586199Z 2025-05-07T20:27:07.5606095Z total 36 2025-05-07T20:27:07.5606378Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:26 . 2025-05-07T20:27:07.5606772Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:25 .. 2025-05-07T20:27:07.5607215Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:27:07.5608715Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:27:07.5609183Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:27:07.5609994Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:27:07.5610422Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:27:07.5610870Z -rw-r--r--. 2 ec2-user ec2-user 2932 Jan 24 22:22 ~cuda-nvcc_activate.sh 2025-05-07T20:27:07.5611151Z 2025-05-07T20:27:07.5611368Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:27:07.5612006Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:27:07.5612425Z 2025-05-07T20:27:07.5631469Z 2025-05-07T20:27:07.5632160Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:27:07.5632420Z 2025-05-07T20:27:09.5173404Z 2025-05-07T20:27:09.5174002Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:27:09.5174529Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:27:09.5174903Z 2025-05-07T20:27:09.9426205Z 2025-05-07T20:27:09.9426566Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:27:09.9426824Z 2025-05-07T20:27:11.8330008Z -allow-unsupported-compiler 2025-05-07T20:27:11.8330329Z 2025-05-07T20:27:11.8955613Z 2025-05-07T20:27:11.8955979Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:27:11.8956776Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:27:11.8957212Z 2025-05-07T20:27:13.8437533Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:27:13.8438239Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:27:13.8438572Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:27:13.8438889Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:27:13.8439214Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:27:13.8439554Z #define _STL_PAIR_H 1 2025-05-07T20:27:13.8439884Z #define __cpp_attributes 200809L 2025-05-07T20:27:13.8441415Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:27:13.8441886Z #define __DELETE_THROW throw() 2025-05-07T20:27:13.8442226Z #define _PTRDIFF_T_ 2025-05-07T20:27:13.8442581Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:27:13.8442983Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:27:13.8443358Z #define _IO_LEFT 02 2025-05-07T20:27:13.8443666Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:27:13.8444027Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:27:13.8444414Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:27:13.8445050Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:27:13.8445625Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:27:13.8445994Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:27:13.8446231Z #define _IOS_OUTPUT 2 2025-05-07T20:27:13.8446456Z #define __SM_100_RT_HPP__ 2025-05-07T20:27:13.8446755Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:27:13.8447190Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:27:13.8447627Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:27:13.8447930Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:27:13.8448269Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:27:13.8458601Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:27:13.8459652Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:27:13.8460117Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:27:13.8460524Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:27:13.8460965Z #define _T_WCHAR_ 2025-05-07T20:27:13.8461272Z #define stdout stdout 2025-05-07T20:27:13.8461714Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:27:13.8462221Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:27:13.8462563Z #define __flexarr [] 2025-05-07T20:27:13.8462896Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:27:13.8464299Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:27:13.8465315Z 2025-05-07T20:27:13.8465541Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:27:13.8465999Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:27:13.8466292Z #define _MATH_H 1 2025-05-07T20:27:13.8466567Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:27:13.8466896Z #define __S64_TYPE long int 2025-05-07T20:27:13.8467133Z #define __stub_fchflags 2025-05-07T20:27:13.8467700Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:27:13.8468062Z #define __SQUAD_TYPE long int 2025-05-07T20:27:13.8468313Z #define __INTMAX_C(c) c ## L 2025-05-07T20:27:13.8468607Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:27:13.8468934Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:27:13.8469185Z #define NL_NMAX INT_MAX 2025-05-07T20:27:13.8469413Z #define _BITS_TIME_H 1 2025-05-07T20:27:13.8469690Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:27:13.8470003Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:27:13.8470306Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:27:13.8470650Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:27:13.8471031Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:27:13.8471394Z #define __CHAR_BIT__ 8 2025-05-07T20:27:13.8471649Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:13.8472029Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:27:13.8472332Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:27:13.8472594Z #define FP_NAN 0 2025-05-07T20:27:13.8472851Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:27:13.8473253Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:27:13.8473631Z #define __cudaCDP2GetErrorString 2025-05-07T20:27:13.8473907Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:27:13.8474162Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:27:13.8474408Z #define __SM_80_RT_H__ 2025-05-07T20:27:13.8474628Z #define _NEW 2025-05-07T20:27:13.8474842Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:27:13.8475111Z #define __UINT8_MAX__ 0xff 2025-05-07T20:27:13.8475468Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:27:13.8475872Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:27:13.8476096Z #define __USE_ANSI 1 2025-05-07T20:27:13.8476373Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:27:13.8476758Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:27:13.8477107Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:27:13.8477401Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:27:13.8477671Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:27:13.8477939Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:27:13.8478210Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:27:13.8478488Z #define PIPE_BUF 4096 2025-05-07T20:27:13.8478802Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:27:13.8479247Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:27:13.8479613Z #define ADJ_TICK 0x4000 2025-05-07T20:27:13.8479885Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:27:13.8480201Z #define MQ_PRIO_MAX 32768 2025-05-07T20:27:13.8480446Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:27:13.8480755Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:27:13.8481200Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:27:13.8481701Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:27:13.8482058Z #define _XOPEN_SOURCE 700 2025-05-07T20:27:13.8482306Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:27:13.8482565Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:27:13.8482843Z #define __cpp_static_assert 201411L 2025-05-07T20:27:13.8483114Z #define __GLIBCXX__ 20230528 2025-05-07T20:27:13.8483477Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:27:13.8483741Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:27:13.8484014Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:27:13.8484307Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:27:13.8484568Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:27:13.8484858Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.8485203Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:27:13.8485525Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:27:13.8485795Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:27:13.8486177Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:13.8486518Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:27:13.8486867Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:27:13.8487147Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:27:13.8487424Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:27:13.8487741Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:27:13.8488060Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:27:13.8488495Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:27:13.8488887Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:27:13.8489183Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:27:13.8489442Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:27:13.8489703Z #define __GCC_IEC_559 2 2025-05-07T20:27:13.8489984Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:27:13.8490310Z #define _IO_flockfile(_fp) 2025-05-07T20:27:13.8490553Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:27:13.8490820Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:27:13.8491074Z #define _IOFBF 0 2025-05-07T20:27:13.8491270Z #define __USE_BSD 1 2025-05-07T20:27:13.8491488Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:27:13.8491746Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:27:13.8491999Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:27:13.8492240Z #define _IO_NO_WRITES 8 2025-05-07T20:27:13.8492494Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:27:13.8492836Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:27:13.8493177Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:27:13.8493473Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:27:13.8493780Z #define __cpp_binary_literals 201304L 2025-05-07T20:27:13.8494057Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:27:13.8494314Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:27:13.8494572Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:27:13.8494869Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:27:13.8495247Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:27:13.8495600Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:27:13.8495890Z #define M_PI 3.14159265358979323846 2025-05-07T20:27:13.8496193Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:27:13.8496507Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:27:13.8496811Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:27:13.8497098Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:27:13.8497364Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:27:13.8497622Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:27:13.8498204Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:27:13.8498773Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:27:13.8499090Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:27:13.8499391Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:27:13.8499687Z #define __cudaCDP2GetErrorName 2025-05-07T20:27:13.8499960Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:27:13.8500298Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:27:13.8500734Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:27:13.8501192Z #define __cpp_variadic_templates 200704L 2025-05-07T20:27:13.8501506Z #define RAND_MAX 2147483647 2025-05-07T20:27:13.8501861Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:27:13.8502176Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.8502476Z #define __SM_90_RT_H__ 2025-05-07T20:27:13.8502704Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:27:13.8502953Z #define __COMPAR_FN_T 2025-05-07T20:27:13.8503184Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:27:13.8503431Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:27:13.8503892Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:27:13.8504393Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:27:13.8504825Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:27:13.8505170Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:27:13.8505461Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:27:13.8505780Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:27:13.8506074Z #define __cpp_variable_templates 201304L 2025-05-07T20:27:13.8506581Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:27:13.8507122Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:27:13.8507433Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:27:13.8507815Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:27:13.8508103Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:27:13.8508418Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:27:13.8508702Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:27:13.8508961Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:27:13.8509213Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:27:13.8509442Z #define __u_char_defined 2025-05-07T20:27:13.8509754Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:27:13.8510104Z #define STA_PPSERROR 0x0800 2025-05-07T20:27:13.8510344Z #define _GLIBCXX_STD_A std 2025-05-07T20:27:13.8510590Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:27:13.8510857Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:27:13.8511274Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:27:13.8511690Z #define FP_INFINITE 1 2025-05-07T20:27:13.8512046Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:13.8512444Z #define _IO_pid_t __pid_t 2025-05-07T20:27:13.8512693Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:27:13.8512950Z #define __LEAF , __leaf__ 2025-05-07T20:27:13.8513191Z #define PATH_MAX 4096 2025-05-07T20:27:13.8513428Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:27:13.8513754Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:27:13.8514062Z #define _LIMITS_H___ 2025-05-07T20:27:13.8514275Z #define __size_t 2025-05-07T20:27:13.8514499Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:27:13.8515028Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:27:13.8515569Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:27:13.8515888Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:27:13.8516210Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:27:13.8516464Z #define _WCHAR_T_DEFINED 2025-05-07T20:27:13.8516812Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:27:13.8517211Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:27:13.8517496Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:27:13.8517810Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:27:13.8518106Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:27:13.8518408Z #define __INT8_C(c) c 2025-05-07T20:27:13.8518659Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:27:13.8518953Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:27:13.8519201Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:27:13.8519464Z #define __SM_70_RT_HPP__ 2025-05-07T20:27:13.8519711Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:27:13.8519968Z #define __cpp_variadic_using 201611L 2025-05-07T20:27:13.8520283Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.8520707Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:27:13.8520972Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:27:13.8521237Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:27:13.8521489Z #define __cpp_capture_star_this 201603L 2025-05-07T20:27:13.8521795Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:27:13.8522087Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:27:13.8522440Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:27:13.8522811Z #define NFDBITS __NFDBITS 2025-05-07T20:27:13.8523055Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:27:13.8523337Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:27:13.8523764Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:27:13.8524072Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:27:13.8524326Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:27:13.8524603Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:27:13.8524894Z #define STA_UNSYNC 0x0040 2025-05-07T20:27:13.8525197Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:27:13.8525609Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:27:13.8525959Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:27:13.8526243Z #define __cpp_if_constexpr 201606L 2025-05-07T20:27:13.8526618Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:27:13.8526935Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:27:13.8527241Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:27:13.8527575Z #define __daddr_t_defined 2025-05-07T20:27:13.8527820Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:27:13.8528084Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:27:13.8528407Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:27:13.8528923Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:27:13.8529440Z #define _ACRTIMP 2025-05-07T20:27:13.8529657Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:27:13.8529919Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:27:13.8530207Z #define _IOS_BIN 128 2025-05-07T20:27:13.8530542Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:27:13.8530941Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.8531201Z #define UNDERFLOW 4 2025-05-07T20:27:13.8531414Z #define NAME_MAX 255 2025-05-07T20:27:13.8531648Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:27:13.8531919Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:27:13.8532184Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:27:13.8532470Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:27:13.8532847Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:27:13.8533225Z #define __ptr_t void * 2025-05-07T20:27:13.8533457Z #define M_E 2.7182818284590452354 2025-05-07T20:27:13.8533728Z #define cudaSurfaceType1D 0x01 2025-05-07T20:27:13.8533988Z #define __USE_ISOCXX11 1 2025-05-07T20:27:13.8534242Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:27:13.8534551Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:27:13.8534838Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:27:13.8535097Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:27:13.8535376Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:27:13.8535680Z #define cudaSurfaceType2D 0x02 2025-05-07T20:27:13.8535921Z #define __linux 1 2025-05-07T20:27:13.8536139Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:27:13.8536407Z #define cudaDeviceMask 0xff 2025-05-07T20:27:13.8536658Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:27:13.8536940Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:27:13.8537213Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:27:13.8537488Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:27:13.8537783Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:27:13.8538073Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:27:13.8538352Z #define _BITS_TYPES_H 1 2025-05-07T20:27:13.8538624Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:27:13.8539069Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:27:13.8539357Z #define cudaSurfaceType3D 0x03 2025-05-07T20:27:13.8539617Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:27:13.8539896Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:27:13.8540547Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:27:13.8541494Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:27:13.8542294Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:27:13.8542573Z #define __unix 1 2025-05-07T20:27:13.8543024Z #define MATH_ERRNO 1 2025-05-07T20:27:13.8543256Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:27:13.8543525Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:27:13.8543781Z #define __SM_100_RT_H__ 2025-05-07T20:27:13.8544022Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:27:13.8544303Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:27:13.8544589Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:27:13.8544865Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:27:13.8545163Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:27:13.8545633Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:27:13.8546091Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:27:13.8546397Z #define CUDARTAPI_CDECL 2025-05-07T20:27:13.8546652Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:27:13.8546926Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:27:13.8547204Z #define __cpp_lib_void_t 201411 2025-05-07T20:27:13.8547465Z #define _POSIX_AIO_MAX 1 2025-05-07T20:27:13.8547791Z #define __SIZE_T 2025-05-07T20:27:13.8548032Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:27:13.8548346Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:27:13.8548639Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:27:13.8548890Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:27:13.8549153Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:27:13.8549415Z #define _ATFILE_SOURCE 1 2025-05-07T20:27:13.8549797Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:27:13.8550236Z #define __WAIT_STATUS void * 2025-05-07T20:27:13.8550496Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:27:13.8550752Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:27:13.8551018Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:27:13.8551293Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:27:13.8551555Z #define __WINT_MIN__ 0U 2025-05-07T20:27:13.8552120Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:27:13.8552751Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:27:13.8553041Z #define WUNTRACED 2 2025-05-07T20:27:13.8553260Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:27:13.8553529Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:27:13.8553806Z #define NZERO 20 2025-05-07T20:27:13.8554027Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:27:13.8554301Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:27:13.8554589Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:27:13.8554862Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:27:13.8555111Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:27:13.8555389Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:27:13.8555661Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:27:13.8555926Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:27:13.8556189Z #define EXIT_FAILURE 1 2025-05-07T20:27:13.8556423Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:27:13.8556678Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:27:13.8556939Z #define _SIZE_T_DEFINED_ 2025-05-07T20:27:13.8557182Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:27:13.8557450Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:27:13.8557779Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:27:13.8558137Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:27:13.8558416Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:27:13.8558893Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:27:13.8559158Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:27:13.8559442Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:27:13.8559841Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:27:13.8560151Z #define SEEK_DATA 3 2025-05-07T20:27:13.8560379Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:27:13.8560664Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:27:13.8561081Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:27:13.8561465Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:27:13.8561846Z #define __INT64_C(c) c ## L 2025-05-07T20:27:13.8562114Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:27:13.8562446Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:27:13.8562756Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:27:13.8563031Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:27:13.8563321Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:27:13.8563608Z #define STA_PPSWANDER 0x0400 2025-05-07T20:27:13.8563860Z #define __INT_WCHAR_T_H 2025-05-07T20:27:13.8564093Z #define WSTOPPED 2 2025-05-07T20:27:13.8564315Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:27:13.8564597Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:27:13.8564847Z #define FP_NORMAL 4 2025-05-07T20:27:13.8565079Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:27:13.8565345Z #define _BITS_TIMEX_H 1 2025-05-07T20:27:13.8565573Z #define _POSIX_LINK_MAX 8 2025-05-07T20:27:13.8565819Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:27:13.8566085Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:27:13.8566357Z #define cudaTextureType1D 0x01 2025-05-07T20:27:13.8566618Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:27:13.8566869Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:27:13.8567133Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:27:13.8567425Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:27:13.8567837Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:27:13.8568284Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:27:13.8568540Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:27:13.8568794Z #define _POSIX_SOURCE 1 2025-05-07T20:27:13.8569033Z #define cudaTextureType2D 0x02 2025-05-07T20:27:13.8569290Z #define _PTR_TRAITS_H 1 2025-05-07T20:27:13.8569553Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:27:13.8569853Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:27:13.8570111Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:27:13.8570423Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:27:13.8570744Z #define cudaTextureType3D 0x03 2025-05-07T20:27:13.8571004Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:27:13.8571255Z #define CLOCK_REALTIME 0 2025-05-07T20:27:13.8571490Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:27:13.8571753Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:27:13.8572046Z #define __cpp_aligned_new 201606L 2025-05-07T20:27:13.8572307Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:27:13.8572580Z #define cudaEventBlockingSync 0x01 2025-05-07T20:27:13.8572861Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:27:13.8573118Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:27:13.8573425Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:27:13.8573711Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:27:13.8573984Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:27:13.8574224Z #define __GLIBC__ 2 2025-05-07T20:27:13.8574433Z #define __END_DECLS } 2025-05-07T20:27:13.8574664Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:27:13.8575021Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:27:13.8575393Z #define __CONCAT(x,y) x ## y 2025-05-07T20:27:13.8631097Z #define WCONTINUED 8 2025-05-07T20:27:13.8631525Z #define __STDC_HOSTED__ 1 2025-05-07T20:27:13.8631872Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:27:13.8632236Z #define _ALLOCA_H 1 2025-05-07T20:27:13.8632533Z #define __host__ __location__(host) 2025-05-07T20:27:13.8634051Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:27:13.8634630Z #define __SLONG32_TYPE int 2025-05-07T20:27:13.8634976Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:27:13.8635364Z #define _SYS_SELECT_H 1 2025-05-07T20:27:13.8635684Z #define _IO_LINE_BUF 0x200 2025-05-07T20:27:13.8636019Z #define _IOS_NOCREATE 32 2025-05-07T20:27:13.8636347Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:27:13.8636728Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:27:13.8637122Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:27:13.8637501Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:27:13.8638024Z #define __global__ __location__(global) 2025-05-07T20:27:13.8638466Z #define __GNU_LIBRARY__ 6 2025-05-07T20:27:13.8638804Z #define __cpp_decltype_auto 201304L 2025-05-07T20:27:13.8639174Z #define __DBL_DIG__ 15 2025-05-07T20:27:13.8639477Z #define TIME_UTC 1 2025-05-07T20:27:13.8639766Z #define __FLT32_DIG__ 6 2025-05-07T20:27:13.8640493Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:27:13.8641036Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:27:13.8641392Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:27:13.8641697Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:27:13.8641985Z #define _G_BUFSIZ 8192 2025-05-07T20:27:13.8642280Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:27:13.8642633Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:27:13.8642926Z #define __cudaCDP2GetDevice 2025-05-07T20:27:13.8643196Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:27:13.8643472Z #define STA_CLOCKERR 0x1000 2025-05-07T20:27:13.8643713Z #define __GXX_WEAK__ 1 2025-05-07T20:27:13.8643961Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:13.8644255Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:27:13.8644508Z #define __SHRT_WIDTH__ 16 2025-05-07T20:27:13.8644797Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:27:13.8645120Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:27:13.8645397Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:27:13.8645679Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:27:13.8645965Z #define _G_config_h 1 2025-05-07T20:27:13.8646238Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:27:13.8646584Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:27:13.8646953Z #define _GCC_WCHAR_T 2025-05-07T20:27:13.8647170Z #define TMP_MAX 238328 2025-05-07T20:27:13.8647407Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:27:13.8647675Z #define __DEVICE_TYPES_H__ 2025-05-07T20:27:13.8647921Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:13.8648199Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:27:13.8648467Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:27:13.8648738Z #define _IO_SKIPWS 01 2025-05-07T20:27:13.8649154Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:27:13.8649602Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:27:13.8649963Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:27:13.8650446Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:27:13.8650899Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:27:13.8651263Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:27:13.8651615Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:27:13.8651861Z #define le32toh(x) (x) 2025-05-07T20:27:13.8652087Z #define _SIZE_T_DEFINED 2025-05-07T20:27:13.8652324Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:27:13.8652654Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:27:13.8653000Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:27:13.8653389Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:27:13.8653791Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:27:13.8654053Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:27:13.8654304Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:27:13.8654563Z #define _POSIX_NAME_MAX 14 2025-05-07T20:27:13.8655095Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:27:13.8655601Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:27:13.8656079Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:27:13.8656379Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:27:13.8656722Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:27:13.8657021Z #define _WCHAR_T_ 2025-05-07T20:27:13.8657239Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:27:13.8657603Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:27:13.8658121Z #define RTSIG_MAX 32 2025-05-07T20:27:13.8658340Z #define _STDDEF_H 2025-05-07T20:27:13.8658565Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:27:13.8658824Z #define _VA_LIST_DEFINED 2025-05-07T20:27:13.8659073Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:27:13.8659396Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:27:13.8659777Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:27:13.8660101Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:27:13.8660385Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:27:13.8660832Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:27:13.8661339Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:27:13.8661691Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:27:13.8661998Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:27:13.8662291Z #define __unix__ 1 2025-05-07T20:27:13.8662515Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:13.8662789Z #define __INT_WIDTH__ 32 2025-05-07T20:27:13.8663016Z #define __SIZEOF_LONG__ 8 2025-05-07T20:27:13.8663244Z #define _IONBF 2 2025-05-07T20:27:13.8663674Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:27:13.8664416Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:27:13.8664943Z #define __STDC_IEC_559__ 1 2025-05-07T20:27:13.8665189Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:27:13.8665445Z #define __UINT16_C(c) c 2025-05-07T20:27:13.8665672Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:27:13.8665929Z #define STA_DEL 0x0020 2025-05-07T20:27:13.8666163Z #define __CUDACC_VER_MINOR__ 8 2025-05-07T20:27:13.8666403Z #define __id_t_defined 2025-05-07T20:27:13.8666663Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:27:13.8667106Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:27:13.8667509Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:27:13.8667873Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:27:13.8668121Z #define __DECIMAL_DIG__ 21 2025-05-07T20:27:13.8668369Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:27:13.8668614Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:27:13.8668868Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:27:13.8669124Z #define SING 2 2025-05-07T20:27:13.8669325Z #define STA_FREQHOLD 0x0080 2025-05-07T20:27:13.8669580Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:13.8669866Z #define cudaStreamDefault 0x00 2025-05-07T20:27:13.8670200Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:27:13.8670570Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:27:13.8670827Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:27:13.8671078Z #define __gnu_linux__ 1 2025-05-07T20:27:13.8671306Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:27:13.8671556Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:27:13.8671859Z #define MAX_INPUT 255 2025-05-07T20:27:13.8672101Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:27:13.8672416Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:27:13.8672772Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:27:13.8673079Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:27:13.8673339Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:27:13.8673836Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:27:13.8674236Z #define _IO_SHOWPOS 02000 2025-05-07T20:27:13.8674556Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:27:13.8674906Z #define _Mfloat_ float 2025-05-07T20:27:13.8675155Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:27:13.8675456Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:27:13.8675739Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:27:13.8676051Z #define cudaMemPoolCreateUsageHwDecompress 0x2 2025-05-07T20:27:13.8676665Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:27:13.8677163Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.8677430Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:27:13.8677738Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:27:13.8678093Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:27:13.8678382Z #define __USE_ISOC11 1 2025-05-07T20:27:13.8678597Z #define _BSD_SIZE_T_ 2025-05-07T20:27:13.8678816Z #define ADJ_MICRO 0x1000 2025-05-07T20:27:13.8679054Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:27:13.8679296Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:27:13.8679581Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:27:13.8679888Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:27:13.8680264Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:27:13.8680642Z #define __THROW throw () 2025-05-07T20:27:13.8680883Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:27:13.8681168Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:13.8681507Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:27:13.8681850Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:27:13.8682117Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:27:13.8682364Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:27:13.8682626Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:27:13.8682878Z #define L_tmpnam 20 2025-05-07T20:27:13.8683086Z #define ___int_wchar_t_h 2025-05-07T20:27:13.8683417Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:27:13.8683795Z #define isascii(c) __isascii (c) 2025-05-07T20:27:13.8684039Z #define _T_PTRDIFF 2025-05-07T20:27:13.8684335Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:27:13.8684675Z #define toascii(c) __toascii (c) 2025-05-07T20:27:13.8684927Z #define __GNUC__ 11 2025-05-07T20:27:13.8685167Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:27:13.8685455Z #define __GXX_RTTI 1 2025-05-07T20:27:13.8685673Z #define __pie__ 2 2025-05-07T20:27:13.8685868Z #define __MMX__ 1 2025-05-07T20:27:13.8686081Z #define __cudaCDP2Malloc 2025-05-07T20:27:13.8686328Z #define __timespec_defined 1 2025-05-07T20:27:13.8686566Z #define L_ctermid 9 2025-05-07T20:27:13.8686796Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:27:13.8687094Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:27:13.8687478Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:27:13.8687837Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:27:13.8688093Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:27:13.8688367Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:27:13.8688660Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:27:13.8688971Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:27:13.8689225Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:27:13.8689643Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:27:13.8690374Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:27:13.8690956Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:27:13.8691251Z #define __USE_SVID 1 2025-05-07T20:27:13.8691497Z #define __constant__ __location__(constant) 2025-05-07T20:27:13.8691799Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:27:13.8692181Z #define __device__ __location__(device) 2025-05-07T20:27:13.8692501Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:27:13.8692809Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:27:13.8693057Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:27:13.8693346Z #define CUDART_DEVICE __device__ 2025-05-07T20:27:13.8693687Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:27:13.8694043Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:27:13.8694317Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:27:13.8694672Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:27:13.8695159Z #define __STDC_UTF_16__ 1 2025-05-07T20:27:13.8695399Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:27:13.8695754Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:27:13.8696162Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:27:13.8696458Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:27:13.8696731Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:27:13.8696992Z #define NGROUPS_MAX 65536 2025-05-07T20:27:13.8697232Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:27:13.8697483Z #define __USE_ISOC95 1 2025-05-07T20:27:13.8697700Z #define _TIME_H 1 2025-05-07T20:27:13.8697954Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:27:13.8698262Z #define __USE_ISOC99 1 2025-05-07T20:27:13.8698581Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:27:13.8698933Z #define HOST_NAME_MAX 64 2025-05-07T20:27:13.8699171Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:27:13.8699423Z #define _IOS_ATEND 4 2025-05-07T20:27:13.8699640Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:27:13.8699957Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:27:13.8700352Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:27:13.8700685Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:27:13.8700948Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:27:13.8701260Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:27:13.8701567Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:27:13.8701811Z #define _STDIO_H 1 2025-05-07T20:27:13.8702192Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:27:13.8702647Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:27:13.8702988Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:27:13.8703354Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:27:13.8703638Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:27:13.8703896Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:27:13.8704163Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:27:13.8704447Z #define __cpp_raw_strings 200710L 2025-05-07T20:27:13.8704740Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.8705040Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:27:13.8705305Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:27:13.8705574Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:27:13.8705965Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:27:13.8706255Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:27:13.8706530Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:27:13.8706873Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:27:13.8707234Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:27:13.8707472Z #define __USE_XOPEN 1 2025-05-07T20:27:13.8707788Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:27:13.8708217Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:27:13.8708641Z #define __USE_XOPEN2K 1 2025-05-07T20:27:13.8708881Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:27:13.8709132Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:27:13.8709458Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:27:13.8709837Z #define __cpp_fold_expressions 201603L 2025-05-07T20:27:13.8710576Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:27:13.8711207Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:27:13.8711482Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:27:13.8711824Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:27:13.8712197Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:27:13.8712572Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:27:13.8712958Z #define __END_NAMESPACE_C99 2025-05-07T20:27:13.8713227Z #define __glibcxx_integral_traps true 2025-05-07T20:27:13.8713507Z #define _POSIX_PATH_MAX 256 2025-05-07T20:27:13.8713761Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:27:13.8714095Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:27:13.8714355Z #define _IOS_TRUNC 16 2025-05-07T20:27:13.8714581Z #define _ISOC11_SOURCE 1 2025-05-07T20:27:13.8714817Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:27:13.8715095Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:27:13.8715386Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:27:13.8715733Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:27:13.8716121Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:27:13.8716393Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:27:13.8716641Z #define _IO_UNITBUF 020000 2025-05-07T20:27:13.8716896Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:27:13.8717150Z #define __FD_SETSIZE 1024 2025-05-07T20:27:13.8717389Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:27:13.8717655Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:27:13.8717988Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:27:13.8718335Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:27:13.8718592Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:27:13.8718893Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:27:13.8719206Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:27:13.8719462Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:27:13.8719755Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:27:13.8720080Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:27:13.8720358Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:27:13.8720674Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:27:13.8720964Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:27:13.8721229Z #define __USE_POSIX199506 1 2025-05-07T20:27:13.8721475Z #define _FEATURES_H 1 2025-05-07T20:27:13.8721705Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:27:13.8722095Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:27:13.8722549Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:27:13.8722870Z #define __stub_getmsg 2025-05-07T20:27:13.8723103Z #define _IO_FIXED 010000 2025-05-07T20:27:13.8723362Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:27:13.8723668Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:27:13.8723934Z #define __stub_setlogin 2025-05-07T20:27:13.8724163Z #define __stub_fattach 2025-05-07T20:27:13.8724400Z #define __cplusplus 201703L 2025-05-07T20:27:13.8724664Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:27:13.8724933Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:27:13.8725187Z #define INFINITY (__builtin_inff()) 2025-05-07T20:27:13.8725465Z #define _IO_UNBUFFERED 2 2025-05-07T20:27:13.8725939Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:27:13.8726453Z #define _IO_INTERNAL 010 2025-05-07T20:27:13.8726698Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:27:13.8727029Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:27:13.8727369Z #define __dev_t_defined 2025-05-07T20:27:13.8727601Z #define __DEPRECATED 1 2025-05-07T20:27:13.8727828Z #define __S32_TYPE int 2025-05-07T20:27:13.8728067Z #define __cpp_rvalue_references 200610L 2025-05-07T20:27:13.8728365Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:27:13.8728619Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:27:13.8728862Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:27:13.8729470Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:27:13.8730197Z #define _G_HAVE_MREMAP 1 2025-05-07T20:27:13.8730495Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:27:13.8730824Z #define OVERFLOW 3 2025-05-07T20:27:13.8731063Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:27:13.8731366Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:27:13.8731635Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:13.8731963Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:27:13.8732285Z #define __SSE2_MATH__ 1 2025-05-07T20:27:13.8732516Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:27:13.8732902Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:13.8733201Z #define _IO_STDIO_H 2025-05-07T20:27:13.8733435Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:27:13.8733721Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:27:13.8734032Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:27:13.8734319Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.8734625Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:27:13.8734893Z #define __amd64 1 2025-05-07T20:27:13.8735109Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:27:13.8735363Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:27:13.8735634Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:27:13.8735916Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:27:13.8736211Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:27:13.8736469Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:27:13.8736759Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:27:13.8737006Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:27:13.8737258Z #define __bounded 2025-05-07T20:27:13.8737482Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:27:13.8737737Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:27:13.8738017Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:27:13.8738366Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:27:13.8738703Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:27:13.8738975Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:13.8739287Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:27:13.8739696Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:27:13.8740339Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:27:13.8740624Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:27:13.8740963Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:27:13.8741297Z #define STA_PLL 0x0001 2025-05-07T20:27:13.8741534Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:27:13.8741792Z #define __GNUG__ 11 2025-05-07T20:27:13.8742011Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:27:13.8742278Z #define _T_WCHAR 2025-05-07T20:27:13.8742514Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:27:13.8742790Z #define __specialization_static 2025-05-07T20:27:13.8743086Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:27:13.8743388Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:27:13.8743637Z #define cudaArraySparse 0x40 2025-05-07T20:27:13.8743893Z #define STA_PPSFREQ 0x0002 2025-05-07T20:27:13.8744172Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:27:13.8744471Z #define _WCHAR_T 2025-05-07T20:27:13.8744678Z #define __cudaCDP2Free 2025-05-07T20:27:13.8745313Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:27:13.8745995Z #define __cpp_nsdmi 200809L 2025-05-07T20:27:13.8746423Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:27:13.8746869Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:27:13.8747138Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:27:13.8747388Z #define cudaArrayCubemap 0x04 2025-05-07T20:27:13.8747806Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:27:13.8748146Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:27:13.8748386Z #define __NO_CTYPE 1 2025-05-07T20:27:13.8748853Z #define __stub_bdflush 2025-05-07T20:27:13.8749210Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:27:13.8749619Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:27:13.8749903Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:27:13.8750161Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:27:13.8750429Z #define __cpp_initializer_lists 200806L 2025-05-07T20:27:13.8750714Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:27:13.8751001Z #define __U16_TYPE unsigned short int 2025-05-07T20:27:13.8751329Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:27:13.8751791Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:27:13.8752066Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:27:13.8752338Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:27:13.8753091Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:27:13.8762529Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:27:13.8762819Z #define _IO_STDIO 040000 2025-05-07T20:27:13.8763156Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:27:13.8763544Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:27:13.8763849Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:27:13.8764133Z #define _PTRDIFF_T 2025-05-07T20:27:13.8764345Z #define _MOVE_H 1 2025-05-07T20:27:13.8764565Z #define __cpp_hex_float 201603L 2025-05-07T20:27:13.8764821Z #define ADJ_TAI 0x0080 2025-05-07T20:27:13.8765044Z #define __ptrvalue 2025-05-07T20:27:13.8765261Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:27:13.8765513Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:27:13.8765817Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:27:13.8766216Z #define MATH_ERREXCEPT 2 2025-05-07T20:27:13.8766502Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:27:13.8766786Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:27:13.8767169Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:27:13.8767540Z #define __USE_GNU 1 2025-05-07T20:27:13.8767769Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:27:13.8768043Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:27:13.8768349Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:27:13.8768733Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:27:13.8769112Z #define WEXITED 4 2025-05-07T20:27:13.8769317Z #define _IO_NO_READS 4 2025-05-07T20:27:13.8769660Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:27:13.8770041Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:27:13.8770306Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:27:13.8770601Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:27:13.8770908Z #define __uid_t_defined 2025-05-07T20:27:13.8771143Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:27:13.8771431Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:27:13.8771694Z #define WNOHANG 1 2025-05-07T20:27:13.8771932Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:27:13.8772223Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:27:13.8772497Z #define cudaEventDefault 0x00 2025-05-07T20:27:13.8772796Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:27:13.8773101Z #define NL_SETMAX INT_MAX 2025-05-07T20:27:13.8773334Z #define __x86_64 1 2025-05-07T20:27:13.8773562Z #define __cudaCDP2LaunchDevice 2025-05-07T20:27:13.8773941Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:13.8774421Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:27:13.8774914Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:27:13.8775331Z #define __PTRDIFF_T 2025-05-07T20:27:13.8775650Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:27:13.8776015Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:27:13.8776285Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:13.8776566Z #define _Mlong_double_ long double 2025-05-07T20:27:13.8776837Z #define __cpp_lambdas 200907L 2025-05-07T20:27:13.8777302Z #define _IO_DEC 020 2025-05-07T20:27:13.8777515Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:27:13.8777774Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:27:13.8778058Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:27:13.8778321Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:27:13.8778579Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:27:13.8778870Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:27:13.8779184Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:27:13.8779449Z #define _ANSI_STDDEF_H 2025-05-07T20:27:13.8779722Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:27:13.8780126Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:27:13.8780486Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:27:13.8780857Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:27:13.8781132Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:27:13.8781414Z #define __cpp_template_auto 201606L 2025-05-07T20:27:13.8781766Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:27:13.8782133Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:27:13.8782390Z #define __key_t_defined 2025-05-07T20:27:13.8782633Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:27:13.8782996Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:27:13.8783450Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:27:13.8783813Z #define __GNUC_VA_LIST 2025-05-07T20:27:13.8784144Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:27:13.8784521Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:27:13.8784781Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:27:13.8785053Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:27:13.8785329Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:27:13.8785571Z #define __WCOREFLAG 0x80 2025-05-07T20:27:13.8785817Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:27:13.8786105Z #define cudaEventDisableTiming 0x02 2025-05-07T20:27:13.8786377Z #define __LP64__ 1 2025-05-07T20:27:13.8786616Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:27:13.8786914Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:27:13.8787187Z #define _IO_off64_t __off64_t 2025-05-07T20:27:13.8787439Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.8787812Z #define __time_t_defined 1 2025-05-07T20:27:13.8788052Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:27:13.8788389Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:27:13.8788750Z #define __USE_UNIX98 1 2025-05-07T20:27:13.8788972Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:27:13.8789238Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:27:13.8789498Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:27:13.8789778Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:27:13.8790078Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:27:13.8790327Z #define SEEK_CUR 1 2025-05-07T20:27:13.8790542Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:13.8790804Z #define _ASSERT_H 1 2025-05-07T20:27:13.8791353Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:27:13.8791956Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:27:13.8792216Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:27:13.8792463Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:27:13.8792720Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:27:13.8792975Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:27:13.8793332Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:27:13.8793729Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:27:13.8794360Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:27:13.8794989Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:27:13.8795276Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:27:13.8795718Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:27:13.8796077Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:27:13.8796335Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:27:13.8796609Z #define cudaArrayDefault 0x00 2025-05-07T20:27:13.8796872Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:27:13.8797152Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:27:13.8797426Z #define TLOSS 5 2025-05-07T20:27:13.8797630Z #define __ssize_t_defined 2025-05-07T20:27:13.8797873Z #define __CUDACC_VER_BUILD__ 61 2025-05-07T20:27:13.8798226Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:27:13.8798506Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:27:13.8798774Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:27:13.8799051Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:27:13.8799319Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:27:13.8799623Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:27:13.8799913Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:27:13.8800252Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:27:13.8800618Z #define __REGISTER_PREFIX__ 2025-05-07T20:27:13.8800980Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:27:13.8801419Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:27:13.8801771Z #define _IOS_NOREPLACE 64 2025-05-07T20:27:13.8802005Z #define __cdecl 2025-05-07T20:27:13.8802233Z #define cudaEventInterprocess 0x04 2025-05-07T20:27:13.8802548Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:27:13.8802870Z #define LOGIN_NAME_MAX 256 2025-05-07T20:27:13.8803123Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:27:13.8803380Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:27:13.8803659Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:27:13.8803917Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:27:13.8804221Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:27:13.8804539Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:27:13.8804939Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:27:13.8805365Z #define ADJ_NANO 0x2000 2025-05-07T20:27:13.8805655Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:27:13.8806002Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:27:13.8806279Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:27:13.8806528Z #define __FLT_DIG__ 6 2025-05-07T20:27:13.8806864Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:27:13.8807257Z #define __NO_INLINE__ 1 2025-05-07T20:27:13.8807550Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:27:13.8807890Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:27:13.8808148Z #define ADJ_STATUS 0x0010 2025-05-07T20:27:13.8808442Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:27:13.8808715Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:27:13.8808978Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:27:13.8809272Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:27:13.8809558Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:27:13.8809939Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:27:13.8810345Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:27:13.8810677Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:27:13.8811010Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:27:13.8811245Z #define MAX_CANON 255 2025-05-07T20:27:13.8811464Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:27:13.8811712Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:27:13.8811973Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:27:13.8812258Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:27:13.8812557Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:27:13.8812851Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:27:13.8813118Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:27:13.8813425Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:27:13.8813734Z #define __VERSION__ "11.4.0" 2025-05-07T20:27:13.8814091Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:27:13.8814374Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:27:13.8814657Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:27:13.8814929Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:27:13.8815224Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:27:13.8815513Z #define __UINT64_C(c) c ## UL 2025-05-07T20:27:13.8815765Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:27:13.8816002Z #define _SYS_TYPES_H 1 2025-05-07T20:27:13.8816235Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:27:13.8816492Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:27:13.8816815Z #define _SYS_CDEFS_H 1 2025-05-07T20:27:13.8817041Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:27:13.8817308Z #define __cpp_unicode_characters 201411L 2025-05-07T20:27:13.8817589Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:27:13.8817828Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:27:13.8818112Z #define __cudaCDP2StreamDestroy 2025-05-07T20:27:13.8818382Z #define FP_SUBNORMAL 3 2025-05-07T20:27:13.8818617Z #define cudaOccupancyDefault 0x00 2025-05-07T20:27:13.8818888Z #define _INITIALIZER_LIST 2025-05-07T20:27:13.8819132Z #define _STDC_PREDEF_H 1 2025-05-07T20:27:13.8819393Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:27:13.8819678Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:27:13.8819922Z #define _IO_file_flags _flags 2025-05-07T20:27:13.8820172Z #define __USE_XOPEN2K8 1 2025-05-07T20:27:13.8820419Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:27:13.8820691Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:27:13.8820951Z #define HUGE 3.40282347e+38F 2025-05-07T20:27:13.8821218Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:27:13.8821586Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:27:13.8821961Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:27:13.8822259Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:27:13.8822526Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:27:13.8822769Z #define _BSD_SOURCE 1 2025-05-07T20:27:13.8823002Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:27:13.8823823Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:27:13.8824657Z #define __catch(X) catch(X) 2025-05-07T20:27:13.8824906Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:27:13.8825185Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:27:13.8825452Z #define __TIMER_T_TYPE void * 2025-05-07T20:27:13.8825688Z #define __STRING(x) #x 2025-05-07T20:27:13.8825972Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:27:13.8826292Z #define _T_PTRDIFF_ 2025-05-07T20:27:13.8826525Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:27:13.8826826Z #define cudaEventWaitExternal 0x01 2025-05-07T20:27:13.8827088Z #define __unbounded 2025-05-07T20:27:13.8827313Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:13.8827709Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:27:13.8827983Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:13.8828269Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:27:13.8828546Z #define __cpp_lib_is_final 201402L 2025-05-07T20:27:13.8828834Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:27:13.8829153Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:27:13.8829446Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:27:13.8829802Z #define __managed__ __location__(managed) 2025-05-07T20:27:13.8830093Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:27:13.8830483Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:27:13.8830897Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:27:13.8831151Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:27:13.8831511Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:27:13.8831914Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:27:13.8832155Z #define _SYS_SIZE_T_H 2025-05-07T20:27:13.8832543Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:27:13.8832869Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:27:13.8833150Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:27:13.8833437Z #define _CRTIMP 2025-05-07T20:27:13.8833652Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:27:13.8833949Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:27:13.8834268Z #define STA_PPSJITTER 0x0200 2025-05-07T20:27:13.8834607Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:27:13.8835016Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:13.8835448Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:27:13.8835714Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:27:13.8835991Z #define __SIZE_T__ 2025-05-07T20:27:13.8836201Z #define __stub_gtty 2025-05-07T20:27:13.8836414Z #define __pid_t_defined 2025-05-07T20:27:13.8836682Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:27:13.8836987Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:13.8837301Z #define __glibcxx_function_requires(...) 2025-05-07T20:27:13.8837578Z #define __SM_80_RT_HPP__ 2025-05-07T20:27:13.8837818Z #define __need_clockid_t 2025-05-07T20:27:13.8838063Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:27:13.8838307Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:27:13.8838619Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:27:13.8838927Z #define _IO_HEX 0100 2025-05-07T20:27:13.8839170Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:27:13.8839502Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:27:13.8839602Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:27:13.8839703Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:27:13.8839919Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:13.8840031Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:27:13.8840517Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:27:13.8840665Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:27:13.8840784Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:27:13.8840884Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:27:13.8840963Z #define __stub_sstk 2025-05-07T20:27:13.8841058Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:27:13.8841206Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:27:13.8841283Z #define __wur 2025-05-07T20:27:13.8841399Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:27:13.8841482Z #define _G_HAVE_MMAP 1 2025-05-07T20:27:13.8841560Z #define _IO_OCT 040 2025-05-07T20:27:13.8841654Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:27:13.8841746Z #define NL_MSGMAX INT_MAX 2025-05-07T20:27:13.8841833Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:27:13.8841962Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:27:13.8842049Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:27:13.8842151Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:27:13.8842336Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:27:13.8842432Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:27:13.8842525Z #define _STL_ALGOBASE_H 1 2025-05-07T20:27:13.8842627Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:27:13.8842711Z #define __off64_t_defined 2025-05-07T20:27:13.8842812Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:27:13.8842898Z #define __FLT128_DIG__ 33 2025-05-07T20:27:13.8842997Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:27:13.8843094Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:27:13.8843173Z #define __INT32_C(c) c 2025-05-07T20:27:13.8843269Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:27:13.8843365Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:27:13.8843457Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:27:13.8843548Z #define __PDP_ENDIAN 3412 2025-05-07T20:27:13.8843632Z #define _ISOC95_SOURCE 1 2025-05-07T20:27:13.8843722Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:27:13.8843855Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:27:13.8844190Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:27:13.8844275Z #define __SM_90_RT_HPP__ 2025-05-07T20:27:13.8844375Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:27:13.8844468Z #define __have_pthread_attr_t 1 2025-05-07T20:27:13.8844563Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:27:13.8844784Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:27:13.8844889Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:27:13.8844995Z #define __cudaCDP2EventRecord 2025-05-07T20:27:13.8845084Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:27:13.8845167Z #define htole32(x) (x) 2025-05-07T20:27:13.8845538Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:27:13.8845656Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:27:13.8845750Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:27:13.8845907Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:27:13.8846043Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:27:13.8846169Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:27:13.8846310Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:27:13.8846401Z #define ADJ_OFFSET 0x0001 2025-05-07T20:27:13.8846502Z #define cudaArrayLayered 0x01 2025-05-07T20:27:13.8846666Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:27:13.8846773Z #define cudaEventRecordDefault 0x00 2025-05-07T20:27:13.8846872Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:27:13.8846968Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:27:13.8847045Z #define unix 1 2025-05-07T20:27:13.8847151Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:27:13.8847241Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:27:13.8847333Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:27:13.8847455Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:27:13.8847537Z #define __USE_POSIX 1 2025-05-07T20:27:13.8847627Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:27:13.8847764Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:27:13.8847861Z #define __THROWNL throw () 2025-05-07T20:27:13.8847954Z #define __cpp_rtti 199711L 2025-05-07T20:27:13.8848055Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:27:13.8848141Z #define __PMT(args) args 2025-05-07T20:27:13.8848257Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.8848401Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:27:13.8848510Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:27:13.8848605Z #define _SIZE_T_DECLARED 2025-05-07T20:27:13.8848699Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:27:13.8848787Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:27:13.8849180Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:27:13.8849277Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:27:13.8849374Z #define XATTR_LIST_MAX 65536 2025-05-07T20:27:13.8849464Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:27:13.8849610Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:27:13.8849700Z #define _WCHAR_T_H 2025-05-07T20:27:13.8849787Z #define __FLT64X_DIG__ 18 2025-05-07T20:27:13.8849875Z #define _IO_SHOWBASE 0200 2025-05-07T20:27:13.8849967Z #define _POSIX_QLIMIT 1 2025-05-07T20:27:13.8850060Z #define __INT8_TYPE__ signed char 2025-05-07T20:27:13.8850154Z #define __SURFACE_TYPES_H__ 2025-05-07T20:27:13.8850247Z #define __CUDA_ARCH__ 520 2025-05-07T20:27:13.8850350Z #define __cpp_digit_separators 201309L 2025-05-07T20:27:13.8850429Z #define __ELF__ 1 2025-05-07T20:27:13.8850531Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:27:13.8850628Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:27:13.8850715Z #define STA_INS 0x0010 2025-05-07T20:27:13.8850809Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:27:13.8850975Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:27:13.8851072Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:27:13.8851163Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:27:13.8851366Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:13.8851477Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:27:13.8851571Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:27:13.8851671Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:27:13.8851770Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:27:13.8851920Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:27:13.8852079Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:27:13.8852175Z #define _IO_funlockfile(_fp) 2025-05-07T20:27:13.8852568Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:27:13.8852702Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:27:13.8852792Z #define __DRIVER_TYPES_H__ 2025-05-07T20:27:13.8852878Z #define __FLT_RADIX__ 2 2025-05-07T20:27:13.8852983Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:27:13.8853145Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:27:13.8853241Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:27:13.8853338Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:27:13.8853436Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:27:13.8853536Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:27:13.8853628Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:27:13.8853726Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:27:13.8853813Z #define WORD_BIT 32 2025-05-07T20:27:13.8853898Z #define _IO_USER_BUF 1 2025-05-07T20:27:13.8853987Z #define __VECTOR_TYPES_H__ 2025-05-07T20:27:13.8854091Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:13.8854203Z #define cudaHostAllocPortable 0x01 2025-05-07T20:27:13.8854303Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:27:13.8854403Z #define __long_double_t long double 2025-05-07T20:27:13.8854495Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:27:13.8854584Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:27:13.8854976Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:27:13.8855061Z #define __k8 1 2025-05-07T20:27:13.8855256Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:27:13.8855422Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:27:13.8855534Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:27:13.8855635Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:27:13.8855728Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:27:13.8855822Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:27:13.8855919Z #define __blksize_t_defined 2025-05-07T20:27:13.8856014Z #define _IO_SHOWPOINT 0400 2025-05-07T20:27:13.8856108Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:27:13.8856224Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:27:13.8856313Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:27:13.8856421Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:27:13.8856512Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:27:13.8856604Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:27:13.8856863Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:27:13.8857206Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:27:13.8857302Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:27:13.8857404Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:27:13.8857484Z #define SEEK_SET 0 2025-05-07T20:27:13.8857579Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:27:13.8857679Z #define __CUDA_API_VER_MINOR__ 8 2025-05-07T20:27:13.8857870Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:27:13.8857978Z #define __cudaCDP2GetLastError 2025-05-07T20:27:13.8858070Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:27:13.8858157Z #define _MATH_H_MATHDEF 1 2025-05-07T20:27:13.8858576Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:27:13.8858846Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:27:13.8858985Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:27:13.8859119Z #define __stub_sigreturn 2025-05-07T20:27:13.8859444Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:27:13.8859540Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:27:13.8859637Z #define __HOST_CONFIG_H__ 2025-05-07T20:27:13.8859734Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:27:13.8859826Z #define CLOCK_TAI 11 2025-05-07T20:27:13.8859927Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:27:13.8860212Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:27:13.8860308Z #define __restrict_arr 2025-05-07T20:27:13.8860417Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:27:13.8860553Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:27:13.8861095Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:27:13.8861281Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:27:13.8861371Z #define __USE_MISC 1 2025-05-07T20:27:13.8861469Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:27:13.8861565Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:27:13.8861654Z #define _GCC_LIMITS_H_ 2025-05-07T20:27:13.8861736Z #define __LDBL_DIG__ 18 2025-05-07T20:27:13.8861829Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:27:13.8861933Z #define __malloc_and_calloc_defined 2025-05-07T20:27:13.8862027Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:27:13.8862126Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:27:13.8862214Z #define __x86_64__ 1 2025-05-07T20:27:13.8862291Z #define _SIZE_T_ 2025-05-07T20:27:13.8863151Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:27:13.8863256Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:27:13.8863348Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:27:13.8863465Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:27:13.8863579Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:27:13.8863671Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:27:13.8863781Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:27:13.8863902Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:27:13.8864036Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:27:13.8864136Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:27:13.8864586Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:27:13.8864714Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:27:13.8864854Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:27:13.8864949Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:27:13.8865047Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:27:13.8865131Z #define STA_FLL 0x0008 2025-05-07T20:27:13.8865268Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:27:13.8865364Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:27:13.8865480Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.8865595Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:27:13.8865677Z #define __stub_revoke 2025-05-07T20:27:13.8865763Z #define __timer_t_defined 1 2025-05-07T20:27:13.8865898Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:27:13.8865987Z #define INT_MAX __INT_MAX__ 2025-05-07T20:27:13.8866087Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:27:13.8866193Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:27:13.8866411Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:27:13.8866506Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:27:13.8866619Z #define cudaArrayTextureGather 0x08 2025-05-07T20:27:13.8866715Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:27:13.8866857Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:27:13.8866952Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:27:13.8867037Z #define _IO_off_t __off_t 2025-05-07T20:27:13.8867129Z #define __FLT64_DIG__ 15 2025-05-07T20:27:13.8867348Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:27:13.8867515Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:27:13.8867778Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.8867899Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:27:13.8867991Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:27:13.8868095Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:27:13.8868177Z #define NULL __null 2025-05-07T20:27:13.8868308Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:27:13.8868412Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:27:13.8868506Z #define __U64_TYPE unsigned long int 2025-05-07T20:27:13.8868604Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.8868695Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:27:13.8868771Z #define FP_ZERO 2 2025-05-07T20:27:13.8868866Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:27:13.8869015Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:27:13.8869117Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.8869203Z #define __WCHAR_T__ 2025-05-07T20:27:13.8869299Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:27:13.8869487Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:27:13.8869639Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:27:13.8869731Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:27:13.8869852Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:27:13.8869967Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:13.8870091Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:27:13.8870220Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:27:13.8870308Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:27:13.8870395Z #define _SIGSET_H_types 1 2025-05-07T20:27:13.8870528Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:27:13.8870732Z #define __cpp_unicode_literals 200710L 2025-05-07T20:27:13.8870907Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:27:13.8871072Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:27:13.8871214Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:27:13.8879918Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:27:13.8880042Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:27:13.8880170Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:27:13.8880279Z #define __CUDACC_DEVICE_ATOMIC_BUILTINS__ 1 2025-05-07T20:27:13.8880476Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:27:13.8880574Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:27:13.8880679Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:27:13.8880785Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:27:13.8880873Z #define STA_MODE 0x4000 2025-05-07T20:27:13.8880981Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:27:13.8881097Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:27:13.8881210Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:27:13.8881318Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:27:13.8881416Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:27:13.8881521Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:27:13.8881626Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:27:13.8881739Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:27:13.8881828Z #define __SIZE_WIDTH__ 64 2025-05-07T20:27:13.8881949Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:13.8882029Z #define __SEG_FS 1 2025-05-07T20:27:13.8882260Z #define _IO_size_t size_t 2025-05-07T20:27:13.8882360Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:27:13.8882456Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:27:13.8882540Z #define __stub_lchmod 2025-05-07T20:27:13.8882638Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:27:13.8882742Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.8882847Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:27:13.8882926Z #define __SEG_GS 1 2025-05-07T20:27:13.8883108Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:27:13.8883199Z #define _IOS_APPEND 8 2025-05-07T20:27:13.8883433Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:27:13.8883525Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:27:13.8883626Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:27:13.8883720Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:27:13.8883815Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:27:13.8883905Z #define htole16(x) (x) 2025-05-07T20:27:13.8884017Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:27:13.8884108Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:27:13.8884207Z #define __INT16_TYPE__ short int 2025-05-07T20:27:13.8884308Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:27:13.8884415Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:27:13.8884521Z #define __cpp_structured_bindings 201606L 2025-05-07T20:27:13.8884642Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:27:13.8884734Z #define __SIZEOF_INT__ 4 2025-05-07T20:27:13.8884821Z #define __WCLONE 0x80000000 2025-05-07T20:27:13.8884911Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:27:13.8885006Z #define SEEK_HOLE 4 2025-05-07T20:27:13.8885091Z #define TIMER_ABSTIME 1 2025-05-07T20:27:13.8885181Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:27:13.8885274Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:27:13.8885455Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:27:13.8885597Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.8885735Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:27:13.8885874Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:27:13.8886009Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:27:13.8886138Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:27:13.8886223Z #define _LINUX_LIMITS_H 2025-05-07T20:27:13.8886311Z #define linux 1 2025-05-07T20:27:13.8886397Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:27:13.8886504Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:27:13.8886604Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:27:13.8886696Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:27:13.8886804Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:27:13.8886951Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:27:13.8887045Z #define __cpp_lib_hypot 201603 2025-05-07T20:27:13.8887145Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.8887239Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:27:13.8887326Z #define MOD_NANO ADJ_NANO 2025-05-07T20:27:13.8887418Z #define htole64(x) (x) 2025-05-07T20:27:13.8887514Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:27:13.8887636Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:27:13.8887734Z #define _IO_UPPERCASE 01000 2025-05-07T20:27:13.8888381Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:27:13.8888506Z #define __USE_POSIX2 1 2025-05-07T20:27:13.8888659Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:27:13.8888786Z #define __WALL 0x40000000 2025-05-07T20:27:13.8888897Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:27:13.8888991Z #define _XLOCALE_H 1 2025-05-07T20:27:13.8889084Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:27:13.8889185Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:27:13.8889278Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:27:13.8889381Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:27:13.8889475Z #define __EXCEPTIONS 1 2025-05-07T20:27:13.8889574Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:27:13.8889869Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:27:13.8889960Z #define __WORDSIZE 64 2025-05-07T20:27:13.8890049Z #define CLOCK_MONOTONIC 1 2025-05-07T20:27:13.8890135Z #define _STL_RELOPS_H 1 2025-05-07T20:27:13.8890234Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:27:13.8890328Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:27:13.8890424Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:27:13.8890528Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:27:13.8890625Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:27:13.8891012Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:27:13.8891243Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:27:13.8891370Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:27:13.8891474Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:27:13.8891574Z #define __cpp_range_based_for 201603L 2025-05-07T20:27:13.8891687Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:27:13.8891791Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:27:13.8891894Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:27:13.8892079Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:27:13.8892175Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:27:13.8892264Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:27:13.8892371Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:27:13.8892542Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:27:13.8892653Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:27:13.8892748Z #define _STRING_H 1 2025-05-07T20:27:13.8892844Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:27:13.8892931Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:27:13.8893036Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:27:13.8893167Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:27:13.8893258Z #define __code_model_small__ 1 2025-05-07T20:27:13.8893357Z #define _PSTL_CONFIG_H 2025-05-07T20:27:13.8893454Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:27:13.8893575Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:27:13.8893665Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:27:13.8893763Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:27:13.8894097Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:27:13.8894188Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:27:13.8894270Z #define le64toh(x) (x) 2025-05-07T20:27:13.8894367Z #define FILENAME_MAX 4096 2025-05-07T20:27:13.8894515Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:27:13.8894625Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:27:13.8894712Z #define L_cuserid 9 2025-05-07T20:27:13.8894797Z #define __ino_t_defined 2025-05-07T20:27:13.8894880Z #define __k8__ 1 2025-05-07T20:27:13.8894976Z #define __INTPTR_TYPE__ long int 2025-05-07T20:27:13.8895079Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:27:13.8895178Z #define __int8_t_defined 2025-05-07T20:27:13.8895266Z #define __WCHAR_TYPE__ int 2025-05-07T20:27:13.8895361Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:27:13.8895477Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:27:13.8895570Z #define __SLONGWORD_TYPE long int 2025-05-07T20:27:13.8895683Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:27:13.8895834Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:27:13.8895919Z #define __HAVE_COLUMN 2025-05-07T20:27:13.8896001Z #define __stub_fdetach 2025-05-07T20:27:13.8896407Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:27:13.8896486Z #define __pic__ 2 2025-05-07T20:27:13.8896615Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.8896709Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:27:13.8896797Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:27:13.8896988Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:27:13.8897070Z #define __stub_chflags 2025-05-07T20:27:13.8897154Z #define CLOCK_BOOTTIME 7 2025-05-07T20:27:13.8897247Z #define __need_IOV_MAX 2025-05-07T20:27:13.8897351Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:27:13.8897449Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:27:13.8897550Z #define __cpp_decltype 200707L 2025-05-07T20:27:13.8897644Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:27:13.8897732Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:27:13.8897840Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:27:13.8897995Z #define TTY_NAME_MAX 32 2025-05-07T20:27:13.8898165Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:27:13.8898296Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.8898482Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:27:13.8898607Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:27:13.8898703Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:27:13.8898792Z #define STA_PPSTIME 0x0004 2025-05-07T20:27:13.8898878Z #define __import__ 2025-05-07T20:27:13.8898963Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:27:13.8899094Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:27:13.8899183Z #define __export__ 2025-05-07T20:27:13.8899297Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:27:13.8899394Z #define cudaMemAttachHost 0x02 2025-05-07T20:27:13.8899556Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:27:13.8899649Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:27:13.8899747Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:27:13.8899838Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:27:13.8899925Z #define _WCHAR_T_DECLARED 2025-05-07T20:27:13.8900050Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:27:13.8900162Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:27:13.8900265Z #define __cpp_inline_variables 201606L 2025-05-07T20:27:13.8900363Z #define WNOWAIT 0x01000000 2025-05-07T20:27:13.8900444Z #define PLOSS 6 2025-05-07T20:27:13.8900532Z #define M_LN10 2.30258509299404568402 2025-05-07T20:27:13.8900791Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:27:13.8900875Z #define EXIT_SUCCESS 0 2025-05-07T20:27:13.8900972Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:27:13.8901063Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:27:13.8901160Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:27:13.8901253Z #define __thread__ __thread 2025-05-07T20:27:13.8901345Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:27:13.8901437Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:27:13.8901540Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:27:13.8901759Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:27:13.8901868Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:27:13.8901963Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:27:13.8902044Z #define __linux__ 1 2025-05-07T20:27:13.8902136Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:27:13.8902265Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:27:13.8902352Z #define __S16_TYPE short int 2025-05-07T20:27:13.8902693Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:27:13.8902793Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:27:13.8902976Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:27:13.8903075Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:27:13.8903172Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:27:13.8903252Z #define _T_SIZE_ 2025-05-07T20:27:13.8903351Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:27:13.8903466Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:27:13.8903556Z #define _PSTL_VERSION 12000 2025-05-07T20:27:13.8903677Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:27:13.8903768Z #define __WNOTHREAD 0x20000000 2025-05-07T20:27:13.8903959Z #define _G_va_list __gnuc_va_list 2025-05-07T20:27:13.8904084Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:27:13.8904164Z #define _IOS_INPUT 1 2025-05-07T20:27:13.8904258Z #define __USE_LARGEFILE64 1 2025-05-07T20:27:13.8904357Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:27:13.8904444Z #define __INT64_TYPE__ long int 2025-05-07T20:27:13.8904541Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:27:13.8904635Z #define __shared__ __location__(shared) 2025-05-07T20:27:13.8904722Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:27:13.8904958Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:27:13.8905045Z #define __gid_t_defined 2025-05-07T20:27:13.8905160Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:27:13.8905252Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:27:13.8905448Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:27:13.8905548Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:27:13.8905638Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:27:13.8905722Z #define ___int_size_t_h 2025-05-07T20:27:13.8905826Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:13.8905946Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:27:13.8906094Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:27:13.8906199Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:27:13.8906289Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:27:13.8906379Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:27:13.8906475Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:27:13.8906599Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.8906715Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:27:13.8906829Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:27:13.8906917Z #define __clock_t_defined 1 2025-05-07T20:27:13.8907018Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:27:13.8907124Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:27:13.8907215Z #define __GLIBC_MINOR__ 17 2025-05-07T20:27:13.8907308Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:27:13.8907402Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:27:13.8907505Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:27:13.8907736Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:27:13.8907906Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:27:13.8907989Z #define __SSE__ 1 2025-05-07T20:27:13.8908081Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:27:13.8908171Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:27:13.8908257Z #define _CTYPE_H 1 2025-05-07T20:27:13.8908351Z #define __sigset_t_defined 2025-05-07T20:27:13.8908443Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:27:13.8908537Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:27:13.8908620Z #define MOD_TAI ADJ_TAI 2025-05-07T20:27:13.8908714Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:27:13.8908809Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:27:13.8908889Z #define __SM_70_RT_H__ 2025-05-07T20:27:13.8908985Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:27:13.8909092Z #define cudaEventWaitDefault 0x00 2025-05-07T20:27:13.8909183Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:27:13.8909343Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:27:13.8909434Z #define _POSIX_MAX_CANON 255 2025-05-07T20:27:13.8909537Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:27:13.8909633Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:27:13.8909720Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:27:13.8909800Z #define __amd64__ 1 2025-05-07T20:27:13.8909893Z #define __WINT_WIDTH__ 32 2025-05-07T20:27:13.8909999Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:27:13.8910259Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:13.8910361Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:27:13.8910443Z #define EOF (-1) 2025-05-07T20:27:13.8910535Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:27:13.8910717Z #define __USE_POSIX199309 1 2025-05-07T20:27:13.8910808Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:27:13.8910905Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:27:13.8910995Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:27:13.8911088Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:27:13.8911203Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:27:13.8911293Z #define ____mbstate_t_defined 1 2025-05-07T20:27:13.8911376Z #define STA_NANO 0x2000 2025-05-07T20:27:13.8911473Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:27:13.8911564Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:27:13.8911646Z #define _IO_LINKED 0x80 2025-05-07T20:27:13.8911851Z #define __cpp_lib_launder 201606 2025-05-07T20:27:13.8911942Z #define __SIZEOF_INT128__ 16 2025-05-07T20:27:13.8912040Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:27:13.8912135Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:27:13.8912224Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:27:13.8912366Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:27:13.8912475Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:13.8912570Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:27:13.8912668Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:27:13.8912757Z #define __W_CONTINUED 0xffff 2025-05-07T20:27:13.8912843Z #define __ATOMIC_RELAXED 0 2025-05-07T20:27:13.8912974Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:27:13.8913092Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:13.8913289Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:27:13.8913475Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:27:13.8913562Z #define __stub_stty 2025-05-07T20:27:13.8913728Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:27:13.8913812Z #define le16toh(x) (x) 2025-05-07T20:27:13.8913914Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:27:13.8914088Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:27:13.8914172Z #define _SIZET_ 2025-05-07T20:27:13.8914258Z #define XATTR_NAME_MAX 255 2025-05-07T20:27:13.8914345Z #define _SVID_SOURCE 1 2025-05-07T20:27:13.8914423Z #define _LP64 1 2025-05-07T20:27:13.8914510Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:27:13.8914747Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:27:13.8914854Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:27:13.8914942Z #define __UINT8_C(c) c 2025-05-07T20:27:13.8915032Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:27:13.8915121Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:27:13.8915236Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:27:13.8915327Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:27:13.8915417Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:27:13.8915514Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:27:13.8915599Z #define CUDARTAPI 2025-05-07T20:27:13.8915679Z #define IOV_MAX 1024 2025-05-07T20:27:13.8915823Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:27:13.8915919Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:27:13.8916015Z #define P_tmpdir "/tmp" 2025-05-07T20:27:13.8916118Z #define cudaMemAttachSingle 0x04 2025-05-07T20:27:13.8916199Z #define __wchar_t__ 2025-05-07T20:27:13.8916305Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:27:13.8916382Z #define SEEK_END 2 2025-05-07T20:27:13.8916469Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:27:13.8916643Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:27:13.8916737Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:27:13.8916876Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:27:13.8916971Z #define ____FILE_defined 1 2025-05-07T20:27:13.8917083Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:27:13.8917176Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:27:13.8917265Z #define _ISOC99_SOURCE 1 2025-05-07T20:27:13.8917357Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:27:13.8917596Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:13.8917812Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:27:13.8917891Z #define _IO_RIGHT 04 2025-05-07T20:27:13.8917987Z #define __END_NAMESPACE_STD 2025-05-07T20:27:13.8918172Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:27:13.8918259Z #define _GLIBCXX_STD_C std 2025-05-07T20:27:13.8918379Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:27:13.8918468Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:27:13.8918564Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:27:13.8918653Z #define _STDDEF_H_ 2025-05-07T20:27:13.8918962Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:27:13.8919103Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.8919230Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:27:13.8919422Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:27:13.8919539Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.8919682Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:27:13.8919799Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:27:13.8919902Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:27:13.8920009Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:27:13.8920101Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:27:13.8920217Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:27:13.8920310Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:27:13.8920404Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:27:13.8920501Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:27:13.8920678Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:27:13.8920775Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:27:13.8920955Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:27:13.8921051Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:27:13.8921150Z #define __STDCPP_THREADS__ 1 2025-05-07T20:27:13.8921293Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:27:13.8921384Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:27:13.8921481Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:27:13.8921578Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:27:13.8921692Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:27:13.8921786Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:27:13.8921881Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:27:13.8922049Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:27:13.8922212Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:27:13.8922312Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:27:13.8922433Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:27:13.8922540Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:27:13.8922637Z #define __location__(a) __annotate__(a) 2025-05-07T20:27:13.8922868Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:27:13.8922967Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:27:13.8923080Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:27:13.8923197Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:27:13.8923286Z #define __STDC_UTF_32__ 1 2025-05-07T20:27:13.8923377Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:27:13.8923475Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:27:13.8923568Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:27:13.8923646Z #define __FXSR__ 1 2025-05-07T20:27:13.8923732Z #define _SIZE_T 2025-05-07T20:27:13.8923829Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:27:13.8923943Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:27:13.8924113Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:27:13.8924258Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:27:13.8924349Z #define _IO_ssize_t __ssize_t 2025-05-07T20:27:13.8924455Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:27:13.8924636Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:27:13.8924934Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:27:13.8925021Z #define _GXX_NULLPTR_T 2025-05-07T20:27:13.8925142Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:27:13.8925230Z #define FOPEN_MAX 16 2025-05-07T20:27:13.8925315Z #define __BIG_ENDIAN 4321 2025-05-07T20:27:13.8925429Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:27:13.8925529Z #define __suseconds_t_defined 2025-05-07T20:27:13.8925613Z #define __off_t_defined 2025-05-07T20:27:13.8925697Z #define stderr stderr 2025-05-07T20:27:13.8925870Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:27:13.8925980Z #define __glibcxx_requires_string(_String) 2025-05-07T20:27:13.8926081Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:27:13.8926169Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:27:13.8926573Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:27:13.8926674Z #define __mode_t_defined 2025-05-07T20:27:13.8926758Z #define _GCC_SIZE_T 2025-05-07T20:27:13.8926854Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:13.8926958Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:27:13.8927060Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:27:13.8927151Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:27:13.8927246Z #define __UINT32_C(c) c ## U 2025-05-07T20:27:13.8927346Z #define __cpp_alias_templates 200704L 2025-05-07T20:27:13.8927446Z #define cudaHostAllocMapped 0x02 2025-05-07T20:27:13.8927552Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:27:13.8927642Z #define _STL_ITERATOR_H 1 2025-05-07T20:27:13.8927727Z #define __size_t__ 2025-05-07T20:27:13.8927853Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:27:13.8927944Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:27:13.8928052Z #define cudaEventRecordExternal 0x01 2025-05-07T20:27:13.8928199Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:27:13.8928297Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:27:13.8928490Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:27:13.8928593Z #define _ENDIAN_H 1 2025-05-07T20:27:13.8928697Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:27:13.8928794Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:27:13.8928889Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:27:13.8928974Z #define __try try 2025-05-07T20:27:13.8929070Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:27:13.8929158Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:27:13.8929251Z #define __INT8_MAX__ 0x7f 2025-05-07T20:27:13.8929505Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:27:13.8929591Z #define __LONG_WIDTH__ 64 2025-05-07T20:27:13.8929675Z #define __PIC__ 2 2025-05-07T20:27:13.8929782Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:27:13.8929897Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:27:13.8930035Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:27:13.8930126Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:27:13.8930216Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:27:13.8930405Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:27:13.8930502Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:27:13.8930601Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:27:13.8930688Z #define _IO_uid_t __uid_t 2025-05-07T20:27:13.8930781Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:27:13.8930909Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:27:13.8931002Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:27:13.8931234Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:27:13.8931333Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:27:13.8931455Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:27:13.8931537Z #define LONG_BIT 64 2025-05-07T20:27:13.8931641Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:27:13.8931831Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:27:13.8931958Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:27:13.8932052Z #define __fsfilcnt_t_defined 2025-05-07T20:27:13.8932145Z #define __blkcnt_t_defined 2025-05-07T20:27:13.8932414Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:27:13.8932507Z #define __USE_LARGEFILE 1 2025-05-07T20:27:13.8932602Z #define __cpp_constexpr 201603L 2025-05-07T20:27:13.8932695Z #define CUDART_VERSION 12080 2025-05-07T20:27:13.8932789Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:27:13.8932889Z #define cudaDeviceMapHost 0x08 2025-05-07T20:27:13.8933052Z #define _GLIBCXX_CMATH 1 2025-05-07T20:27:13.8933252Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:27:13.8933342Z #define __lldiv_t_defined 1 2025-05-07T20:27:13.8933422Z #define __SSE2__ 1 2025-05-07T20:27:13.8933505Z #define _IOLBF 1 2025-05-07T20:27:13.8933603Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:27:13.8933702Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:27:13.8933808Z #define __cpp_deduction_guides 201703L 2025-05-07T20:27:13.8933899Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:27:13.8934013Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:27:13.8934100Z #define __INT32_TYPE__ int 2025-05-07T20:27:13.8934188Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:27:13.8934298Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:27:13.8934396Z #define __cpp_exceptions 199711L 2025-05-07T20:27:13.8934487Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:27:13.8934602Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:27:13.8934699Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:27:13.8934816Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:27:13.8934979Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:27:13.8935075Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:27:13.8935180Z #define __SWORD_TYPE long int 2025-05-07T20:27:13.8935277Z #define __INTMAX_TYPE__ long int 2025-05-07T20:27:13.8935375Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:27:13.8935474Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:27:13.8935563Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:27:13.8935844Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:27:13.8935943Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:27:13.8936088Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:27:13.8936169Z #define _T_SIZE 2025-05-07T20:27:13.8936281Z #define cudaHostAllocDefault 0x00 2025-05-07T20:27:13.8936403Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:27:13.8936530Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:27:13.8936627Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:27:13.8936717Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:27:13.8936842Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:27:13.8936939Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.8937027Z #define __ATOMIC_CONSUME 1 2025-05-07T20:27:13.8937207Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:27:13.8937295Z #define __GNUC_MINOR__ 4 2025-05-07T20:27:13.8937394Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:27:13.8937491Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:27:13.8937605Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.8937684Z #define __PIE__ 2 2025-05-07T20:27:13.8937788Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:27:13.8937885Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:27:13.8938082Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:27:13.8938299Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:27:13.8938392Z #define __nlink_t_defined 2025-05-07T20:27:13.8938524Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:27:13.8938632Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:27:13.8938716Z #define _XOPEN_LIM_H 1 2025-05-07T20:27:13.8939101Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:27:13.8939215Z #define __cpp_template_template_args 201611L 2025-05-07T20:27:13.8939314Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:27:13.8939416Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:27:13.8939509Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:27:13.8939598Z #define __FILE_defined 1 2025-05-07T20:27:13.8939771Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:27:13.8939864Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:27:13.8939964Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:27:13.8940637Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:27:13.8940873Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:27:13.8941055Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:27:13.8941191Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:27:13.8941295Z #define __INT16_C(c) c 2025-05-07T20:27:13.8941392Z #define __U32_TYPE unsigned int 2025-05-07T20:27:13.8941495Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:27:13.8941614Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:27:13.8941698Z #define __STDC__ 1 2025-05-07T20:27:13.8941787Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:27:13.8941889Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:27:13.8941980Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:27:13.8942128Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:27:13.8942227Z #define __FLT32X_DIG__ 15 2025-05-07T20:27:13.8942320Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:27:13.8942418Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:27:13.8942533Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:27:13.8942640Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:27:13.8942733Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:27:13.8942836Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:27:13.8942916Z #define stdin stdin 2025-05-07T20:27:13.8943010Z #define __ino64_t_defined 2025-05-07T20:27:13.8943098Z #define STA_CLK 0x8000 2025-05-07T20:27:13.8943189Z #define __clockid_t_defined 1 2025-05-07T20:27:13.8943337Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:27:13.8943494Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:27:13.8943592Z #define __cudaCDP2MemsetAsync 2025-05-07T20:27:13.8943694Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:27:13.8943794Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:27:13.8943893Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:27:13.8944097Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:27:13.8944187Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:27:13.8944723Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:27:13.8944815Z #define DOMAIN 1 2025-05-07T20:27:13.8944907Z #define M_LN2 0.69314718055994530942 2025-05-07T20:27:13.8944990Z #define __NVCC__ 1 2025-05-07T20:27:13.8945096Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:27:13.8945239Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:13.8945380Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:27:13.8945502Z #define __throw_exception_again throw 2025-05-07T20:27:13.8945591Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:27:13.8945682Z #define __EXCEPTION_H 1 2025-05-07T20:27:13.8945775Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:27:13.8945886Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:27:13.8946184Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:27:13.8946292Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:27:13.8946396Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:27:13.8946486Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:27:13.8946588Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:27:13.8946847Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:27:13.8946987Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:27:13.8947089Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:13.8947200Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:27:13.8947291Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:27:13.8947408Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:27:13.8947501Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:27:13.8947692Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:27:13.8947833Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:27:13.8948004Z #define __useconds_t_defined 2025-05-07T20:27:13.8948105Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:27:13.8948287Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:27:13.8948432Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:27:13.8948514Z #define __SSE_MATH__ 1 2025-05-07T20:27:13.8948615Z #define _IO_wint_t wint_t 2025-05-07T20:27:13.8948705Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:27:13.8948793Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:27:13.8948890Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:27:13.8948999Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:27:13.8949095Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:27:13.8949184Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:27:13.8949264Z #define __USE_ATFILE 1 2025-05-07T20:27:13.8949362Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:27:13.8949469Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:27:13.8949588Z #define _GCC_PTRDIFF_T 2025-05-07T20:27:13.8949864Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:27:13.8949994Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:27:13.8950123Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:27:13.8950231Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:27:13.8950335Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:27:13.8950420Z #define _STDLIB_H 1 2025-05-07T20:27:13.8950563Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:27:13.8950654Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.8950750Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:27:13.8950876Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.8950979Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:27:13.8951082Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:27:13.8951261Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:27:13.8951412Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:27:13.8951523Z #define __glibcxx_requires_nonempty() 2025-05-07T20:27:13.8951637Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:27:13.8951730Z #define __ldiv_t_defined 1 2025-05-07T20:27:13.8951901Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:27:13.8951991Z #define ___int_ptrdiff_t_h 2025-05-07T20:27:13.8952159Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:27:13.8952262Z #define __cudaCDP2EventDestroy 2025-05-07T20:27:13.8952350Z #define __HOST_DEFINES_H__ 2025-05-07T20:27:13.8952452Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:27:13.8952549Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:13.8952645Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:27:13.8952731Z #define CUDART_CB 2025-05-07T20:27:13.8952829Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:27:13.8952952Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:27:13.8953039Z #define MB_LEN_MAX 16 2025-05-07T20:27:13.8953260Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:27:13.8953360Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:27:13.8953483Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:27:13.8953590Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:27:13.8953693Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:27:13.8953838Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:27:13.8954035Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:27:13.8954123Z #define _GNU_SOURCE 1 2025-05-07T20:27:13.8954206Z #define __stub_putmsg 2025-05-07T20:27:13.8954288Z #define __CUDACC__ 1 2025-05-07T20:27:13.8954383Z #define __N(msgid) (msgid) 2025-05-07T20:27:13.8954464Z #define __P(args) args 2025-05-07T20:27:13.8954725Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:27:13.8954824Z #define __cpp_init_captures 201304L 2025-05-07T20:27:13.8954925Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:27:13.8955108Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:27:13.8955203Z #define __cpp_lib_as_const 201510 2025-05-07T20:27:13.8955283Z #define __WCHAR_T 2025-05-07T20:27:13.8955382Z #define __ATOMIC_RELEASE 3 2025-05-07T20:27:13.8955473Z #define __fsblkcnt_t_defined 2025-05-07T20:27:13.8955585Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:27:13.8955692Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:27:13.8955705Z 2025-05-07T20:27:13.9163268Z 2025-05-07T20:27:13.9163909Z + conda run -n build_binary nvcc --version 2025-05-07T20:27:13.9163925Z 2025-05-07T20:27:15.8089610Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:27:15.8089968Z Copyright (c) 2005-2025 NVIDIA Corporation 2025-05-07T20:27:15.8090268Z Built on Wed_Jan_15_19:20:09_PST_2025 2025-05-07T20:27:15.8090563Z Cuda compilation tools, release 12.8, V12.8.61 2025-05-07T20:27:15.8090893Z Build cuda_12.8.r12.8/compiler.35404655_0 2025-05-07T20:27:15.8091099Z 2025-05-07T20:27:15.8707207Z 2025-05-07T20:27:15.8716686Z /usr/bin/nvidia-smi 2025-05-07T20:27:15.8722492Z + nvidia-smi 2025-05-07T20:27:15.8722696Z 2025-05-07T20:27:15.8892435Z Wed May 7 20:27:15 2025 2025-05-07T20:27:15.8892898Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:15.8893393Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:27:15.8893899Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:15.8894372Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:27:15.8894886Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:27:15.8895300Z | | | MIG M. | 2025-05-07T20:27:15.8895626Z |=========================================+========================+======================| 2025-05-07T20:27:15.9072505Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:27:15.9073954Z | 0% 25C P8 16W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:27:15.9075014Z | | | N/A | 2025-05-07T20:27:15.9076078Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:15.9077174Z 2025-05-07T20:27:15.9077572Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:15.9080982Z | Processes: | 2025-05-07T20:27:15.9081581Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:27:15.9082149Z | ID ID Usage | 2025-05-07T20:27:15.9082615Z |=========================================================================================| 2025-05-07T20:27:15.9083184Z | No running processes found | 2025-05-07T20:27:15.9083806Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:16.1634811Z 2025-05-07T20:27:16.1640532Z [INSTALL] Successfully installed CUDA 12.8.0 2025-05-07T20:27:16.1695522Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0 2025-05-07T20:27:16.1696059Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0 2025-05-07T20:27:16.1708661Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:27:16.1709009Z env: 2025-05-07T20:27:16.1709255Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:27:16.1709591Z BUILD_ENV: build_binary 2025-05-07T20:27:16.1709842Z BUILD_TARGET: genai 2025-05-07T20:27:16.1710075Z BUILD_VARIANT: cuda 2025-05-07T20:27:16.1710306Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:27:16.1710563Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:27:16.1710867Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:27:16.1711197Z ##[endgroup] 2025-05-07T20:27:16.5076311Z ################################################################################ 2025-05-07T20:27:16.5076661Z # Install PyTorch (PIP) 2025-05-07T20:27:16.5076917Z # 2025-05-07T20:27:16.5092889Z # [2025-05-07T20:27:16.508Z] + install_pytorch_pip build_binary nightly cuda/12.8.0 2025-05-07T20:27:16.5093312Z ################################################################################ 2025-05-07T20:27:16.5093529Z 2025-05-07T20:27:16.5122076Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:27:17.4967315Z Channels: 2025-05-07T20:27:17.4967551Z - conda-forge 2025-05-07T20:27:17.4967765Z Platform: linux-64 2025-05-07T20:27:20.7474283Z Collecting package metadata (repodata.json): - \ | / - done 2025-05-07T20:27:21.4572314Z Solving environment: | / - done 2025-05-07T20:27:21.6755119Z 2025-05-07T20:27:21.6755633Z ## Package Plan ## 2025-05-07T20:27:21.6755844Z 2025-05-07T20:27:21.6756116Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:27:21.6756526Z 2025-05-07T20:27:21.6756643Z added / updated specs: 2025-05-07T20:27:21.6756989Z - numpy 2025-05-07T20:27:21.6757130Z 2025-05-07T20:27:21.6757152Z 2025-05-07T20:27:21.6757303Z The following packages will be downloaded: 2025-05-07T20:27:21.6757575Z 2025-05-07T20:27:21.6757713Z package | build 2025-05-07T20:27:21.6758030Z ---------------------------|----------------- 2025-05-07T20:27:21.6758405Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:27:21.6758927Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:27:21.6759536Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:27:21.6760131Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:27:21.6760657Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:27:21.6761116Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:27:21.6761569Z numpy-2.2.5 | py313h17eae1a_0 8.1 MB conda-forge 2025-05-07T20:27:21.6761950Z ------------------------------------------------------------ 2025-05-07T20:27:21.6762274Z Total: 15.4 MB 2025-05-07T20:27:21.6762481Z 2025-05-07T20:27:21.6762602Z The following NEW packages will be INSTALLED: 2025-05-07T20:27:21.6762817Z 2025-05-07T20:27:21.6763023Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:27:21.6763517Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:27:21.6764009Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:27:21.6764488Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:27:21.6764988Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:27:21.6765514Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:27:21.6766547Z numpy conda-forge/linux-64::numpy-2.2.5-py313h17eae1a_0 2025-05-07T20:27:21.6766823Z 2025-05-07T20:27:21.6766828Z 2025-05-07T20:27:21.6766832Z 2025-05-07T20:27:21.6766983Z Downloading and Extracting Packages: ...working... 2025-05-07T20:27:21.6767486Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:27:21.6767786Z 2025-05-07T20:27:21.6768291Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:21.6768601Z 2025-05-07T20:27:21.6768605Z 2025-05-07T20:27:21.6777301Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:27:21.6777649Z 2025-05-07T20:27:21.6777655Z 2025-05-07T20:27:21.6777660Z 2025-05-07T20:27:21.6809643Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:27:21.6809996Z 2025-05-07T20:27:21.6810001Z 2025-05-07T20:27:21.6810007Z 2025-05-07T20:27:21.6811855Z 2025-05-07T20:27:21.6819576Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:21.6819857Z 2025-05-07T20:27:21.6819869Z 2025-05-07T20:27:21.6819873Z 2025-05-07T20:27:21.6819877Z 2025-05-07T20:27:21.6824068Z 2025-05-07T20:27:21.6832967Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:21.6833222Z 2025-05-07T20:27:21.6833229Z 2025-05-07T20:27:21.6833233Z 2025-05-07T20:27:21.6833237Z 2025-05-07T20:27:21.6833240Z 2025-05-07T20:27:21.6834060Z 2025-05-07T20:27:21.7639202Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:21.7639491Z 2025-05-07T20:27:21.7639495Z 2025-05-07T20:27:21.7639499Z 2025-05-07T20:27:21.7643155Z 2025-05-07T20:27:21.8370901Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:21.8371166Z 2025-05-07T20:27:21.8371170Z 2025-05-07T20:27:21.8371174Z 2025-05-07T20:27:21.8371190Z 2025-05-07T20:27:21.8371194Z 2025-05-07T20:27:21.8397664Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:21.8397980Z 2025-05-07T20:27:21.8397984Z 2025-05-07T20:27:21.8397994Z 2025-05-07T20:27:21.8398006Z 2025-05-07T20:27:21.8399596Z 2025-05-07T20:27:21.9134481Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:21.9134769Z 2025-05-07T20:27:21.9134773Z 2025-05-07T20:27:21.9134786Z 2025-05-07T20:27:21.9134789Z 2025-05-07T20:27:21.9134793Z 2025-05-07T20:27:21.9138106Z 2025-05-07T20:27:21.9219483Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:21.9219757Z 2025-05-07T20:27:21.9219969Z 2025-05-07T20:27:21.9219983Z 2025-05-07T20:27:21.9219994Z 2025-05-07T20:27:21.9220004Z 2025-05-07T20:27:21.9222988Z 2025-05-07T20:27:21.9899102Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:21.9900103Z 2025-05-07T20:27:22.0579436Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:22.0579802Z 2025-05-07T20:27:22.0579808Z 2025-05-07T20:27:22.0579813Z 2025-05-07T20:27:22.0650055Z libgfortran-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:27:22.0650426Z 2025-05-07T20:27:22.0650433Z 2025-05-07T20:27:22.0652599Z 2025-05-07T20:27:22.0932864Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:22.0933578Z 2025-05-07T20:27:22.0933590Z 2025-05-07T20:27:22.0954578Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:27:22.0954935Z 2025-05-07T20:27:22.0964322Z libopenblas-0.3.29 | 5.6 MB | ##5 | 26%  2025-05-07T20:27:22.0964664Z 2025-05-07T20:27:22.0964670Z 2025-05-07T20:27:22.0964676Z 2025-05-07T20:27:22.0966877Z 2025-05-07T20:27:22.0970345Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.0970704Z 2025-05-07T20:27:22.0970710Z 2025-05-07T20:27:22.0970715Z 2025-05-07T20:27:22.0970720Z 2025-05-07T20:27:22.0970726Z 2025-05-07T20:27:22.0973192Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.0973550Z 2025-05-07T20:27:22.0973789Z 2025-05-07T20:27:22.0973794Z 2025-05-07T20:27:22.0973997Z 2025-05-07T20:27:22.1000092Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.1000453Z 2025-05-07T20:27:22.1000459Z 2025-05-07T20:27:22.1000465Z 2025-05-07T20:27:22.1000470Z 2025-05-07T20:27:22.1000475Z 2025-05-07T20:27:22.1000778Z 2025-05-07T20:27:22.1151435Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.1151814Z 2025-05-07T20:27:22.1151828Z 2025-05-07T20:27:22.1200776Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:22.1273014Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:27:22.1273341Z 2025-05-07T20:27:22.1273347Z 2025-05-07T20:27:22.1273385Z 2025-05-07T20:27:22.1729702Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:22.1730076Z 2025-05-07T20:27:22.1730082Z 2025-05-07T20:27:22.1955976Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:22.1956360Z 2025-05-07T20:27:22.1956673Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:22.1957019Z 2025-05-07T20:27:22.2137886Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:22.3326086Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:22.3326423Z 2025-05-07T20:27:22.6245383Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:22.6245905Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:22.6253309Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:22.6253776Z 2025-05-07T20:27:22.6254036Z 2025-05-07T20:27:22.6254290Z  2025-05-07T20:27:22.6254560Z 2025-05-07T20:27:22.6254565Z 2025-05-07T20:27:22.6254781Z  2025-05-07T20:27:22.6255058Z 2025-05-07T20:27:22.6255062Z 2025-05-07T20:27:22.6255078Z 2025-05-07T20:27:22.6255274Z  2025-05-07T20:27:22.6255481Z 2025-05-07T20:27:22.6255485Z 2025-05-07T20:27:22.6255489Z 2025-05-07T20:27:22.6255493Z 2025-05-07T20:27:22.6255715Z  2025-05-07T20:27:22.6256011Z 2025-05-07T20:27:22.6256017Z 2025-05-07T20:27:22.6256022Z 2025-05-07T20:27:22.6256027Z 2025-05-07T20:27:22.6256033Z 2025-05-07T20:27:22.6256266Z  2025-05-07T20:27:22.6256549Z 2025-05-07T20:27:22.6256554Z 2025-05-07T20:27:22.6256569Z 2025-05-07T20:27:22.6256574Z 2025-05-07T20:27:22.6256579Z 2025-05-07T20:27:22.6256585Z 2025-05-07T20:27:22.6256791Z  done 2025-05-07T20:27:22.7261903Z Preparing transaction: | done 2025-05-07T20:27:22.9268411Z Verifying transaction: - \ done 2025-05-07T20:27:23.0277075Z Executing transaction: / done 2025-05-07T20:27:23.2060844Z ################################################################################ 2025-05-07T20:27:23.2061420Z # Install Package From PyTorch PIP: torch 2025-05-07T20:27:23.2061861Z # 2025-05-07T20:27:23.2076390Z # [2025-05-07T20:27:23.207Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.8.0 2025-05-07T20:27:23.2077149Z ################################################################################ 2025-05-07T20:27:23.2077484Z 2025-05-07T20:27:23.2092066Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:27:23.2998465Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:27:23.2998942Z ################################################################################ 2025-05-07T20:27:23.2999386Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:27:23.2999671Z # 2025-05-07T20:27:23.3016234Z # [2025-05-07T20:27:23.301Z] + __prepare_pip_arguments torch nightly cuda/12.8.0 2025-05-07T20:27:23.3016807Z ################################################################################ 2025-05-07T20:27:23.3017562Z 2025-05-07T20:27:23.3038628Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:27:23.3065497Z [INSTALL] Extracted package variant: cu128 2025-05-07T20:27:23.3082822Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:27:23.3083553Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:27:23.3092400Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:27:23.3100302Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu128/ ... 2025-05-07T20:27:23.3121230Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:29:00.7023278Z DEPRECATION: Building 'MarkupSafe' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'MarkupSafe'. Discussion can be found at https://github.com/pypa/pip/issues/6334 2025-05-07T20:29:00.7025999Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:29:00.7026461Z 2025-05-07T20:29:00.7026584Z Collecting torch 2025-05-07T20:29:00.7027383Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:29:00.7028534Z Collecting filelock (from torch) 2025-05-07T20:29:00.7029230Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:29:00.7030641Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (4.13.2) 2025-05-07T20:29:00.7031933Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (78.1.1) 2025-05-07T20:29:00.7032591Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:29:00.7033084Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:29:00.7033918Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 41.8 MB/s eta 0:00:00 2025-05-07T20:29:00.7034266Z Collecting networkx (from torch) 2025-05-07T20:29:00.7034761Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB) 2025-05-07T20:29:00.7035392Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 18.8 MB/s eta 0:00:00 2025-05-07T20:29:00.7035731Z Collecting jinja2 (from torch) 2025-05-07T20:29:00.7036201Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:29:00.7036693Z Collecting fsspec (from torch) 2025-05-07T20:29:00.7037194Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:29:00.7037760Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch) 2025-05-07T20:29:00.7038616Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:29:00.7039432Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch) 2025-05-07T20:29:00.7040658Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:29:00.7041521Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch) 2025-05-07T20:29:00.7042312Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:29:00.7043092Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch) 2025-05-07T20:29:00.7044407Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB) 2025-05-07T20:29:00.7045124Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch) 2025-05-07T20:29:00.7045862Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:29:00.7046551Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch) 2025-05-07T20:29:00.7047325Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:29:00.7048135Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch) 2025-05-07T20:29:00.7048835Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:29:00.7049544Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch) 2025-05-07T20:29:00.7050261Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:29:00.7051096Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch) 2025-05-07T20:29:00.7051930Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:29:00.7052724Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:29:00.7053435Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB) 2025-05-07T20:29:00.7054145Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:29:00.7054896Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:29:00.7055670Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch) 2025-05-07T20:29:00.7056420Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:29:00.7057193Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch) 2025-05-07T20:29:00.7058013Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:29:00.7058806Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch) 2025-05-07T20:29:00.7059623Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:29:00.7060418Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:29:00.7061244Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:29:00.7062050Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:29:00.7062604Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:29:00.7063315Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 5.6 MB/s eta 0:00:00 2025-05-07T20:29:00.7063683Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:29:00.7064173Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5.tar.gz (19 kB) 2025-05-07T20:29:00.7064652Z Preparing metadata (setup.py): started 2025-05-07T20:29:00.7065027Z Preparing metadata (setup.py): finished with status 'done' 2025-05-07T20:29:00.7065772Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl (1047.0 MB) 2025-05-07T20:29:00.7066560Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 GB 22.5 MB/s eta 0:00:00 2025-05-07T20:29:00.7067516Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB) 2025-05-07T20:29:00.7068410Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 609.6/609.6 MB 53.4 MB/s eta 0:00:00 2025-05-07T20:29:00.7069182Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB) 2025-05-07T20:29:00.7070026Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 176.3 MB/s eta 0:00:00 2025-05-07T20:29:00.7070789Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB) 2025-05-07T20:29:00.7071633Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.0/88.0 MB 169.6 MB/s eta 0:00:00 2025-05-07T20:29:00.7072397Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB) 2025-05-07T20:29:00.7073263Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 954.8/954.8 kB 103.2 MB/s eta 0:00:00 2025-05-07T20:29:00.7073936Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB) 2025-05-07T20:29:00.7074692Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 698.0/698.0 MB 45.3 MB/s eta 0:00:00 2025-05-07T20:29:00.7075450Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB) 2025-05-07T20:29:00.7076295Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.1/193.1 MB 89.3 MB/s eta 0:00:00 2025-05-07T20:29:00.7077044Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB) 2025-05-07T20:29:00.7078022Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 66.4 MB/s eta 0:00:00 2025-05-07T20:29:00.7078727Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB) 2025-05-07T20:29:00.7079481Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.6/63.6 MB 148.7 MB/s eta 0:00:00 2025-05-07T20:29:00.7080180Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB) 2025-05-07T20:29:00.7081066Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 260.4/260.4 MB 107.4 MB/s eta 0:00:00 2025-05-07T20:29:00.7081845Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB) 2025-05-07T20:29:00.7082695Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 292.1/292.1 MB 93.1 MB/s eta 0:00:00 2025-05-07T20:29:00.7083382Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:29:00.7084180Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 136.5 MB/s eta 0:00:00 2025-05-07T20:29:00.7085160Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:29:00.7086003Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 130.4 MB/s eta 0:00:00 2025-05-07T20:29:00.7086764Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB) 2025-05-07T20:29:00.7087602Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.2/39.2 MB 161.2 MB/s eta 0:00:00 2025-05-07T20:29:00.7088338Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB) 2025-05-07T20:29:00.7089474Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB) 2025-05-07T20:29:00.7090352Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 125.7 MB/s eta 0:00:00 2025-05-07T20:29:00.7090731Z Building wheels for collected packages: MarkupSafe 2025-05-07T20:29:00.7091116Z Building wheel for MarkupSafe (setup.py): started 2025-05-07T20:29:00.7091551Z Building wheel for MarkupSafe (setup.py): finished with status 'done' 2025-05-07T20:29:00.7092396Z Created wheel for MarkupSafe: filename=markupsafe-2.1.5-cp313-cp313-linux_x86_64.whl size=14954 sha256=c651adbbf11229a5595504d32ca1e5d9b02f5c896a75bb208e770b56236dac00 2025-05-07T20:29:00.7093410Z Stored in directory: /home/ec2-user/.cache/pip/wheels/3a/21/87/28c44597225fd0c28d6ffa365f1c2c9dd0ab763711aa4957c6 2025-05-07T20:29:00.7094009Z Successfully built MarkupSafe 2025-05-07T20:29:00.7095700Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:29:00.7097278Z 2025-05-07T20:29:00.7099218Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128 2025-05-07T20:29:00.7101233Z 2025-05-07T20:29:02.9352652Z torch 2.8.0.dev20250507+cu128 2025-05-07T20:29:02.9354666Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu128) 2025-05-07T20:29:06.4122360Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:29:09.9015744Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu128 2025-05-07T20:29:09.9016168Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:29:13.2980487Z True 2025-05-07T20:29:13.2980715Z True 2025-05-07T20:29:13.2980822Z 2025-05-07T20:29:13.3599066Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:29:13.3646109Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:29:13.3646722Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:29:13.3660989Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:13.3661330Z env: 2025-05-07T20:29:13.3661553Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:13.3661849Z BUILD_ENV: build_binary 2025-05-07T20:29:13.3662092Z BUILD_TARGET: genai 2025-05-07T20:29:13.3662501Z BUILD_VARIANT: cuda 2025-05-07T20:29:13.3662735Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:29:13.3662987Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:13.3663285Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:13.3663620Z ##[endgroup] 2025-05-07T20:29:13.6995917Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:29:13.6997582Z ################################################################################ 2025-05-07T20:29:13.6998208Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:29:13.6998577Z # 2025-05-07T20:29:13.7013320Z # [2025-05-07T20:29:13.701Z] + collect_pytorch_env_info build_binary 2025-05-07T20:29:13.7013815Z ################################################################################ 2025-05-07T20:29:13.7014132Z 2025-05-07T20:29:13.7028664Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:29:13.7955144Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:29:13.7965465Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:29:13.7966357Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:29:13.7966866Z 2025-05-07T20:29:13.8853531Z 2025-05-07T20:29:13.8854037Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:29:13.8878174Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:29:19.8321706Z Collecting environment information... 2025-05-07T20:29:19.8322246Z PyTorch version: 2.8.0.dev20250507+cu128 2025-05-07T20:29:19.8322654Z Is debug build: False 2025-05-07T20:29:19.8322925Z CUDA used to build PyTorch: 12.8 2025-05-07T20:29:19.8323197Z ROCM used to build PyTorch: N/A 2025-05-07T20:29:19.8323374Z 2025-05-07T20:29:19.8323474Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:29:19.8323784Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:29:19.8324093Z Clang version: Could not collect 2025-05-07T20:29:19.8324369Z CMake version: Could not collect 2025-05-07T20:29:19.8324628Z Libc version: glibc-2.34 2025-05-07T20:29:19.8324778Z 2025-05-07T20:29:19.8325082Z Python version: 3.13.0 | packaged by conda-forge | (main, Nov 27 2024, 19:18:50) [GCC 13.3.0] (64-bit runtime) 2025-05-07T20:29:19.8325678Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:29:19.8326081Z Is CUDA available: True 2025-05-07T20:29:19.8326340Z CUDA runtime version: 12.8.61 2025-05-07T20:29:19.8326599Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:29:19.8326896Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:29:19.8327228Z Nvidia driver version: 570.133.07 2025-05-07T20:29:19.8327493Z cuDNN version: Could not collect 2025-05-07T20:29:19.8327754Z HIP runtime version: N/A 2025-05-07T20:29:19.8327999Z MIOpen runtime version: N/A 2025-05-07T20:29:19.8328252Z Is XNNPACK available: True 2025-05-07T20:29:19.8328421Z 2025-05-07T20:29:19.8328495Z CPU: 2025-05-07T20:29:19.8328712Z Architecture: x86_64 2025-05-07T20:29:19.8329040Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:29:19.8329422Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:29:19.8329799Z Byte Order: Little Endian 2025-05-07T20:29:19.8330113Z CPU(s): 16 2025-05-07T20:29:19.8330392Z On-line CPU(s) list: 0-15 2025-05-07T20:29:19.8331037Z Vendor ID: AuthenticAMD 2025-05-07T20:29:19.8331384Z Model name: AMD EPYC 7R32 2025-05-07T20:29:19.8331689Z CPU family: 23 2025-05-07T20:29:19.8331965Z Model: 49 2025-05-07T20:29:19.8332245Z Thread(s) per core: 2 2025-05-07T20:29:19.8332518Z Core(s) per socket: 8 2025-05-07T20:29:19.8332791Z Socket(s): 1 2025-05-07T20:29:19.8333061Z Stepping: 0 2025-05-07T20:29:19.8333499Z BogoMIPS: 5598.98 2025-05-07T20:29:19.8335523Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:29:19.8339693Z Hypervisor vendor: KVM 2025-05-07T20:29:19.8340003Z Virtualization type: full 2025-05-07T20:29:19.8340819Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:29:19.8341278Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:29:19.8341639Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:29:19.8341988Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:29:19.8342303Z NUMA node(s): 1 2025-05-07T20:29:19.8342585Z NUMA node0 CPU(s): 0-15 2025-05-07T20:29:19.8342931Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:29:19.8343292Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:29:19.8343645Z Vulnerability L1tf: Not affected 2025-05-07T20:29:19.8343995Z Vulnerability Mds: Not affected 2025-05-07T20:29:19.8344334Z Vulnerability Meltdown: Not affected 2025-05-07T20:29:19.8344682Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:29:19.8345040Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:29:19.8345565Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:29:19.8346139Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:29:19.8346670Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:29:19.8347350Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:29:19.8348297Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:29:19.8348957Z Vulnerability Srbds: Not affected 2025-05-07T20:29:19.8349313Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:29:19.8349537Z 2025-05-07T20:29:19.8349644Z Versions of relevant libraries: 2025-05-07T20:29:19.8349902Z [pip3] numpy==2.2.5 2025-05-07T20:29:19.8350144Z [pip3] nvidia-cublas-cu12==12.8.3.14 2025-05-07T20:29:19.8350453Z [pip3] nvidia-cuda-cupti-cu12==12.8.57 2025-05-07T20:29:19.8350749Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61 2025-05-07T20:29:19.8351057Z [pip3] nvidia-cuda-runtime-cu12==12.8.57 2025-05-07T20:29:19.8351365Z [pip3] nvidia-cudnn-cu12==9.8.0.87 2025-05-07T20:29:19.8351640Z [pip3] nvidia-cufft-cu12==11.3.3.41 2025-05-07T20:29:19.8351925Z [pip3] nvidia-curand-cu12==10.3.9.55 2025-05-07T20:29:19.8352218Z [pip3] nvidia-cusolver-cu12==11.7.2.55 2025-05-07T20:29:19.8352505Z [pip3] nvidia-cusparse-cu12==12.5.7.53 2025-05-07T20:29:19.8353029Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:29:19.8353327Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:29:19.8353609Z [pip3] nvidia-nvjitlink-cu12==12.8.61 2025-05-07T20:29:19.8353894Z [pip3] nvidia-nvtx-cu12==12.8.55 2025-05-07T20:29:19.8354173Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:29:19.8354465Z [pip3] torch==2.8.0.dev20250507+cu128 2025-05-07T20:29:19.8354827Z [conda] cuda-cudart 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:29:19.8355299Z [conda] cuda-cudart-dev 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:29:19.8355924Z [conda] cuda-cudart-dev_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:29:19.8356424Z [conda] cuda-cudart-static 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:29:19.8356938Z [conda] cuda-cudart-static_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:29:19.8357459Z [conda] cuda-cudart_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:29:19.8357932Z [conda] cuda-cupti 12.8.57 hbd13f7d_0 conda-forge 2025-05-07T20:29:19.8358376Z [conda] cuda-cupti-dev 12.8.57 h5888daf_0 conda-forge 2025-05-07T20:29:19.8358837Z [conda] cuda-libraries 12.8.0 ha770c72_0 conda-forge 2025-05-07T20:29:19.8359312Z [conda] cuda-libraries-dev 12.8.0 ha770c72_0 conda-forge 2025-05-07T20:29:19.8359773Z [conda] cuda-nvrtc 12.8.61 hbd13f7d_0 conda-forge 2025-05-07T20:29:19.8360222Z [conda] cuda-nvrtc-dev 12.8.61 h5888daf_0 conda-forge 2025-05-07T20:29:19.8360662Z [conda] cuda-nvtx 12.8.55 hbd13f7d_0 conda-forge 2025-05-07T20:29:19.8361100Z [conda] cuda-opencl 12.8.55 hbd13f7d_0 conda-forge 2025-05-07T20:29:19.8361551Z [conda] cuda-opencl-dev 12.8.55 h5888daf_0 conda-forge 2025-05-07T20:29:19.8362017Z [conda] cuda-runtime 12.8.0 ha804496_0 conda-forge 2025-05-07T20:29:19.8362459Z [conda] libcublas 12.8.3.14 h9ab20c4_0 conda-forge 2025-05-07T20:29:19.8362910Z [conda] libcublas-dev 12.8.3.14 h9ab20c4_0 conda-forge 2025-05-07T20:29:19.8363352Z [conda] libcufft 11.3.3.41 hbd13f7d_0 conda-forge 2025-05-07T20:29:19.8363795Z [conda] libcufft-dev 11.3.3.41 h5888daf_0 conda-forge 2025-05-07T20:29:19.8364245Z [conda] libcurand 10.3.9.55 hbd13f7d_0 conda-forge 2025-05-07T20:29:19.8364694Z [conda] libcurand-dev 10.3.9.55 h5888daf_0 conda-forge 2025-05-07T20:29:19.8365152Z [conda] libcusolver 11.7.2.55 h9ab20c4_0 conda-forge 2025-05-07T20:29:19.8365618Z [conda] libcusolver-dev 11.7.2.55 h9ab20c4_0 conda-forge 2025-05-07T20:29:19.8366086Z [conda] libcusparse 12.5.7.53 hbd13f7d_0 conda-forge 2025-05-07T20:29:19.8366543Z [conda] libcusparse-dev 12.5.7.53 h5888daf_0 conda-forge 2025-05-07T20:29:19.8367013Z [conda] libnvjitlink 12.8.61 hbd13f7d_0 conda-forge 2025-05-07T20:29:19.8367482Z [conda] libnvjitlink-dev 12.8.61 h5888daf_0 conda-forge 2025-05-07T20:29:19.8367926Z [conda] numpy 2.2.5 py313h17eae1a_0 conda-forge 2025-05-07T20:29:19.8368381Z [conda] nvidia-cublas-cu12 12.8.3.14 pypi_0 pypi 2025-05-07T20:29:19.8368914Z [conda] nvidia-cuda-cupti-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:29:19.8369392Z [conda] nvidia-cuda-nvrtc-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:29:19.8369886Z [conda] nvidia-cuda-runtime-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:29:19.8370360Z [conda] nvidia-cudnn-cu12 9.8.0.87 pypi_0 pypi 2025-05-07T20:29:19.8370929Z [conda] nvidia-cufft-cu12 11.3.3.41 pypi_0 pypi 2025-05-07T20:29:19.8371391Z [conda] nvidia-curand-cu12 10.3.9.55 pypi_0 pypi 2025-05-07T20:29:19.8371852Z [conda] nvidia-cusolver-cu12 11.7.2.55 pypi_0 pypi 2025-05-07T20:29:19.8372321Z [conda] nvidia-cusparse-cu12 12.5.7.53 pypi_0 pypi 2025-05-07T20:29:19.8372798Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:29:19.8373265Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:29:19.8373812Z [conda] nvidia-nvjitlink-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:29:19.8374271Z [conda] nvidia-nvtx-cu12 12.8.55 pypi_0 pypi 2025-05-07T20:29:19.8374728Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:29:19.8375167Z [conda] torch 2.8.0.dev20250507+cu128 pypi_0 pypi 2025-05-07T20:29:19.8375436Z 2025-05-07T20:29:19.9076737Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:29:19.9077400Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:29:19.9090023Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:19.9090364Z env: 2025-05-07T20:29:19.9090583Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:19.9090866Z BUILD_ENV: build_binary 2025-05-07T20:29:19.9091105Z BUILD_TARGET: genai 2025-05-07T20:29:19.9091332Z BUILD_VARIANT: cuda 2025-05-07T20:29:19.9091580Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:29:19.9091826Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:19.9092123Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:19.9092454Z ##[endgroup] 2025-05-07T20:29:20.2483984Z ################################################################################ 2025-05-07T20:29:20.2484474Z # Prepare FBGEMM-GPU Build 2025-05-07T20:29:20.2484791Z # 2025-05-07T20:29:20.2499711Z # [2025-05-07T20:29:20.249Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:29:20.2500252Z ################################################################################ 2025-05-07T20:29:20.2500540Z 2025-05-07T20:29:20.2515190Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:29:20.3547675Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:29:20.3567407Z [BUILD] Running git submodules update ... 2025-05-07T20:29:20.3589023Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:29:20.3953751Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:29:20.3954213Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:29:20.3954653Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:29:20.3955036Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:29:20.3955435Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:29:20.3955875Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:29:20.3956285Z Synchronizing submodule url for '../external/json' 2025-05-07T20:29:20.3989463Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:29:20.4538848Z [BUILD] Installing other build dependencies ... 2025-05-07T20:29:20.4560819Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:29:22.8631458Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:29:22.8805037Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:29:22.9721894Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:29:23.0200511Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:29:23.2145868Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:29:23.2176229Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:29:23.3178564Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:29:23.3203443Z Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:29:23.6186010Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:29:23.6218664Z Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:29:23.6782659Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:29:23.6786379Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:29:23.7466710Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:29:23.7495315Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:29:23.7999781Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 21)) (2.2.5) 2025-05-07T20:29:23.8558934Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:29:23.8614441Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:29:23.9780353Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:29:23.9806605Z Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:29:24.0836835Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:29:24.0868438Z Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:29:24.1294995Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:29:24.1900697Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:29:24.1925793Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:29:24.2792186Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:29:24.2824154Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:29:24.3843008Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:29:24.3880959Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:29:24.4961048Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:29:24.4994155Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:29:24.5935970Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:29:24.5963423Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:29:24.7012026Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:29:24.7037379Z Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:29:24.8136706Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:29:24.8172229Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:29:24.8691747Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:29:24.9145938Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:29:24.9195393Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:29:24.9666020Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:29:25.0150977Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:29:25.0178224Z Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:29:25.0655210Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:29:25.1261246Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:29:25.1290324Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:29:25.1754641Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:29:25.2249842Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:29:25.2780044Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:29:25.7938162Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 54.0 MB/s eta 0:00:00 2025-05-07T20:29:25.7968639Z Downloading click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:29:25.8509437Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:29:25.9108258Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:29:25.9603058Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:29:26.0142119Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:29:26.0655155Z Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (759 kB) 2025-05-07T20:29:26.1216444Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 759.5/759.5 kB 9.2 MB/s eta 0:00:00 2025-05-07T20:29:26.1244281Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:29:26.1726754Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:29:26.2248736Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:29:26.2760802Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:29:26.3368300Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:29:26.3851413Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:29:26.4283576Z Downloading distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:29:26.4823130Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:29:26.5311749Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:29:26.5805670Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:29:26.7467840Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions 2025-05-07T20:29:28.9943768Z 2025-05-07T20:29:28.9994395Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0 2025-05-07T20:29:29.1737657Z ################################################################################ 2025-05-07T20:29:29.1738005Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:29:29.1738420Z # 2025-05-07T20:29:29.1755763Z # [2025-05-07T20:29:29.175Z] + install_triton_pip build_binary 2025-05-07T20:29:29.1756144Z ################################################################################ 2025-05-07T20:29:29.1756384Z 2025-05-07T20:29:29.1756605Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:29:29.1757036Z ################################################################################ 2025-05-07T20:29:29.1757386Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:29:29.1757687Z # 2025-05-07T20:29:29.1772386Z # [2025-05-07T20:29:29.176Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:29:29.1772899Z ################################################################################ 2025-05-07T20:29:29.1773108Z 2025-05-07T20:29:29.1787885Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:29:29.2680092Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:29:29.2680427Z ################################################################################ 2025-05-07T20:29:29.2680753Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:29:29.2681268Z # 2025-05-07T20:29:29.2700589Z # [2025-05-07T20:29:29.269Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:29:29.2701228Z ################################################################################ 2025-05-07T20:29:29.2701508Z 2025-05-07T20:29:29.2750584Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:29:29.2767238Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:29:29.2767806Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:29.2776059Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:29:29.2785488Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:29:29.2806690Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:36.8607823Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:29:36.8609062Z torch 2.8.0.dev20250507+cu128 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:29:36.8609730Z 2025-05-07T20:29:36.8609958Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:36.8610361Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:29:36.8611188Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:29:36.8612410Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB) 2025-05-07T20:29:36.8613504Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 58.2 MB/s eta 0:00:00 2025-05-07T20:29:36.8613902Z Installing collected packages: pytorch-triton 2025-05-07T20:29:36.8614246Z Attempting uninstall: pytorch-triton 2025-05-07T20:29:36.8614629Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:29:36.8615045Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:29:36.8615479Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:29:36.8615920Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:29:36.8616170Z 2025-05-07T20:29:39.1002995Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:29:39.1006892Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:29:41.2603730Z ################################################################################ 2025-05-07T20:29:41.2604310Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:29:41.2604838Z ################################################################################ 2025-05-07T20:29:41.2605127Z 2025-05-07T20:29:43.3325487Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:29:45.5033532Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:29:45.5037598Z [BUILD] Successfully ran git submodules update 2025-05-07T20:29:45.5083171Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:45.5083639Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:45.5098022Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:45.5098368Z env: 2025-05-07T20:29:45.5098589Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:45.5098878Z BUILD_ENV: build_binary 2025-05-07T20:29:45.5099120Z BUILD_TARGET: genai 2025-05-07T20:29:45.5099341Z BUILD_VARIANT: cuda 2025-05-07T20:29:45.5099563Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:29:45.5099814Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:45.5100355Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:45.5100686Z ##[endgroup] 2025-05-07T20:29:45.8485840Z ################################################################################ 2025-05-07T20:29:45.8486331Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:29:45.8486691Z # 2025-05-07T20:29:45.8502465Z # [2025-05-07T20:29:45.849Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:45.8503126Z ################################################################################ 2025-05-07T20:29:45.8503344Z 2025-05-07T20:29:45.8503720Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:45.8504481Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:45.8504816Z 2025-05-07T20:29:45.8662877Z e6e36b113f85d3aaa465a028688a068480db398f fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:45.8665322Z 2025-05-07T20:29:45.8665932Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:45.8666285Z 2025-05-07T20:29:45.8851444Z ad0b4412d9939ed191fe39ed235330a3031fd537afb4dd426cb9ce0834b66e07 fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:45.8854031Z 2025-05-07T20:29:45.8854347Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:45.8854681Z 2025-05-07T20:29:45.9183654Z 74a01928743b9ea024408833cc9e2c10 fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:45.9187063Z 2025-05-07T20:29:45.9198532Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl ... 2025-05-07T20:29:45.9220147Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:48.7133749Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:48.7134716Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5) 2025-05-07T20:29:48.7135598Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:29:48.7136033Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:29:48.7136293Z 2025-05-07T20:29:55.6313119Z ################################################################################ 2025-05-07T20:29:55.6313538Z [CHECK] !!!! INFO !!!! 2025-05-07T20:29:55.6313910Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128 2025-05-07T20:29:55.6314323Z [CHECK] CUDA version reported by PyTorch is: 12.8 2025-05-07T20:29:55.6314640Z [CHECK] 2025-05-07T20:29:55.6314955Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:29:55.6315456Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:29:55.6315869Z ################################################################################ 2025-05-07T20:29:55.6316089Z 2025-05-07T20:29:55.6316203Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:29:59.6502215Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:03.6490469Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:30:07.6389346Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:30:07.6393026Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:30:19.6441718Z ################################################################################ 2025-05-07T20:30:19.6442297Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:30:19.6442655Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:30:19.6442989Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:30:19.6443316Z ################################################################################ 2025-05-07T20:30:19.6444015Z 2025-05-07T20:30:27.6281736Z ################################################################################ 2025-05-07T20:30:27.6282332Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:30:27.6283731Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:30:27.6285306Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:30:27.6285838Z ################################################################################ 2025-05-07T20:30:27.6286051Z 2025-05-07T20:30:27.6286223Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:30:31.6590486Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:30:35.6533067Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:30:39.7828703Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:30:43.7996516Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:30:43.8000973Z [INSTALL] Check for operator registrations ... 2025-05-07T20:30:47.7104052Z fbgemm.nccl_init 2025-05-07T20:30:47.7106070Z 2025-05-07T20:30:47.7725546Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:30:51.6801900Z fbgemm.gqa_attn_splitk 2025-05-07T20:30:51.6802115Z 2025-05-07T20:30:51.7416790Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:30:55.6618710Z fbgemm.rope_qkv_decoding 2025-05-07T20:30:55.6618925Z 2025-05-07T20:30:55.7239131Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:30:55.7239721Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:30:55.7278263Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:55.7278706Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:55.7292524Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:30:55.7292870Z env: 2025-05-07T20:30:55.7293100Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:30:55.7293394Z BUILD_ENV: build_binary 2025-05-07T20:30:55.7293647Z BUILD_TARGET: genai 2025-05-07T20:30:55.7293876Z BUILD_VARIANT: cuda 2025-05-07T20:30:55.7294105Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:30:55.7294367Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:30:55.7294674Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:30:55.7294998Z ##[endgroup] 2025-05-07T20:30:56.0639613Z ################################################################################ 2025-05-07T20:30:56.0640008Z # Test All FBGEMM-GPU Modules 2025-05-07T20:30:56.0640542Z # 2025-05-07T20:30:56.0657218Z # [2025-05-07T20:30:56.065Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:30:56.0657623Z ################################################################################ 2025-05-07T20:30:56.0657831Z 2025-05-07T20:31:04.0404547Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:31:04.0405091Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:31:04.0405482Z [TEST] Determined the test directories: 2025-05-07T20:31:04.0405796Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:31:04.0406094Z fbgemm_gpu/experimental/example/test 2025-05-07T20:31:04.0406380Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:31:04.0406566Z 2025-05-07T20:31:04.0416480Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:31:04.0423123Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:31:04.0423990Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:31:04.0424279Z 2025-05-07T20:31:04.4634126Z 2025-05-07T20:31:04.4634452Z [TEST] Installing PyTest ... 2025-05-07T20:31:04.4658858Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:31:05.5654506Z Channels: 2025-05-07T20:31:05.5654796Z - conda-forge 2025-05-07T20:31:05.5655018Z Platform: linux-64 2025-05-07T20:31:08.8996716Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:31:10.0496087Z Solving environment: \ | / done 2025-05-07T20:31:10.2780583Z 2025-05-07T20:31:10.2781094Z ## Package Plan ## 2025-05-07T20:31:10.2781342Z 2025-05-07T20:31:10.2781646Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:31:10.2782074Z 2025-05-07T20:31:10.2782176Z added / updated specs: 2025-05-07T20:31:10.2782413Z - expecttest 2025-05-07T20:31:10.2782652Z - pytest 2025-05-07T20:31:10.2782768Z 2025-05-07T20:31:10.2782782Z 2025-05-07T20:31:10.2782897Z The following packages will be downloaded: 2025-05-07T20:31:10.2783114Z 2025-05-07T20:31:10.2783230Z package | build 2025-05-07T20:31:10.2783540Z ---------------------------|----------------- 2025-05-07T20:31:10.2783902Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:31:10.2784455Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:31:10.2785088Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:31:10.2785517Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:31:10.2786113Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:31:10.2786691Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:31:10.2787236Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:31:10.2788127Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:31:10.2788507Z ------------------------------------------------------------ 2025-05-07T20:31:10.2788839Z Total: 428 KB 2025-05-07T20:31:10.2789039Z 2025-05-07T20:31:10.2789160Z The following NEW packages will be INSTALLED: 2025-05-07T20:31:10.2789375Z 2025-05-07T20:31:10.2789565Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:31:10.2790057Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:31:10.2790580Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:31:10.2791033Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:31:10.2791481Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:31:10.2791934Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:31:10.2792353Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:31:10.2792752Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:31:10.2793007Z 2025-05-07T20:31:10.2793011Z 2025-05-07T20:31:10.2793015Z 2025-05-07T20:31:10.2793153Z Downloading and Extracting Packages: ...working... 2025-05-07T20:31:10.2793508Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:31:10.2793808Z 2025-05-07T20:31:10.2794240Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:31:10.2794472Z 2025-05-07T20:31:10.2794476Z 2025-05-07T20:31:10.2804503Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:31:10.2804837Z 2025-05-07T20:31:10.2804844Z 2025-05-07T20:31:10.2809097Z 2025-05-07T20:31:10.2819306Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:31:10.2819912Z 2025-05-07T20:31:10.2819917Z 2025-05-07T20:31:10.2819921Z 2025-05-07T20:31:10.2825092Z 2025-05-07T20:31:10.2829641Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:31:10.2830031Z 2025-05-07T20:31:10.2830036Z 2025-05-07T20:31:10.2830040Z 2025-05-07T20:31:10.2830044Z 2025-05-07T20:31:10.2830047Z 2025-05-07T20:31:10.2834447Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:31:10.2834714Z 2025-05-07T20:31:10.2834718Z 2025-05-07T20:31:10.2834722Z 2025-05-07T20:31:10.2834726Z 2025-05-07T20:31:10.2834730Z 2025-05-07T20:31:10.2841691Z 2025-05-07T20:31:10.2843461Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:31:10.2843803Z 2025-05-07T20:31:10.2843807Z 2025-05-07T20:31:10.2843811Z 2025-05-07T20:31:10.2843815Z 2025-05-07T20:31:10.2843819Z 2025-05-07T20:31:10.2843822Z 2025-05-07T20:31:10.2843826Z 2025-05-07T20:31:10.3834010Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:31:10.3834570Z 2025-05-07T20:31:10.3834579Z 2025-05-07T20:31:10.3938165Z 2025-05-07T20:31:10.3938775Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:31:10.3939046Z 2025-05-07T20:31:10.3939051Z 2025-05-07T20:31:10.3939055Z 2025-05-07T20:31:10.4990536Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:31:10.4990810Z 2025-05-07T20:31:10.4990814Z 2025-05-07T20:31:10.4990818Z 2025-05-07T20:31:10.4990822Z 2025-05-07T20:31:10.4990826Z 2025-05-07T20:31:10.5019944Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:31:10.5020207Z 2025-05-07T20:31:10.5020211Z 2025-05-07T20:31:10.5020215Z 2025-05-07T20:31:10.5020219Z 2025-05-07T20:31:10.5022056Z 2025-05-07T20:31:10.5830828Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:31:10.5847447Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:31:10.5847701Z 2025-05-07T20:31:10.5847706Z 2025-05-07T20:31:10.5847719Z 2025-05-07T20:31:10.5847723Z 2025-05-07T20:31:10.5847727Z 2025-05-07T20:31:10.5847731Z 2025-05-07T20:31:10.5853938Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:31:10.5854227Z 2025-05-07T20:31:10.5854231Z 2025-05-07T20:31:10.5927544Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:31:10.5927899Z 2025-05-07T20:31:10.5927904Z 2025-05-07T20:31:10.5927908Z 2025-05-07T20:31:10.5927913Z 2025-05-07T20:31:10.5927918Z 2025-05-07T20:31:10.5931112Z 2025-05-07T20:31:10.6274882Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:31:10.6275161Z 2025-05-07T20:31:10.6275176Z 2025-05-07T20:31:10.6651086Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:31:10.6767054Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:31:10.6767513Z 2025-05-07T20:31:10.6767521Z 2025-05-07T20:31:10.6767529Z 2025-05-07T20:31:10.6767537Z 2025-05-07T20:31:10.6767545Z 2025-05-07T20:31:10.6782744Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:31:10.6782997Z 2025-05-07T20:31:10.6783002Z 2025-05-07T20:31:10.6783281Z 2025-05-07T20:31:10.6805106Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:31:10.6805489Z 2025-05-07T20:31:10.6805495Z 2025-05-07T20:31:10.6805500Z 2025-05-07T20:31:10.6805505Z 2025-05-07T20:31:10.6805510Z 2025-05-07T20:31:10.6805515Z 2025-05-07T20:31:10.6808463Z 2025-05-07T20:31:10.6816143Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:31:10.6816422Z 2025-05-07T20:31:10.6816426Z 2025-05-07T20:31:10.6816430Z 2025-05-07T20:31:10.6816434Z 2025-05-07T20:31:10.6816437Z 2025-05-07T20:31:10.6816441Z 2025-05-07T20:31:10.6816743Z 2025-05-07T20:31:10.6823939Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:31:10.6824237Z 2025-05-07T20:31:10.6824242Z 2025-05-07T20:31:10.6824246Z 2025-05-07T20:31:10.6824257Z 2025-05-07T20:31:10.6824492Z 2025-05-07T20:31:10.6824496Z 2025-05-07T20:31:10.6999885Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:31:10.7003282Z 2025-05-07T20:31:10.7072287Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:31:10.7073101Z 2025-05-07T20:31:10.7119466Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:31:10.7119702Z 2025-05-07T20:31:10.7119706Z 2025-05-07T20:31:10.7119710Z 2025-05-07T20:31:10.7119714Z 2025-05-07T20:31:10.7119718Z 2025-05-07T20:31:10.7119722Z 2025-05-07T20:31:10.7120660Z 2025-05-07T20:31:10.7153401Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:31:10.7153667Z 2025-05-07T20:31:10.7153670Z 2025-05-07T20:31:10.7153674Z 2025-05-07T20:31:10.7154390Z 2025-05-07T20:31:10.7183761Z exceptiongroup-1.2.2 | 20 KB | #######9 | 80%  2025-05-07T20:31:10.7184049Z 2025-05-07T20:31:10.7184053Z 2025-05-07T20:31:10.7184056Z 2025-05-07T20:31:10.7184069Z 2025-05-07T20:31:10.7272524Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:31:10.7272797Z 2025-05-07T20:31:10.7276035Z 2025-05-07T20:31:10.7279104Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:31:10.7279363Z 2025-05-07T20:31:10.7282679Z 2025-05-07T20:31:10.7476124Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:31:10.7476432Z 2025-05-07T20:31:10.7549935Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:31:10.7550184Z 2025-05-07T20:31:10.7550188Z 2025-05-07T20:31:10.7550191Z 2025-05-07T20:31:10.7550195Z 2025-05-07T20:31:10.7643098Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:31:10.7643557Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:31:10.7650942Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:31:10.7651412Z 2025-05-07T20:31:10.7651683Z 2025-05-07T20:31:10.7651963Z  2025-05-07T20:31:10.7652236Z 2025-05-07T20:31:10.7652242Z 2025-05-07T20:31:10.7652776Z  2025-05-07T20:31:10.7653052Z 2025-05-07T20:31:10.7653058Z 2025-05-07T20:31:10.7653063Z 2025-05-07T20:31:10.7653273Z  2025-05-07T20:31:10.7653532Z 2025-05-07T20:31:10.7653538Z 2025-05-07T20:31:10.7653551Z 2025-05-07T20:31:10.7653557Z 2025-05-07T20:31:10.7653793Z  2025-05-07T20:31:10.7653996Z 2025-05-07T20:31:10.7654000Z 2025-05-07T20:31:10.7654004Z 2025-05-07T20:31:10.7654007Z 2025-05-07T20:31:10.7654011Z 2025-05-07T20:31:10.7654190Z  2025-05-07T20:31:10.7654403Z 2025-05-07T20:31:10.7654406Z 2025-05-07T20:31:10.7654410Z 2025-05-07T20:31:10.7654414Z 2025-05-07T20:31:10.7654423Z 2025-05-07T20:31:10.7654434Z 2025-05-07T20:31:10.7654610Z  2025-05-07T20:31:10.7654818Z 2025-05-07T20:31:10.7654822Z 2025-05-07T20:31:10.7654832Z 2025-05-07T20:31:10.7654836Z 2025-05-07T20:31:10.7654839Z 2025-05-07T20:31:10.7654843Z 2025-05-07T20:31:10.7654853Z 2025-05-07T20:31:10.7655033Z  done 2025-05-07T20:31:10.8665002Z Preparing transaction: \ done 2025-05-07T20:31:10.9670318Z Verifying transaction: / done 2025-05-07T20:31:12.8697087Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:31:12.9974241Z [TEST] Checking imports ... 2025-05-07T20:31:16.9416544Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:31:16.9429589Z [TEST] Setting feature flags ... 2025-05-07T20:31:16.9430007Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:31:16.9430693Z 2025-05-07T20:31:17.3638730Z 2025-05-07T20:31:17.3639333Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:31:17.3641359Z ################################################################################ 2025-05-07T20:31:17.3641673Z # Run FBGEMM-GPU Tests: 2025-05-07T20:31:17.3641905Z # 2025-05-07T20:31:17.3661272Z # [2025-05-07T20:31:17.365Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:31:17.3661678Z ################################################################################ 2025-05-07T20:31:17.3661886Z 2025-05-07T20:31:17.3669203Z [TEST] Enumerating ALL test files ... 2025-05-07T20:31:17.3698234Z ./attention/gqa_test.py 2025-05-07T20:31:17.3698499Z ./coalesce/coalesce_test.py 2025-05-07T20:31:17.3698766Z ./comm/multi_gpu_car_test.py 2025-05-07T20:31:17.3699027Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:17.3699315Z ./kv_cache/kv_cache_test.py 2025-05-07T20:31:17.3699564Z ./moe/activation_test.py 2025-05-07T20:31:17.3699818Z ./moe/gather_scatter_test.py 2025-05-07T20:31:17.3700062Z ./moe/layers_test.py 2025-05-07T20:31:17.3700288Z ./moe/shuffling_test.py 2025-05-07T20:31:17.3700521Z ./quantize/quantize_test.py 2025-05-07T20:31:17.3700683Z 2025-05-07T20:31:17.3700794Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:31:17.3701002Z 2025-05-07T20:31:17.3719424Z ################################################################################ 2025-05-07T20:31:17.3735456Z # [2025-05-07T20:31:17.373Z] Run Python Test Suite: 2025-05-07T20:31:17.3735782Z # ./attention/gqa_test.py 2025-05-07T20:31:17.3736058Z ################################################################################ 2025-05-07T20:31:17.3761379Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:31:17.3761995Z 2025-05-07T20:31:19.9113672Z ============================= test session starts ============================== 2025-05-07T20:31:19.9114562Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:19.9115399Z cachedir: .pytest_cache 2025-05-07T20:31:19.9115971Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:19.9116676Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:19.9117096Z plugins: hypothesis-6.131.14 2025-05-07T20:31:21.5097541Z collecting ... collected 2 items 2025-05-07T20:31:21.5097835Z 2025-05-07T20:31:59.0796049Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:31:59.0796649Z self=, 2025-05-07T20:31:59.0797066Z int4_kv=False, 2025-05-07T20:31:59.0797319Z num_groups=1, 2025-05-07T20:31:59.0797566Z B=1, 2025-05-07T20:31:59.0799308Z MAX_T=4, 2025-05-07T20:31:59.0799592Z N_H_L=1, 2025-05-07T20:31:59.0799849Z ) 2025-05-07T20:31:59.0800085Z Trying example: test_gqa( 2025-05-07T20:31:59.0800444Z self=, 2025-05-07T20:31:59.0800838Z int4_kv=True, 2025-05-07T20:31:59.0801082Z num_groups=1, 2025-05-07T20:31:59.0801326Z B=1, 2025-05-07T20:31:59.0801546Z MAX_T=4, 2025-05-07T20:31:59.0801767Z N_H_L=1, 2025-05-07T20:31:59.0801991Z ) 2025-05-07T20:31:59.0802217Z Trying example: test_gqa( 2025-05-07T20:31:59.0802556Z self=, 2025-05-07T20:31:59.0802930Z int4_kv=True, 2025-05-07T20:31:59.0803174Z num_groups=4, 2025-05-07T20:31:59.0803411Z B=23, 2025-05-07T20:31:59.0803643Z MAX_T=33, 2025-05-07T20:31:59.0803875Z N_H_L=68, 2025-05-07T20:31:59.0804093Z ) 2025-05-07T20:31:59.0804327Z Trying example: test_gqa( 2025-05-07T20:31:59.0804668Z self=, 2025-05-07T20:31:59.0805030Z int4_kv=True, 2025-05-07T20:31:59.0805281Z num_groups=4, 2025-05-07T20:31:59.0805931Z B=77, 2025-05-07T20:31:59.0806147Z MAX_T=4, 2025-05-07T20:31:59.0806378Z N_H_L=1, 2025-05-07T20:31:59.0806604Z ) 2025-05-07T20:31:59.0806836Z Trying example: test_gqa( 2025-05-07T20:31:59.0807184Z self=, 2025-05-07T20:31:59.0807555Z int4_kv=True, 2025-05-07T20:31:59.0807793Z num_groups=4, 2025-05-07T20:31:59.0808038Z B=77, 2025-05-07T20:31:59.0808259Z MAX_T=52, 2025-05-07T20:31:59.0808484Z N_H_L=67, 2025-05-07T20:31:59.0808709Z ) 2025-05-07T20:31:59.0808935Z Trying example: test_gqa( 2025-05-07T20:31:59.0809275Z self=, 2025-05-07T20:31:59.0809646Z int4_kv=False, 2025-05-07T20:31:59.0809893Z num_groups=4, 2025-05-07T20:31:59.0810136Z B=57, 2025-05-07T20:31:59.0810351Z MAX_T=45, 2025-05-07T20:31:59.0810585Z N_H_L=120, 2025-05-07T20:31:59.0810823Z ) 2025-05-07T20:31:59.0811049Z Trying example: test_gqa( 2025-05-07T20:31:59.0811393Z self=, 2025-05-07T20:31:59.0811778Z int4_kv=True, 2025-05-07T20:31:59.0812027Z num_groups=4, 2025-05-07T20:31:59.0812278Z B=52, 2025-05-07T20:31:59.0812509Z MAX_T=42, 2025-05-07T20:31:59.0812733Z N_H_L=53, 2025-05-07T20:31:59.0813026Z ) 2025-05-07T20:31:59.0813261Z Trying example: test_gqa( 2025-05-07T20:31:59.0813601Z self=, 2025-05-07T20:31:59.0813974Z int4_kv=True, 2025-05-07T20:31:59.0814232Z num_groups=1, 2025-05-07T20:31:59.0814473Z B=77, 2025-05-07T20:31:59.0814695Z MAX_T=95, 2025-05-07T20:31:59.0814921Z N_H_L=53, 2025-05-07T20:31:59.0815140Z ) 2025-05-07T20:31:59.0815367Z Trying example: test_gqa( 2025-05-07T20:31:59.0815710Z self=, 2025-05-07T20:31:59.0816077Z int4_kv=True, 2025-05-07T20:31:59.0816318Z num_groups=4, 2025-05-07T20:31:59.0816560Z B=113, 2025-05-07T20:31:59.0816784Z MAX_T=48, 2025-05-07T20:31:59.0817013Z N_H_L=96, 2025-05-07T20:31:59.0817290Z ) 2025-05-07T20:31:59.0817526Z Trying example: test_gqa( 2025-05-07T20:31:59.0817862Z self=, 2025-05-07T20:31:59.0818443Z int4_kv=False, 2025-05-07T20:31:59.0818700Z num_groups=1, 2025-05-07T20:31:59.0818937Z B=51, 2025-05-07T20:31:59.0819162Z MAX_T=61, 2025-05-07T20:31:59.0819397Z N_H_L=69, 2025-05-07T20:31:59.0819621Z ) 2025-05-07T20:31:59.0819856Z Trying example: test_gqa( 2025-05-07T20:31:59.0820200Z self=, 2025-05-07T20:31:59.0820566Z int4_kv=False, 2025-05-07T20:31:59.0820822Z num_groups=4, 2025-05-07T20:31:59.0821066Z B=17, 2025-05-07T20:31:59.0821280Z MAX_T=113, 2025-05-07T20:31:59.0821516Z N_H_L=65, 2025-05-07T20:31:59.0821740Z ) 2025-05-07T20:31:59.0821957Z Trying example: test_gqa( 2025-05-07T20:31:59.0822299Z self=, 2025-05-07T20:31:59.0822671Z int4_kv=False, 2025-05-07T20:31:59.0822918Z num_groups=4, 2025-05-07T20:31:59.0823165Z B=17, 2025-05-07T20:31:59.0823386Z MAX_T=65, 2025-05-07T20:31:59.0823618Z N_H_L=65, 2025-05-07T20:31:59.0823853Z ) 2025-05-07T20:31:59.0824092Z Trying example: test_gqa( 2025-05-07T20:31:59.0824428Z self=, 2025-05-07T20:31:59.0824810Z int4_kv=False, 2025-05-07T20:31:59.0825066Z num_groups=4, 2025-05-07T20:31:59.0825319Z B=65, 2025-05-07T20:31:59.0825533Z MAX_T=65, 2025-05-07T20:31:59.0825762Z N_H_L=65, 2025-05-07T20:31:59.0825986Z ) 2025-05-07T20:31:59.0826202Z Trying example: test_gqa( 2025-05-07T20:31:59.0826539Z self=, 2025-05-07T20:31:59.0826912Z int4_kv=False, 2025-05-07T20:31:59.0827152Z num_groups=1, 2025-05-07T20:31:59.0827402Z B=6, 2025-05-07T20:31:59.0827745Z MAX_T=108, 2025-05-07T20:31:59.0827970Z N_H_L=14, 2025-05-07T20:31:59.0828194Z ) 2025-05-07T20:31:59.0828422Z Trying example: test_gqa( 2025-05-07T20:31:59.0828855Z self=, 2025-05-07T20:31:59.0829230Z int4_kv=False, 2025-05-07T20:31:59.0829481Z num_groups=1, 2025-05-07T20:31:59.0829721Z B=6, 2025-05-07T20:31:59.0829946Z MAX_T=14, 2025-05-07T20:31:59.0830182Z N_H_L=14, 2025-05-07T20:31:59.0830400Z ) 2025-05-07T20:31:59.0830628Z Trying example: test_gqa( 2025-05-07T20:31:59.0830974Z self=, 2025-05-07T20:31:59.0831340Z int4_kv=False, 2025-05-07T20:31:59.0831598Z num_groups=1, 2025-05-07T20:31:59.0831842Z B=6, 2025-05-07T20:31:59.0832057Z MAX_T=6, 2025-05-07T20:31:59.0832291Z N_H_L=14, 2025-05-07T20:31:59.0832520Z ) 2025-05-07T20:31:59.0832743Z Trying example: test_gqa( 2025-05-07T20:31:59.0833084Z self=, 2025-05-07T20:31:59.0833453Z int4_kv=False, 2025-05-07T20:31:59.0833704Z num_groups=1, 2025-05-07T20:31:59.0833943Z B=6, 2025-05-07T20:31:59.0834158Z MAX_T=6, 2025-05-07T20:31:59.0834397Z N_H_L=6, 2025-05-07T20:31:59.0834611Z ) 2025-05-07T20:31:59.0834836Z Trying example: test_gqa( 2025-05-07T20:31:59.0835183Z self=, 2025-05-07T20:31:59.0835549Z int4_kv=False, 2025-05-07T20:31:59.0835796Z num_groups=1, 2025-05-07T20:31:59.0836038Z B=70, 2025-05-07T20:31:59.0836255Z MAX_T=94, 2025-05-07T20:31:59.0836490Z N_H_L=78, 2025-05-07T20:31:59.0836720Z ) 2025-05-07T20:31:59.0836943Z Trying example: test_gqa( 2025-05-07T20:31:59.0837289Z self=, 2025-05-07T20:31:59.0837662Z int4_kv=False, 2025-05-07T20:31:59.0837901Z num_groups=1, 2025-05-07T20:31:59.0838147Z B=78, 2025-05-07T20:31:59.0838378Z MAX_T=94, 2025-05-07T20:31:59.0838600Z N_H_L=78, 2025-05-07T20:31:59.0838827Z ) 2025-05-07T20:31:59.0839056Z Trying example: test_gqa( 2025-05-07T20:31:59.0839391Z self=, 2025-05-07T20:31:59.0839770Z int4_kv=False, 2025-05-07T20:31:59.0840031Z num_groups=1, 2025-05-07T20:31:59.0840533Z B=94, 2025-05-07T20:31:59.0840754Z MAX_T=94, 2025-05-07T20:31:59.0840983Z N_H_L=78, 2025-05-07T20:31:59.0841355Z ) 2025-05-07T20:31:59.0841586Z Trying example: test_gqa( 2025-05-07T20:31:59.0841924Z self=, 2025-05-07T20:31:59.0842290Z int4_kv=False, 2025-05-07T20:31:59.0842541Z num_groups=1, 2025-05-07T20:31:59.0842778Z B=94, 2025-05-07T20:31:59.0843017Z MAX_T=94, 2025-05-07T20:31:59.0843237Z N_H_L=94, 2025-05-07T20:31:59.0843610Z ) 2025-05-07T20:31:59.0844030Z Trying example: test_gqa( 2025-05-07T20:31:59.0844471Z self=, 2025-05-07T20:31:59.0844950Z int4_kv=False, 2025-05-07T20:31:59.0854367Z num_groups=4, 2025-05-07T20:31:59.0854600Z B=41, 2025-05-07T20:31:59.0854784Z MAX_T=105, 2025-05-07T20:31:59.0854991Z N_H_L=126, 2025-05-07T20:31:59.0855189Z ) 2025-05-07T20:31:59.0855381Z Trying example: test_gqa( 2025-05-07T20:31:59.0855684Z self=, 2025-05-07T20:31:59.0855995Z int4_kv=False, 2025-05-07T20:31:59.0856197Z num_groups=4, 2025-05-07T20:31:59.0856401Z B=105, 2025-05-07T20:31:59.0856585Z MAX_T=105, 2025-05-07T20:31:59.0856782Z N_H_L=126, 2025-05-07T20:31:59.0856967Z ) 2025-05-07T20:31:59.0857157Z Trying example: test_gqa( 2025-05-07T20:31:59.0857447Z self=, 2025-05-07T20:31:59.0857750Z int4_kv=False, 2025-05-07T20:31:59.0857958Z num_groups=4, 2025-05-07T20:31:59.0858159Z B=105, 2025-05-07T20:31:59.0858339Z MAX_T=105, 2025-05-07T20:31:59.0858533Z N_H_L=105, 2025-05-07T20:31:59.0858723Z ) 2025-05-07T20:31:59.0858909Z Trying example: test_gqa( 2025-05-07T20:31:59.0859187Z self=, 2025-05-07T20:31:59.0859492Z int4_kv=True, 2025-05-07T20:31:59.0859698Z num_groups=1, 2025-05-07T20:31:59.0859894Z B=95, 2025-05-07T20:31:59.0860261Z MAX_T=114, 2025-05-07T20:31:59.0860457Z N_H_L=43, 2025-05-07T20:31:59.0860640Z ) 2025-05-07T20:31:59.0860828Z Trying example: test_gqa( 2025-05-07T20:31:59.0861118Z self=, 2025-05-07T20:31:59.0861417Z int4_kv=True, 2025-05-07T20:31:59.0861625Z num_groups=1, 2025-05-07T20:31:59.0861836Z B=43, 2025-05-07T20:31:59.0862015Z MAX_T=114, 2025-05-07T20:31:59.0862208Z N_H_L=43, 2025-05-07T20:31:59.0862392Z ) 2025-05-07T20:31:59.0862574Z Trying example: test_gqa( 2025-05-07T20:31:59.0862851Z self=, 2025-05-07T20:31:59.0863148Z int4_kv=True, 2025-05-07T20:31:59.0863349Z num_groups=1, 2025-05-07T20:31:59.0863555Z B=43, 2025-05-07T20:31:59.0863741Z MAX_T=43, 2025-05-07T20:31:59.0863925Z N_H_L=43, 2025-05-07T20:31:59.0864116Z ) 2025-05-07T20:31:59.0864306Z Trying example: test_gqa( 2025-05-07T20:31:59.0864587Z self=, 2025-05-07T20:31:59.0864903Z int4_kv=False, 2025-05-07T20:31:59.0865118Z num_groups=1, 2025-05-07T20:31:59.0865321Z B=21, 2025-05-07T20:31:59.0865513Z MAX_T=38, 2025-05-07T20:31:59.0865707Z N_H_L=42, 2025-05-07T20:31:59.0865891Z ) 2025-05-07T20:31:59.0866088Z Trying example: test_gqa( 2025-05-07T20:31:59.0866376Z self=, 2025-05-07T20:31:59.0866687Z int4_kv=False, 2025-05-07T20:31:59.0866894Z num_groups=1, 2025-05-07T20:31:59.0867109Z B=38, 2025-05-07T20:31:59.0867297Z MAX_T=38, 2025-05-07T20:31:59.0867588Z N_H_L=42, 2025-05-07T20:31:59.0867776Z ) 2025-05-07T20:31:59.0867965Z Trying example: test_gqa( 2025-05-07T20:31:59.0868244Z self=, 2025-05-07T20:31:59.0868551Z int4_kv=False, 2025-05-07T20:31:59.0868768Z num_groups=1, 2025-05-07T20:31:59.0868968Z B=38, 2025-05-07T20:31:59.0869148Z MAX_T=42, 2025-05-07T20:31:59.0869340Z N_H_L=42, 2025-05-07T20:31:59.0869520Z ) 2025-05-07T20:31:59.0869715Z Trying example: test_gqa( 2025-05-07T20:31:59.0869998Z self=, 2025-05-07T20:31:59.0870298Z int4_kv=False, 2025-05-07T20:31:59.0870602Z num_groups=1, 2025-05-07T20:31:59.0870807Z B=42, 2025-05-07T20:31:59.0870983Z MAX_T=42, 2025-05-07T20:31:59.0871173Z N_H_L=42, 2025-05-07T20:31:59.0871362Z ) 2025-05-07T20:31:59.0871543Z Trying example: test_gqa( 2025-05-07T20:31:59.0871829Z self=, 2025-05-07T20:31:59.0872128Z int4_kv=True, 2025-05-07T20:31:59.0872332Z num_groups=1, 2025-05-07T20:31:59.0872534Z B=74, 2025-05-07T20:31:59.0872714Z MAX_T=20, 2025-05-07T20:31:59.0872897Z N_H_L=15, 2025-05-07T20:31:59.0873088Z ) 2025-05-07T20:31:59.0873276Z Trying example: test_gqa( 2025-05-07T20:31:59.0873559Z self=, 2025-05-07T20:31:59.0873853Z int4_kv=True, 2025-05-07T20:31:59.0874063Z num_groups=1, 2025-05-07T20:31:59.0874275Z B=20, 2025-05-07T20:31:59.0874458Z MAX_T=20, 2025-05-07T20:31:59.0874655Z N_H_L=15, 2025-05-07T20:31:59.0874847Z ) 2025-05-07T20:31:59.0875034Z Trying example: test_gqa( 2025-05-07T20:31:59.0875328Z self=, 2025-05-07T20:31:59.0875634Z int4_kv=True, 2025-05-07T20:31:59.0875842Z num_groups=1, 2025-05-07T20:31:59.0876053Z B=20, 2025-05-07T20:31:59.0876246Z MAX_T=15, 2025-05-07T20:31:59.0876434Z N_H_L=15, 2025-05-07T20:31:59.0876629Z ) 2025-05-07T20:31:59.0876823Z Trying example: test_gqa( 2025-05-07T20:31:59.0877104Z self=, 2025-05-07T20:31:59.0877410Z int4_kv=True, 2025-05-07T20:31:59.0877622Z num_groups=1, 2025-05-07T20:31:59.0877825Z B=15, 2025-05-07T20:31:59.0878015Z MAX_T=20, 2025-05-07T20:31:59.0878218Z N_H_L=15, 2025-05-07T20:31:59.0878410Z ) 2025-05-07T20:31:59.0878605Z Trying example: test_gqa( 2025-05-07T20:31:59.0878894Z self=, 2025-05-07T20:31:59.0879290Z int4_kv=True, 2025-05-07T20:31:59.0879504Z num_groups=1, 2025-05-07T20:31:59.0879709Z B=15, 2025-05-07T20:31:59.0879900Z MAX_T=15, 2025-05-07T20:31:59.0880094Z N_H_L=15, 2025-05-07T20:31:59.0880282Z ) 2025-05-07T20:31:59.0880462Z Trying example: test_gqa( 2025-05-07T20:31:59.0880746Z self=, 2025-05-07T20:31:59.0881050Z int4_kv=False, 2025-05-07T20:31:59.0881257Z num_groups=4, 2025-05-07T20:31:59.0881456Z B=117, 2025-05-07T20:31:59.0881642Z MAX_T=104, 2025-05-07T20:31:59.0881836Z N_H_L=69, 2025-05-07T20:31:59.0882018Z ) 2025-05-07T20:31:59.0882207Z Trying example: test_gqa( 2025-05-07T20:31:59.0882495Z self=, 2025-05-07T20:31:59.0882796Z int4_kv=False, 2025-05-07T20:31:59.0883006Z num_groups=4, 2025-05-07T20:31:59.0883211Z B=117, 2025-05-07T20:31:59.0883392Z MAX_T=117, 2025-05-07T20:31:59.0883587Z N_H_L=69, 2025-05-07T20:31:59.0883783Z ) 2025-05-07T20:31:59.0883969Z Trying example: test_gqa( 2025-05-07T20:31:59.0884259Z self=, 2025-05-07T20:31:59.0884575Z int4_kv=False, 2025-05-07T20:31:59.0884775Z num_groups=4, 2025-05-07T20:31:59.0884984Z B=69, 2025-05-07T20:31:59.0885177Z MAX_T=117, 2025-05-07T20:31:59.0885368Z N_H_L=69, 2025-05-07T20:31:59.0885559Z ) 2025-05-07T20:31:59.0885751Z Trying example: test_gqa( 2025-05-07T20:31:59.0886034Z self=, 2025-05-07T20:31:59.0886347Z int4_kv=False, 2025-05-07T20:31:59.0886568Z num_groups=4, 2025-05-07T20:31:59.0886772Z B=117, 2025-05-07T20:31:59.0886971Z MAX_T=69, 2025-05-07T20:31:59.0887171Z N_H_L=69, 2025-05-07T20:31:59.0887356Z ) 2025-05-07T20:31:59.0887541Z PASSED 2025-05-07T20:31:59.0978026Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:31:59.0978345Z 2025-05-07T20:31:59.0978495Z =========================== short test summary info ============================ 2025-05-07T20:31:59.0979363Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when CUDA is not available or xformers is not available 2025-05-07T20:31:59.0980052Z ======================== 1 passed, 1 skipped in 39.70s ========================= 2025-05-07T20:31:59.7762105Z 2025-05-07T20:31:59.7762620Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:31:59.7782578Z [TEST] Python test time for ./attention/gqa_test.py: 42 seconds 2025-05-07T20:31:59.7782868Z 2025-05-07T20:31:59.7782872Z 2025-05-07T20:31:59.7782876Z 2025-05-07T20:31:59.7782880Z 2025-05-07T20:31:59.7803072Z ################################################################################ 2025-05-07T20:31:59.7821220Z # [2025-05-07T20:31:59.781Z] Run Python Test Suite: 2025-05-07T20:31:59.7821542Z # ./coalesce/coalesce_test.py 2025-05-07T20:31:59.7821816Z ################################################################################ 2025-05-07T20:31:59.7846101Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:31:59.7846749Z 2025-05-07T20:32:01.9387073Z ============================= test session starts ============================== 2025-05-07T20:32:01.9388353Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:01.9389375Z cachedir: .pytest_cache 2025-05-07T20:32:01.9390485Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:01.9391887Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:01.9392670Z plugins: hypothesis-6.131.14 2025-05-07T20:32:03.5096172Z collecting ... collected 1 item 2025-05-07T20:32:03.5096390Z 2025-05-07T20:32:04.2624502Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:32:04.2625121Z 2025-05-07T20:32:04.2625265Z ============================== 1 passed in 2.46s =============================== 2025-05-07T20:32:04.9274943Z 2025-05-07T20:32:04.9275661Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:32:04.9292464Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:32:04.9292771Z 2025-05-07T20:32:04.9292776Z 2025-05-07T20:32:04.9292792Z 2025-05-07T20:32:04.9292796Z 2025-05-07T20:32:04.9312804Z ################################################################################ 2025-05-07T20:32:04.9327976Z # [2025-05-07T20:32:04.932Z] Run Python Test Suite: 2025-05-07T20:32:04.9328298Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:32:04.9328586Z ################################################################################ 2025-05-07T20:32:04.9353291Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:32:04.9353934Z 2025-05-07T20:32:07.0900231Z ============================= test session starts ============================== 2025-05-07T20:32:07.0900903Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:07.0901442Z cachedir: .pytest_cache 2025-05-07T20:32:07.0901997Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:07.0902811Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:07.0903216Z plugins: hypothesis-6.131.14 2025-05-07T20:32:08.7064014Z collecting ... collected 5 items 2025-05-07T20:32:08.7064441Z 2025-05-07T20:32:08.7074109Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:32:08.7081798Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:32:08.7088391Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:32:08.7099324Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:32:08.7113738Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:32:08.7114074Z 2025-05-07T20:32:08.7114223Z =========================== short test summary info ============================ 2025-05-07T20:32:08.7114893Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:08.7115807Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:08.7116706Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:08.7117609Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:08.7118519Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:08.7119153Z ============================== 5 skipped in 1.75s ============================== 2025-05-07T20:32:09.3044034Z 2025-05-07T20:32:09.3044488Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:32:09.3063389Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds 2025-05-07T20:32:09.3063674Z 2025-05-07T20:32:09.3063679Z 2025-05-07T20:32:09.3063684Z 2025-05-07T20:32:09.3063687Z 2025-05-07T20:32:09.3084975Z ################################################################################ 2025-05-07T20:32:09.3102331Z # [2025-05-07T20:32:09.309Z] Run Python Test Suite: 2025-05-07T20:32:09.3102668Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:32:09.3103527Z ################################################################################ 2025-05-07T20:32:09.3127628Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:32:09.3128533Z 2025-05-07T20:32:11.4666614Z ============================= test session starts ============================== 2025-05-07T20:32:11.4667967Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:11.4669004Z cachedir: .pytest_cache 2025-05-07T20:32:11.4669837Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:11.4670584Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:11.4670982Z plugins: hypothesis-6.131.14 2025-05-07T20:32:13.1249571Z collecting ... collected 2 items 2025-05-07T20:32:13.1249799Z 2025-05-07T20:32:13.1259609Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:32:13.1274160Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:32:13.1274609Z 2025-05-07T20:32:13.1274757Z =========================== short test summary info ============================ 2025-05-07T20:32:13.1275396Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:32:13.1276231Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:32:13.1276817Z ============================== 2 skipped in 1.79s ============================== 2025-05-07T20:32:13.7408108Z 2025-05-07T20:32:13.7409019Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:32:13.7428842Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds 2025-05-07T20:32:13.7429198Z 2025-05-07T20:32:13.7429202Z 2025-05-07T20:32:13.7429206Z 2025-05-07T20:32:13.7429581Z 2025-05-07T20:32:13.7451457Z ################################################################################ 2025-05-07T20:32:13.7466943Z # [2025-05-07T20:32:13.746Z] Run Python Test Suite: 2025-05-07T20:32:13.7467271Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:32:13.7467653Z ################################################################################ 2025-05-07T20:32:13.7492455Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:32:13.7493305Z 2025-05-07T20:32:15.9139037Z ============================= test session starts ============================== 2025-05-07T20:32:15.9139671Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:15.9140581Z cachedir: .pytest_cache 2025-05-07T20:32:15.9141236Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:15.9141986Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:15.9142385Z plugins: hypothesis-6.131.14 2025-05-07T20:32:17.5197629Z collecting ... collected 4 items 2025-05-07T20:32:17.5197932Z 2025-05-07T20:32:20.0071914Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:32:20.0153068Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:32:20.0243794Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:32:20.0328558Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:32:20.0329050Z 2025-05-07T20:32:20.0329254Z =========================== short test summary info ============================ 2025-05-07T20:32:20.0330398Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when H100 is not available or MI300 is not available 2025-05-07T20:32:20.0331332Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when xformers is not available 2025-05-07T20:32:20.0331946Z ============================== 4 skipped in 4.25s ============================== 2025-05-07T20:32:22.1256559Z 2025-05-07T20:32:22.1257372Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:32:22.1276326Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 9 seconds 2025-05-07T20:32:22.1276610Z 2025-05-07T20:32:22.1276615Z 2025-05-07T20:32:22.1276619Z 2025-05-07T20:32:22.1276622Z 2025-05-07T20:32:22.1296965Z ################################################################################ 2025-05-07T20:32:22.1311953Z # [2025-05-07T20:32:22.130Z] Run Python Test Suite: 2025-05-07T20:32:22.1312407Z # ./moe/activation_test.py 2025-05-07T20:32:22.1312805Z ################################################################################ 2025-05-07T20:32:22.1338364Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:32:22.1338989Z 2025-05-07T20:32:24.2955570Z ============================= test session starts ============================== 2025-05-07T20:32:24.2956530Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:24.2957417Z cachedir: .pytest_cache 2025-05-07T20:32:24.2958394Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:24.2959565Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:24.2960240Z plugins: hypothesis-6.131.14 2025-05-07T20:32:25.8988336Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:25.9950017Z collecting ... collected 2 items 2025-05-07T20:32:25.9950220Z 2025-05-07T20:32:31.0647525Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:32:31.0648148Z self=, 2025-05-07T20:32:31.0648529Z T=1, 2025-05-07T20:32:31.0648719Z D=5120, 2025-05-07T20:32:31.0648914Z contiguous=True, 2025-05-07T20:32:31.0649131Z compiled=True, 2025-05-07T20:32:31.0649338Z ) 2025-05-07T20:32:31.0649532Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0649897Z self=, 2025-05-07T20:32:31.0650276Z T=4096, 2025-05-07T20:32:31.0650465Z D=5120, 2025-05-07T20:32:31.0650648Z contiguous=True, 2025-05-07T20:32:31.0650867Z compiled=True, 2025-05-07T20:32:31.0651067Z ) 2025-05-07T20:32:31.0651261Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0651641Z self=, 2025-05-07T20:32:31.0652018Z T=4096, 2025-05-07T20:32:31.0652199Z D=7168, 2025-05-07T20:32:31.0652393Z contiguous=False, 2025-05-07T20:32:31.0652613Z compiled=False, 2025-05-07T20:32:31.0652809Z ) 2025-05-07T20:32:31.0653002Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0653369Z self=, 2025-05-07T20:32:31.0653737Z T=4096, 2025-05-07T20:32:31.0655688Z D=5120, 2025-05-07T20:32:31.0655891Z contiguous=False, 2025-05-07T20:32:31.0656113Z compiled=True, 2025-05-07T20:32:31.0656306Z ) 2025-05-07T20:32:31.0656499Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0656869Z self=, 2025-05-07T20:32:31.0657248Z T=1, 2025-05-07T20:32:31.0657427Z D=7168, 2025-05-07T20:32:31.0657615Z contiguous=True, 2025-05-07T20:32:31.0658011Z compiled=True, 2025-05-07T20:32:31.0658208Z ) 2025-05-07T20:32:31.0658399Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0658766Z self=, 2025-05-07T20:32:31.0659132Z T=1, 2025-05-07T20:32:31.0659310Z D=7168, 2025-05-07T20:32:31.0659492Z contiguous=False, 2025-05-07T20:32:31.0659713Z compiled=True, 2025-05-07T20:32:31.0659909Z ) 2025-05-07T20:32:31.0660096Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0660454Z self=, 2025-05-07T20:32:31.0660822Z T=4096, 2025-05-07T20:32:31.0660999Z D=5120, 2025-05-07T20:32:31.0661183Z contiguous=False, 2025-05-07T20:32:31.0661401Z compiled=False, 2025-05-07T20:32:31.0661599Z ) 2025-05-07T20:32:31.0661781Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0662146Z self=, 2025-05-07T20:32:31.0662511Z T=1, 2025-05-07T20:32:31.0662689Z D=7168, 2025-05-07T20:32:31.0662871Z contiguous=True, 2025-05-07T20:32:31.0663090Z compiled=False, 2025-05-07T20:32:31.0663282Z ) 2025-05-07T20:32:31.0663475Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0663837Z self=, 2025-05-07T20:32:31.0664204Z T=2048, 2025-05-07T20:32:31.0664383Z D=5120, 2025-05-07T20:32:31.0664569Z contiguous=True, 2025-05-07T20:32:31.0664780Z compiled=True, 2025-05-07T20:32:31.0664978Z ) 2025-05-07T20:32:31.0665165Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0665523Z self=, 2025-05-07T20:32:31.0665887Z T=2048, 2025-05-07T20:32:31.0666067Z D=7168, 2025-05-07T20:32:31.0666249Z contiguous=True, 2025-05-07T20:32:31.0666459Z compiled=True, 2025-05-07T20:32:31.0666656Z ) 2025-05-07T20:32:31.0666842Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0667197Z self=, 2025-05-07T20:32:31.0667657Z T=2048, 2025-05-07T20:32:31.0667841Z D=7168, 2025-05-07T20:32:31.0668021Z contiguous=True, 2025-05-07T20:32:31.0668743Z compiled=False, 2025-05-07T20:32:31.0668963Z ) 2025-05-07T20:32:31.0669146Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0669512Z self=, 2025-05-07T20:32:31.0669892Z T=128, 2025-05-07T20:32:31.0670068Z D=5120, 2025-05-07T20:32:31.0670260Z contiguous=False, 2025-05-07T20:32:31.0670615Z compiled=True, 2025-05-07T20:32:31.0670940Z ) 2025-05-07T20:32:31.0671215Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0671666Z self=, 2025-05-07T20:32:31.0680203Z T=128, 2025-05-07T20:32:31.0680404Z D=5120, 2025-05-07T20:32:31.0680589Z contiguous=True, 2025-05-07T20:32:31.0680815Z compiled=True, 2025-05-07T20:32:31.0681020Z ) 2025-05-07T20:32:31.0681218Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0681593Z self=, 2025-05-07T20:32:31.0681980Z T=16384, 2025-05-07T20:32:31.0682177Z D=5120, 2025-05-07T20:32:31.0682365Z contiguous=False, 2025-05-07T20:32:31.0682584Z compiled=True, 2025-05-07T20:32:31.0682786Z ) 2025-05-07T20:32:31.0682972Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0683345Z self=, 2025-05-07T20:32:31.0683715Z T=16384, 2025-05-07T20:32:31.0683901Z D=5120, 2025-05-07T20:32:31.0684092Z contiguous=False, 2025-05-07T20:32:31.0684315Z compiled=False, 2025-05-07T20:32:31.0684513Z ) 2025-05-07T20:32:31.0684708Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0685071Z self=, 2025-05-07T20:32:31.0685428Z T=128, 2025-05-07T20:32:31.0685612Z D=7168, 2025-05-07T20:32:31.0685798Z contiguous=True, 2025-05-07T20:32:31.0686121Z compiled=False, 2025-05-07T20:32:31.0686320Z ) 2025-05-07T20:32:31.0686504Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0686863Z self=, 2025-05-07T20:32:31.0687225Z T=128, 2025-05-07T20:32:31.0687407Z D=7168, 2025-05-07T20:32:31.0687590Z contiguous=False, 2025-05-07T20:32:31.0687812Z compiled=False, 2025-05-07T20:32:31.0688014Z ) 2025-05-07T20:32:31.0688192Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0688554Z self=, 2025-05-07T20:32:31.0688921Z T=1, 2025-05-07T20:32:31.0689094Z D=5120, 2025-05-07T20:32:31.0689275Z contiguous=False, 2025-05-07T20:32:31.0689498Z compiled=False, 2025-05-07T20:32:31.0689697Z ) 2025-05-07T20:32:31.0689878Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0690244Z self=, 2025-05-07T20:32:31.0690609Z T=1, 2025-05-07T20:32:31.0690787Z D=7168, 2025-05-07T20:32:31.0690978Z contiguous=False, 2025-05-07T20:32:31.0691196Z compiled=False, 2025-05-07T20:32:31.0691388Z ) 2025-05-07T20:32:31.0691577Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0691936Z self=, 2025-05-07T20:32:31.0692296Z T=4096, 2025-05-07T20:32:31.0692478Z D=5120, 2025-05-07T20:32:31.0692666Z contiguous=True, 2025-05-07T20:32:31.0692877Z compiled=False, 2025-05-07T20:32:31.0693075Z ) 2025-05-07T20:32:31.0693264Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0693622Z self=, 2025-05-07T20:32:31.0693990Z T=128, 2025-05-07T20:32:31.0694170Z D=7168, 2025-05-07T20:32:31.0694356Z contiguous=True, 2025-05-07T20:32:31.0694566Z compiled=True, 2025-05-07T20:32:31.0694763Z ) 2025-05-07T20:32:31.0694949Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0695304Z self=, 2025-05-07T20:32:31.0695676Z T=1, 2025-05-07T20:32:31.0695857Z D=5120, 2025-05-07T20:32:31.0696035Z contiguous=False, 2025-05-07T20:32:31.0696357Z compiled=True, 2025-05-07T20:32:31.0696555Z ) 2025-05-07T20:32:31.0696737Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0697091Z self=, 2025-05-07T20:32:31.0697466Z T=4096, 2025-05-07T20:32:31.0697641Z D=7168, 2025-05-07T20:32:31.0697823Z contiguous=True, 2025-05-07T20:32:31.0698037Z compiled=False, 2025-05-07T20:32:31.0698230Z ) 2025-05-07T20:32:31.0698416Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0698776Z self=, 2025-05-07T20:32:31.0699134Z T=4096, 2025-05-07T20:32:31.0699312Z D=7168, 2025-05-07T20:32:31.0699495Z contiguous=False, 2025-05-07T20:32:31.0699707Z compiled=True, 2025-05-07T20:32:31.0699911Z ) 2025-05-07T20:32:31.0700106Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0700469Z self=, 2025-05-07T20:32:31.0700841Z T=128, 2025-05-07T20:32:31.0701021Z D=5120, 2025-05-07T20:32:31.0701208Z contiguous=True, 2025-05-07T20:32:31.0701416Z compiled=False, 2025-05-07T20:32:31.0701615Z ) 2025-05-07T20:32:31.0701802Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0702153Z self=, 2025-05-07T20:32:31.0702516Z T=128, 2025-05-07T20:32:31.0702691Z D=5120, 2025-05-07T20:32:31.0702873Z contiguous=False, 2025-05-07T20:32:31.0703093Z compiled=False, 2025-05-07T20:32:31.0703287Z ) 2025-05-07T20:32:31.0703477Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0703836Z self=, 2025-05-07T20:32:31.0704206Z T=1, 2025-05-07T20:32:31.0704379Z D=5120, 2025-05-07T20:32:31.0704563Z contiguous=True, 2025-05-07T20:32:31.0704869Z compiled=False, 2025-05-07T20:32:31.0705062Z ) 2025-05-07T20:32:31.0705256Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0705624Z self=, 2025-05-07T20:32:31.0705983Z T=2048, 2025-05-07T20:32:31.0706162Z D=7168, 2025-05-07T20:32:31.0706349Z contiguous=False, 2025-05-07T20:32:31.0706570Z compiled=True, 2025-05-07T20:32:31.0706765Z ) 2025-05-07T20:32:31.0706955Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0707318Z self=, 2025-05-07T20:32:31.0707760Z T=2048, 2025-05-07T20:32:31.0707941Z D=7168, 2025-05-07T20:32:31.0708124Z contiguous=False, 2025-05-07T20:32:31.0708339Z compiled=False, 2025-05-07T20:32:31.0708537Z ) 2025-05-07T20:32:31.0708723Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0709078Z self=, 2025-05-07T20:32:31.0709448Z T=16384, 2025-05-07T20:32:31.0709640Z D=7168, 2025-05-07T20:32:31.0709819Z contiguous=False, 2025-05-07T20:32:31.0710038Z compiled=True, 2025-05-07T20:32:31.0710237Z ) 2025-05-07T20:32:31.0710425Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0710785Z self=, 2025-05-07T20:32:31.0711150Z T=16384, 2025-05-07T20:32:31.0711335Z D=7168, 2025-05-07T20:32:31.0711522Z contiguous=True, 2025-05-07T20:32:31.0711740Z compiled=True, 2025-05-07T20:32:31.0711930Z ) 2025-05-07T20:32:31.0712124Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0712483Z self=, 2025-05-07T20:32:31.0712849Z T=4096, 2025-05-07T20:32:31.0713019Z D=7168, 2025-05-07T20:32:31.0713204Z contiguous=True, 2025-05-07T20:32:31.0713419Z compiled=True, 2025-05-07T20:32:31.0713608Z ) 2025-05-07T20:32:31.0713793Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0714151Z self=, 2025-05-07T20:32:31.0714511Z T=2048, 2025-05-07T20:32:31.0714692Z D=5120, 2025-05-07T20:32:31.0714877Z contiguous=False, 2025-05-07T20:32:31.0715178Z compiled=False, 2025-05-07T20:32:31.0715386Z ) 2025-05-07T20:32:31.0715576Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0715930Z self=, 2025-05-07T20:32:31.0716297Z T=2048, 2025-05-07T20:32:31.0716479Z D=5120, 2025-05-07T20:32:31.0716658Z contiguous=True, 2025-05-07T20:32:31.0716874Z compiled=False, 2025-05-07T20:32:31.0717074Z ) 2025-05-07T20:32:31.0717256Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0717621Z self=, 2025-05-07T20:32:31.0717989Z T=128, 2025-05-07T20:32:31.0718171Z D=7168, 2025-05-07T20:32:31.0718354Z contiguous=False, 2025-05-07T20:32:31.0718574Z compiled=True, 2025-05-07T20:32:31.0718768Z ) 2025-05-07T20:32:31.0718954Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0719313Z self=, 2025-05-07T20:32:31.0719677Z T=16384, 2025-05-07T20:32:31.0719860Z D=5120, 2025-05-07T20:32:31.0720050Z contiguous=True, 2025-05-07T20:32:31.0720263Z compiled=True, 2025-05-07T20:32:31.0720455Z ) 2025-05-07T20:32:31.0720645Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0721005Z self=, 2025-05-07T20:32:31.0721363Z T=2048, 2025-05-07T20:32:31.0721546Z D=5120, 2025-05-07T20:32:31.0721733Z contiguous=False, 2025-05-07T20:32:31.0721942Z compiled=True, 2025-05-07T20:32:31.0722139Z ) 2025-05-07T20:32:31.0722327Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0722682Z self=, 2025-05-07T20:32:31.0723046Z T=16384, 2025-05-07T20:32:31.0723234Z D=5120, 2025-05-07T20:32:31.0723415Z contiguous=True, 2025-05-07T20:32:31.0723717Z compiled=False, 2025-05-07T20:32:31.0723921Z ) 2025-05-07T20:32:31.0724101Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0724500Z self=, 2025-05-07T20:32:31.0724910Z T=16384, 2025-05-07T20:32:31.0725090Z D=7168, 2025-05-07T20:32:31.0725278Z contiguous=False, 2025-05-07T20:32:31.0725501Z compiled=False, 2025-05-07T20:32:31.0725695Z ) 2025-05-07T20:32:31.0725883Z Trying example: test_silu_mul( 2025-05-07T20:32:31.0726247Z self=, 2025-05-07T20:32:31.0726611Z T=16384, 2025-05-07T20:32:31.0726793Z D=7168, 2025-05-07T20:32:31.0726979Z contiguous=True, 2025-05-07T20:32:31.0727195Z compiled=False, 2025-05-07T20:32:31.0727390Z ) 2025-05-07T20:32:31.0727561Z PASSED 2025-05-07T20:32:31.1329204Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:31.1330594Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:31.1331919Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:31.1333401Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:31.1334366Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.1335673Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:31.1337240Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.1338519Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:31.1339865Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.1341064Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:32:31.1342318Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:31.1343543Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:31.1344360Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:31.1345533Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:31.1346704Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:31.1347920Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:31.1348914Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:31.1350102Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:31.1351355Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:31.1352234Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:31.1353307Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:31.1354322Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:31.1355073Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:31.1356208Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:31.1357532Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:31.1358564Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.1359581Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.1360308Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:31.1361296Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.1483142Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:31.1484232Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:31.1486921Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:31.1489770Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:31.1491650Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.1494183Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:31.1495757Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.1497220Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:31.1498564Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.1499578Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:32:31.1500808Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:31.1502032Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:31.1502849Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:31.1504021Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:31.1505207Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:31.1506220Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:31.1507221Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:31.1508638Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:31.1509934Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:31.1510810Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:31.1511870Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:31.1512888Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:31.1513643Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:31.1514779Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:31.1516101Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:31.1517132Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.1518017Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.1518817Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:31.1519807Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.1871061Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:31.1873446Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:31.1875205Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:31.1876601Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:31.1877574Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.1878860Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:31.1880207Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.1881475Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:31.1883003Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.1884040Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:32:31.1885263Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:31.1886473Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:31.1887294Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:31.1888473Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:31.1889647Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:31.1890653Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:31.1891647Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:31.1892837Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:31.1894206Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:31.1895078Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:31.1896138Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:31.1897156Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:31.1897905Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:31.1899061Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:31.1900380Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:31.1901417Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.1902303Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.1903023Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:31.1904087Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.1918902Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:31.1920040Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:31.1921340Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:31.1922723Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:31.1923678Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.1924945Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:31.1926295Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.1927563Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:31.1928897Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.1930026Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:32:31.1931251Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:31.1932464Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:31.1933281Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:31.1934482Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:31.1935689Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:31.1936689Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:31.1937677Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:31.1938863Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:31.1940339Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:31.1941686Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:31.1942754Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:31.1943766Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:31.1944520Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:31.1945665Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:31.1946996Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:31.1948139Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.1949026Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.1949752Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:31.1950738Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.5990021Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.5991023Z self=, 2025-05-07T20:32:31.5991645Z T=1, 2025-05-07T20:32:31.5991837Z D=5120, 2025-05-07T20:32:31.5992032Z scale_ub=None, 2025-05-07T20:32:31.5992238Z contiguous=True, 2025-05-07T20:32:31.5992461Z compiled=True, 2025-05-07T20:32:31.5992668Z ) 2025-05-07T20:32:31.5992976Z self = 2025-05-07T20:32:31.5993452Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:31.5993716Z 2025-05-07T20:32:31.5993802Z @given( 2025-05-07T20:32:31.5994022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.5994335Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.5994635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.5994962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.5995288Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.5995565Z ) 2025-05-07T20:32:31.5995921Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.5996362Z def test_silu_mul_quant( 2025-05-07T20:32:31.5996598Z self, 2025-05-07T20:32:31.5996786Z T: int, 2025-05-07T20:32:31.5996975Z D: int, 2025-05-07T20:32:31.5997191Z scale_ub: Optional[float], 2025-05-07T20:32:31.5997458Z contiguous: bool, 2025-05-07T20:32:31.5997690Z compiled: bool, 2025-05-07T20:32:31.5997911Z ) -> None: 2025-05-07T20:32:31.5998124Z torch.manual_seed(2025) 2025-05-07T20:32:31.5998355Z 2025-05-07T20:32:31.5998624Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.5998962Z 2025-05-07T20:32:31.5999151Z x_sign = torch.sign(x) 2025-05-07T20:32:31.5999430Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.5999743Z x = x_sign * x_clamp 2025-05-07T20:32:31.5999982Z x0 = x[:, :D] 2025-05-07T20:32:31.6000187Z x1 = x[:, D:] 2025-05-07T20:32:31.6000664Z 2025-05-07T20:32:31.6000855Z if contiguous: 2025-05-07T20:32:31.6001082Z x0 = x0.contiguous() 2025-05-07T20:32:31.6001339Z x1 = x1.contiguous() 2025-05-07T20:32:31.6001587Z 2025-05-07T20:32:31.6001768Z if scale_ub is not None: 2025-05-07T20:32:31.6002041Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.6002370Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.6002673Z ) 2025-05-07T20:32:31.6002856Z else: 2025-05-07T20:32:31.6003060Z scale_ub_tensor = None 2025-05-07T20:32:31.6003308Z 2025-05-07T20:32:31.6003528Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.6012529Z op = silu_mul_quant 2025-05-07T20:32:31.6012810Z if compiled: 2025-05-07T20:32:31.6013071Z op = torch.compile(op) 2025-05-07T20:32:31.6013384Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.6013665Z 2025-05-07T20:32:31.6013866Z y_fp8, y_scale = fn() 2025-05-07T20:32:31.6014144Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:31.6014429Z 2025-05-07T20:32:31.6014667Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.6014999Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:31.6015284Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:31.6015586Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:31.6015938Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.6016245Z 2025-05-07T20:32:31.6016442Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:31.6016640Z 2025-05-07T20:32:31.6016741Z moe/activation_test.py:126: 2025-05-07T20:32:31.6017030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.6017563Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:31.6017889Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.6018665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:31.6019406Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:31.6019938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.6020618Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.6021297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:31.6022013Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.6022720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:31.6023378Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:31.6023988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:31.6024490Z fn() 2025-05-07T20:32:31.6025008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:31.6025578Z self.fn.run( 2025-05-07T20:32:31.6026057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.6026581Z kernel = self.compile( 2025-05-07T20:32:31.6027110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.6027886Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.6028272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.6028507Z 2025-05-07T20:32:31.6028799Z self = 2025-05-07T20:32:31.6029870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.6031240Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13ea25d4e0>} 2025-05-07T20:32:31.6032571Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.6033576Z context = 2025-05-07T20:32:31.6033868Z 2025-05-07T20:32:31.6034033Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.6034562Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.6035038Z module_map=module_map) 2025-05-07T20:32:31.6035397Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.6035751Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:31.6036014Z E ^ 2025-05-07T20:32:31.6036465Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.6036913Z 2025-05-07T20:32:31.6037330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.6037841Z 2025-05-07T20:32:31.6037940Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.6038352Z self=, 2025-05-07T20:32:31.6038836Z T=2048, 2025-05-07T20:32:31.6039024Z D=5120, 2025-05-07T20:32:31.6039217Z scale_ub=1200.0, 2025-05-07T20:32:31.6039434Z contiguous=True, 2025-05-07T20:32:31.6039658Z compiled=False, 2025-05-07T20:32:31.6039866Z ) 2025-05-07T20:32:31.6040500Z self = 2025-05-07T20:32:31.6041304Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:31.6041581Z 2025-05-07T20:32:31.6041660Z @given( 2025-05-07T20:32:31.6041890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.6042193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.6042497Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.6042823Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.6043140Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.6043423Z ) 2025-05-07T20:32:31.6043776Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.6044228Z def test_silu_mul_quant( 2025-05-07T20:32:31.6044482Z self, 2025-05-07T20:32:31.6044676Z T: int, 2025-05-07T20:32:31.6044871Z D: int, 2025-05-07T20:32:31.6045090Z scale_ub: Optional[float], 2025-05-07T20:32:31.6045358Z contiguous: bool, 2025-05-07T20:32:31.6045590Z compiled: bool, 2025-05-07T20:32:31.6045807Z ) -> None: 2025-05-07T20:32:31.6046021Z torch.manual_seed(2025) 2025-05-07T20:32:31.6046263Z 2025-05-07T20:32:31.6046524Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.6046880Z 2025-05-07T20:32:31.6047076Z x_sign = torch.sign(x) 2025-05-07T20:32:31.6047359Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.6047667Z x = x_sign * x_clamp 2025-05-07T20:32:31.6047910Z x0 = x[:, :D] 2025-05-07T20:32:31.6048126Z x1 = x[:, D:] 2025-05-07T20:32:31.6048332Z 2025-05-07T20:32:31.6048519Z if contiguous: 2025-05-07T20:32:31.6048752Z x0 = x0.contiguous() 2025-05-07T20:32:31.6049808Z x1 = x1.contiguous() 2025-05-07T20:32:31.6050058Z 2025-05-07T20:32:31.6050250Z if scale_ub is not None: 2025-05-07T20:32:31.6050521Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.6050853Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.6051165Z ) 2025-05-07T20:32:31.6051355Z else: 2025-05-07T20:32:31.6051566Z scale_ub_tensor = None 2025-05-07T20:32:31.6051821Z 2025-05-07T20:32:31.6052050Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.6052363Z op = silu_mul_quant 2025-05-07T20:32:31.6052615Z if compiled: 2025-05-07T20:32:31.6052856Z op = torch.compile(op) 2025-05-07T20:32:31.6053151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.6053433Z 2025-05-07T20:32:31.6053622Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.6053794Z 2025-05-07T20:32:31.6053893Z moe/activation_test.py:117: 2025-05-07T20:32:31.6054193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.6054524Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.6054799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.6055483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.6056170Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.6056705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.6057377Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.6058031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.6058682Z kernel = self.compile( 2025-05-07T20:32:31.6059221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.6059864Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.6060257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.6060480Z 2025-05-07T20:32:31.6060690Z self = 2025-05-07T20:32:31.6061751Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.6063106Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13ea28e160>} 2025-05-07T20:32:31.6064445Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.6065454Z context = 2025-05-07T20:32:31.6065735Z 2025-05-07T20:32:31.6065902Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.6066423Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.6066887Z module_map=module_map) 2025-05-07T20:32:31.6067247Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.6067705Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.6067963Z E ^ 2025-05-07T20:32:31.6068425Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.6068871Z 2025-05-07T20:32:31.6069306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.8665898Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:31.8666985Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:31.8668366Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:31.8669781Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:31.8670756Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.8672023Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:31.8673379Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.8674664Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:31.8676015Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.8677207Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:32:31.8678444Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:31.8679667Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:31.8680510Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:31.8681688Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:31.8682887Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:31.8683891Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:31.8684914Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:31.8686116Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:31.8687374Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:31.8688347Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:31.8689413Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:31.8690430Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:31.8691185Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:31.8692340Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:31.8693851Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:31.8695208Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.8696316Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.8697215Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:31.8698467Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.9367235Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:31.9368578Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:31.9369893Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:31.9371393Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:31.9372355Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.9373644Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:31.9375004Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.9376297Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:31.9377644Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.9378682Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:32:31.9380064Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:31.9381290Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:31.9382114Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:31.9383296Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:31.9384483Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:31.9385510Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:31.9386510Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:31.9387792Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:31.9389053Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:31.9389941Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:31.9391098Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:31.9392117Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:31.9392865Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:31.9394013Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:31.9395338Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:31.9396388Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.9397274Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.9398009Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:31.9399005Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.1453450Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:32.1454506Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:32.1456013Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:32.1457413Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:32.1458377Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.1459658Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:32.1461027Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.1462313Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:32.1463652Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.1464675Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:32:32.1465915Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:32.1467292Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:32.1468172Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:32.1469359Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:32.1470543Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:32.1471552Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:32.1472572Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:32.1473758Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:32.1475012Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:32.1475902Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:32.1476967Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:32.1478079Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:32.1478835Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:32.1479981Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:32.1481315Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:32.1482362Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.1483269Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.1484006Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:32.1485011Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.1553791Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:32.1554837Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:32.1556146Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:32.1557684Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:32.1558677Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.1559966Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:32.1561327Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.1562620Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:32.1563964Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.1565012Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:32:32.1566254Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:32.1567477Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:32.1568425Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:32.1569625Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:32.1570815Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:32.1571839Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:32.1572848Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:32.1574057Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:32.1575314Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:32.1576205Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:32.1577273Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:32.1578293Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:32.1579122Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:32.1580272Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:32.1581604Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:32.1582646Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.1583540Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.1584260Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:32.1585264Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.4606605Z 2025-05-07T20:32:32.4607151Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.4607600Z self=, 2025-05-07T20:32:32.4608017Z T=2048, 2025-05-07T20:32:32.4608206Z D=5120, 2025-05-07T20:32:32.4608399Z scale_ub=1200.0, 2025-05-07T20:32:32.4608613Z contiguous=True, 2025-05-07T20:32:32.4608836Z compiled=True, 2025-05-07T20:32:32.4609047Z ) 2025-05-07T20:32:32.4609357Z self = 2025-05-07T20:32:32.4609854Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.4610129Z 2025-05-07T20:32:32.4610229Z @given( 2025-05-07T20:32:32.4610471Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.4610781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.4611304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.4611642Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.4611966Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.4612250Z ) 2025-05-07T20:32:32.4612589Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.4613045Z def test_silu_mul_quant( 2025-05-07T20:32:32.4613285Z self, 2025-05-07T20:32:32.4613467Z T: int, 2025-05-07T20:32:32.4613661Z D: int, 2025-05-07T20:32:32.4613880Z scale_ub: Optional[float], 2025-05-07T20:32:32.4614139Z contiguous: bool, 2025-05-07T20:32:32.4614376Z compiled: bool, 2025-05-07T20:32:32.4614597Z ) -> None: 2025-05-07T20:32:32.4614830Z torch.manual_seed(2025) 2025-05-07T20:32:32.4615090Z 2025-05-07T20:32:32.4615362Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.4615697Z 2025-05-07T20:32:32.4615889Z x_sign = torch.sign(x) 2025-05-07T20:32:32.4616176Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.4616476Z x = x_sign * x_clamp 2025-05-07T20:32:32.4616707Z x0 = x[:, :D] 2025-05-07T20:32:32.4616919Z x1 = x[:, D:] 2025-05-07T20:32:32.4617120Z 2025-05-07T20:32:32.4617291Z if contiguous: 2025-05-07T20:32:32.4617518Z x0 = x0.contiguous() 2025-05-07T20:32:32.4617764Z x1 = x1.contiguous() 2025-05-07T20:32:32.4617987Z 2025-05-07T20:32:32.4618171Z if scale_ub is not None: 2025-05-07T20:32:32.4618439Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.4618759Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.4619055Z ) 2025-05-07T20:32:32.4619376Z else: 2025-05-07T20:32:32.4619574Z scale_ub_tensor = None 2025-05-07T20:32:32.4619813Z 2025-05-07T20:32:32.4620042Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.4620341Z op = silu_mul_quant 2025-05-07T20:32:32.4620588Z if compiled: 2025-05-07T20:32:32.4620828Z op = torch.compile(op) 2025-05-07T20:32:32.4621110Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4621374Z 2025-05-07T20:32:32.4621557Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.4621835Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.4622113Z 2025-05-07T20:32:32.4622346Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.4622670Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.4622952Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.4623255Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.4623608Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.4623913Z 2025-05-07T20:32:32.4624115Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.4624302Z 2025-05-07T20:32:32.4624413Z moe/activation_test.py:126: 2025-05-07T20:32:32.4624702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4625028Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.4625348Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.4626128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.4626864Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.4627403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.4628151Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.4628844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.4629643Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.4630370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.4630997Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.4631589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.4632102Z fn() 2025-05-07T20:32:32.4632611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.4633181Z self.fn.run( 2025-05-07T20:32:32.4633643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.4634167Z kernel = self.compile( 2025-05-07T20:32:32.4634709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.4635386Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.4635774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4636003Z 2025-05-07T20:32:32.4636203Z self = 2025-05-07T20:32:32.4637266Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.4638632Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13ea1be660>} 2025-05-07T20:32:32.4640361Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.4641433Z context = 2025-05-07T20:32:32.4641723Z 2025-05-07T20:32:32.4641886Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.4642395Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.4642848Z module_map=module_map) 2025-05-07T20:32:32.4643209Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.4643563Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.4643822Z E ^ 2025-05-07T20:32:32.4644267Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.4644738Z 2025-05-07T20:32:32.4645164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.4645665Z 2025-05-07T20:32:32.4645776Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.4646175Z self=, 2025-05-07T20:32:32.4646573Z T=16384, 2025-05-07T20:32:32.4646756Z D=7168, 2025-05-07T20:32:32.4646949Z scale_ub=1200.0, 2025-05-07T20:32:32.4647171Z contiguous=False, 2025-05-07T20:32:32.4647384Z compiled=False, 2025-05-07T20:32:32.4647591Z ) 2025-05-07T20:32:32.4647907Z self = 2025-05-07T20:32:32.4648394Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.4648675Z 2025-05-07T20:32:32.4648749Z @given( 2025-05-07T20:32:32.4648979Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.4649343Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.4649776Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.4650337Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.4658901Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.4659204Z ) 2025-05-07T20:32:32.4659545Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.4660013Z def test_silu_mul_quant( 2025-05-07T20:32:32.4660259Z self, 2025-05-07T20:32:32.4660449Z T: int, 2025-05-07T20:32:32.4660652Z D: int, 2025-05-07T20:32:32.4660872Z scale_ub: Optional[float], 2025-05-07T20:32:32.4661135Z contiguous: bool, 2025-05-07T20:32:32.4661375Z compiled: bool, 2025-05-07T20:32:32.4661593Z ) -> None: 2025-05-07T20:32:32.4661811Z torch.manual_seed(2025) 2025-05-07T20:32:32.4662050Z 2025-05-07T20:32:32.4662312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.4662654Z 2025-05-07T20:32:32.4662848Z x_sign = torch.sign(x) 2025-05-07T20:32:32.4663140Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.4663445Z x = x_sign * x_clamp 2025-05-07T20:32:32.4663679Z x0 = x[:, :D] 2025-05-07T20:32:32.4663886Z x1 = x[:, D:] 2025-05-07T20:32:32.4664083Z 2025-05-07T20:32:32.4664268Z if contiguous: 2025-05-07T20:32:32.4664496Z x0 = x0.contiguous() 2025-05-07T20:32:32.4664770Z x1 = x1.contiguous() 2025-05-07T20:32:32.4665035Z 2025-05-07T20:32:32.4665222Z if scale_ub is not None: 2025-05-07T20:32:32.4665485Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.4665812Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.4666117Z ) 2025-05-07T20:32:32.4666301Z else: 2025-05-07T20:32:32.4666503Z scale_ub_tensor = None 2025-05-07T20:32:32.4666748Z 2025-05-07T20:32:32.4666967Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.4667411Z op = silu_mul_quant 2025-05-07T20:32:32.4667716Z if compiled: 2025-05-07T20:32:32.4667956Z op = torch.compile(op) 2025-05-07T20:32:32.4668244Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4668511Z 2025-05-07T20:32:32.4668696Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.4668854Z 2025-05-07T20:32:32.4668947Z moe/activation_test.py:117: 2025-05-07T20:32:32.4669233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4669559Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.4669825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4670515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.4671192Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.4671720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.4672404Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.4673061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.4673597Z kernel = self.compile( 2025-05-07T20:32:32.4674148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.4674788Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.4675181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4675403Z 2025-05-07T20:32:32.4675613Z self = 2025-05-07T20:32:32.4676675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.4678155Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13ea2a1080>} 2025-05-07T20:32:32.4679483Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.4680488Z context = 2025-05-07T20:32:32.4680767Z 2025-05-07T20:32:32.4680935Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.4681440Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.4681907Z module_map=module_map) 2025-05-07T20:32:32.4682267Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.4682604Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.4682855Z E ^ 2025-05-07T20:32:32.4683331Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.4683772Z 2025-05-07T20:32:32.4684204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.6443177Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:32.6444239Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:32.6445561Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:32.6447156Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:32.6448127Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.6449402Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:32.6450770Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.6452078Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:32.6453444Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.6454493Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:32:32.6455740Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:32.6456961Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:32.6457915Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:32.6459102Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:32.6460298Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:32.6461317Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:32.6462327Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:32.6463540Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:32.6464793Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:32.6465680Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:32.6466732Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:32.6467812Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:32.6468653Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:32.6469809Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:32.6471137Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:32.6472165Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.6473056Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.6473778Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:32.6474787Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.6947970Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:32.6949033Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:32.6950360Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:32.6951776Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:32.6952930Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.6954268Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:32.6955621Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.6956906Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:32.6958283Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.6959324Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:32:32.6960616Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:32.6961838Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:32.6962675Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:32.6964014Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:32.6965260Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:32.6966291Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:32.6967294Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:32.6968502Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:32.6969782Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:32.6970690Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:32.6971759Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:32.6972782Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:32.6973543Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:32.6974701Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:32.6976138Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:32.6977176Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.6978072Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.6978797Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:32.6979804Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8647521Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:32.8648563Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:32.8649884Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:32.8651292Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:32.8652254Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.8653720Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:32.8655081Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8656365Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:32.8657715Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8658757Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:32:32.8660001Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:32.8661218Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:32.8662048Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:32.8663228Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:32.8664538Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:32.8665559Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:32.8666551Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:32.8667849Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:32.8669106Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:32.8670002Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:32.8671093Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:32.8672120Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:32.8672887Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:32.8674038Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:32.8675423Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:32.8676549Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8677448Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8678184Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:32.8679189Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8745391Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:32.8746434Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:32.8747862Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:32.8749288Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:32.8750246Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.8751525Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:32.8753022Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8754304Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:32.8755653Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8756673Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:32:32.8757980Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:32.8759199Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:32.8760019Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:32.8761197Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:32.8762387Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:32.8763522Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:32.8764517Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:32.8765758Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:32.8767008Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:32.8767890Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:32.8768959Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:32.8770359Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:32.8771239Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:32.8772394Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:32.8773732Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:32.8774767Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8775786Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8776517Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:32.8777516Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.5555389Z 2025-05-07T20:32:33.5556205Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.5556895Z self=, 2025-05-07T20:32:33.5557453Z T=1, 2025-05-07T20:32:33.5557694Z D=7168, 2025-05-07T20:32:33.5557896Z scale_ub=None, 2025-05-07T20:32:33.5558106Z contiguous=True, 2025-05-07T20:32:33.5558362Z compiled=True, 2025-05-07T20:32:33.5558566Z ) 2025-05-07T20:32:33.5558873Z self = 2025-05-07T20:32:33.5559363Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:33.5559632Z 2025-05-07T20:32:33.5559704Z @given( 2025-05-07T20:32:33.5559938Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.5560242Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.5560554Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.5560882Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.5561198Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.5561478Z ) 2025-05-07T20:32:33.5561830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.5562263Z def test_silu_mul_quant( 2025-05-07T20:32:33.5562505Z self, 2025-05-07T20:32:33.5562695Z T: int, 2025-05-07T20:32:33.5563246Z D: int, 2025-05-07T20:32:33.5563452Z scale_ub: Optional[float], 2025-05-07T20:32:33.5563721Z contiguous: bool, 2025-05-07T20:32:33.5563969Z compiled: bool, 2025-05-07T20:32:33.5564186Z ) -> None: 2025-05-07T20:32:33.5564401Z torch.manual_seed(2025) 2025-05-07T20:32:33.5564648Z 2025-05-07T20:32:33.5564920Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.5565265Z 2025-05-07T20:32:33.5565457Z x_sign = torch.sign(x) 2025-05-07T20:32:33.5565742Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.5566042Z x = x_sign * x_clamp 2025-05-07T20:32:33.5566275Z x0 = x[:, :D] 2025-05-07T20:32:33.5566480Z x1 = x[:, D:] 2025-05-07T20:32:33.5566678Z 2025-05-07T20:32:33.5566856Z if contiguous: 2025-05-07T20:32:33.5567077Z x0 = x0.contiguous() 2025-05-07T20:32:33.5567323Z x1 = x1.contiguous() 2025-05-07T20:32:33.5567565Z 2025-05-07T20:32:33.5567743Z if scale_ub is not None: 2025-05-07T20:32:33.5568011Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.5568341Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.5568651Z ) 2025-05-07T20:32:33.5568827Z else: 2025-05-07T20:32:33.5569029Z scale_ub_tensor = None 2025-05-07T20:32:33.5569273Z 2025-05-07T20:32:33.5569494Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.5569798Z op = silu_mul_quant 2025-05-07T20:32:33.5570037Z if compiled: 2025-05-07T20:32:33.5570309Z op = torch.compile(op) 2025-05-07T20:32:33.5570596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.5570852Z 2025-05-07T20:32:33.5571033Z y_fp8, y_scale = fn() 2025-05-07T20:32:33.5571314Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:33.5571587Z 2025-05-07T20:32:33.5571814Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.5572141Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:33.5572423Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:33.5572882Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:33.5573234Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.5573531Z 2025-05-07T20:32:33.5573727Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:33.5573925Z 2025-05-07T20:32:33.5574019Z moe/activation_test.py:126: 2025-05-07T20:32:33.5574310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.5574628Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:33.5574952Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.5575732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:33.5576465Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:33.5577004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.5577686Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.5578367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:33.5579069Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.5579801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:33.5580452Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:33.5581046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:33.5581546Z fn() 2025-05-07T20:32:33.5582047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:33.5582706Z self.fn.run( 2025-05-07T20:32:33.5583169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.5583691Z kernel = self.compile( 2025-05-07T20:32:33.5584224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.5584908Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.5585322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.5585548Z 2025-05-07T20:32:33.5585750Z self = 2025-05-07T20:32:33.5586815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.5588349Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13ea2a3740>} 2025-05-07T20:32:33.5589663Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.5590658Z context = 2025-05-07T20:32:33.5590947Z 2025-05-07T20:32:33.5591108Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.5591623Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.5592084Z module_map=module_map) 2025-05-07T20:32:33.5592438Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.5592793Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:33.5593051Z E ^ 2025-05-07T20:32:33.5593576Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.5594045Z 2025-05-07T20:32:33.5594462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.5594971Z 2025-05-07T20:32:33.5595068Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.5595478Z self=, 2025-05-07T20:32:33.5595860Z T=4096, 2025-05-07T20:32:33.5596044Z D=5120, 2025-05-07T20:32:33.5596231Z scale_ub=None, 2025-05-07T20:32:33.5596434Z contiguous=False, 2025-05-07T20:32:33.5596653Z compiled=False, 2025-05-07T20:32:33.5596854Z ) 2025-05-07T20:32:33.5597160Z self = 2025-05-07T20:32:33.5597650Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:33.5597925Z 2025-05-07T20:32:33.5598001Z @given( 2025-05-07T20:32:33.5598227Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.5598532Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.5598833Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.5599156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.5599470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.5599749Z ) 2025-05-07T20:32:33.5600086Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.5600511Z def test_silu_mul_quant( 2025-05-07T20:32:33.5600746Z self, 2025-05-07T20:32:33.5600941Z T: int, 2025-05-07T20:32:33.5601123Z D: int, 2025-05-07T20:32:33.5601333Z scale_ub: Optional[float], 2025-05-07T20:32:33.5601595Z contiguous: bool, 2025-05-07T20:32:33.5601911Z compiled: bool, 2025-05-07T20:32:33.5602134Z ) -> None: 2025-05-07T20:32:33.5602347Z torch.manual_seed(2025) 2025-05-07T20:32:33.5602582Z 2025-05-07T20:32:33.5602847Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.5603185Z 2025-05-07T20:32:33.5603370Z x_sign = torch.sign(x) 2025-05-07T20:32:33.5603643Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.5603942Z x = x_sign * x_clamp 2025-05-07T20:32:33.5604174Z x0 = x[:, :D] 2025-05-07T20:32:33.5604376Z x1 = x[:, D:] 2025-05-07T20:32:33.5604579Z 2025-05-07T20:32:33.5604755Z if contiguous: 2025-05-07T20:32:33.5604970Z x0 = x0.contiguous() 2025-05-07T20:32:33.5605217Z x1 = x1.contiguous() 2025-05-07T20:32:33.5605452Z 2025-05-07T20:32:33.5605628Z if scale_ub is not None: 2025-05-07T20:32:33.5605896Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.5606229Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.5606522Z ) 2025-05-07T20:32:33.5606703Z else: 2025-05-07T20:32:33.5606913Z scale_ub_tensor = None 2025-05-07T20:32:33.5607160Z 2025-05-07T20:32:33.5607377Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.5607680Z op = silu_mul_quant 2025-05-07T20:32:33.5607927Z if compiled: 2025-05-07T20:32:33.5608160Z op = torch.compile(op) 2025-05-07T20:32:33.5608449Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.5608716Z 2025-05-07T20:32:33.5608895Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.5609063Z 2025-05-07T20:32:33.5609160Z moe/activation_test.py:117: 2025-05-07T20:32:33.5609457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.5609779Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.5610058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.5610744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.5611505Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.5612054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.5612730Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.5613395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.5613908Z kernel = self.compile( 2025-05-07T20:32:33.5614442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.5615109Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.5615501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.5615727Z 2025-05-07T20:32:33.5615931Z self = 2025-05-07T20:32:33.5616997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.5618359Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13e0527100>} 2025-05-07T20:32:33.5619672Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.5620681Z context = 2025-05-07T20:32:33.5620960Z 2025-05-07T20:32:33.5621203Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.5621721Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.5622179Z module_map=module_map) 2025-05-07T20:32:33.5622539Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.5622884Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.5623141Z E ^ 2025-05-07T20:32:33.5623598Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.5624049Z 2025-05-07T20:32:33.5632896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.8325021Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:33.8326211Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:33.8327544Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:33.8328937Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:33.8329909Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:33.8331220Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:33.8332789Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.8334125Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:33.8335527Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.8336570Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:32:33.8337819Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:33.8339045Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:33.8339873Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:33.8341240Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:33.8342427Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:33.8343442Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:33.8344576Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:33.8345771Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:33.8347024Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:33.8347955Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:33.8349016Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:33.8350040Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:33.8350788Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:33.8351951Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:33.8353277Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:33.8354301Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.8355311Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.8356033Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:33.8357026Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.0040473Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:34.0041932Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:34.0043489Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:34.0044891Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:34.0045900Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.0047179Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:34.0048572Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.0050045Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:34.0051399Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.0052445Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:32:34.0053734Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:34.0054971Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:34.0055794Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:34.0056980Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:34.0058163Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:34.0059178Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:34.0060184Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:34.0061489Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:34.0062791Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:34.0063678Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:34.0064749Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:34.0065815Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:34.0066588Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:34.0067823Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:34.0069154Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:34.0070194Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.0071098Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.0071911Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:34.0072915Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.2643697Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:34.2644971Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:34.2646289Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:34.2647707Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:34.2648675Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.2649979Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:34.2651337Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.2652608Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:34.2654129Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.2655160Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:32:34.2656399Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:34.2657619Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:34.2658430Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:34.2659619Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:34.2660811Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:34.2661825Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:34.2662831Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:34.2664023Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:34.2665484Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:34.2666363Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:34.2667434Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:34.2668574Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:34.2669318Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:34.2670474Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:34.2671812Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:34.2672880Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.2673770Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.2674499Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:34.2675502Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.2742420Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:34.2743668Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:34.2744959Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:34.2746350Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:34.2747334Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.2748684Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:34.2750035Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.2751305Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:34.2752640Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.2754636Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:32:34.2755876Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:34.2757098Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:34.2757917Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:34.2759084Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:34.2760268Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:34.2761282Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:34.2762280Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:34.2763464Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:34.2764713Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:34.2765726Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:34.2766791Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:34.2767804Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:34.2768545Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:34.2769684Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:34.2771021Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:34.2772061Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.2772949Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.2773664Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:34.2774659Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.4399659Z 2025-05-07T20:32:35.4400587Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.4401749Z self=, 2025-05-07T20:32:35.4402168Z T=4096, 2025-05-07T20:32:35.4402358Z D=7168, 2025-05-07T20:32:35.4402557Z scale_ub=None, 2025-05-07T20:32:35.4402769Z contiguous=False, 2025-05-07T20:32:35.4402996Z compiled=False, 2025-05-07T20:32:35.4403207Z ) 2025-05-07T20:32:35.4403517Z self = 2025-05-07T20:32:35.4404004Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.4404287Z 2025-05-07T20:32:35.4404365Z @given( 2025-05-07T20:32:35.4404600Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.4404913Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.4405227Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.4405612Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.4405936Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.4406231Z ) 2025-05-07T20:32:35.4406588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.4407039Z def test_silu_mul_quant( 2025-05-07T20:32:35.4407290Z self, 2025-05-07T20:32:35.4407483Z T: int, 2025-05-07T20:32:35.4407678Z D: int, 2025-05-07T20:32:35.4407904Z scale_ub: Optional[float], 2025-05-07T20:32:35.4408173Z contiguous: bool, 2025-05-07T20:32:35.4408411Z compiled: bool, 2025-05-07T20:32:35.4408632Z ) -> None: 2025-05-07T20:32:35.4408850Z torch.manual_seed(2025) 2025-05-07T20:32:35.4409094Z 2025-05-07T20:32:35.4409359Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.4409704Z 2025-05-07T20:32:35.4409895Z x_sign = torch.sign(x) 2025-05-07T20:32:35.4410177Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.4410487Z x = x_sign * x_clamp 2025-05-07T20:32:35.4410732Z x0 = x[:, :D] 2025-05-07T20:32:35.4410944Z x1 = x[:, D:] 2025-05-07T20:32:35.4411150Z 2025-05-07T20:32:35.4411514Z if contiguous: 2025-05-07T20:32:35.4411922Z x0 = x0.contiguous() 2025-05-07T20:32:35.4412189Z x1 = x1.contiguous() 2025-05-07T20:32:35.4412420Z 2025-05-07T20:32:35.4412599Z if scale_ub is not None: 2025-05-07T20:32:35.4412869Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.4413207Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.4413527Z ) 2025-05-07T20:32:35.4413721Z else: 2025-05-07T20:32:35.4413938Z scale_ub_tensor = None 2025-05-07T20:32:35.4414188Z 2025-05-07T20:32:35.4414421Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.4414737Z op = silu_mul_quant 2025-05-07T20:32:35.4414978Z if compiled: 2025-05-07T20:32:35.4415225Z op = torch.compile(op) 2025-05-07T20:32:35.4415525Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.4415800Z 2025-05-07T20:32:35.4415990Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.4416158Z 2025-05-07T20:32:35.4416265Z moe/activation_test.py:117: 2025-05-07T20:32:35.4416560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.4416889Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.4417172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.4417866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.4418544Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.4419085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.4419757Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.4420413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.4421020Z kernel = self.compile( 2025-05-07T20:32:35.4421577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.4422238Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.4422629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.4422852Z 2025-05-07T20:32:35.4423056Z self = 2025-05-07T20:32:35.4424132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.4425525Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13e0526f20>} 2025-05-07T20:32:35.4426898Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.4427957Z context = 2025-05-07T20:32:35.4428244Z 2025-05-07T20:32:35.4428409Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.4428934Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.4429396Z module_map=module_map) 2025-05-07T20:32:35.4429752Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.4430103Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.4430359Z E ^ 2025-05-07T20:32:35.4430809Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.4431260Z 2025-05-07T20:32:35.4431763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.4432272Z 2025-05-07T20:32:35.4432374Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.4432780Z self=, 2025-05-07T20:32:35.4433184Z T=128, 2025-05-07T20:32:35.4433372Z D=7168, 2025-05-07T20:32:35.4433562Z scale_ub=None, 2025-05-07T20:32:35.4433769Z contiguous=False, 2025-05-07T20:32:35.4433996Z compiled=True, 2025-05-07T20:32:35.4434203Z ) 2025-05-07T20:32:35.4434514Z self = 2025-05-07T20:32:35.4434995Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.4435265Z 2025-05-07T20:32:35.4435346Z @given( 2025-05-07T20:32:35.4435580Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.4435889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.4436202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.4436537Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.4436854Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.4437137Z ) 2025-05-07T20:32:35.4437482Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.4437927Z def test_silu_mul_quant( 2025-05-07T20:32:35.4438170Z self, 2025-05-07T20:32:35.4438366Z T: int, 2025-05-07T20:32:35.4438556Z D: int, 2025-05-07T20:32:35.4438775Z scale_ub: Optional[float], 2025-05-07T20:32:35.4439050Z contiguous: bool, 2025-05-07T20:32:35.4439295Z compiled: bool, 2025-05-07T20:32:35.4439510Z ) -> None: 2025-05-07T20:32:35.4439729Z torch.manual_seed(2025) 2025-05-07T20:32:35.4439972Z 2025-05-07T20:32:35.4440515Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.4440860Z 2025-05-07T20:32:35.4441048Z x_sign = torch.sign(x) 2025-05-07T20:32:35.4441338Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.4441644Z x = x_sign * x_clamp 2025-05-07T20:32:35.4441879Z x0 = x[:, :D] 2025-05-07T20:32:35.4442086Z x1 = x[:, D:] 2025-05-07T20:32:35.4442291Z 2025-05-07T20:32:35.4442472Z if contiguous: 2025-05-07T20:32:35.4442694Z x0 = x0.contiguous() 2025-05-07T20:32:35.4442948Z x1 = x1.contiguous() 2025-05-07T20:32:35.4443189Z 2025-05-07T20:32:35.4443373Z if scale_ub is not None: 2025-05-07T20:32:35.4443646Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.4443976Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.4444278Z ) 2025-05-07T20:32:35.4444466Z else: 2025-05-07T20:32:35.4444672Z scale_ub_tensor = None 2025-05-07T20:32:35.4444923Z 2025-05-07T20:32:35.4445149Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.4445461Z op = silu_mul_quant 2025-05-07T20:32:35.4445718Z if compiled: 2025-05-07T20:32:35.4445954Z op = torch.compile(op) 2025-05-07T20:32:35.4446244Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.4446514Z 2025-05-07T20:32:35.4446696Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.4446976Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.4447261Z 2025-05-07T20:32:35.4447487Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.4447819Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.4448109Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.4448416Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.4448765Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.4449084Z 2025-05-07T20:32:35.4449282Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.4449471Z 2025-05-07T20:32:35.4449568Z moe/activation_test.py:126: 2025-05-07T20:32:35.4449990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.4450325Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.4450643Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.4451428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.4452167Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.4452713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.4453382Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.4454074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.4454793Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.4455532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.4456161Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.4456768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.4457282Z fn() 2025-05-07T20:32:35.4457941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.4458530Z self.fn.run( 2025-05-07T20:32:35.4459007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.4459524Z kernel = self.compile( 2025-05-07T20:32:35.4460050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.4460827Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.4461219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.4461440Z 2025-05-07T20:32:35.4461641Z self = 2025-05-07T20:32:35.4462701Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.4464051Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13e0527e20>} 2025-05-07T20:32:35.4465368Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.4466383Z context = 2025-05-07T20:32:35.4466664Z 2025-05-07T20:32:35.4466831Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.4467345Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.4467876Z module_map=module_map) 2025-05-07T20:32:35.4468241Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.4468586Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.4468853Z E ^ 2025-05-07T20:32:35.4469308Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.4469898Z 2025-05-07T20:32:35.4470316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.6872207Z 2025-05-07T20:32:35.6872401Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6874378Z self=, 2025-05-07T20:32:35.6875522Z T=128, 2025-05-07T20:32:35.6875789Z D=7168, 2025-05-07T20:32:35.6876053Z scale_ub=None, 2025-05-07T20:32:35.6876364Z contiguous=False, 2025-05-07T20:32:35.6876679Z compiled=False, 2025-05-07T20:32:35.6876936Z ) 2025-05-07T20:32:35.6877258Z self = 2025-05-07T20:32:35.6877754Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.6878036Z 2025-05-07T20:32:35.6878126Z @given( 2025-05-07T20:32:35.6878351Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6878668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6878972Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6879295Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6879619Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6879907Z ) 2025-05-07T20:32:35.6880259Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6880713Z def test_silu_mul_quant( 2025-05-07T20:32:35.6880959Z self, 2025-05-07T20:32:35.6881155Z T: int, 2025-05-07T20:32:35.6881345Z D: int, 2025-05-07T20:32:35.6881566Z scale_ub: Optional[float], 2025-05-07T20:32:35.6881848Z contiguous: bool, 2025-05-07T20:32:35.6882078Z compiled: bool, 2025-05-07T20:32:35.6882297Z ) -> None: 2025-05-07T20:32:35.6882510Z torch.manual_seed(2025) 2025-05-07T20:32:35.6882744Z 2025-05-07T20:32:35.6883018Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6883357Z 2025-05-07T20:32:35.6883543Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6883979Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6884294Z x = x_sign * x_clamp 2025-05-07T20:32:35.6884534Z x0 = x[:, :D] 2025-05-07T20:32:35.6884744Z x1 = x[:, D:] 2025-05-07T20:32:35.6884949Z 2025-05-07T20:32:35.6885124Z if contiguous: 2025-05-07T20:32:35.6885353Z x0 = x0.contiguous() 2025-05-07T20:32:35.6885606Z x1 = x1.contiguous() 2025-05-07T20:32:35.6885839Z 2025-05-07T20:32:35.6886020Z if scale_ub is not None: 2025-05-07T20:32:35.6886287Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.6886618Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.6886926Z ) 2025-05-07T20:32:35.6887122Z else: 2025-05-07T20:32:35.6887328Z scale_ub_tensor = None 2025-05-07T20:32:35.6887568Z 2025-05-07T20:32:35.6887795Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6888108Z op = silu_mul_quant 2025-05-07T20:32:35.6888352Z if compiled: 2025-05-07T20:32:35.6888595Z op = torch.compile(op) 2025-05-07T20:32:35.6888893Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6889165Z 2025-05-07T20:32:35.6889354Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.6889514Z 2025-05-07T20:32:35.6889621Z moe/activation_test.py:117: 2025-05-07T20:32:35.6889912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6890237Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.6890514Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6891232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.6891905Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.6892437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.6901923Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.6902721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.6903289Z kernel = self.compile( 2025-05-07T20:32:35.6903832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.6904491Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.6904896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6905203Z 2025-05-07T20:32:35.6905511Z self = 2025-05-07T20:32:35.6906625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.6908161Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c73999e0>} 2025-05-07T20:32:35.6909502Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.6910520Z context = 2025-05-07T20:32:35.6910804Z 2025-05-07T20:32:35.6910970Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.6911500Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.6911969Z module_map=module_map) 2025-05-07T20:32:35.6912326Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.6912787Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.6913046Z E ^ 2025-05-07T20:32:35.6913520Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.6913964Z 2025-05-07T20:32:35.6914394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.6914903Z 2025-05-07T20:32:35.6915003Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6915412Z self=, 2025-05-07T20:32:35.6915821Z T=4096, 2025-05-07T20:32:35.6916081Z D=5120, 2025-05-07T20:32:35.6916349Z scale_ub=1200.0, 2025-05-07T20:32:35.6916664Z contiguous=True, 2025-05-07T20:32:35.6916919Z compiled=False, 2025-05-07T20:32:35.6917118Z ) 2025-05-07T20:32:35.6917430Z self = 2025-05-07T20:32:35.6917918Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.6918193Z 2025-05-07T20:32:35.6918267Z @given( 2025-05-07T20:32:35.6918492Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6918789Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6919084Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6919406Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6919722Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6919999Z ) 2025-05-07T20:32:35.6920344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6920773Z def test_silu_mul_quant( 2025-05-07T20:32:35.6921004Z self, 2025-05-07T20:32:35.6921195Z T: int, 2025-05-07T20:32:35.6921393Z D: int, 2025-05-07T20:32:35.6921607Z scale_ub: Optional[float], 2025-05-07T20:32:35.6921872Z contiguous: bool, 2025-05-07T20:32:35.6922115Z compiled: bool, 2025-05-07T20:32:35.6922329Z ) -> None: 2025-05-07T20:32:35.6922539Z torch.manual_seed(2025) 2025-05-07T20:32:35.6922777Z 2025-05-07T20:32:35.6923140Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6923485Z 2025-05-07T20:32:35.6923669Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6923945Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6924248Z x = x_sign * x_clamp 2025-05-07T20:32:35.6924482Z x0 = x[:, :D] 2025-05-07T20:32:35.6924688Z x1 = x[:, D:] 2025-05-07T20:32:35.6924890Z 2025-05-07T20:32:35.6925070Z if contiguous: 2025-05-07T20:32:35.6925300Z x0 = x0.contiguous() 2025-05-07T20:32:35.6925550Z x1 = x1.contiguous() 2025-05-07T20:32:35.6925788Z 2025-05-07T20:32:35.6925974Z if scale_ub is not None: 2025-05-07T20:32:35.6926237Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.6926568Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.6926877Z ) 2025-05-07T20:32:35.6927062Z else: 2025-05-07T20:32:35.6927274Z scale_ub_tensor = None 2025-05-07T20:32:35.6927518Z 2025-05-07T20:32:35.6927739Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6928044Z op = silu_mul_quant 2025-05-07T20:32:35.6928293Z if compiled: 2025-05-07T20:32:35.6928527Z op = torch.compile(op) 2025-05-07T20:32:35.6928817Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6929085Z 2025-05-07T20:32:35.6929264Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.6929435Z 2025-05-07T20:32:35.6929530Z moe/activation_test.py:117: 2025-05-07T20:32:35.6929819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6930142Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.6930407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6931176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.6931863Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.6932387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.6933050Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.6933704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.6934222Z kernel = self.compile( 2025-05-07T20:32:35.6934762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.6935405Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.6935841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6936073Z 2025-05-07T20:32:35.6936277Z self = 2025-05-07T20:32:35.6937338Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.6938696Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c739a200>} 2025-05-07T20:32:35.6940008Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.6941335Z context = 2025-05-07T20:32:35.6941616Z 2025-05-07T20:32:35.6941781Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.6942302Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.6942911Z module_map=module_map) 2025-05-07T20:32:35.6943276Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.6943616Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.6943872Z E ^ 2025-05-07T20:32:35.6944325Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.6944766Z 2025-05-07T20:32:35.6945182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.6945691Z 2025-05-07T20:32:35.6945789Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6946192Z self=, 2025-05-07T20:32:35.6946597Z T=1, 2025-05-07T20:32:35.6946782Z D=5120, 2025-05-07T20:32:35.6946973Z scale_ub=None, 2025-05-07T20:32:35.6947185Z contiguous=True, 2025-05-07T20:32:35.6947395Z compiled=True, 2025-05-07T20:32:35.6947656Z ) 2025-05-07T20:32:35.6947967Z self = 2025-05-07T20:32:35.6948438Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.6948691Z 2025-05-07T20:32:35.6948765Z @given( 2025-05-07T20:32:35.6948986Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6949290Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6949580Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6949896Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6950215Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6950483Z ) 2025-05-07T20:32:35.6950830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6951387Z def test_silu_mul_quant( 2025-05-07T20:32:35.6951614Z self, 2025-05-07T20:32:35.6951802Z T: int, 2025-05-07T20:32:35.6951996Z D: int, 2025-05-07T20:32:35.6952206Z scale_ub: Optional[float], 2025-05-07T20:32:35.6952468Z contiguous: bool, 2025-05-07T20:32:35.6952700Z compiled: bool, 2025-05-07T20:32:35.6952906Z ) -> None: 2025-05-07T20:32:35.6953111Z torch.manual_seed(2025) 2025-05-07T20:32:35.6953346Z 2025-05-07T20:32:35.6953610Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6953939Z 2025-05-07T20:32:35.6954148Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6954429Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6954729Z x = x_sign * x_clamp 2025-05-07T20:32:35.6954956Z x0 = x[:, :D] 2025-05-07T20:32:35.6955158Z x1 = x[:, D:] 2025-05-07T20:32:35.6955352Z 2025-05-07T20:32:35.6955524Z if contiguous: 2025-05-07T20:32:35.6955755Z x0 = x0.contiguous() 2025-05-07T20:32:35.6956005Z x1 = x1.contiguous() 2025-05-07T20:32:35.6956229Z 2025-05-07T20:32:35.6956415Z if scale_ub is not None: 2025-05-07T20:32:35.6956681Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.6957001Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.6957301Z ) 2025-05-07T20:32:35.6957488Z else: 2025-05-07T20:32:35.6957687Z scale_ub_tensor = None 2025-05-07T20:32:35.6957932Z 2025-05-07T20:32:35.6958160Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6958463Z op = silu_mul_quant 2025-05-07T20:32:35.6958703Z if compiled: 2025-05-07T20:32:35.6958943Z op = torch.compile(op) 2025-05-07T20:32:35.6959236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6959495Z 2025-05-07T20:32:35.6959691Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.6959979Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.6960260Z 2025-05-07T20:32:35.6960489Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6960905Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.6961185Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.6961493Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.6961844Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.6962143Z 2025-05-07T20:32:35.6962338Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.6962535Z 2025-05-07T20:32:35.6962630Z moe/activation_test.py:126: 2025-05-07T20:32:35.6962917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6963235Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.6963548Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.6964315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.6965056Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.6965633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.6966303Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.6966979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.6967680Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.6968414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.6969036Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.6969627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.6970238Z fn() 2025-05-07T20:32:35.6970742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.6971312Z self.fn.run( 2025-05-07T20:32:35.6971771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.6972285Z kernel = self.compile( 2025-05-07T20:32:35.6972816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.6973448Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.6973843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6974070Z 2025-05-07T20:32:35.6974276Z self = 2025-05-07T20:32:35.6975344Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.6976752Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c739ac00>} 2025-05-07T20:32:35.6978101Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.6979128Z context = 2025-05-07T20:32:35.6979417Z 2025-05-07T20:32:35.6979580Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.6980102Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.6980562Z module_map=module_map) 2025-05-07T20:32:35.6980925Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.6981359Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.6981617Z E ^ 2025-05-07T20:32:35.6982072Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.6982522Z 2025-05-07T20:32:35.6982935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.9220021Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:35.9222232Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:35.9224724Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:35.9226729Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:35.9227783Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.9229068Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:35.9230442Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.9231920Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:35.9233277Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.9234308Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:32:35.9235599Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:35.9236843Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:35.9237669Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:35.9238853Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:35.9240040Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:35.9241322Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:35.9242324Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:35.9243657Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:35.9244918Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:35.9245853Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:35.9246916Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:35.9247948Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:35.9248703Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:35.9249850Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:35.9251178Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:35.9252223Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.9253106Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.9253948Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:35.9254954Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.9827908Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:35.9829199Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:35.9830536Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:35.9831958Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:35.9832931Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.9834217Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:35.9835614Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.9836960Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:35.9838505Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.9839542Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:32:35.9841053Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:35.9842288Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:35.9843113Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:35.9844302Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:35.9845495Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:35.9846508Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:35.9847507Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:35.9848697Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:35.9850099Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:35.9850981Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:35.9852035Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:35.9853054Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:35.9853807Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:35.9854967Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:35.9856288Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:35.9857327Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.9858219Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.9858948Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:35.9859945Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.1704020Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:36.1705112Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:36.1706491Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:36.1707995Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:36.1709144Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.1710433Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:36.1711793Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.1713080Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:36.1714433Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.1715651Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:32:36.1716913Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:36.1718142Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:36.1718973Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:36.1720176Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:36.1721386Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:36.1722403Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:36.1723419Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:36.1724627Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:36.1725907Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:36.1726906Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:36.1727985Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:36.1729037Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:36.1729809Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:36.1730973Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:36.1732342Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:36.1733396Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.1734315Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.1735050Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:36.1736111Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.1794313Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:36.1795756Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:36.1797091Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:36.1798509Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:36.1799582Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.1800894Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:36.1802272Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.1803831Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:36.1805304Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.1806351Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:32:36.1807763Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:36.1809005Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:36.1809883Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:36.1811334Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:36.1812547Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:36.1813590Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:36.1814757Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:36.1816020Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:36.1817292Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:36.1818188Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:36.1819375Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:36.1820407Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:36.1821169Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:36.1822324Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:36.1823689Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:36.1824747Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.1825679Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.1826414Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:36.1827421Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.6031103Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:36.6032775Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:36.6034535Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:36.6037830Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:36.6038776Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.6040054Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:36.6041772Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.6043387Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:36.6045100Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.6046382Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:32:36.6047954Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:36.6049344Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:36.6050169Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:36.6051346Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:36.6052522Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:36.6053532Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:36.6054538Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:36.6055733Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:36.6056984Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:36.6057857Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:36.6058917Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:36.6060050Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:36.6060801Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:36.6061933Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:36.6063253Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:36.6064281Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.6065170Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.6066051Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:36.6067035Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.6643597Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:36.6644845Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:36.6646165Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:36.6647812Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:36.6648777Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.6650067Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:36.6651426Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.6652729Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:36.6654069Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.6655098Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:32:36.6663936Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:36.6665173Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:36.6666214Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:36.6667424Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:36.6668688Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:36.6669704Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:36.6670694Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:36.6671898Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:36.6673167Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:36.6674058Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:36.6675126Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:36.6676150Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:36.6677162Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:36.6678617Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:36.6680302Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:36.6681604Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.6682710Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.6683593Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:36.6684854Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.8518893Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:36.8520117Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:36.8521429Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:36.8522818Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:36.8523959Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.8525242Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:36.8526650Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.8527927Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:36.8529281Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.8530307Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:32:36.8531555Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:36.8532779Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:36.8533602Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:36.8534904Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:36.8536091Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:36.8537112Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:36.8538112Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:36.8539307Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:36.8540747Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:36.8541632Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:36.8542704Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:36.8543729Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:36.8544486Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:36.8545635Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:36.8547103Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:36.8548210Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.8549118Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.8549842Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:36.8550836Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.8610984Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:36.8612354Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:36.8614015Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:36.8615802Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:36.8616804Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.8618243Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:36.8619592Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.8620862Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:36.8622206Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.8623232Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:32:36.8624463Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:36.8625686Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:36.8626560Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:36.8627813Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:36.8628996Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:36.8630083Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:36.8631075Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:36.8632265Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:36.8633506Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:36.8634392Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:36.8635492Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:36.8636560Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:36.8637305Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:36.8638439Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:36.8639762Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:36.8641076Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.8641964Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.8642677Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:36.8643665Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.2475759Z 2025-05-07T20:32:37.2476209Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.2476828Z self=, 2025-05-07T20:32:37.2477444Z T=2048, 2025-05-07T20:32:37.2477709Z D=5120, 2025-05-07T20:32:37.2477967Z scale_ub=None, 2025-05-07T20:32:37.2478244Z contiguous=True, 2025-05-07T20:32:37.2478474Z compiled=True, 2025-05-07T20:32:37.2478676Z ) 2025-05-07T20:32:37.2479000Z self = 2025-05-07T20:32:37.2479486Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.2479758Z 2025-05-07T20:32:37.2479846Z @given( 2025-05-07T20:32:37.2480070Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.2480380Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.2480693Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.2481020Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.2481352Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.2481637Z ) 2025-05-07T20:32:37.2481988Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.2482451Z def test_silu_mul_quant( 2025-05-07T20:32:37.2482687Z self, 2025-05-07T20:32:37.2483056Z T: int, 2025-05-07T20:32:37.2483256Z D: int, 2025-05-07T20:32:37.2483474Z scale_ub: Optional[float], 2025-05-07T20:32:37.2483738Z contiguous: bool, 2025-05-07T20:32:37.2483976Z compiled: bool, 2025-05-07T20:32:37.2484205Z ) -> None: 2025-05-07T20:32:37.2484412Z torch.manual_seed(2025) 2025-05-07T20:32:37.2484652Z 2025-05-07T20:32:37.2484920Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.2485260Z 2025-05-07T20:32:37.2485447Z x_sign = torch.sign(x) 2025-05-07T20:32:37.2485739Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.2486048Z x = x_sign * x_clamp 2025-05-07T20:32:37.2486277Z x0 = x[:, :D] 2025-05-07T20:32:37.2486494Z x1 = x[:, D:] 2025-05-07T20:32:37.2486707Z 2025-05-07T20:32:37.2486886Z if contiguous: 2025-05-07T20:32:37.2487118Z x0 = x0.contiguous() 2025-05-07T20:32:37.2487382Z x1 = x1.contiguous() 2025-05-07T20:32:37.2487613Z 2025-05-07T20:32:37.2487806Z if scale_ub is not None: 2025-05-07T20:32:37.2488082Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.2488409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.2488723Z ) 2025-05-07T20:32:37.2488920Z else: 2025-05-07T20:32:37.2489123Z scale_ub_tensor = None 2025-05-07T20:32:37.2489370Z 2025-05-07T20:32:37.2489598Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.2489911Z op = silu_mul_quant 2025-05-07T20:32:37.2490148Z if compiled: 2025-05-07T20:32:37.2490400Z op = torch.compile(op) 2025-05-07T20:32:37.2490701Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.2490963Z 2025-05-07T20:32:37.2491276Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.2491560Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.2491843Z 2025-05-07T20:32:37.2492079Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.2492406Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.2492686Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.2492992Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.2493351Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.2493665Z 2025-05-07T20:32:37.2493858Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.2494058Z 2025-05-07T20:32:37.2494157Z moe/activation_test.py:126: 2025-05-07T20:32:37.2494450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.2494772Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.2495092Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.2495904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.2496637Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.2497178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.2497878Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.2498566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.2499269Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.2499989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.2500619Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.2501217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.2501718Z fn() 2025-05-07T20:32:37.2502309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.2502884Z self.fn.run( 2025-05-07T20:32:37.2503343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.2503862Z kernel = self.compile( 2025-05-07T20:32:37.2504397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.2505038Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.2505422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.2505650Z 2025-05-07T20:32:37.2505852Z self = 2025-05-07T20:32:37.2506923Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.2508358Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c77ca020>} 2025-05-07T20:32:37.2509667Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.2510682Z context = 2025-05-07T20:32:37.2510964Z 2025-05-07T20:32:37.2511133Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.2511638Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.2512183Z module_map=module_map) 2025-05-07T20:32:37.2512552Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.2512901Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.2513164Z E ^ 2025-05-07T20:32:37.2513618Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.2514054Z 2025-05-07T20:32:37.2514473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.2514976Z 2025-05-07T20:32:37.2515075Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.2515488Z self=, 2025-05-07T20:32:37.2515885Z T=128, 2025-05-07T20:32:37.2516065Z D=5120, 2025-05-07T20:32:37.2516254Z scale_ub=None, 2025-05-07T20:32:37.2516470Z contiguous=True, 2025-05-07T20:32:37.2516686Z compiled=True, 2025-05-07T20:32:37.2516889Z ) 2025-05-07T20:32:37.2517205Z self = 2025-05-07T20:32:37.2517684Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.2517943Z 2025-05-07T20:32:37.2518021Z @given( 2025-05-07T20:32:37.2518248Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.2518557Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.2518851Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.2519170Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.2519499Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.2519772Z ) 2025-05-07T20:32:37.2520120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.2520570Z def test_silu_mul_quant( 2025-05-07T20:32:37.2520814Z self, 2025-05-07T20:32:37.2521002Z T: int, 2025-05-07T20:32:37.2521201Z D: int, 2025-05-07T20:32:37.2521417Z scale_ub: Optional[float], 2025-05-07T20:32:37.2522052Z contiguous: bool, 2025-05-07T20:32:37.2522296Z compiled: bool, 2025-05-07T20:32:37.2522511Z ) -> None: 2025-05-07T20:32:37.2522719Z torch.manual_seed(2025) 2025-05-07T20:32:37.2522957Z 2025-05-07T20:32:37.2523223Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.2523546Z 2025-05-07T20:32:37.2523735Z x_sign = torch.sign(x) 2025-05-07T20:32:37.2524020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.2524320Z x = x_sign * x_clamp 2025-05-07T20:32:37.2524557Z x0 = x[:, :D] 2025-05-07T20:32:37.2524767Z x1 = x[:, D:] 2025-05-07T20:32:37.2524969Z 2025-05-07T20:32:37.2525154Z if contiguous: 2025-05-07T20:32:37.2525382Z x0 = x0.contiguous() 2025-05-07T20:32:37.2525635Z x1 = x1.contiguous() 2025-05-07T20:32:37.2525874Z 2025-05-07T20:32:37.2526062Z if scale_ub is not None: 2025-05-07T20:32:37.2526326Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.2526657Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.2526958Z ) 2025-05-07T20:32:37.2527143Z else: 2025-05-07T20:32:37.2527343Z scale_ub_tensor = None 2025-05-07T20:32:37.2527583Z 2025-05-07T20:32:37.2527807Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.2528107Z op = silu_mul_quant 2025-05-07T20:32:37.2528355Z if compiled: 2025-05-07T20:32:37.2528602Z op = torch.compile(op) 2025-05-07T20:32:37.2528890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.2529163Z 2025-05-07T20:32:37.2529352Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.2529631Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.2530004Z 2025-05-07T20:32:37.2530235Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.2530557Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.2530844Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.2531149Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.2531500Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.2531803Z 2025-05-07T20:32:37.2532005Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.2532193Z 2025-05-07T20:32:37.2532295Z moe/activation_test.py:126: 2025-05-07T20:32:37.2532579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.2532919Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.2533242Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.2534019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.2534759Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.2535304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.2536025Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.2536693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.2537401Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.2538122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.2538753Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.2539358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.2539869Z fn() 2025-05-07T20:32:37.2540652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.2541377Z self.fn.run( 2025-05-07T20:32:37.2541839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.2542359Z kernel = self.compile( 2025-05-07T20:32:37.2542891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.2543521Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.2543912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.2544136Z 2025-05-07T20:32:37.2544343Z self = 2025-05-07T20:32:37.2545405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.2546758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c679b420>} 2025-05-07T20:32:37.2548120Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.2549128Z context = 2025-05-07T20:32:37.2549408Z 2025-05-07T20:32:37.2549579Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.2550093Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.2550564Z module_map=module_map) 2025-05-07T20:32:37.2551048Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.2551406Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.2551663Z E ^ 2025-05-07T20:32:37.2552126Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.2552584Z 2025-05-07T20:32:37.2553001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4833248Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:37.4834537Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:37.4835870Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:37.4837312Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:37.4838259Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.4839542Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:37.4841184Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4842977Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:37.4844332Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4845351Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:32:37.4846598Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:37.4847818Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:37.4848652Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:37.4849825Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:37.4851000Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:37.4852018Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:37.4853015Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:37.4854341Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:37.4855596Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:37.4856483Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:37.4857571Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:37.4858585Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:37.4859350Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:37.4860498Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:37.4861823Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:37.4862859Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4863775Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4864497Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:37.4865568Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5446809Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:37.5448040Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:37.5449347Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:37.5450733Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:37.5451721Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.5452997Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:37.5454349Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5455620Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:37.5457130Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5458161Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:32:37.5459399Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:37.5460621Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:37.5461439Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:37.5462624Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:37.5463796Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:37.5464801Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:37.5465797Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:37.5467035Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:37.5468477Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:37.5469353Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:37.5470416Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:37.5471437Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:37.5472212Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:37.5473348Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:37.5474688Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:37.5475739Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5476682Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5477415Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:37.5478401Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.7341875Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:37.7343003Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:37.7344324Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:37.7345902Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:37.7346867Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.7348270Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:37.7349635Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.7350919Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:37.7352290Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.7353771Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:32:37.7355061Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:37.7356306Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:37.7357137Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:37.7358318Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:37.7359513Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:37.7360547Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:37.7361558Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:37.7362764Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:37.7364027Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:37.7365081Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:37.7366148Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:37.7367178Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:37.7367937Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:37.7369095Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:37.7370447Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:37.7381515Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.7382449Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.7383188Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:37.7384230Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.7435701Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:37.7436912Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:37.7438301Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:37.7439737Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:37.7441568Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.7442862Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:37.7444237Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.7445520Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:37.7446870Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.7447894Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:32:37.7449315Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:37.7450540Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:37.7451387Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:37.7452566Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:37.7453757Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:37.7454781Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:37.7455783Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:37.7456978Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:37.7458233Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:37.7459114Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:37.7460182Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:37.7461338Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:37.7462103Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:37.7463276Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:37.7464598Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:37.7465639Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.7466543Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.7467273Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:37.7468350Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.2028285Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:38.2029387Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:38.2031130Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:38.2032552Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:38.2033510Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.2034797Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:38.2036215Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.2037512Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:38.2038870Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.2039913Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:32:38.2041412Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:38.2042797Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:38.2043631Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:38.2044815Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:38.2045995Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:38.2047018Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:38.2048044Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:38.2049276Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:38.2050556Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:38.2051441Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:38.2052503Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:38.2053528Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:38.2054450Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:38.2055588Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:38.2056923Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:38.2057978Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.2058875Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.2059613Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:38.2060614Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.2640498Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:38.2641549Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:38.2642864Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:38.2644573Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:38.2645557Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.2646838Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:38.2648254Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.2649532Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:38.2650898Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.2651955Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:32:38.2653197Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:38.2654444Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:38.2655475Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:38.2656662Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:38.2657847Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:38.2658893Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:38.2659903Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:38.2661106Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:38.2662387Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:38.2663290Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:38.2664365Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:38.2665395Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:38.2666157Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:38.2667426Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:38.2668829Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:38.2669875Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.2670783Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.2671520Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:38.2672517Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.4536088Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:38.4537530Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:38.4538854Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:38.4540523Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:38.4541758Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.4543060Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:38.4544429Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.4545719Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:38.4547078Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.4548194Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:32:38.4549434Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:38.4550669Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:38.4551506Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:38.4552695Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:38.4554036Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:38.4555047Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:38.4556057Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:38.4557262Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:38.4558532Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:38.4559439Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:38.4560505Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:38.4561656Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:38.4562578Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:38.4563738Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:38.4565186Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:38.4566231Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.4567134Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.4567867Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:38.4568872Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.4627169Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:38.4628282Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:38.4629594Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:38.4631037Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:38.4631995Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.4633484Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:38.4634908Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.4636192Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:38.4637542Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.4638577Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:32:38.4639837Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:38.4641310Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:38.4642140Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:38.4643323Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:38.4644515Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:38.4645685Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:38.4646705Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:38.4647907Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:38.4649171Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:38.4650068Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:38.4651140Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:38.4652176Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:38.4652935Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:38.4654094Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:38.4655435Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:38.4657261Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.4658239Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.4658974Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:38.4659980Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.6877233Z 2025-05-07T20:32:38.6877713Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.6878163Z self=, 2025-05-07T20:32:38.6878584Z T=4096, 2025-05-07T20:32:38.6878842Z D=5120, 2025-05-07T20:32:38.6879040Z scale_ub=None, 2025-05-07T20:32:38.6879259Z contiguous=True, 2025-05-07T20:32:38.6879492Z compiled=True, 2025-05-07T20:32:38.6879731Z ) 2025-05-07T20:32:38.6880086Z self = 2025-05-07T20:32:38.6880579Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:38.6880862Z 2025-05-07T20:32:38.6880943Z @given( 2025-05-07T20:32:38.6881181Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.6881497Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.6881808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.6882151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.6882484Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.6882766Z ) 2025-05-07T20:32:38.6883132Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.6883924Z def test_silu_mul_quant( 2025-05-07T20:32:38.6884167Z self, 2025-05-07T20:32:38.6884369Z T: int, 2025-05-07T20:32:38.6884579Z D: int, 2025-05-07T20:32:38.6884802Z scale_ub: Optional[float], 2025-05-07T20:32:38.6885085Z contiguous: bool, 2025-05-07T20:32:38.6885334Z compiled: bool, 2025-05-07T20:32:38.6885559Z ) -> None: 2025-05-07T20:32:38.6885790Z torch.manual_seed(2025) 2025-05-07T20:32:38.6886046Z 2025-05-07T20:32:38.6886350Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.6886712Z 2025-05-07T20:32:38.6886914Z x_sign = torch.sign(x) 2025-05-07T20:32:38.6887210Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.6887516Z x = x_sign * x_clamp 2025-05-07T20:32:38.6887760Z x0 = x[:, :D] 2025-05-07T20:32:38.6887981Z x1 = x[:, D:] 2025-05-07T20:32:38.6888184Z 2025-05-07T20:32:38.6888376Z if contiguous: 2025-05-07T20:32:38.6888621Z x0 = x0.contiguous() 2025-05-07T20:32:38.6888878Z x1 = x1.contiguous() 2025-05-07T20:32:38.6889128Z 2025-05-07T20:32:38.6889340Z if scale_ub is not None: 2025-05-07T20:32:38.6889612Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.6889953Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.6890270Z ) 2025-05-07T20:32:38.6890457Z else: 2025-05-07T20:32:38.6890672Z scale_ub_tensor = None 2025-05-07T20:32:38.6890932Z 2025-05-07T20:32:38.6891162Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.6891481Z op = silu_mul_quant 2025-05-07T20:32:38.6891738Z if compiled: 2025-05-07T20:32:38.6891997Z op = torch.compile(op) 2025-05-07T20:32:38.6892288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.6892564Z 2025-05-07T20:32:38.6892766Z y_fp8, y_scale = fn() 2025-05-07T20:32:38.6893056Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:38.6893353Z 2025-05-07T20:32:38.6893596Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.6894080Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:38.6894383Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:38.6894701Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:38.6895056Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.6895370Z 2025-05-07T20:32:38.6895580Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:38.6895780Z 2025-05-07T20:32:38.6895898Z moe/activation_test.py:126: 2025-05-07T20:32:38.6896194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.6896535Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:38.6896869Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.6897645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:38.6898400Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:38.6898955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.6899636Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.6900320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:38.6901068Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:38.6901801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:38.6902432Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:38.6903028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:38.6903710Z fn() 2025-05-07T20:32:38.6904234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:38.6904826Z self.fn.run( 2025-05-07T20:32:38.6905299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.6905827Z kernel = self.compile( 2025-05-07T20:32:38.6906378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.6907029Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.6907430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.6907739Z 2025-05-07T20:32:38.6907957Z self = 2025-05-07T20:32:38.6909033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.6910428Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c67eaac0>} 2025-05-07T20:32:38.6911792Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.6912811Z context = 2025-05-07T20:32:38.6913092Z 2025-05-07T20:32:38.6913268Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.6913794Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.6914271Z module_map=module_map) 2025-05-07T20:32:38.6914642Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.6915087Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:38.6915357Z E ^ 2025-05-07T20:32:38.6915816Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.6916259Z 2025-05-07T20:32:38.6916692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.6917243Z 2025-05-07T20:32:38.6917348Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.6917761Z self=, 2025-05-07T20:32:38.6918172Z T=16384, 2025-05-07T20:32:38.6918360Z D=5120, 2025-05-07T20:32:38.6918558Z scale_ub=None, 2025-05-07T20:32:38.6918778Z contiguous=True, 2025-05-07T20:32:38.6919007Z compiled=True, 2025-05-07T20:32:38.6919212Z ) 2025-05-07T20:32:38.6919533Z self = 2025-05-07T20:32:38.6920032Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:38.6920314Z 2025-05-07T20:32:38.6920394Z @given( 2025-05-07T20:32:38.6920631Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.6920947Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.6921246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.6921579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.6921911Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.6922190Z ) 2025-05-07T20:32:38.6922541Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.6922994Z def test_silu_mul_quant( 2025-05-07T20:32:38.6923250Z self, 2025-05-07T20:32:38.6923444Z T: int, 2025-05-07T20:32:38.6923740Z D: int, 2025-05-07T20:32:38.6923967Z scale_ub: Optional[float], 2025-05-07T20:32:38.6924235Z contiguous: bool, 2025-05-07T20:32:38.6924484Z compiled: bool, 2025-05-07T20:32:38.6924708Z ) -> None: 2025-05-07T20:32:38.6924922Z torch.manual_seed(2025) 2025-05-07T20:32:38.6925170Z 2025-05-07T20:32:38.6925449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.6925780Z 2025-05-07T20:32:38.6925979Z x_sign = torch.sign(x) 2025-05-07T20:32:38.6926272Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.6926577Z x = x_sign * x_clamp 2025-05-07T20:32:38.6926818Z x0 = x[:, :D] 2025-05-07T20:32:38.6927039Z x1 = x[:, D:] 2025-05-07T20:32:38.6927242Z 2025-05-07T20:32:38.6927430Z if contiguous: 2025-05-07T20:32:38.6927664Z x0 = x0.contiguous() 2025-05-07T20:32:38.6927926Z x1 = x1.contiguous() 2025-05-07T20:32:38.6928157Z 2025-05-07T20:32:38.6928361Z if scale_ub is not None: 2025-05-07T20:32:38.6928636Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.6928969Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.6929280Z ) 2025-05-07T20:32:38.6929479Z else: 2025-05-07T20:32:38.6929684Z scale_ub_tensor = None 2025-05-07T20:32:38.6929936Z 2025-05-07T20:32:38.6930169Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.6930477Z op = silu_mul_quant 2025-05-07T20:32:38.6930732Z if compiled: 2025-05-07T20:32:38.6930984Z op = torch.compile(op) 2025-05-07T20:32:38.6931273Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.6931552Z 2025-05-07T20:32:38.6931751Z y_fp8, y_scale = fn() 2025-05-07T20:32:38.6932030Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:38.6932324Z 2025-05-07T20:32:38.6932565Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.6932907Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:38.6933193Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:38.6933598Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:38.6933966Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.6934267Z 2025-05-07T20:32:38.6934472Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:38.6934664Z 2025-05-07T20:32:38.6934773Z moe/activation_test.py:126: 2025-05-07T20:32:38.6935065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.6935403Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:38.6935730Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.6936516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:38.6937253Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:38.6937805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.6938490Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.6939174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:38.6939897Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:38.6940966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:38.6941603Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:38.6942200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:38.6942715Z fn() 2025-05-07T20:32:38.6943244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:38.6943999Z self.fn.run( 2025-05-07T20:32:38.6944470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.6944997Z kernel = self.compile( 2025-05-07T20:32:38.6945540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.6946193Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.6946634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.6946864Z 2025-05-07T20:32:38.6947067Z self = 2025-05-07T20:32:38.6948204Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.6949556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13f8b5a520>} 2025-05-07T20:32:38.6950875Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.6951882Z context = 2025-05-07T20:32:38.6952163Z 2025-05-07T20:32:38.6952333Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.6952858Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.6953328Z module_map=module_map) 2025-05-07T20:32:38.6953691Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.6954049Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:38.6954308Z E ^ 2025-05-07T20:32:38.6954917Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.6955358Z 2025-05-07T20:32:38.6955782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.7174724Z W0507 20:32:38.716000 88176 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:38.7176948Z W0507 20:32:38.716000 88176 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:38.7178314Z W0507 20:32:38.716000 88176 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:38.7179315Z W0507 20:32:38.716000 88176 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:38.7180404Z W0507 20:32:38.716000 88176 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:38.9312460Z 2025-05-07T20:32:38.9312821Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.9313265Z self=, 2025-05-07T20:32:38.9313711Z T=1, 2025-05-07T20:32:38.9313892Z D=5120, 2025-05-07T20:32:38.9314085Z scale_ub=1200.0, 2025-05-07T20:32:38.9324073Z contiguous=True, 2025-05-07T20:32:38.9324349Z compiled=True, 2025-05-07T20:32:38.9324555Z ) 2025-05-07T20:32:38.9324869Z self = 2025-05-07T20:32:38.9325748Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:38.9326013Z 2025-05-07T20:32:38.9326092Z @given( 2025-05-07T20:32:38.9326329Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.9326643Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.9326944Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.9327277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.9327612Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.9327900Z ) 2025-05-07T20:32:38.9328251Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.9328710Z def test_silu_mul_quant( 2025-05-07T20:32:38.9328948Z self, 2025-05-07T20:32:38.9329152Z T: int, 2025-05-07T20:32:38.9329357Z D: int, 2025-05-07T20:32:38.9329571Z scale_ub: Optional[float], 2025-05-07T20:32:38.9329847Z contiguous: bool, 2025-05-07T20:32:38.9330091Z compiled: bool, 2025-05-07T20:32:38.9330322Z ) -> None: 2025-05-07T20:32:38.9330532Z torch.manual_seed(2025) 2025-05-07T20:32:38.9330774Z 2025-05-07T20:32:38.9331056Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.9331395Z 2025-05-07T20:32:38.9331589Z x_sign = torch.sign(x) 2025-05-07T20:32:38.9331882Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.9332188Z x = x_sign * x_clamp 2025-05-07T20:32:38.9332420Z x0 = x[:, :D] 2025-05-07T20:32:38.9332633Z x1 = x[:, D:] 2025-05-07T20:32:38.9332845Z 2025-05-07T20:32:38.9333027Z if contiguous: 2025-05-07T20:32:38.9333257Z x0 = x0.contiguous() 2025-05-07T20:32:38.9333515Z x1 = x1.contiguous() 2025-05-07T20:32:38.9333752Z 2025-05-07T20:32:38.9333938Z if scale_ub is not None: 2025-05-07T20:32:38.9334213Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.9334540Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.9334856Z ) 2025-05-07T20:32:38.9335053Z else: 2025-05-07T20:32:38.9335411Z scale_ub_tensor = None 2025-05-07T20:32:38.9335664Z 2025-05-07T20:32:38.9335893Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.9336200Z op = silu_mul_quant 2025-05-07T20:32:38.9336448Z if compiled: 2025-05-07T20:32:38.9336694Z op = torch.compile(op) 2025-05-07T20:32:38.9336988Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.9337260Z 2025-05-07T20:32:38.9337451Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.9337615Z 2025-05-07T20:32:38.9337722Z moe/activation_test.py:117: 2025-05-07T20:32:38.9338009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.9338339Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.9338615Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.9339176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.9339727Z return fn(*args, **kwargs) 2025-05-07T20:32:38.9340798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.9341518Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.9342059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.9342743Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.9343404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.9343926Z kernel = self.compile( 2025-05-07T20:32:38.9344493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.9345281Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.9345682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.9345907Z 2025-05-07T20:32:38.9346118Z self = 2025-05-07T20:32:38.9347196Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.9348652Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c5d0f1a0>} 2025-05-07T20:32:38.9350007Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.9351024Z context = 2025-05-07T20:32:38.9351307Z 2025-05-07T20:32:38.9351481Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.9352007Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.9352484Z module_map=module_map) 2025-05-07T20:32:38.9352843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.9353201Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.9353461Z E ^ 2025-05-07T20:32:38.9353919Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.9354369Z 2025-05-07T20:32:38.9354777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.9355289Z 2025-05-07T20:32:38.9355396Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.9355800Z self=, 2025-05-07T20:32:38.9356334Z T=1, 2025-05-07T20:32:38.9356516Z D=5120, 2025-05-07T20:32:38.9356714Z scale_ub=None, 2025-05-07T20:32:38.9356927Z contiguous=False, 2025-05-07T20:32:38.9357146Z compiled=True, 2025-05-07T20:32:38.9357355Z ) 2025-05-07T20:32:38.9357678Z self = 2025-05-07T20:32:38.9358154Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:38.9358423Z 2025-05-07T20:32:38.9358502Z @given( 2025-05-07T20:32:38.9358733Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.9359036Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.9359343Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.9359668Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.9359989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.9360270Z ) 2025-05-07T20:32:38.9360623Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.9361068Z def test_silu_mul_quant( 2025-05-07T20:32:38.9361305Z self, 2025-05-07T20:32:38.9361497Z T: int, 2025-05-07T20:32:38.9361694Z D: int, 2025-05-07T20:32:38.9361902Z scale_ub: Optional[float], 2025-05-07T20:32:38.9362178Z contiguous: bool, 2025-05-07T20:32:38.9362416Z compiled: bool, 2025-05-07T20:32:38.9362634Z ) -> None: 2025-05-07T20:32:38.9362844Z torch.manual_seed(2025) 2025-05-07T20:32:38.9363099Z 2025-05-07T20:32:38.9363364Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.9363713Z 2025-05-07T20:32:38.9363905Z x_sign = torch.sign(x) 2025-05-07T20:32:38.9364190Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.9364499Z x = x_sign * x_clamp 2025-05-07T20:32:38.9364885Z x0 = x[:, :D] 2025-05-07T20:32:38.9365100Z x1 = x[:, D:] 2025-05-07T20:32:38.9365311Z 2025-05-07T20:32:38.9365500Z if contiguous: 2025-05-07T20:32:38.9365729Z x0 = x0.contiguous() 2025-05-07T20:32:38.9365978Z x1 = x1.contiguous() 2025-05-07T20:32:38.9366245Z 2025-05-07T20:32:38.9366472Z if scale_ub is not None: 2025-05-07T20:32:38.9366743Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.9367075Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.9367382Z ) 2025-05-07T20:32:38.9367576Z else: 2025-05-07T20:32:38.9367787Z scale_ub_tensor = None 2025-05-07T20:32:38.9368036Z 2025-05-07T20:32:38.9368257Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.9368565Z op = silu_mul_quant 2025-05-07T20:32:38.9368822Z if compiled: 2025-05-07T20:32:38.9369062Z op = torch.compile(op) 2025-05-07T20:32:38.9369361Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.9369637Z 2025-05-07T20:32:38.9369823Z y_fp8, y_scale = fn() 2025-05-07T20:32:38.9370104Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:38.9370391Z 2025-05-07T20:32:38.9370626Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.9370947Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:38.9371231Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:38.9371538Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:38.9371886Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.9372198Z 2025-05-07T20:32:38.9372395Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:38.9372585Z 2025-05-07T20:32:38.9372680Z moe/activation_test.py:126: 2025-05-07T20:32:38.9372977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.9373310Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:38.9373632Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.9374487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:38.9375231Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:38.9375768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.9376462Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.9377172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:38.9377883Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:38.9378625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:38.9379255Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:38.9379870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:38.9380390Z fn() 2025-05-07T20:32:38.9380923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:38.9381510Z self.fn.run( 2025-05-07T20:32:38.9381979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.9382498Z kernel = self.compile( 2025-05-07T20:32:38.9383059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.9383706Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.9384108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.9384415Z 2025-05-07T20:32:38.9384647Z self = 2025-05-07T20:32:38.9385979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.9387679Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c64782c0>} 2025-05-07T20:32:38.9389028Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.9390032Z context = 2025-05-07T20:32:38.9390320Z 2025-05-07T20:32:38.9390494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.9391017Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.9391481Z module_map=module_map) 2025-05-07T20:32:38.9391838Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.9392203Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:38.9392463Z E ^ 2025-05-07T20:32:38.9392930Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.9393378Z 2025-05-07T20:32:38.9393805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.0771502Z 2025-05-07T20:32:39.0771899Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.0772365Z self=, 2025-05-07T20:32:39.0772827Z T=1, 2025-05-07T20:32:39.0773020Z D=5120, 2025-05-07T20:32:39.0773208Z scale_ub=None, 2025-05-07T20:32:39.0773429Z contiguous=True, 2025-05-07T20:32:39.0773866Z compiled=False, 2025-05-07T20:32:39.0774069Z ) 2025-05-07T20:32:39.0774391Z self = 2025-05-07T20:32:39.0774918Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:39.0775195Z 2025-05-07T20:32:39.0775277Z @given( 2025-05-07T20:32:39.0775504Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.0775807Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.0776118Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.0776451Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.0776768Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.0777052Z ) 2025-05-07T20:32:39.0777397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.0777864Z def test_silu_mul_quant( 2025-05-07T20:32:39.0778099Z self, 2025-05-07T20:32:39.0778289Z T: int, 2025-05-07T20:32:39.0778486Z D: int, 2025-05-07T20:32:39.0778692Z scale_ub: Optional[float], 2025-05-07T20:32:39.0778954Z contiguous: bool, 2025-05-07T20:32:39.0779188Z compiled: bool, 2025-05-07T20:32:39.0779401Z ) -> None: 2025-05-07T20:32:39.0779614Z torch.manual_seed(2025) 2025-05-07T20:32:39.0779848Z 2025-05-07T20:32:39.0780107Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.0780444Z 2025-05-07T20:32:39.0780627Z x_sign = torch.sign(x) 2025-05-07T20:32:39.0780905Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.0781209Z x = x_sign * x_clamp 2025-05-07T20:32:39.0781442Z x0 = x[:, :D] 2025-05-07T20:32:39.0781642Z x1 = x[:, D:] 2025-05-07T20:32:39.0781842Z 2025-05-07T20:32:39.0782177Z if contiguous: 2025-05-07T20:32:39.0782400Z x0 = x0.contiguous() 2025-05-07T20:32:39.0782659Z x1 = x1.contiguous() 2025-05-07T20:32:39.0782904Z 2025-05-07T20:32:39.0783089Z if scale_ub is not None: 2025-05-07T20:32:39.0783361Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.0783691Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.0783995Z ) 2025-05-07T20:32:39.0784173Z else: 2025-05-07T20:32:39.0784389Z scale_ub_tensor = None 2025-05-07T20:32:39.0784643Z 2025-05-07T20:32:39.0784862Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.0785167Z op = silu_mul_quant 2025-05-07T20:32:39.0785417Z if compiled: 2025-05-07T20:32:39.0785652Z op = torch.compile(op) 2025-05-07T20:32:39.0785942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.0786231Z 2025-05-07T20:32:39.0786447Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.0786626Z 2025-05-07T20:32:39.0786727Z moe/activation_test.py:117: 2025-05-07T20:32:39.0787023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.0787354Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.0787697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.0788391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.0789078Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.0789606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.0790282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.0790945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.0791472Z kernel = self.compile( 2025-05-07T20:32:39.0792005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.0792743Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.0793131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.0793356Z 2025-05-07T20:32:39.0793558Z self = 2025-05-07T20:32:39.0794673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.0796039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c63eeac0>} 2025-05-07T20:32:39.0797368Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.0798378Z context = 2025-05-07T20:32:39.0798658Z 2025-05-07T20:32:39.0798817Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.0799332Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.0799788Z module_map=module_map) 2025-05-07T20:32:39.0800144Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.0800487Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.0800744Z E ^ 2025-05-07T20:32:39.0801370Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.0801812Z 2025-05-07T20:32:39.0802320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.0802827Z 2025-05-07T20:32:39.0802934Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.0803333Z self=, 2025-05-07T20:32:39.0803731Z T=128, 2025-05-07T20:32:39.0803905Z D=5120, 2025-05-07T20:32:39.0804093Z scale_ub=None, 2025-05-07T20:32:39.0804304Z contiguous=False, 2025-05-07T20:32:39.0804516Z compiled=True, 2025-05-07T20:32:39.0804716Z ) 2025-05-07T20:32:39.0805025Z self = 2025-05-07T20:32:39.0805493Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:39.0805759Z 2025-05-07T20:32:39.0805833Z @given( 2025-05-07T20:32:39.0806057Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.0806360Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.0806660Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.0806988Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.0807313Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.0807586Z ) 2025-05-07T20:32:39.0807926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.0808359Z def test_silu_mul_quant( 2025-05-07T20:32:39.0808597Z self, 2025-05-07T20:32:39.0808790Z T: int, 2025-05-07T20:32:39.0808992Z D: int, 2025-05-07T20:32:39.0809201Z scale_ub: Optional[float], 2025-05-07T20:32:39.0809468Z contiguous: bool, 2025-05-07T20:32:39.0809703Z compiled: bool, 2025-05-07T20:32:39.0809914Z ) -> None: 2025-05-07T20:32:39.0810129Z torch.manual_seed(2025) 2025-05-07T20:32:39.0810369Z 2025-05-07T20:32:39.0810628Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.0810968Z 2025-05-07T20:32:39.0811162Z x_sign = torch.sign(x) 2025-05-07T20:32:39.0811443Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.0811823Z x = x_sign * x_clamp 2025-05-07T20:32:39.0812056Z x0 = x[:, :D] 2025-05-07T20:32:39.0812268Z x1 = x[:, D:] 2025-05-07T20:32:39.0812467Z 2025-05-07T20:32:39.0812643Z if contiguous: 2025-05-07T20:32:39.0812869Z x0 = x0.contiguous() 2025-05-07T20:32:39.0813110Z x1 = x1.contiguous() 2025-05-07T20:32:39.0813344Z 2025-05-07T20:32:39.0813526Z if scale_ub is not None: 2025-05-07T20:32:39.0813784Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.0814113Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.0814411Z ) 2025-05-07T20:32:39.0814592Z else: 2025-05-07T20:32:39.0814796Z scale_ub_tensor = None 2025-05-07T20:32:39.0815042Z 2025-05-07T20:32:39.0815259Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.0815572Z op = silu_mul_quant 2025-05-07T20:32:39.0815815Z if compiled: 2025-05-07T20:32:39.0816061Z op = torch.compile(op) 2025-05-07T20:32:39.0816364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.0816659Z 2025-05-07T20:32:39.0816843Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.0817002Z 2025-05-07T20:32:39.0817098Z moe/activation_test.py:117: 2025-05-07T20:32:39.0817386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.0817711Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.0817979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.0818529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.0819081Z return fn(*args, **kwargs) 2025-05-07T20:32:39.0819746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.0820510Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.0821048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.0821718Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.0822366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.0822888Z kernel = self.compile( 2025-05-07T20:32:39.0823438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.0824077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.0824458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.0824686Z 2025-05-07T20:32:39.0824887Z self = 2025-05-07T20:32:39.0825960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.0827309Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c647be20>} 2025-05-07T20:32:39.0828772Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.0829780Z context = 2025-05-07T20:32:39.0830067Z 2025-05-07T20:32:39.0830227Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.0830743Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.0831201Z module_map=module_map) 2025-05-07T20:32:39.0831674Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.0832020Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.0832275Z E ^ 2025-05-07T20:32:39.0832719Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.0833160Z 2025-05-07T20:32:39.0833577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.0834078Z 2025-05-07T20:32:39.0834183Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.0834579Z self=, 2025-05-07T20:32:39.0834960Z T=128, 2025-05-07T20:32:39.0835140Z D=7168, 2025-05-07T20:32:39.0835326Z scale_ub=1200.0, 2025-05-07T20:32:39.0835539Z contiguous=False, 2025-05-07T20:32:39.0835761Z compiled=False, 2025-05-07T20:32:39.2385126Z ) 2025-05-07T20:32:39.2385926Z self = 2025-05-07T20:32:39.2386451Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:39.2386731Z 2025-05-07T20:32:39.2386815Z @given( 2025-05-07T20:32:39.2387035Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.2387351Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.2387718Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.2388030Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.2388349Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.2388626Z ) 2025-05-07T20:32:39.2388966Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.2389418Z def test_silu_mul_quant( 2025-05-07T20:32:39.2390057Z self, 2025-05-07T20:32:39.2390239Z T: int, 2025-05-07T20:32:39.2390434Z D: int, 2025-05-07T20:32:39.2390652Z scale_ub: Optional[float], 2025-05-07T20:32:39.2390924Z contiguous: bool, 2025-05-07T20:32:39.2399790Z compiled: bool, 2025-05-07T20:32:39.2400066Z ) -> None: 2025-05-07T20:32:39.2400279Z torch.manual_seed(2025) 2025-05-07T20:32:39.2400524Z 2025-05-07T20:32:39.2400799Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.2401138Z 2025-05-07T20:32:39.2401322Z x_sign = torch.sign(x) 2025-05-07T20:32:39.2401616Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.2401933Z x = x_sign * x_clamp 2025-05-07T20:32:39.2402163Z x0 = x[:, :D] 2025-05-07T20:32:39.2402375Z x1 = x[:, D:] 2025-05-07T20:32:39.2402583Z 2025-05-07T20:32:39.2402760Z if contiguous: 2025-05-07T20:32:39.2402990Z x0 = x0.contiguous() 2025-05-07T20:32:39.2403253Z x1 = x1.contiguous() 2025-05-07T20:32:39.2403488Z 2025-05-07T20:32:39.2403677Z if scale_ub is not None: 2025-05-07T20:32:39.2403954Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.2404279Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.2404627Z ) 2025-05-07T20:32:39.2404825Z else: 2025-05-07T20:32:39.2405026Z scale_ub_tensor = None 2025-05-07T20:32:39.2405275Z 2025-05-07T20:32:39.2405508Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.2405808Z op = silu_mul_quant 2025-05-07T20:32:39.2406058Z if compiled: 2025-05-07T20:32:39.2406308Z op = torch.compile(op) 2025-05-07T20:32:39.2406605Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2406870Z 2025-05-07T20:32:39.2407060Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.2407220Z 2025-05-07T20:32:39.2407324Z moe/activation_test.py:117: 2025-05-07T20:32:39.2407613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2407939Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.2408431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2409118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.2409798Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.2410334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.2411002Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.2411651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.2412175Z kernel = self.compile( 2025-05-07T20:32:39.2412730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.2413380Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.2413772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2414000Z 2025-05-07T20:32:39.2414202Z self = 2025-05-07T20:32:39.2415270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.2416681Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c647bb00>} 2025-05-07T20:32:39.2417989Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.2419078Z context = 2025-05-07T20:32:39.2419373Z 2025-05-07T20:32:39.2419534Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.2420053Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.2420505Z module_map=module_map) 2025-05-07T20:32:39.2420868Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.2421219Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.2421472Z E ^ 2025-05-07T20:32:39.2421930Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.2422376Z 2025-05-07T20:32:39.2422787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.2423293Z 2025-05-07T20:32:39.2423402Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.2423804Z self=, 2025-05-07T20:32:39.2424201Z T=128, 2025-05-07T20:32:39.2424388Z D=5120, 2025-05-07T20:32:39.2424575Z scale_ub=None, 2025-05-07T20:32:39.2424796Z contiguous=False, 2025-05-07T20:32:39.2425016Z compiled=False, 2025-05-07T20:32:39.2425213Z ) 2025-05-07T20:32:39.2425530Z self = 2025-05-07T20:32:39.2426014Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:39.2426276Z 2025-05-07T20:32:39.2426365Z @given( 2025-05-07T20:32:39.2426622Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.2426942Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.2427241Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.2427664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.2427995Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.2428272Z ) 2025-05-07T20:32:39.2428729Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.2429177Z def test_silu_mul_quant( 2025-05-07T20:32:39.2429414Z self, 2025-05-07T20:32:39.2429599Z T: int, 2025-05-07T20:32:39.2429785Z D: int, 2025-05-07T20:32:39.2430001Z scale_ub: Optional[float], 2025-05-07T20:32:39.2430264Z contiguous: bool, 2025-05-07T20:32:39.2430488Z compiled: bool, 2025-05-07T20:32:39.2430706Z ) -> None: 2025-05-07T20:32:39.2430914Z torch.manual_seed(2025) 2025-05-07T20:32:39.2431143Z 2025-05-07T20:32:39.2431408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.2431747Z 2025-05-07T20:32:39.2431928Z x_sign = torch.sign(x) 2025-05-07T20:32:39.2432216Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.2432522Z x = x_sign * x_clamp 2025-05-07T20:32:39.2432750Z x0 = x[:, :D] 2025-05-07T20:32:39.2432961Z x1 = x[:, D:] 2025-05-07T20:32:39.2433168Z 2025-05-07T20:32:39.2433343Z if contiguous: 2025-05-07T20:32:39.2433567Z x0 = x0.contiguous() 2025-05-07T20:32:39.2433821Z x1 = x1.contiguous() 2025-05-07T20:32:39.2434049Z 2025-05-07T20:32:39.2434235Z if scale_ub is not None: 2025-05-07T20:32:39.2434509Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.2434837Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.2435131Z ) 2025-05-07T20:32:39.2435330Z else: 2025-05-07T20:32:39.2435536Z scale_ub_tensor = None 2025-05-07T20:32:39.2435777Z 2025-05-07T20:32:39.2436000Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.2436308Z op = silu_mul_quant 2025-05-07T20:32:39.2436551Z if compiled: 2025-05-07T20:32:39.2436890Z op = torch.compile(op) 2025-05-07T20:32:39.2437189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2437454Z 2025-05-07T20:32:39.2437658Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.2437819Z 2025-05-07T20:32:39.2437923Z moe/activation_test.py:117: 2025-05-07T20:32:39.2438208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2438546Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.2438828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2439536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.2440572Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.2441110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.2441783Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.2442466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.2442988Z kernel = self.compile( 2025-05-07T20:32:39.2443544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.2444184Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.2444565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2444795Z 2025-05-07T20:32:39.2444998Z self = 2025-05-07T20:32:39.2446057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.2447407Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c63edb20>} 2025-05-07T20:32:39.2448858Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.2449858Z context = 2025-05-07T20:32:39.2450149Z 2025-05-07T20:32:39.2450310Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.2450827Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.2451291Z module_map=module_map) 2025-05-07T20:32:39.2451642Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.2451993Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.2452256Z E ^ 2025-05-07T20:32:39.2452706Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.2453152Z 2025-05-07T20:32:39.2453573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.2454084Z 2025-05-07T20:32:39.2454187Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.2454587Z self=, 2025-05-07T20:32:39.2454989Z T=128, 2025-05-07T20:32:39.2455174Z D=5120, 2025-05-07T20:32:39.2455367Z scale_ub=1200.0, 2025-05-07T20:32:39.2455580Z contiguous=True, 2025-05-07T20:32:39.2455804Z compiled=False, 2025-05-07T20:32:39.2456000Z ) 2025-05-07T20:32:39.2456304Z self = 2025-05-07T20:32:39.2456784Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:39.2457176Z 2025-05-07T20:32:39.2457257Z @given( 2025-05-07T20:32:39.2457475Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.2457787Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.2458085Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.2458407Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.2458723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.2458996Z ) 2025-05-07T20:32:39.2459344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.2459791Z def test_silu_mul_quant( 2025-05-07T20:32:39.2460037Z self, 2025-05-07T20:32:39.2460234Z T: int, 2025-05-07T20:32:39.2460423Z D: int, 2025-05-07T20:32:39.2460643Z scale_ub: Optional[float], 2025-05-07T20:32:39.2460923Z contiguous: bool, 2025-05-07T20:32:39.2461159Z compiled: bool, 2025-05-07T20:32:39.2461380Z ) -> None: 2025-05-07T20:32:39.2461603Z torch.manual_seed(2025) 2025-05-07T20:32:39.2461835Z 2025-05-07T20:32:39.2462103Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.2462446Z 2025-05-07T20:32:39.2462641Z x_sign = torch.sign(x) 2025-05-07T20:32:39.2462926Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.2463233Z x = x_sign * x_clamp 2025-05-07T20:32:39.2463474Z x0 = x[:, :D] 2025-05-07T20:32:39.2463689Z x1 = x[:, D:] 2025-05-07T20:32:39.2463896Z 2025-05-07T20:32:39.2464078Z if contiguous: 2025-05-07T20:32:39.2464296Z x0 = x0.contiguous() 2025-05-07T20:32:39.2464553Z x1 = x1.contiguous() 2025-05-07T20:32:39.2464793Z 2025-05-07T20:32:39.2464973Z if scale_ub is not None: 2025-05-07T20:32:39.2465247Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.2465580Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.2465878Z ) 2025-05-07T20:32:39.2466069Z else: 2025-05-07T20:32:39.2466275Z scale_ub_tensor = None 2025-05-07T20:32:39.2466516Z 2025-05-07T20:32:39.2466835Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.2467140Z op = silu_mul_quant 2025-05-07T20:32:39.2467376Z if compiled: 2025-05-07T20:32:39.2467671Z op = torch.compile(op) 2025-05-07T20:32:39.2467957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2468221Z 2025-05-07T20:32:39.2468425Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.2468599Z 2025-05-07T20:32:39.2468697Z moe/activation_test.py:117: 2025-05-07T20:32:39.2468992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2469311Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.2469590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2470274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.2470948Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.2471496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.2472169Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.2472830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.2473349Z kernel = self.compile( 2025-05-07T20:32:39.2473912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.2474557Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.2474952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2475183Z 2025-05-07T20:32:39.2475386Z self = 2025-05-07T20:32:39.2476570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.2477922Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c5d59ee0>} 2025-05-07T20:32:39.2479294Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.2480297Z context = 2025-05-07T20:32:39.2480587Z 2025-05-07T20:32:39.2480754Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.2481275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.2481744Z module_map=module_map) 2025-05-07T20:32:39.2482103Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.2482459Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.2482715Z E ^ 2025-05-07T20:32:39.2483166Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.2483616Z 2025-05-07T20:32:39.2484021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.4011318Z 2025-05-07T20:32:39.4011749Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.4012194Z self=, 2025-05-07T20:32:39.4012642Z T=1, 2025-05-07T20:32:39.4012832Z D=7168, 2025-05-07T20:32:39.4013030Z scale_ub=1200.0, 2025-05-07T20:32:39.4013282Z contiguous=True, 2025-05-07T20:32:39.4013506Z compiled=True, 2025-05-07T20:32:39.4013709Z ) 2025-05-07T20:32:39.4014350Z self = 2025-05-07T20:32:39.4014836Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:39.4015098Z 2025-05-07T20:32:39.4015178Z @given( 2025-05-07T20:32:39.4015413Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.4015730Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.4016041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.4016365Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.4016691Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.4016981Z ) 2025-05-07T20:32:39.4017334Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.4017792Z def test_silu_mul_quant( 2025-05-07T20:32:39.4018048Z self, 2025-05-07T20:32:39.4018239Z T: int, 2025-05-07T20:32:39.4018444Z D: int, 2025-05-07T20:32:39.4018666Z scale_ub: Optional[float], 2025-05-07T20:32:39.4018938Z contiguous: bool, 2025-05-07T20:32:39.4019180Z compiled: bool, 2025-05-07T20:32:39.4019412Z ) -> None: 2025-05-07T20:32:39.4019625Z torch.manual_seed(2025) 2025-05-07T20:32:39.4019870Z 2025-05-07T20:32:39.4020140Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.4020481Z 2025-05-07T20:32:39.4020671Z x_sign = torch.sign(x) 2025-05-07T20:32:39.4020964Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.4021276Z x = x_sign * x_clamp 2025-05-07T20:32:39.4021511Z x0 = x[:, :D] 2025-05-07T20:32:39.4021732Z x1 = x[:, D:] 2025-05-07T20:32:39.4021940Z 2025-05-07T20:32:39.4022122Z if contiguous: 2025-05-07T20:32:39.4022352Z x0 = x0.contiguous() 2025-05-07T20:32:39.4022773Z x1 = x1.contiguous() 2025-05-07T20:32:39.4023011Z 2025-05-07T20:32:39.4023205Z if scale_ub is not None: 2025-05-07T20:32:39.4023484Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.4023817Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.4024123Z ) 2025-05-07T20:32:39.4024322Z else: 2025-05-07T20:32:39.4024526Z scale_ub_tensor = None 2025-05-07T20:32:39.4024780Z 2025-05-07T20:32:39.4025014Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.4025330Z op = silu_mul_quant 2025-05-07T20:32:39.4025575Z if compiled: 2025-05-07T20:32:39.4025827Z op = torch.compile(op) 2025-05-07T20:32:39.4026125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.4026415Z 2025-05-07T20:32:39.4026633Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.4026794Z 2025-05-07T20:32:39.4026903Z moe/activation_test.py:117: 2025-05-07T20:32:39.4027196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.4027596Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.4027884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.4028434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.4028990Z return fn(*args, **kwargs) 2025-05-07T20:32:39.4029646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.4030329Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.4030860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.4031534Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.4032212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.4032751Z kernel = self.compile( 2025-05-07T20:32:39.4033379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.4034032Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.4034428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.4034652Z 2025-05-07T20:32:39.4034856Z self = 2025-05-07T20:32:39.4035924Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.4037297Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c5d599e0>} 2025-05-07T20:32:39.4038676Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.4039691Z context = 2025-05-07T20:32:39.4039974Z 2025-05-07T20:32:39.4040381Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.4040904Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.4041378Z module_map=module_map) 2025-05-07T20:32:39.4041750Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.4042098Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.4042359Z E ^ 2025-05-07T20:32:39.4042825Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.4043398Z 2025-05-07T20:32:39.4043817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.4044328Z 2025-05-07T20:32:39.4044429Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.4044838Z self=, 2025-05-07T20:32:39.4045243Z T=1, 2025-05-07T20:32:39.4045420Z D=7168, 2025-05-07T20:32:39.4045613Z scale_ub=1200.0, 2025-05-07T20:32:39.4045842Z contiguous=False, 2025-05-07T20:32:39.4046065Z compiled=True, 2025-05-07T20:32:39.4046274Z ) 2025-05-07T20:32:39.4046591Z self = 2025-05-07T20:32:39.4047065Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:39.4047344Z 2025-05-07T20:32:39.4047423Z @given( 2025-05-07T20:32:39.4047657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.4047967Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.4048276Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.4048611Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.4048941Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.4049219Z ) 2025-05-07T20:32:39.4049566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.4050022Z def test_silu_mul_quant( 2025-05-07T20:32:39.4050256Z self, 2025-05-07T20:32:39.4050453Z T: int, 2025-05-07T20:32:39.4050655Z D: int, 2025-05-07T20:32:39.4050871Z scale_ub: Optional[float], 2025-05-07T20:32:39.4051147Z contiguous: bool, 2025-05-07T20:32:39.4051392Z compiled: bool, 2025-05-07T20:32:39.4051610Z ) -> None: 2025-05-07T20:32:39.4051832Z torch.manual_seed(2025) 2025-05-07T20:32:39.4052079Z 2025-05-07T20:32:39.4052347Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.4052699Z 2025-05-07T20:32:39.4052900Z x_sign = torch.sign(x) 2025-05-07T20:32:39.4053308Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.4053623Z x = x_sign * x_clamp 2025-05-07T20:32:39.4053867Z x0 = x[:, :D] 2025-05-07T20:32:39.4054090Z x1 = x[:, D:] 2025-05-07T20:32:39.4054297Z 2025-05-07T20:32:39.4054487Z if contiguous: 2025-05-07T20:32:39.4054721Z x0 = x0.contiguous() 2025-05-07T20:32:39.4054976Z x1 = x1.contiguous() 2025-05-07T20:32:39.4055227Z 2025-05-07T20:32:39.4055424Z if scale_ub is not None: 2025-05-07T20:32:39.4055694Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.4056031Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.4056351Z ) 2025-05-07T20:32:39.4056542Z else: 2025-05-07T20:32:39.4056755Z scale_ub_tensor = None 2025-05-07T20:32:39.4057013Z 2025-05-07T20:32:39.4057239Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.4057551Z op = silu_mul_quant 2025-05-07T20:32:39.4057810Z if compiled: 2025-05-07T20:32:39.4058050Z op = torch.compile(op) 2025-05-07T20:32:39.4058350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.4058625Z 2025-05-07T20:32:39.4058822Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.4058985Z 2025-05-07T20:32:39.4059084Z moe/activation_test.py:117: 2025-05-07T20:32:39.4059376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.4059703Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.4059975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.4060533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.4061089Z return fn(*args, **kwargs) 2025-05-07T20:32:39.4061744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.4062517Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.4063054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.4063728Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.4064394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.4064923Z kernel = self.compile( 2025-05-07T20:32:39.4065461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.4066132Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.4066517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.4066750Z 2025-05-07T20:32:39.4066960Z self = 2025-05-07T20:32:39.4068174Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.4069526Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c5d59d00>} 2025-05-07T20:32:39.4070838Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.4071842Z context = 2025-05-07T20:32:39.4072129Z 2025-05-07T20:32:39.4072291Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.4072814Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.4073381Z module_map=module_map) 2025-05-07T20:32:39.4081435Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.4081835Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.4082093Z E ^ 2025-05-07T20:32:39.4082561Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.4083029Z 2025-05-07T20:32:39.4083463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.8365265Z 2025-05-07T20:32:39.8365740Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.8366494Z self=, 2025-05-07T20:32:39.8367048Z T=1, 2025-05-07T20:32:39.8367306Z D=7168, 2025-05-07T20:32:39.8367600Z scale_ub=None, 2025-05-07T20:32:39.8367905Z contiguous=False, 2025-05-07T20:32:39.8368234Z compiled=True, 2025-05-07T20:32:39.8368534Z ) 2025-05-07T20:32:39.8368978Z self = 2025-05-07T20:32:39.8369633Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:39.8369944Z 2025-05-07T20:32:39.8370024Z @given( 2025-05-07T20:32:39.8370253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.8370555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.8370853Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.8371174Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.8371489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.8371764Z ) 2025-05-07T20:32:39.8372114Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.8372757Z def test_silu_mul_quant( 2025-05-07T20:32:39.8372996Z self, 2025-05-07T20:32:39.8373182Z T: int, 2025-05-07T20:32:39.8373366Z D: int, 2025-05-07T20:32:39.8373584Z scale_ub: Optional[float], 2025-05-07T20:32:39.8373851Z contiguous: bool, 2025-05-07T20:32:39.8374088Z compiled: bool, 2025-05-07T20:32:39.8374302Z ) -> None: 2025-05-07T20:32:39.8374511Z torch.manual_seed(2025) 2025-05-07T20:32:39.8374753Z 2025-05-07T20:32:39.8375012Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.8375349Z 2025-05-07T20:32:39.8375540Z x_sign = torch.sign(x) 2025-05-07T20:32:39.8375815Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.8376122Z x = x_sign * x_clamp 2025-05-07T20:32:39.8376388Z x0 = x[:, :D] 2025-05-07T20:32:39.8376598Z x1 = x[:, D:] 2025-05-07T20:32:39.8376798Z 2025-05-07T20:32:39.8376974Z if contiguous: 2025-05-07T20:32:39.8377201Z x0 = x0.contiguous() 2025-05-07T20:32:39.8377454Z x1 = x1.contiguous() 2025-05-07T20:32:39.8377684Z 2025-05-07T20:32:39.8377873Z if scale_ub is not None: 2025-05-07T20:32:39.8378149Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.8378479Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.8378789Z ) 2025-05-07T20:32:39.8378979Z else: 2025-05-07T20:32:39.8379177Z scale_ub_tensor = None 2025-05-07T20:32:39.8379424Z 2025-05-07T20:32:39.8379646Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.8379947Z op = silu_mul_quant 2025-05-07T20:32:39.8380188Z if compiled: 2025-05-07T20:32:39.8380427Z op = torch.compile(op) 2025-05-07T20:32:39.8380716Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.8380977Z 2025-05-07T20:32:39.8381162Z y_fp8, y_scale = fn() 2025-05-07T20:32:39.8381438Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:39.8381716Z 2025-05-07T20:32:39.8381946Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.8382406Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:39.8382686Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:39.8382993Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:39.8383339Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.8383632Z 2025-05-07T20:32:39.8383821Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:39.8384014Z 2025-05-07T20:32:39.8384110Z moe/activation_test.py:126: 2025-05-07T20:32:39.8384395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8384711Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:39.8385028Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.8385804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:39.8386543Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:39.8387088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.8387845Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.8388526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:39.8389229Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.8389947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:39.8390569Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:39.8391172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:39.8391758Z fn() 2025-05-07T20:32:39.8392279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:39.8392868Z self.fn.run( 2025-05-07T20:32:39.8393319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.8393834Z kernel = self.compile( 2025-05-07T20:32:39.8394366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.8395007Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.8395393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8395625Z 2025-05-07T20:32:39.8395826Z self = 2025-05-07T20:32:39.8396895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.8398252Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c53ac2c0>} 2025-05-07T20:32:39.8399559Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.8400564Z context = 2025-05-07T20:32:39.8400852Z 2025-05-07T20:32:39.8401014Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.8401527Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.8401993Z module_map=module_map) 2025-05-07T20:32:39.8402358Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.8402794Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:39.8403051Z E ^ 2025-05-07T20:32:39.8403502Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.8403950Z 2025-05-07T20:32:39.8404363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.8404869Z 2025-05-07T20:32:39.8404972Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.8405372Z self=, 2025-05-07T20:32:39.8405777Z T=1, 2025-05-07T20:32:39.8405953Z D=5120, 2025-05-07T20:32:39.8406134Z scale_ub=1200.0, 2025-05-07T20:32:39.8406350Z contiguous=False, 2025-05-07T20:32:39.8406565Z compiled=True, 2025-05-07T20:32:39.8406768Z ) 2025-05-07T20:32:39.8407073Z self = 2025-05-07T20:32:39.8407551Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:39.8407821Z 2025-05-07T20:32:39.8407900Z @given( 2025-05-07T20:32:39.8408117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.8408415Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.8408712Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.8409025Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.8409344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.8409619Z ) 2025-05-07T20:32:39.8409958Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.8410399Z def test_silu_mul_quant( 2025-05-07T20:32:39.8410637Z self, 2025-05-07T20:32:39.8410828Z T: int, 2025-05-07T20:32:39.8411014Z D: int, 2025-05-07T20:32:39.8411318Z scale_ub: Optional[float], 2025-05-07T20:32:39.8411586Z contiguous: bool, 2025-05-07T20:32:39.8411820Z compiled: bool, 2025-05-07T20:32:39.8412042Z ) -> None: 2025-05-07T20:32:39.8412248Z torch.manual_seed(2025) 2025-05-07T20:32:39.8412478Z 2025-05-07T20:32:39.8412736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.8413069Z 2025-05-07T20:32:39.8413245Z x_sign = torch.sign(x) 2025-05-07T20:32:39.8413528Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.8413834Z x = x_sign * x_clamp 2025-05-07T20:32:39.8414063Z x0 = x[:, :D] 2025-05-07T20:32:39.8414276Z x1 = x[:, D:] 2025-05-07T20:32:39.8414477Z 2025-05-07T20:32:39.8414652Z if contiguous: 2025-05-07T20:32:39.8414876Z x0 = x0.contiguous() 2025-05-07T20:32:39.8415120Z x1 = x1.contiguous() 2025-05-07T20:32:39.8415348Z 2025-05-07T20:32:39.8415531Z if scale_ub is not None: 2025-05-07T20:32:39.8415804Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.8416142Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.8416433Z ) 2025-05-07T20:32:39.8416619Z else: 2025-05-07T20:32:39.8416823Z scale_ub_tensor = None 2025-05-07T20:32:39.8417060Z 2025-05-07T20:32:39.8417283Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.8417588Z op = silu_mul_quant 2025-05-07T20:32:39.8417826Z if compiled: 2025-05-07T20:32:39.8418064Z op = torch.compile(op) 2025-05-07T20:32:39.8418348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.8418603Z 2025-05-07T20:32:39.8418782Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.8418939Z 2025-05-07T20:32:39.8419037Z moe/activation_test.py:117: 2025-05-07T20:32:39.8419320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8419638Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.8419906Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.8420538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.8421080Z return fn(*args, **kwargs) 2025-05-07T20:32:39.8421724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.8422401Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.8422932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.8423592Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.8424244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.8424766Z kernel = self.compile( 2025-05-07T20:32:39.8425301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.8425945Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.8426339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8426588Z 2025-05-07T20:32:39.8426823Z self = 2025-05-07T20:32:39.8427919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.8429309Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c589f9c0>} 2025-05-07T20:32:39.8430626Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.8431720Z context = 2025-05-07T20:32:39.8431999Z 2025-05-07T20:32:39.8432164Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.8432674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.8433139Z module_map=module_map) 2025-05-07T20:32:39.8433494Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.8433838Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.8434096Z E ^ 2025-05-07T20:32:39.8434548Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.8434986Z 2025-05-07T20:32:39.8435402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.9820698Z 2025-05-07T20:32:39.9820907Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.9821512Z self=, 2025-05-07T20:32:39.9822126Z T=1, 2025-05-07T20:32:39.9822376Z D=5120, 2025-05-07T20:32:39.9822631Z scale_ub=1200.0, 2025-05-07T20:32:39.9822870Z contiguous=False, 2025-05-07T20:32:39.9823100Z compiled=False, 2025-05-07T20:32:39.9823312Z ) 2025-05-07T20:32:39.9823639Z self = 2025-05-07T20:32:39.9824146Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:39.9824421Z 2025-05-07T20:32:39.9824504Z @given( 2025-05-07T20:32:39.9824741Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.9825061Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.9825377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.9825716Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.9826255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.9826556Z ) 2025-05-07T20:32:39.9826914Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.9827351Z def test_silu_mul_quant( 2025-05-07T20:32:39.9827665Z self, 2025-05-07T20:32:39.9827853Z T: int, 2025-05-07T20:32:39.9828051Z D: int, 2025-05-07T20:32:39.9828272Z scale_ub: Optional[float], 2025-05-07T20:32:39.9828538Z contiguous: bool, 2025-05-07T20:32:39.9828775Z compiled: bool, 2025-05-07T20:32:39.9828999Z ) -> None: 2025-05-07T20:32:39.9829208Z torch.manual_seed(2025) 2025-05-07T20:32:39.9829451Z 2025-05-07T20:32:39.9829722Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.9830057Z 2025-05-07T20:32:39.9830262Z x_sign = torch.sign(x) 2025-05-07T20:32:39.9830561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.9830875Z x = x_sign * x_clamp 2025-05-07T20:32:39.9831124Z x0 = x[:, :D] 2025-05-07T20:32:39.9831348Z x1 = x[:, D:] 2025-05-07T20:32:39.9831566Z 2025-05-07T20:32:39.9831751Z if contiguous: 2025-05-07T20:32:39.9831988Z x0 = x0.contiguous() 2025-05-07T20:32:39.9832252Z x1 = x1.contiguous() 2025-05-07T20:32:39.9832489Z 2025-05-07T20:32:39.9832683Z if scale_ub is not None: 2025-05-07T20:32:39.9832966Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.9833304Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.9833627Z ) 2025-05-07T20:32:39.9833828Z else: 2025-05-07T20:32:39.9834040Z scale_ub_tensor = None 2025-05-07T20:32:39.9834301Z 2025-05-07T20:32:39.9834542Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.9834981Z op = silu_mul_quant 2025-05-07T20:32:39.9835239Z if compiled: 2025-05-07T20:32:39.9835496Z op = torch.compile(op) 2025-05-07T20:32:39.9835807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.9836090Z 2025-05-07T20:32:39.9836289Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.9836470Z 2025-05-07T20:32:39.9836590Z moe/activation_test.py:117: 2025-05-07T20:32:39.9836911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.9837253Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.9837546Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.9838246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.9838956Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.9839514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.9840465Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.9841305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.9841855Z kernel = self.compile( 2025-05-07T20:32:39.9842414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.9843059Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.9843457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.9843694Z 2025-05-07T20:32:39.9843901Z self = 2025-05-07T20:32:39.9844972Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.9846502Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c6b9a3e0>} 2025-05-07T20:32:39.9847859Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.9848929Z context = 2025-05-07T20:32:39.9849213Z 2025-05-07T20:32:39.9849398Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.9849924Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.9850384Z module_map=module_map) 2025-05-07T20:32:39.9850751Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.9851110Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.9851367Z E ^ 2025-05-07T20:32:39.9851832Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.9852277Z 2025-05-07T20:32:39.9852700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.9853206Z 2025-05-07T20:32:39.9853319Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.9853721Z self=, 2025-05-07T20:32:39.9854120Z T=16384, 2025-05-07T20:32:39.9854316Z D=5120, 2025-05-07T20:32:39.9854505Z scale_ub=1200.0, 2025-05-07T20:32:39.9854738Z contiguous=False, 2025-05-07T20:32:39.9854967Z compiled=True, 2025-05-07T20:32:39.9855167Z ) 2025-05-07T20:32:39.9855496Z self = 2025-05-07T20:32:39.9856120Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:39.9856400Z 2025-05-07T20:32:39.9856486Z @given( 2025-05-07T20:32:39.9856723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.9857044Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.9857358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.9857683Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.9858014Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.9858305Z ) 2025-05-07T20:32:39.9858650Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.9859110Z def test_silu_mul_quant( 2025-05-07T20:32:39.9859356Z self, 2025-05-07T20:32:39.9859555Z T: int, 2025-05-07T20:32:39.9859757Z D: int, 2025-05-07T20:32:39.9859981Z scale_ub: Optional[float], 2025-05-07T20:32:39.9860250Z contiguous: bool, 2025-05-07T20:32:39.9860509Z compiled: bool, 2025-05-07T20:32:39.9860744Z ) -> None: 2025-05-07T20:32:39.9860973Z torch.manual_seed(2025) 2025-05-07T20:32:39.9861213Z 2025-05-07T20:32:39.9861500Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.9861856Z 2025-05-07T20:32:39.9862049Z x_sign = torch.sign(x) 2025-05-07T20:32:39.9862355Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.9862677Z x = x_sign * x_clamp 2025-05-07T20:32:39.9862919Z x0 = x[:, :D] 2025-05-07T20:32:39.9863147Z x1 = x[:, D:] 2025-05-07T20:32:39.9863356Z 2025-05-07T20:32:39.9863537Z if contiguous: 2025-05-07T20:32:39.9863767Z x0 = x0.contiguous() 2025-05-07T20:32:39.9864023Z x1 = x1.contiguous() 2025-05-07T20:32:39.9864262Z 2025-05-07T20:32:39.9864454Z if scale_ub is not None: 2025-05-07T20:32:39.9864734Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.9865066Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.9865374Z ) 2025-05-07T20:32:39.9865569Z else: 2025-05-07T20:32:39.9865866Z scale_ub_tensor = None 2025-05-07T20:32:39.9866115Z 2025-05-07T20:32:39.9866345Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.9866665Z op = silu_mul_quant 2025-05-07T20:32:39.9866913Z if compiled: 2025-05-07T20:32:39.9867163Z op = torch.compile(op) 2025-05-07T20:32:39.9867547Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.9867822Z 2025-05-07T20:32:39.9868018Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.9868184Z 2025-05-07T20:32:39.9868296Z moe/activation_test.py:117: 2025-05-07T20:32:39.9868585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.9868918Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.9869201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.9869787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.9870586Z return fn(*args, **kwargs) 2025-05-07T20:32:39.9871351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.9872043Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.9872593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.9873282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.9873949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.9874483Z kernel = self.compile( 2025-05-07T20:32:39.9875045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.9875815Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.9876219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.9876463Z 2025-05-07T20:32:39.9876712Z self = 2025-05-07T20:32:39.9877783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.9879145Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c63c3560>} 2025-05-07T20:32:39.9880478Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.9881545Z context = 2025-05-07T20:32:39.9881832Z 2025-05-07T20:32:39.9882002Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.9882526Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.9883007Z module_map=module_map) 2025-05-07T20:32:39.9883374Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.9883728Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.9883997Z E ^ 2025-05-07T20:32:39.9884466Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.9884916Z 2025-05-07T20:32:39.9885334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.9885856Z 2025-05-07T20:32:39.9885961Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.9886374Z self=, 2025-05-07T20:32:39.9886919Z T=2048, 2025-05-07T20:32:39.9887106Z D=7168, 2025-05-07T20:32:39.9887306Z scale_ub=1200.0, 2025-05-07T20:32:39.9887534Z contiguous=False, 2025-05-07T20:32:39.9887776Z compiled=True, 2025-05-07T20:32:40.1716122Z ) 2025-05-07T20:32:40.1729439Z self = 2025-05-07T20:32:40.1730200Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:40.1730586Z 2025-05-07T20:32:40.1730718Z @given( 2025-05-07T20:32:40.1731024Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.1731440Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.1731827Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.1732151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.1732489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.1732774Z ) 2025-05-07T20:32:40.1733118Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.1733570Z def test_silu_mul_quant( 2025-05-07T20:32:40.1733811Z self, 2025-05-07T20:32:40.1733995Z T: int, 2025-05-07T20:32:40.1734186Z D: int, 2025-05-07T20:32:40.1734401Z scale_ub: Optional[float], 2025-05-07T20:32:40.1734665Z contiguous: bool, 2025-05-07T20:32:40.1734895Z compiled: bool, 2025-05-07T20:32:40.1735117Z ) -> None: 2025-05-07T20:32:40.1735325Z torch.manual_seed(2025) 2025-05-07T20:32:40.1735557Z 2025-05-07T20:32:40.1735826Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.1736159Z 2025-05-07T20:32:40.1736340Z x_sign = torch.sign(x) 2025-05-07T20:32:40.1736644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.1737158Z x = x_sign * x_clamp 2025-05-07T20:32:40.1737391Z x0 = x[:, :D] 2025-05-07T20:32:40.1737592Z x1 = x[:, D:] 2025-05-07T20:32:40.1737794Z 2025-05-07T20:32:40.1737976Z if contiguous: 2025-05-07T20:32:40.1738198Z x0 = x0.contiguous() 2025-05-07T20:32:40.1738450Z x1 = x1.contiguous() 2025-05-07T20:32:40.1738687Z 2025-05-07T20:32:40.1738866Z if scale_ub is not None: 2025-05-07T20:32:40.1739129Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.1739453Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.1739746Z ) 2025-05-07T20:32:40.1739934Z else: 2025-05-07T20:32:40.1740348Z scale_ub_tensor = None 2025-05-07T20:32:40.1740594Z 2025-05-07T20:32:40.1740812Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.1741114Z op = silu_mul_quant 2025-05-07T20:32:40.1741356Z if compiled: 2025-05-07T20:32:40.1741598Z op = torch.compile(op) 2025-05-07T20:32:40.1741893Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1742153Z 2025-05-07T20:32:40.1742334Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.1742494Z 2025-05-07T20:32:40.1742589Z moe/activation_test.py:117: 2025-05-07T20:32:40.1742875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1743193Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.1743471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1744020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.1744572Z return fn(*args, **kwargs) 2025-05-07T20:32:40.1745256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.1745939Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.1746472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.1747262Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.1747980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.1748498Z kernel = self.compile( 2025-05-07T20:32:40.1749043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.1749679Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.1750075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1750297Z 2025-05-07T20:32:40.1750508Z self = 2025-05-07T20:32:40.1751569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.1753012Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c63428e0>} 2025-05-07T20:32:40.1754363Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.1755360Z context = 2025-05-07T20:32:40.1755647Z 2025-05-07T20:32:40.1755816Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.1756338Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.1756787Z module_map=module_map) 2025-05-07T20:32:40.1757272Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.1757623Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.1757870Z E ^ 2025-05-07T20:32:40.1758332Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.1758774Z 2025-05-07T20:32:40.1759185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.1759685Z 2025-05-07T20:32:40.1759789Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.1760181Z self=, 2025-05-07T20:32:40.1760571Z T=1, 2025-05-07T20:32:40.1760747Z D=5120, 2025-05-07T20:32:40.1760926Z scale_ub=None, 2025-05-07T20:32:40.1761136Z contiguous=False, 2025-05-07T20:32:40.1761353Z compiled=False, 2025-05-07T20:32:40.1761543Z ) 2025-05-07T20:32:40.1761855Z self = 2025-05-07T20:32:40.1762340Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:40.1762596Z 2025-05-07T20:32:40.1762679Z @given( 2025-05-07T20:32:40.1762896Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.1763200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.1763495Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.1763812Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.1764125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.1764398Z ) 2025-05-07T20:32:40.1764738Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.1765215Z def test_silu_mul_quant( 2025-05-07T20:32:40.1765444Z self, 2025-05-07T20:32:40.1765632Z T: int, 2025-05-07T20:32:40.1765822Z D: int, 2025-05-07T20:32:40.1766032Z scale_ub: Optional[float], 2025-05-07T20:32:40.1766299Z contiguous: bool, 2025-05-07T20:32:40.1766532Z compiled: bool, 2025-05-07T20:32:40.1766742Z ) -> None: 2025-05-07T20:32:40.1767034Z torch.manual_seed(2025) 2025-05-07T20:32:40.1767274Z 2025-05-07T20:32:40.1767539Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.1767866Z 2025-05-07T20:32:40.1768056Z x_sign = torch.sign(x) 2025-05-07T20:32:40.1768345Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.1768639Z x = x_sign * x_clamp 2025-05-07T20:32:40.1768869Z x0 = x[:, :D] 2025-05-07T20:32:40.1769082Z x1 = x[:, D:] 2025-05-07T20:32:40.1769274Z 2025-05-07T20:32:40.1769457Z if contiguous: 2025-05-07T20:32:40.1769687Z x0 = x0.contiguous() 2025-05-07T20:32:40.1769939Z x1 = x1.contiguous() 2025-05-07T20:32:40.1770177Z 2025-05-07T20:32:40.1770363Z if scale_ub is not None: 2025-05-07T20:32:40.1770631Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.1770956Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.1771251Z ) 2025-05-07T20:32:40.1771444Z else: 2025-05-07T20:32:40.1771652Z scale_ub_tensor = None 2025-05-07T20:32:40.1771895Z 2025-05-07T20:32:40.1772123Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.1772427Z op = silu_mul_quant 2025-05-07T20:32:40.1772674Z if compiled: 2025-05-07T20:32:40.1772916Z op = torch.compile(op) 2025-05-07T20:32:40.1773197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1773469Z 2025-05-07T20:32:40.1773655Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.1773814Z 2025-05-07T20:32:40.1773912Z moe/activation_test.py:117: 2025-05-07T20:32:40.1774210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1774529Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.1774887Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1775565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.1776239Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.1776812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.1777481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.1778131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.1778649Z kernel = self.compile( 2025-05-07T20:32:40.1779182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.1779816Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.1780202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1780422Z 2025-05-07T20:32:40.1780633Z self = 2025-05-07T20:32:40.1781685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.1783034Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c63434c0>} 2025-05-07T20:32:40.1784341Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.1785337Z context = 2025-05-07T20:32:40.1785626Z 2025-05-07T20:32:40.1785796Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.1786405Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.1786876Z module_map=module_map) 2025-05-07T20:32:40.1787235Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.1787627Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.1787878Z E ^ 2025-05-07T20:32:40.1788328Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.1788764Z 2025-05-07T20:32:40.1789187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.1789687Z 2025-05-07T20:32:40.1789786Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.1790188Z self=, 2025-05-07T20:32:40.1790585Z T=4096, 2025-05-07T20:32:40.1790773Z D=7168, 2025-05-07T20:32:40.1790956Z scale_ub=1200.0, 2025-05-07T20:32:40.1791184Z contiguous=False, 2025-05-07T20:32:40.1791402Z compiled=False, 2025-05-07T20:32:40.1791601Z ) 2025-05-07T20:32:40.1791911Z self = 2025-05-07T20:32:40.1792396Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:40.1792671Z 2025-05-07T20:32:40.1792751Z @given( 2025-05-07T20:32:40.1792976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.1793284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.1793576Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.1793894Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.1794210Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.1794484Z ) 2025-05-07T20:32:40.1794914Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.1795339Z def test_silu_mul_quant( 2025-05-07T20:32:40.1795576Z self, 2025-05-07T20:32:40.1795760Z T: int, 2025-05-07T20:32:40.1795949Z D: int, 2025-05-07T20:32:40.1796161Z scale_ub: Optional[float], 2025-05-07T20:32:40.1796419Z contiguous: bool, 2025-05-07T20:32:40.1796652Z compiled: bool, 2025-05-07T20:32:40.1796866Z ) -> None: 2025-05-07T20:32:40.1797073Z torch.manual_seed(2025) 2025-05-07T20:32:40.1797307Z 2025-05-07T20:32:40.1797574Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.1797906Z 2025-05-07T20:32:40.1798094Z x_sign = torch.sign(x) 2025-05-07T20:32:40.1798377Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.1798671Z x = x_sign * x_clamp 2025-05-07T20:32:40.1798904Z x0 = x[:, :D] 2025-05-07T20:32:40.1799127Z x1 = x[:, D:] 2025-05-07T20:32:40.1799321Z 2025-05-07T20:32:40.1799496Z if contiguous: 2025-05-07T20:32:40.1799717Z x0 = x0.contiguous() 2025-05-07T20:32:40.1799964Z x1 = x1.contiguous() 2025-05-07T20:32:40.1800198Z 2025-05-07T20:32:40.1800386Z if scale_ub is not None: 2025-05-07T20:32:40.1800650Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.1800971Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.1801268Z ) 2025-05-07T20:32:40.1801458Z else: 2025-05-07T20:32:40.1801656Z scale_ub_tensor = None 2025-05-07T20:32:40.1801901Z 2025-05-07T20:32:40.1802125Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.1802422Z op = silu_mul_quant 2025-05-07T20:32:40.1802662Z if compiled: 2025-05-07T20:32:40.1802897Z op = torch.compile(op) 2025-05-07T20:32:40.1803175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1803448Z 2025-05-07T20:32:40.1803636Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.1803792Z 2025-05-07T20:32:40.1803885Z moe/activation_test.py:117: 2025-05-07T20:32:40.1804293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1804618Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.1804889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1805563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.1806242Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.1806821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.1807486Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.1808142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.1808669Z kernel = self.compile( 2025-05-07T20:32:40.1809210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.1809846Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.1810228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1810448Z 2025-05-07T20:32:40.1810656Z self = 2025-05-07T20:32:40.1811710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.1813047Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c53ad080>} 2025-05-07T20:32:40.1814443Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.1815452Z context = 2025-05-07T20:32:40.1815734Z 2025-05-07T20:32:40.1815906Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.1816412Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.1816927Z module_map=module_map) 2025-05-07T20:32:40.1817282Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.1817635Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.1817884Z E ^ 2025-05-07T20:32:40.1818339Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.1818782Z 2025-05-07T20:32:40.1819204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.3346876Z 2025-05-07T20:32:40.3347219Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.3347923Z self=, 2025-05-07T20:32:40.3348471Z T=16384, 2025-05-07T20:32:40.3348771Z D=7168, 2025-05-07T20:32:40.3348987Z scale_ub=None, 2025-05-07T20:32:40.3349202Z contiguous=True, 2025-05-07T20:32:40.3349417Z compiled=True, 2025-05-07T20:32:40.3349623Z ) 2025-05-07T20:32:40.3349940Z self = 2025-05-07T20:32:40.3350424Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:40.3350695Z 2025-05-07T20:32:40.3350775Z @given( 2025-05-07T20:32:40.3351003Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.3351321Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.3351627Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.3352127Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.3352454Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.3352733Z ) 2025-05-07T20:32:40.3353076Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.3353538Z def test_silu_mul_quant( 2025-05-07T20:32:40.3353771Z self, 2025-05-07T20:32:40.3353966Z T: int, 2025-05-07T20:32:40.3354166Z D: int, 2025-05-07T20:32:40.3354380Z scale_ub: Optional[float], 2025-05-07T20:32:40.3354650Z contiguous: bool, 2025-05-07T20:32:40.3354886Z compiled: bool, 2025-05-07T20:32:40.3355105Z ) -> None: 2025-05-07T20:32:40.3355316Z torch.manual_seed(2025) 2025-05-07T20:32:40.3355559Z 2025-05-07T20:32:40.3355823Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.3356167Z 2025-05-07T20:32:40.3356362Z x_sign = torch.sign(x) 2025-05-07T20:32:40.3356652Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.3356953Z x = x_sign * x_clamp 2025-05-07T20:32:40.3357193Z x0 = x[:, :D] 2025-05-07T20:32:40.3357407Z x1 = x[:, D:] 2025-05-07T20:32:40.3357614Z 2025-05-07T20:32:40.3357872Z if contiguous: 2025-05-07T20:32:40.3358136Z x0 = x0.contiguous() 2025-05-07T20:32:40.3358389Z x1 = x1.contiguous() 2025-05-07T20:32:40.3358626Z 2025-05-07T20:32:40.3358817Z if scale_ub is not None: 2025-05-07T20:32:40.3359082Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.3359413Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.3359724Z ) 2025-05-07T20:32:40.3359913Z else: 2025-05-07T20:32:40.3360126Z scale_ub_tensor = None 2025-05-07T20:32:40.3360382Z 2025-05-07T20:32:40.3360738Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.3361047Z op = silu_mul_quant 2025-05-07T20:32:40.3361305Z if compiled: 2025-05-07T20:32:40.3361545Z op = torch.compile(op) 2025-05-07T20:32:40.3361839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3362110Z 2025-05-07T20:32:40.3362298Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.3362464Z 2025-05-07T20:32:40.3362562Z moe/activation_test.py:117: 2025-05-07T20:32:40.3362856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3363187Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.3363455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3364015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.3364567Z return fn(*args, **kwargs) 2025-05-07T20:32:40.3365226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.3365902Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.3366623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.3367306Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.3368102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.3368644Z kernel = self.compile( 2025-05-07T20:32:40.3369188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.3369830Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.3370212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3370443Z 2025-05-07T20:32:40.3370655Z self = 2025-05-07T20:32:40.3371814Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.3373192Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c53ae2a0>} 2025-05-07T20:32:40.3374509Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.3375514Z context = 2025-05-07T20:32:40.3375805Z 2025-05-07T20:32:40.3375969Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.3376500Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.3376964Z module_map=module_map) 2025-05-07T20:32:40.3377325Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.3377682Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.3377936Z E ^ 2025-05-07T20:32:40.3378390Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.3378833Z 2025-05-07T20:32:40.3379247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.3379751Z 2025-05-07T20:32:40.3379854Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.3380254Z self=, 2025-05-07T20:32:40.3380650Z T=4096, 2025-05-07T20:32:40.3380836Z D=5120, 2025-05-07T20:32:40.3381108Z scale_ub=None, 2025-05-07T20:32:40.3381314Z contiguous=False, 2025-05-07T20:32:40.3381536Z compiled=True, 2025-05-07T20:32:40.3381745Z ) 2025-05-07T20:32:40.3382065Z self = 2025-05-07T20:32:40.3382556Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.3382832Z 2025-05-07T20:32:40.3382918Z @given( 2025-05-07T20:32:40.3383145Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.3383464Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.3383768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.3384089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.3384413Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.3384694Z ) 2025-05-07T20:32:40.3385045Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.3385498Z def test_silu_mul_quant( 2025-05-07T20:32:40.3385738Z self, 2025-05-07T20:32:40.3385936Z T: int, 2025-05-07T20:32:40.3386124Z D: int, 2025-05-07T20:32:40.3386347Z scale_ub: Optional[float], 2025-05-07T20:32:40.3386640Z contiguous: bool, 2025-05-07T20:32:40.3386894Z compiled: bool, 2025-05-07T20:32:40.3387118Z ) -> None: 2025-05-07T20:32:40.3387332Z torch.manual_seed(2025) 2025-05-07T20:32:40.3387623Z 2025-05-07T20:32:40.3387897Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.3388228Z 2025-05-07T20:32:40.3388414Z x_sign = torch.sign(x) 2025-05-07T20:32:40.3388700Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.3389002Z x = x_sign * x_clamp 2025-05-07T20:32:40.3389234Z x0 = x[:, :D] 2025-05-07T20:32:40.3389451Z x1 = x[:, D:] 2025-05-07T20:32:40.3389658Z 2025-05-07T20:32:40.3389847Z if contiguous: 2025-05-07T20:32:40.3390076Z x0 = x0.contiguous() 2025-05-07T20:32:40.3390331Z x1 = x1.contiguous() 2025-05-07T20:32:40.3390567Z 2025-05-07T20:32:40.3390838Z if scale_ub is not None: 2025-05-07T20:32:40.3391110Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.3391438Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.3391735Z ) 2025-05-07T20:32:40.3391925Z else: 2025-05-07T20:32:40.3392135Z scale_ub_tensor = None 2025-05-07T20:32:40.3392381Z 2025-05-07T20:32:40.3392610Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.3392922Z op = silu_mul_quant 2025-05-07T20:32:40.3393167Z if compiled: 2025-05-07T20:32:40.3393410Z op = torch.compile(op) 2025-05-07T20:32:40.3393699Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3393961Z 2025-05-07T20:32:40.3394162Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.3394327Z 2025-05-07T20:32:40.3394427Z moe/activation_test.py:117: 2025-05-07T20:32:40.3394721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3395056Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.3402447Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3403023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.3403604Z return fn(*args, **kwargs) 2025-05-07T20:32:40.3404255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.3404932Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.3405457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.3406128Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.3406795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.3407438Z kernel = self.compile( 2025-05-07T20:32:40.3407981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.3408618Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.3409007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3409238Z 2025-05-07T20:32:40.3409440Z self = 2025-05-07T20:32:40.3410513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.3411864Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c53aefc0>} 2025-05-07T20:32:40.3413180Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.3414186Z context = 2025-05-07T20:32:40.3414471Z 2025-05-07T20:32:40.3414633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.3415154Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.3415607Z module_map=module_map) 2025-05-07T20:32:40.3415962Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.3416305Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.3416546Z E ^ 2025-05-07T20:32:40.3416993Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.3417443Z 2025-05-07T20:32:40.3417947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.4775493Z 2025-05-07T20:32:40.4775818Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.4776429Z self=, 2025-05-07T20:32:40.4777019Z T=4096, 2025-05-07T20:32:40.4777260Z D=5120, 2025-05-07T20:32:40.4777583Z scale_ub=1200.0, 2025-05-07T20:32:40.4777884Z contiguous=False, 2025-05-07T20:32:40.4778107Z compiled=False, 2025-05-07T20:32:40.4778310Z ) 2025-05-07T20:32:40.4778614Z self = 2025-05-07T20:32:40.4779108Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:40.4779388Z 2025-05-07T20:32:40.4779465Z @given( 2025-05-07T20:32:40.4779704Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.4780003Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.4780306Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.4780630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.4780943Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.4781224Z ) 2025-05-07T20:32:40.4781567Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.4781994Z def test_silu_mul_quant( 2025-05-07T20:32:40.4782229Z self, 2025-05-07T20:32:40.4782420Z T: int, 2025-05-07T20:32:40.4782612Z D: int, 2025-05-07T20:32:40.4782828Z scale_ub: Optional[float], 2025-05-07T20:32:40.4783097Z contiguous: bool, 2025-05-07T20:32:40.4783324Z compiled: bool, 2025-05-07T20:32:40.4783545Z ) -> None: 2025-05-07T20:32:40.4783755Z torch.manual_seed(2025) 2025-05-07T20:32:40.4783989Z 2025-05-07T20:32:40.4784425Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.4784754Z 2025-05-07T20:32:40.4784944Z x_sign = torch.sign(x) 2025-05-07T20:32:40.4785218Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.4785518Z x = x_sign * x_clamp 2025-05-07T20:32:40.4785757Z x0 = x[:, :D] 2025-05-07T20:32:40.4785962Z x1 = x[:, D:] 2025-05-07T20:32:40.4786158Z 2025-05-07T20:32:40.4786331Z if contiguous: 2025-05-07T20:32:40.4786550Z x0 = x0.contiguous() 2025-05-07T20:32:40.4786825Z x1 = x1.contiguous() 2025-05-07T20:32:40.4787080Z 2025-05-07T20:32:40.4787258Z if scale_ub is not None: 2025-05-07T20:32:40.4787622Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.4787948Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.4788243Z ) 2025-05-07T20:32:40.4788427Z else: 2025-05-07T20:32:40.4788632Z scale_ub_tensor = None 2025-05-07T20:32:40.4788881Z 2025-05-07T20:32:40.4789102Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.4789409Z op = silu_mul_quant 2025-05-07T20:32:40.4789648Z if compiled: 2025-05-07T20:32:40.4789880Z op = torch.compile(op) 2025-05-07T20:32:40.4790163Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.4790422Z 2025-05-07T20:32:40.4790599Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.4790763Z 2025-05-07T20:32:40.4790857Z moe/activation_test.py:117: 2025-05-07T20:32:40.4791143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.4791458Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.4791723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.4792411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.4793097Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.4793619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.4794413Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.4795069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.4795587Z kernel = self.compile( 2025-05-07T20:32:40.4796124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.4796863Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.4797436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.4797726Z 2025-05-07T20:32:40.4797929Z self = 2025-05-07T20:32:40.4799010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.4800381Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4ca8360>} 2025-05-07T20:32:40.4801733Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.4802741Z context = 2025-05-07T20:32:40.4803025Z 2025-05-07T20:32:40.4803190Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.4803703Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.4804273Z module_map=module_map) 2025-05-07T20:32:40.4804622Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.4804970Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.4805219Z E ^ 2025-05-07T20:32:40.4805673Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.4806109Z 2025-05-07T20:32:40.4806539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.4807045Z 2025-05-07T20:32:40.4807144Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.4807647Z self=, 2025-05-07T20:32:40.4808143Z T=4096, 2025-05-07T20:32:40.4808328Z D=5120, 2025-05-07T20:32:40.4808525Z scale_ub=1200.0, 2025-05-07T20:32:40.4808747Z contiguous=False, 2025-05-07T20:32:40.4808974Z compiled=True, 2025-05-07T20:32:40.4809178Z ) 2025-05-07T20:32:40.4809498Z self = 2025-05-07T20:32:40.4809978Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:40.4810251Z 2025-05-07T20:32:40.4810328Z @given( 2025-05-07T20:32:40.4810550Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.4810850Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.4811154Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.4811477Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.4811800Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.4812075Z ) 2025-05-07T20:32:40.4812421Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.4812876Z def test_silu_mul_quant( 2025-05-07T20:32:40.4813116Z self, 2025-05-07T20:32:40.4813311Z T: int, 2025-05-07T20:32:40.4813509Z D: int, 2025-05-07T20:32:40.4813718Z scale_ub: Optional[float], 2025-05-07T20:32:40.4813985Z contiguous: bool, 2025-05-07T20:32:40.4814313Z compiled: bool, 2025-05-07T20:32:40.4814528Z ) -> None: 2025-05-07T20:32:40.4814755Z torch.manual_seed(2025) 2025-05-07T20:32:40.4814991Z 2025-05-07T20:32:40.4815250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.4815590Z 2025-05-07T20:32:40.4815781Z x_sign = torch.sign(x) 2025-05-07T20:32:40.4816060Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.4816367Z x = x_sign * x_clamp 2025-05-07T20:32:40.4816609Z x0 = x[:, :D] 2025-05-07T20:32:40.4816825Z x1 = x[:, D:] 2025-05-07T20:32:40.4817021Z 2025-05-07T20:32:40.4817203Z if contiguous: 2025-05-07T20:32:40.4817439Z x0 = x0.contiguous() 2025-05-07T20:32:40.4817686Z x1 = x1.contiguous() 2025-05-07T20:32:40.4817935Z 2025-05-07T20:32:40.4818122Z if scale_ub is not None: 2025-05-07T20:32:40.4818388Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.4818717Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.4819026Z ) 2025-05-07T20:32:40.4819211Z else: 2025-05-07T20:32:40.4819416Z scale_ub_tensor = None 2025-05-07T20:32:40.4819660Z 2025-05-07T20:32:40.4819881Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.4820188Z op = silu_mul_quant 2025-05-07T20:32:40.4820428Z if compiled: 2025-05-07T20:32:40.4820665Z op = torch.compile(op) 2025-05-07T20:32:40.4820959Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.4821233Z 2025-05-07T20:32:40.4821418Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.4821585Z 2025-05-07T20:32:40.4821680Z moe/activation_test.py:117: 2025-05-07T20:32:40.4821963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.4822381Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.4822653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.4823219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.4823762Z return fn(*args, **kwargs) 2025-05-07T20:32:40.4824408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.4825093Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.4825628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.4826288Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.4826943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.4827532Z kernel = self.compile( 2025-05-07T20:32:40.4828094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.4828742Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.4829127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.4829356Z 2025-05-07T20:32:40.4829562Z self = 2025-05-07T20:32:40.4830617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.4831977Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4ca94e0>} 2025-05-07T20:32:40.4833408Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.4834423Z context = 2025-05-07T20:32:40.4834702Z 2025-05-07T20:32:40.4834870Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.4835379Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.4835849Z module_map=module_map) 2025-05-07T20:32:40.4836205Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.4836561Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.4836859Z E ^ 2025-05-07T20:32:40.4837314Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.4837753Z 2025-05-07T20:32:40.4838182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.4838685Z 2025-05-07T20:32:40.4838792Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.4839201Z self=, 2025-05-07T20:32:40.4839589Z T=2048, 2025-05-07T20:32:40.4839763Z D=7168, 2025-05-07T20:32:40.4839942Z scale_ub=1200.0, 2025-05-07T20:32:40.4840599Z contiguous=False, 2025-05-07T20:32:40.4840875Z compiled=False, 2025-05-07T20:32:40.6784083Z ) 2025-05-07T20:32:40.6784743Z self = 2025-05-07T20:32:40.6785485Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:40.6785867Z 2025-05-07T20:32:40.6785970Z @given( 2025-05-07T20:32:40.6786282Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.6786711Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.6787245Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.6787674Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.6787993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.6788279Z ) 2025-05-07T20:32:40.6788630Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.6789092Z def test_silu_mul_quant( 2025-05-07T20:32:40.6789322Z self, 2025-05-07T20:32:40.6789515Z T: int, 2025-05-07T20:32:40.6789706Z D: int, 2025-05-07T20:32:40.6789913Z scale_ub: Optional[float], 2025-05-07T20:32:40.6790174Z contiguous: bool, 2025-05-07T20:32:40.6790405Z compiled: bool, 2025-05-07T20:32:40.6790619Z ) -> None: 2025-05-07T20:32:40.6790830Z torch.manual_seed(2025) 2025-05-07T20:32:40.6791071Z 2025-05-07T20:32:40.6791339Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.6791682Z 2025-05-07T20:32:40.6791866Z x_sign = torch.sign(x) 2025-05-07T20:32:40.6792143Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.6792450Z x = x_sign * x_clamp 2025-05-07T20:32:40.6792681Z x0 = x[:, :D] 2025-05-07T20:32:40.6792886Z x1 = x[:, D:] 2025-05-07T20:32:40.6793083Z 2025-05-07T20:32:40.6793268Z if contiguous: 2025-05-07T20:32:40.6793496Z x0 = x0.contiguous() 2025-05-07T20:32:40.6793741Z x1 = x1.contiguous() 2025-05-07T20:32:40.6793978Z 2025-05-07T20:32:40.6794160Z if scale_ub is not None: 2025-05-07T20:32:40.6794421Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.6794749Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.6795042Z ) 2025-05-07T20:32:40.6795219Z else: 2025-05-07T20:32:40.6795423Z scale_ub_tensor = None 2025-05-07T20:32:40.6795667Z 2025-05-07T20:32:40.6795885Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6796195Z op = silu_mul_quant 2025-05-07T20:32:40.6796442Z if compiled: 2025-05-07T20:32:40.6796839Z op = torch.compile(op) 2025-05-07T20:32:40.6797138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6797403Z 2025-05-07T20:32:40.6797585Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.6797748Z 2025-05-07T20:32:40.6797843Z moe/activation_test.py:117: 2025-05-07T20:32:40.6798125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6798443Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.6798705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6799381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.6800057Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.6800580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.6801244Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.6801899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.6802415Z kernel = self.compile( 2025-05-07T20:32:40.6802957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.6803591Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.6803980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6804202Z 2025-05-07T20:32:40.6804404Z self = 2025-05-07T20:32:40.6805457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.6806945Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4ca9f80>} 2025-05-07T20:32:40.6808296Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.6809289Z context = 2025-05-07T20:32:40.6809569Z 2025-05-07T20:32:40.6809734Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.6810239Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.6810691Z module_map=module_map) 2025-05-07T20:32:40.6811044Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.6811393Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.6811644Z E ^ 2025-05-07T20:32:40.6812099Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6812537Z 2025-05-07T20:32:40.6812955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.6813454Z 2025-05-07T20:32:40.6813554Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.6813952Z self=, 2025-05-07T20:32:40.6814352Z T=1, 2025-05-07T20:32:40.6814529Z D=7168, 2025-05-07T20:32:40.6814706Z scale_ub=None, 2025-05-07T20:32:40.6814914Z contiguous=True, 2025-05-07T20:32:40.6815128Z compiled=False, 2025-05-07T20:32:40.6815318Z ) 2025-05-07T20:32:40.6815625Z self = 2025-05-07T20:32:40.6816098Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:40.6816356Z 2025-05-07T20:32:40.6816515Z @given( 2025-05-07T20:32:40.6816735Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.6817037Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.6817327Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.6817643Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.6817961Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.6818233Z ) 2025-05-07T20:32:40.6818565Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.6818996Z def test_silu_mul_quant( 2025-05-07T20:32:40.6819226Z self, 2025-05-07T20:32:40.6819406Z T: int, 2025-05-07T20:32:40.6819598Z D: int, 2025-05-07T20:32:40.6819803Z scale_ub: Optional[float], 2025-05-07T20:32:40.6820058Z contiguous: bool, 2025-05-07T20:32:40.6820291Z compiled: bool, 2025-05-07T20:32:40.6820502Z ) -> None: 2025-05-07T20:32:40.6820698Z torch.manual_seed(2025) 2025-05-07T20:32:40.6820934Z 2025-05-07T20:32:40.6821200Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.6821518Z 2025-05-07T20:32:40.6821696Z x_sign = torch.sign(x) 2025-05-07T20:32:40.6821978Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.6822270Z x = x_sign * x_clamp 2025-05-07T20:32:40.6822499Z x0 = x[:, :D] 2025-05-07T20:32:40.6822700Z x1 = x[:, D:] 2025-05-07T20:32:40.6822903Z 2025-05-07T20:32:40.6823068Z if contiguous: 2025-05-07T20:32:40.6823290Z x0 = x0.contiguous() 2025-05-07T20:32:40.6823543Z x1 = x1.contiguous() 2025-05-07T20:32:40.6823775Z 2025-05-07T20:32:40.6823956Z if scale_ub is not None: 2025-05-07T20:32:40.6824215Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.6824623Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.6824920Z ) 2025-05-07T20:32:40.6825099Z else: 2025-05-07T20:32:40.6825298Z scale_ub_tensor = None 2025-05-07T20:32:40.6825541Z 2025-05-07T20:32:40.6825763Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6826057Z op = silu_mul_quant 2025-05-07T20:32:40.6826299Z if compiled: 2025-05-07T20:32:40.6826539Z op = torch.compile(op) 2025-05-07T20:32:40.6826819Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6827088Z 2025-05-07T20:32:40.6827279Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.6827486Z 2025-05-07T20:32:40.6827589Z moe/activation_test.py:117: 2025-05-07T20:32:40.6827873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6828190Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.6828466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6829144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.6829811Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.6830333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.6830996Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.6831648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.6832171Z kernel = self.compile( 2025-05-07T20:32:40.6832700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.6833340Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.6833726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6833954Z 2025-05-07T20:32:40.6834154Z self = 2025-05-07T20:32:40.6835333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.6836676Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4cab2e0>} 2025-05-07T20:32:40.6837988Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.6839035Z context = 2025-05-07T20:32:40.6839312Z 2025-05-07T20:32:40.6839481Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.6839993Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.6840719Z module_map=module_map) 2025-05-07T20:32:40.6841072Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.6841414Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.6841660Z E ^ 2025-05-07T20:32:40.6842107Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6842541Z 2025-05-07T20:32:40.6842950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.6843452Z 2025-05-07T20:32:40.6843553Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.6843941Z self=, 2025-05-07T20:32:40.6844491Z T=16384, 2025-05-07T20:32:40.6844677Z D=7168, 2025-05-07T20:32:40.6844852Z scale_ub=1200.0, 2025-05-07T20:32:40.6845063Z contiguous=False, 2025-05-07T20:32:40.6845293Z compiled=True, 2025-05-07T20:32:40.6845483Z ) 2025-05-07T20:32:40.6845789Z self = 2025-05-07T20:32:40.6846276Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:40.6846550Z 2025-05-07T20:32:40.6846627Z @given( 2025-05-07T20:32:40.6846843Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.6847155Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.6855198Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.6855563Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.6855887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.6856158Z ) 2025-05-07T20:32:40.6856502Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.6857006Z def test_silu_mul_quant( 2025-05-07T20:32:40.6857240Z self, 2025-05-07T20:32:40.6857433Z T: int, 2025-05-07T20:32:40.6857620Z D: int, 2025-05-07T20:32:40.6857823Z scale_ub: Optional[float], 2025-05-07T20:32:40.6858087Z contiguous: bool, 2025-05-07T20:32:40.6858316Z compiled: bool, 2025-05-07T20:32:40.6858533Z ) -> None: 2025-05-07T20:32:40.6858738Z torch.manual_seed(2025) 2025-05-07T20:32:40.6858975Z 2025-05-07T20:32:40.6859239Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.6859571Z 2025-05-07T20:32:40.6859750Z x_sign = torch.sign(x) 2025-05-07T20:32:40.6860031Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.6860335Z x = x_sign * x_clamp 2025-05-07T20:32:40.6860560Z x0 = x[:, :D] 2025-05-07T20:32:40.6860764Z x1 = x[:, D:] 2025-05-07T20:32:40.6860961Z 2025-05-07T20:32:40.6861130Z if contiguous: 2025-05-07T20:32:40.6861349Z x0 = x0.contiguous() 2025-05-07T20:32:40.6861595Z x1 = x1.contiguous() 2025-05-07T20:32:40.6861969Z 2025-05-07T20:32:40.6862153Z if scale_ub is not None: 2025-05-07T20:32:40.6862418Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.6862733Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.6863030Z ) 2025-05-07T20:32:40.6863211Z else: 2025-05-07T20:32:40.6863408Z scale_ub_tensor = None 2025-05-07T20:32:40.6863645Z 2025-05-07T20:32:40.6863866Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6864168Z op = silu_mul_quant 2025-05-07T20:32:40.6864402Z if compiled: 2025-05-07T20:32:40.6864641Z op = torch.compile(op) 2025-05-07T20:32:40.6864922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6865175Z 2025-05-07T20:32:40.6865359Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.6865521Z 2025-05-07T20:32:40.6865623Z moe/activation_test.py:117: 2025-05-07T20:32:40.6865906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6866225Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.6866493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6867039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.6867655Z return fn(*args, **kwargs) 2025-05-07T20:32:40.6868303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.6868971Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.6869491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.6870153Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.6870894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.6871417Z kernel = self.compile( 2025-05-07T20:32:40.6871960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.6872601Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.6872996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6873215Z 2025-05-07T20:32:40.6873418Z self = 2025-05-07T20:32:40.6874479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.6875835Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4bb85e0>} 2025-05-07T20:32:40.6877154Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.6878150Z context = 2025-05-07T20:32:40.6878428Z 2025-05-07T20:32:40.6878589Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.6879109Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.6879575Z module_map=module_map) 2025-05-07T20:32:40.6879933Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.6880279Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.6880531Z E ^ 2025-05-07T20:32:40.6880981Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6881496Z 2025-05-07T20:32:40.6881913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.8166300Z 2025-05-07T20:32:40.8166664Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.8167084Z self=, 2025-05-07T20:32:40.8167577Z T=1, 2025-05-07T20:32:40.8167830Z D=7168, 2025-05-07T20:32:40.8168092Z scale_ub=None, 2025-05-07T20:32:40.8168377Z contiguous=False, 2025-05-07T20:32:40.8168687Z compiled=False, 2025-05-07T20:32:40.8168966Z ) 2025-05-07T20:32:40.8169305Z self = 2025-05-07T20:32:40.8169799Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:40.8170065Z 2025-05-07T20:32:40.8170143Z @given( 2025-05-07T20:32:40.8170362Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.8170684Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.8170985Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.8171321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.8171652Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.8171924Z ) 2025-05-07T20:32:40.8172263Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.8172705Z def test_silu_mul_quant( 2025-05-07T20:32:40.8172942Z self, 2025-05-07T20:32:40.8173128Z T: int, 2025-05-07T20:32:40.8173314Z D: int, 2025-05-07T20:32:40.8173523Z scale_ub: Optional[float], 2025-05-07T20:32:40.8173789Z contiguous: bool, 2025-05-07T20:32:40.8174026Z compiled: bool, 2025-05-07T20:32:40.8174240Z ) -> None: 2025-05-07T20:32:40.8174628Z torch.manual_seed(2025) 2025-05-07T20:32:40.8174861Z 2025-05-07T20:32:40.8175125Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.8175468Z 2025-05-07T20:32:40.8175649Z x_sign = torch.sign(x) 2025-05-07T20:32:40.8175942Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.8176246Z x = x_sign * x_clamp 2025-05-07T20:32:40.8176478Z x0 = x[:, :D] 2025-05-07T20:32:40.8176678Z x1 = x[:, D:] 2025-05-07T20:32:40.8176903Z 2025-05-07T20:32:40.8177104Z if contiguous: 2025-05-07T20:32:40.8177336Z x0 = x0.contiguous() 2025-05-07T20:32:40.8177590Z x1 = x1.contiguous() 2025-05-07T20:32:40.8177825Z 2025-05-07T20:32:40.8178010Z if scale_ub is not None: 2025-05-07T20:32:40.8178283Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.8178613Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.8178911Z ) 2025-05-07T20:32:40.8179093Z else: 2025-05-07T20:32:40.8179298Z scale_ub_tensor = None 2025-05-07T20:32:40.8179535Z 2025-05-07T20:32:40.8179765Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.8180070Z op = silu_mul_quant 2025-05-07T20:32:40.8180302Z if compiled: 2025-05-07T20:32:40.8180536Z op = torch.compile(op) 2025-05-07T20:32:40.8180820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8181083Z 2025-05-07T20:32:40.8181263Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.8181427Z 2025-05-07T20:32:40.8181522Z moe/activation_test.py:117: 2025-05-07T20:32:40.8181807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8182126Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.8182396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8183077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.8183755Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.8184392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.8185064Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.8185730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.8186239Z kernel = self.compile( 2025-05-07T20:32:40.8186782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.8187424Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.8187881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8188099Z 2025-05-07T20:32:40.8188299Z self = 2025-05-07T20:32:40.8189365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.8190714Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4bb8fe0>} 2025-05-07T20:32:40.8192026Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.8193016Z context = 2025-05-07T20:32:40.8193297Z 2025-05-07T20:32:40.8193457Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.8193967Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.8194519Z module_map=module_map) 2025-05-07T20:32:40.8194873Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.8195226Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.8195478Z E ^ 2025-05-07T20:32:40.8195921Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.8196362Z 2025-05-07T20:32:40.8196766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.8197267Z 2025-05-07T20:32:40.8197366Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.8197768Z self=, 2025-05-07T20:32:40.8198160Z T=2048, 2025-05-07T20:32:40.8198340Z D=7168, 2025-05-07T20:32:40.8198526Z scale_ub=None, 2025-05-07T20:32:40.8198737Z contiguous=False, 2025-05-07T20:32:40.8198956Z compiled=True, 2025-05-07T20:32:40.8199152Z ) 2025-05-07T20:32:40.8199460Z self = 2025-05-07T20:32:40.8199938Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.8200211Z 2025-05-07T20:32:40.8200284Z @given( 2025-05-07T20:32:40.8200503Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.8200799Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.8201094Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.8201412Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.8201723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.8201999Z ) 2025-05-07T20:32:40.8202331Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.8202776Z def test_silu_mul_quant( 2025-05-07T20:32:40.8203009Z self, 2025-05-07T20:32:40.8203195Z T: int, 2025-05-07T20:32:40.8203384Z D: int, 2025-05-07T20:32:40.8203587Z scale_ub: Optional[float], 2025-05-07T20:32:40.8203933Z contiguous: bool, 2025-05-07T20:32:40.8204167Z compiled: bool, 2025-05-07T20:32:40.8204379Z ) -> None: 2025-05-07T20:32:40.8204590Z torch.manual_seed(2025) 2025-05-07T20:32:40.8204822Z 2025-05-07T20:32:40.8205079Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.8205410Z 2025-05-07T20:32:40.8205595Z x_sign = torch.sign(x) 2025-05-07T20:32:40.8205867Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.8206162Z x = x_sign * x_clamp 2025-05-07T20:32:40.8206391Z x0 = x[:, :D] 2025-05-07T20:32:40.8206595Z x1 = x[:, D:] 2025-05-07T20:32:40.8206799Z 2025-05-07T20:32:40.8206973Z if contiguous: 2025-05-07T20:32:40.8207190Z x0 = x0.contiguous() 2025-05-07T20:32:40.8207443Z x1 = x1.contiguous() 2025-05-07T20:32:40.8207674Z 2025-05-07T20:32:40.8207853Z if scale_ub is not None: 2025-05-07T20:32:40.8208126Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.8208453Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.8208743Z ) 2025-05-07T20:32:40.8208934Z else: 2025-05-07T20:32:40.8209140Z scale_ub_tensor = None 2025-05-07T20:32:40.8209379Z 2025-05-07T20:32:40.8209600Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.8209902Z op = silu_mul_quant 2025-05-07T20:32:40.8210148Z if compiled: 2025-05-07T20:32:40.8210381Z op = torch.compile(op) 2025-05-07T20:32:40.8210671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8210945Z 2025-05-07T20:32:40.8211121Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.8211280Z 2025-05-07T20:32:40.8211372Z moe/activation_test.py:117: 2025-05-07T20:32:40.8211775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8212094Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.8212368Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8212914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.8213457Z return fn(*args, **kwargs) 2025-05-07T20:32:40.8214095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.8214766Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.8215296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.8215957Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.8216607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.8217188Z kernel = self.compile( 2025-05-07T20:32:40.8217726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.8218357Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.8218737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8218958Z 2025-05-07T20:32:40.8219159Z self = 2025-05-07T20:32:40.8220213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.8221548Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4bba7a0>} 2025-05-07T20:32:40.8222993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.8224001Z context = 2025-05-07T20:32:40.8224288Z 2025-05-07T20:32:40.8224450Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.8224961Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.8225423Z module_map=module_map) 2025-05-07T20:32:40.8225783Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.8226131Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.8226377Z E ^ 2025-05-07T20:32:40.8226866Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.8227320Z 2025-05-07T20:32:40.8227786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.8228286Z 2025-05-07T20:32:40.8228388Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.8228786Z self=, 2025-05-07T20:32:40.8229173Z T=4096, 2025-05-07T20:32:40.8229348Z D=7168, 2025-05-07T20:32:40.8229524Z scale_ub=None, 2025-05-07T20:32:40.8229727Z contiguous=False, 2025-05-07T20:32:40.8229942Z compiled=True, 2025-05-07T20:32:41.0462850Z ) 2025-05-07T20:32:41.0463527Z self = 2025-05-07T20:32:41.0464244Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:41.0464609Z 2025-05-07T20:32:41.0464724Z @given( 2025-05-07T20:32:41.0465043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.0465765Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.0466166Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.0466618Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.0467096Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.0467515Z ) 2025-05-07T20:32:41.0467910Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.0468362Z def test_silu_mul_quant( 2025-05-07T20:32:41.0468759Z self, 2025-05-07T20:32:41.0468991Z T: int, 2025-05-07T20:32:41.0469182Z D: int, 2025-05-07T20:32:41.0469391Z scale_ub: Optional[float], 2025-05-07T20:32:41.0469662Z contiguous: bool, 2025-05-07T20:32:41.0469891Z compiled: bool, 2025-05-07T20:32:41.0470109Z ) -> None: 2025-05-07T20:32:41.0470330Z torch.manual_seed(2025) 2025-05-07T20:32:41.0470563Z 2025-05-07T20:32:41.0470826Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.0471166Z 2025-05-07T20:32:41.0471353Z x_sign = torch.sign(x) 2025-05-07T20:32:41.0471635Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.0471952Z x = x_sign * x_clamp 2025-05-07T20:32:41.0472192Z x0 = x[:, :D] 2025-05-07T20:32:41.0472402Z x1 = x[:, D:] 2025-05-07T20:32:41.0472616Z 2025-05-07T20:32:41.0472793Z if contiguous: 2025-05-07T20:32:41.0473016Z x0 = x0.contiguous() 2025-05-07T20:32:41.0473274Z x1 = x1.contiguous() 2025-05-07T20:32:41.0473515Z 2025-05-07T20:32:41.0473693Z if scale_ub is not None: 2025-05-07T20:32:41.0473956Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.0474290Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.0474595Z ) 2025-05-07T20:32:41.0474781Z else: 2025-05-07T20:32:41.0474986Z scale_ub_tensor = None 2025-05-07T20:32:41.0475231Z 2025-05-07T20:32:41.0475451Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.0475755Z op = silu_mul_quant 2025-05-07T20:32:41.0476139Z if compiled: 2025-05-07T20:32:41.0476382Z op = torch.compile(op) 2025-05-07T20:32:41.0476680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0476950Z 2025-05-07T20:32:41.0477129Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.0477291Z 2025-05-07T20:32:41.0477387Z moe/activation_test.py:117: 2025-05-07T20:32:41.0477670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0477982Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.0478251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0478800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.0479348Z return fn(*args, **kwargs) 2025-05-07T20:32:41.0479992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.0480675Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.0481202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.0481872Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.0482535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.0483052Z kernel = self.compile( 2025-05-07T20:32:41.0483582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.0484220Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.0484604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0484918Z 2025-05-07T20:32:41.0485119Z self = 2025-05-07T20:32:41.0486186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.0487589Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4bbb4c0>} 2025-05-07T20:32:41.0488951Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.0489999Z context = 2025-05-07T20:32:41.0490277Z 2025-05-07T20:32:41.0490446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.0490966Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.0491420Z module_map=module_map) 2025-05-07T20:32:41.0491773Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.0492117Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.0492362Z E ^ 2025-05-07T20:32:41.0492810Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.0493248Z 2025-05-07T20:32:41.0493656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.0494157Z 2025-05-07T20:32:41.0494263Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.0494666Z self=, 2025-05-07T20:32:41.0495066Z T=16384, 2025-05-07T20:32:41.0495260Z D=5120, 2025-05-07T20:32:41.0495441Z scale_ub=1200.0, 2025-05-07T20:32:41.0495658Z contiguous=False, 2025-05-07T20:32:41.0495881Z compiled=False, 2025-05-07T20:32:41.0496162Z ) 2025-05-07T20:32:41.0496476Z self = 2025-05-07T20:32:41.0496972Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.0497244Z 2025-05-07T20:32:41.0497323Z @given( 2025-05-07T20:32:41.0497540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.0497844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.0498137Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.0498447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.0498762Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.0499036Z ) 2025-05-07T20:32:41.0499365Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.0499817Z def test_silu_mul_quant( 2025-05-07T20:32:41.0500050Z self, 2025-05-07T20:32:41.0500232Z T: int, 2025-05-07T20:32:41.0500422Z D: int, 2025-05-07T20:32:41.0500633Z scale_ub: Optional[float], 2025-05-07T20:32:41.0500892Z contiguous: bool, 2025-05-07T20:32:41.0501132Z compiled: bool, 2025-05-07T20:32:41.0501350Z ) -> None: 2025-05-07T20:32:41.0501555Z torch.manual_seed(2025) 2025-05-07T20:32:41.0501778Z 2025-05-07T20:32:41.0502042Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.0502378Z 2025-05-07T20:32:41.0502559Z x_sign = torch.sign(x) 2025-05-07T20:32:41.0502840Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.0503145Z x = x_sign * x_clamp 2025-05-07T20:32:41.0503372Z x0 = x[:, :D] 2025-05-07T20:32:41.0503580Z x1 = x[:, D:] 2025-05-07T20:32:41.0503774Z 2025-05-07T20:32:41.0503944Z if contiguous: 2025-05-07T20:32:41.0504253Z x0 = x0.contiguous() 2025-05-07T20:32:41.0504502Z x1 = x1.contiguous() 2025-05-07T20:32:41.0504724Z 2025-05-07T20:32:41.0504919Z if scale_ub is not None: 2025-05-07T20:32:41.0505179Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.0505505Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.0505808Z ) 2025-05-07T20:32:41.0505987Z else: 2025-05-07T20:32:41.0506190Z scale_ub_tensor = None 2025-05-07T20:32:41.0506432Z 2025-05-07T20:32:41.0506654Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.0506967Z op = silu_mul_quant 2025-05-07T20:32:41.0514426Z if compiled: 2025-05-07T20:32:41.0514689Z op = torch.compile(op) 2025-05-07T20:32:41.0514984Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0515247Z 2025-05-07T20:32:41.0515441Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.0515622Z 2025-05-07T20:32:41.0515719Z moe/activation_test.py:117: 2025-05-07T20:32:41.0516019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0516345Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.0516621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0517356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.0518031Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.0518571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.0519237Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.0519903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.0520422Z kernel = self.compile( 2025-05-07T20:32:41.0520959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.0521705Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.0522092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0522313Z 2025-05-07T20:32:41.0522520Z self = 2025-05-07T20:32:41.0523576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.0524925Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4f1c860>} 2025-05-07T20:32:41.0526234Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.0527239Z context = 2025-05-07T20:32:41.0527518Z 2025-05-07T20:32:41.0527682Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.0528182Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.0528637Z module_map=module_map) 2025-05-07T20:32:41.0528988Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.0529333Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.0529581Z E ^ 2025-05-07T20:32:41.0530028Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.0530466Z 2025-05-07T20:32:41.0530902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.0531487Z 2025-05-07T20:32:41.0531587Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.0531989Z self=, 2025-05-07T20:32:41.0532378Z T=16384, 2025-05-07T20:32:41.0532568Z D=5120, 2025-05-07T20:32:41.0532751Z scale_ub=1200.0, 2025-05-07T20:32:41.0532967Z contiguous=True, 2025-05-07T20:32:41.0533177Z compiled=True, 2025-05-07T20:32:41.0533369Z ) 2025-05-07T20:32:41.0533682Z self = 2025-05-07T20:32:41.0534164Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.0534437Z 2025-05-07T20:32:41.0534508Z @given( 2025-05-07T20:32:41.0534729Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.0535029Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.0535322Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.0535640Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.0535961Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.0536240Z ) 2025-05-07T20:32:41.0536572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.0537017Z def test_silu_mul_quant( 2025-05-07T20:32:41.0537249Z self, 2025-05-07T20:32:41.0537431Z T: int, 2025-05-07T20:32:41.0537619Z D: int, 2025-05-07T20:32:41.0537827Z scale_ub: Optional[float], 2025-05-07T20:32:41.0538084Z contiguous: bool, 2025-05-07T20:32:41.0538314Z compiled: bool, 2025-05-07T20:32:41.0538530Z ) -> None: 2025-05-07T20:32:41.0538728Z torch.manual_seed(2025) 2025-05-07T20:32:41.0538966Z 2025-05-07T20:32:41.0539226Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.0539562Z 2025-05-07T20:32:41.0539740Z x_sign = torch.sign(x) 2025-05-07T20:32:41.0540021Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.0540634Z x = x_sign * x_clamp 2025-05-07T20:32:41.0541006Z x0 = x[:, :D] 2025-05-07T20:32:41.0541221Z x1 = x[:, D:] 2025-05-07T20:32:41.0541420Z 2025-05-07T20:32:41.0541593Z if contiguous: 2025-05-07T20:32:41.0541811Z x0 = x0.contiguous() 2025-05-07T20:32:41.0542056Z x1 = x1.contiguous() 2025-05-07T20:32:41.0542286Z 2025-05-07T20:32:41.0542468Z if scale_ub is not None: 2025-05-07T20:32:41.0542729Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.0543046Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.0543349Z ) 2025-05-07T20:32:41.0543534Z else: 2025-05-07T20:32:41.0543728Z scale_ub_tensor = None 2025-05-07T20:32:41.0543973Z 2025-05-07T20:32:41.0544195Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.0544499Z op = silu_mul_quant 2025-05-07T20:32:41.0544735Z if compiled: 2025-05-07T20:32:41.0544973Z op = torch.compile(op) 2025-05-07T20:32:41.0545260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0545518Z 2025-05-07T20:32:41.0545702Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.0545860Z 2025-05-07T20:32:41.0545960Z moe/activation_test.py:117: 2025-05-07T20:32:41.0546236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0546550Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.0546816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0547355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.0548052Z return fn(*args, **kwargs) 2025-05-07T20:32:41.0548829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.0549648Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.0550177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.0550836Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.0551503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.0552016Z kernel = self.compile( 2025-05-07T20:32:41.0552562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.0553205Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.0553591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0553810Z 2025-05-07T20:32:41.0554015Z self = 2025-05-07T20:32:41.0555083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.0556432Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4f1db20>} 2025-05-07T20:32:41.0557799Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.0558802Z context = 2025-05-07T20:32:41.0559078Z 2025-05-07T20:32:41.0559237Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.0559750Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.0560211Z module_map=module_map) 2025-05-07T20:32:41.0560650Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.0560996Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.0561242Z E ^ 2025-05-07T20:32:41.0561693Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.0562132Z 2025-05-07T20:32:41.0562557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.2083364Z 2025-05-07T20:32:41.2083685Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2084101Z self=, 2025-05-07T20:32:41.2084559Z T=16384, 2025-05-07T20:32:41.2084755Z D=5120, 2025-05-07T20:32:41.2085031Z scale_ub=None, 2025-05-07T20:32:41.2085324Z contiguous=False, 2025-05-07T20:32:41.2085655Z compiled=True, 2025-05-07T20:32:41.2085944Z ) 2025-05-07T20:32:41.2086295Z self = 2025-05-07T20:32:41.2086802Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:41.2087072Z 2025-05-07T20:32:41.2087157Z @given( 2025-05-07T20:32:41.2087379Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2087691Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2087993Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2088334Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2088647Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2088929Z ) 2025-05-07T20:32:41.2089272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2089701Z def test_silu_mul_quant( 2025-05-07T20:32:41.2089944Z self, 2025-05-07T20:32:41.2090315Z T: int, 2025-05-07T20:32:41.2090511Z D: int, 2025-05-07T20:32:41.2090734Z scale_ub: Optional[float], 2025-05-07T20:32:41.2090999Z contiguous: bool, 2025-05-07T20:32:41.2091240Z compiled: bool, 2025-05-07T20:32:41.2091465Z ) -> None: 2025-05-07T20:32:41.2091683Z torch.manual_seed(2025) 2025-05-07T20:32:41.2091918Z 2025-05-07T20:32:41.2092188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2092531Z 2025-05-07T20:32:41.2092721Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2093000Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2093307Z x = x_sign * x_clamp 2025-05-07T20:32:41.2093553Z x0 = x[:, :D] 2025-05-07T20:32:41.2093757Z x1 = x[:, D:] 2025-05-07T20:32:41.2093961Z 2025-05-07T20:32:41.2094140Z if contiguous: 2025-05-07T20:32:41.2094363Z x0 = x0.contiguous() 2025-05-07T20:32:41.2094621Z x1 = x1.contiguous() 2025-05-07T20:32:41.2094861Z 2025-05-07T20:32:41.2095042Z if scale_ub is not None: 2025-05-07T20:32:41.2095314Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.2095651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.2095949Z ) 2025-05-07T20:32:41.2096138Z else: 2025-05-07T20:32:41.2096342Z scale_ub_tensor = None 2025-05-07T20:32:41.2096577Z 2025-05-07T20:32:41.2096823Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.2097152Z op = silu_mul_quant 2025-05-07T20:32:41.2097398Z if compiled: 2025-05-07T20:32:41.2097635Z op = torch.compile(op) 2025-05-07T20:32:41.2097923Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2098190Z 2025-05-07T20:32:41.2098367Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.2098527Z 2025-05-07T20:32:41.2098622Z moe/activation_test.py:117: 2025-05-07T20:32:41.2098910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2099227Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.2099501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2100184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.2100736Z return fn(*args, **kwargs) 2025-05-07T20:32:41.2101387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.2102060Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.2102592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.2103252Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.2103922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.2104457Z kernel = self.compile( 2025-05-07T20:32:41.2104988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.2105634Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.2106017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2106237Z 2025-05-07T20:32:41.2106441Z self = 2025-05-07T20:32:41.2107581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.2108952Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4f1e8e0>} 2025-05-07T20:32:41.2110386Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.2111387Z context = 2025-05-07T20:32:41.2111668Z 2025-05-07T20:32:41.2111833Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.2112342Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.2112820Z module_map=module_map) 2025-05-07T20:32:41.2113179Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.2113522Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.2113778Z E ^ 2025-05-07T20:32:41.2114229Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.2114675Z 2025-05-07T20:32:41.2115096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.2115598Z 2025-05-07T20:32:41.2115701Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2116110Z self=, 2025-05-07T20:32:41.2116512Z T=2048, 2025-05-07T20:32:41.2116689Z D=5120, 2025-05-07T20:32:41.2116875Z scale_ub=None, 2025-05-07T20:32:41.2117080Z contiguous=False, 2025-05-07T20:32:41.2117291Z compiled=True, 2025-05-07T20:32:41.2117483Z ) 2025-05-07T20:32:41.2117788Z self = 2025-05-07T20:32:41.2118272Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:41.2118532Z 2025-05-07T20:32:41.2118610Z @given( 2025-05-07T20:32:41.2118831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2119132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2119436Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2119757Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2120162Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2120437Z ) 2025-05-07T20:32:41.2120786Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2121226Z def test_silu_mul_quant( 2025-05-07T20:32:41.2121466Z self, 2025-05-07T20:32:41.2121647Z T: int, 2025-05-07T20:32:41.2121834Z D: int, 2025-05-07T20:32:41.2122053Z scale_ub: Optional[float], 2025-05-07T20:32:41.2122310Z contiguous: bool, 2025-05-07T20:32:41.2122542Z compiled: bool, 2025-05-07T20:32:41.2122754Z ) -> None: 2025-05-07T20:32:41.2122962Z torch.manual_seed(2025) 2025-05-07T20:32:41.2123202Z 2025-05-07T20:32:41.2123471Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2123798Z 2025-05-07T20:32:41.2123985Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2124271Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2124575Z x = x_sign * x_clamp 2025-05-07T20:32:41.2124820Z x0 = x[:, :D] 2025-05-07T20:32:41.2125032Z x1 = x[:, D:] 2025-05-07T20:32:41.2125231Z 2025-05-07T20:32:41.2125415Z if contiguous: 2025-05-07T20:32:41.2125643Z x0 = x0.contiguous() 2025-05-07T20:32:41.2125889Z x1 = x1.contiguous() 2025-05-07T20:32:41.2126120Z 2025-05-07T20:32:41.2126309Z if scale_ub is not None: 2025-05-07T20:32:41.2126578Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.2126950Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.2127248Z ) 2025-05-07T20:32:41.2127433Z else: 2025-05-07T20:32:41.2127638Z scale_ub_tensor = None 2025-05-07T20:32:41.2127882Z 2025-05-07T20:32:41.2128109Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.2128501Z op = silu_mul_quant 2025-05-07T20:32:41.2128744Z if compiled: 2025-05-07T20:32:41.2128991Z op = torch.compile(op) 2025-05-07T20:32:41.2129272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2129538Z 2025-05-07T20:32:41.2129724Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.2129883Z 2025-05-07T20:32:41.2129979Z moe/activation_test.py:117: 2025-05-07T20:32:41.2130261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2130584Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.2130858Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2131403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.2131956Z return fn(*args, **kwargs) 2025-05-07T20:32:41.2132616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.2133292Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.2133831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.2134509Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.2135160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.2135669Z kernel = self.compile( 2025-05-07T20:32:41.2136200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.2136833Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.2137219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2137440Z 2025-05-07T20:32:41.2137639Z self = 2025-05-07T20:32:41.2138785Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.2140374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c48c4040>} 2025-05-07T20:32:41.2141705Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.2142705Z context = 2025-05-07T20:32:41.2142982Z 2025-05-07T20:32:41.2143140Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.2143651Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.2144115Z module_map=module_map) 2025-05-07T20:32:41.2144467Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.2144815Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.2145056Z E ^ 2025-05-07T20:32:41.2145502Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.2145939Z 2025-05-07T20:32:41.2146368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.3728778Z 2025-05-07T20:32:41.3728923Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3729337Z self=, 2025-05-07T20:32:41.3729827Z T=2048, 2025-05-07T20:32:41.3730085Z D=5120, 2025-05-07T20:32:41.3730344Z scale_ub=1200.0, 2025-05-07T20:32:41.3730849Z contiguous=False, 2025-05-07T20:32:41.3731145Z compiled=True, 2025-05-07T20:32:41.3731421Z ) 2025-05-07T20:32:41.3731853Z self = 2025-05-07T20:32:41.3732457Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.3732725Z 2025-05-07T20:32:41.3732800Z @given( 2025-05-07T20:32:41.3733026Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3733329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3733621Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3733942Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3734255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3734526Z ) 2025-05-07T20:32:41.3734874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3735309Z def test_silu_mul_quant( 2025-05-07T20:32:41.3735542Z self, 2025-05-07T20:32:41.3735738Z T: int, 2025-05-07T20:32:41.3735927Z D: int, 2025-05-07T20:32:41.3736140Z scale_ub: Optional[float], 2025-05-07T20:32:41.3736406Z contiguous: bool, 2025-05-07T20:32:41.3736638Z compiled: bool, 2025-05-07T20:32:41.3736863Z ) -> None: 2025-05-07T20:32:41.3737105Z torch.manual_seed(2025) 2025-05-07T20:32:41.3737347Z 2025-05-07T20:32:41.3737609Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3737936Z 2025-05-07T20:32:41.3738123Z x_sign = torch.sign(x) 2025-05-07T20:32:41.3738427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.3738723Z x = x_sign * x_clamp 2025-05-07T20:32:41.3738974Z x0 = x[:, :D] 2025-05-07T20:32:41.3739233Z x1 = x[:, D:] 2025-05-07T20:32:41.3739443Z 2025-05-07T20:32:41.3739629Z if contiguous: 2025-05-07T20:32:41.3739856Z x0 = x0.contiguous() 2025-05-07T20:32:41.3740390Z x1 = x1.contiguous() 2025-05-07T20:32:41.3740629Z 2025-05-07T20:32:41.3740807Z if scale_ub is not None: 2025-05-07T20:32:41.3741217Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.3741552Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.3741842Z ) 2025-05-07T20:32:41.3742044Z else: 2025-05-07T20:32:41.3742248Z scale_ub_tensor = None 2025-05-07T20:32:41.3742479Z 2025-05-07T20:32:41.3742705Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.3743008Z op = silu_mul_quant 2025-05-07T20:32:41.3743248Z if compiled: 2025-05-07T20:32:41.3743490Z op = torch.compile(op) 2025-05-07T20:32:41.3743780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3744044Z 2025-05-07T20:32:41.3744223Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.3744388Z 2025-05-07T20:32:41.3744483Z moe/activation_test.py:117: 2025-05-07T20:32:41.3744774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3745091Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.3745373Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3745938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.3746484Z return fn(*args, **kwargs) 2025-05-07T20:32:41.3747137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.3747875Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.3748396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.3749064Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.3749715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.3750361Z kernel = self.compile( 2025-05-07T20:32:41.3750904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.3751537Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.3751937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3752157Z 2025-05-07T20:32:41.3752368Z self = 2025-05-07T20:32:41.3753434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.3754783Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c48c4e00>} 2025-05-07T20:32:41.3756109Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.3757165Z context = 2025-05-07T20:32:41.3757447Z 2025-05-07T20:32:41.3757613Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.3758118Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.3758586Z module_map=module_map) 2025-05-07T20:32:41.3758954Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.3759299Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.3759549Z E ^ 2025-05-07T20:32:41.3760004Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.3760449Z 2025-05-07T20:32:41.3760976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.3761480Z 2025-05-07T20:32:41.3761578Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3761981Z self=, 2025-05-07T20:32:41.3762372Z T=4096, 2025-05-07T20:32:41.3762556Z D=5120, 2025-05-07T20:32:41.3762734Z scale_ub=1200.0, 2025-05-07T20:32:41.3762944Z contiguous=True, 2025-05-07T20:32:41.3763171Z compiled=True, 2025-05-07T20:32:41.3763365Z ) 2025-05-07T20:32:41.3763674Z self = 2025-05-07T20:32:41.3764159Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.3764421Z 2025-05-07T20:32:41.3764494Z @given( 2025-05-07T20:32:41.3764716Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3765023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3765318Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3773081Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3773433Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3773731Z ) 2025-05-07T20:32:41.3774078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3774527Z def test_silu_mul_quant( 2025-05-07T20:32:41.3774766Z self, 2025-05-07T20:32:41.3774950Z T: int, 2025-05-07T20:32:41.3775130Z D: int, 2025-05-07T20:32:41.3775338Z scale_ub: Optional[float], 2025-05-07T20:32:41.3775606Z contiguous: bool, 2025-05-07T20:32:41.3775833Z compiled: bool, 2025-05-07T20:32:41.3776038Z ) -> None: 2025-05-07T20:32:41.3776246Z torch.manual_seed(2025) 2025-05-07T20:32:41.3776479Z 2025-05-07T20:32:41.3776733Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3777178Z 2025-05-07T20:32:41.3777357Z x_sign = torch.sign(x) 2025-05-07T20:32:41.3777641Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.3777932Z x = x_sign * x_clamp 2025-05-07T20:32:41.3778153Z x0 = x[:, :D] 2025-05-07T20:32:41.3778355Z x1 = x[:, D:] 2025-05-07T20:32:41.3778543Z 2025-05-07T20:32:41.3778715Z if contiguous: 2025-05-07T20:32:41.3778934Z x0 = x0.contiguous() 2025-05-07T20:32:41.3779169Z x1 = x1.contiguous() 2025-05-07T20:32:41.3779396Z 2025-05-07T20:32:41.3779575Z if scale_ub is not None: 2025-05-07T20:32:41.3779836Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.3780159Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.3780449Z ) 2025-05-07T20:32:41.3780634Z else: 2025-05-07T20:32:41.3780833Z scale_ub_tensor = None 2025-05-07T20:32:41.3781085Z 2025-05-07T20:32:41.3781301Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.3781597Z op = silu_mul_quant 2025-05-07T20:32:41.3781839Z if compiled: 2025-05-07T20:32:41.3782076Z op = torch.compile(op) 2025-05-07T20:32:41.3782354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3782617Z 2025-05-07T20:32:41.3782796Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.3782956Z 2025-05-07T20:32:41.3783050Z moe/activation_test.py:117: 2025-05-07T20:32:41.3783331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3783647Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.3783913Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3784461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.3785011Z return fn(*args, **kwargs) 2025-05-07T20:32:41.3785668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.3786427Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.3786952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.3787685Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.3788327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.3788842Z kernel = self.compile( 2025-05-07T20:32:41.3789376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.3790036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.3790422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3790654Z 2025-05-07T20:32:41.3790854Z self = 2025-05-07T20:32:41.3791911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.3793257Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c48c60c0>} 2025-05-07T20:32:41.3794568Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.3795565Z context = 2025-05-07T20:32:41.3795843Z 2025-05-07T20:32:41.3795999Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.3796583Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.3797034Z module_map=module_map) 2025-05-07T20:32:41.3797388Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.3797736Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.3797979Z E ^ 2025-05-07T20:32:41.3798419Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.3798858Z 2025-05-07T20:32:41.3799271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.5469368Z 2025-05-07T20:32:41.5469673Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5470309Z self=, 2025-05-07T20:32:41.5470890Z T=128, 2025-05-07T20:32:41.5471126Z D=5120, 2025-05-07T20:32:41.5471311Z scale_ub=1200.0, 2025-05-07T20:32:41.5471530Z contiguous=False, 2025-05-07T20:32:41.5471749Z compiled=True, 2025-05-07T20:32:41.5471957Z ) 2025-05-07T20:32:41.5472266Z self = 2025-05-07T20:32:41.5472768Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.5473034Z 2025-05-07T20:32:41.5473123Z @given( 2025-05-07T20:32:41.5473346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5473659Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5473962Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5474294Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5474618Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5474906Z ) 2025-05-07T20:32:41.5475243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5475689Z def test_silu_mul_quant( 2025-05-07T20:32:41.5475934Z self, 2025-05-07T20:32:41.5476133Z T: int, 2025-05-07T20:32:41.5476515Z D: int, 2025-05-07T20:32:41.5476746Z scale_ub: Optional[float], 2025-05-07T20:32:41.5477063Z contiguous: bool, 2025-05-07T20:32:41.5477299Z compiled: bool, 2025-05-07T20:32:41.5477525Z ) -> None: 2025-05-07T20:32:41.5477744Z torch.manual_seed(2025) 2025-05-07T20:32:41.5477978Z 2025-05-07T20:32:41.5478256Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5478601Z 2025-05-07T20:32:41.5478789Z x_sign = torch.sign(x) 2025-05-07T20:32:41.5479069Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.5479378Z x = x_sign * x_clamp 2025-05-07T20:32:41.5479607Z x0 = x[:, :D] 2025-05-07T20:32:41.5479824Z x1 = x[:, D:] 2025-05-07T20:32:41.5480031Z 2025-05-07T20:32:41.5480220Z if contiguous: 2025-05-07T20:32:41.5480458Z x0 = x0.contiguous() 2025-05-07T20:32:41.5480710Z x1 = x1.contiguous() 2025-05-07T20:32:41.5480956Z 2025-05-07T20:32:41.5481151Z if scale_ub is not None: 2025-05-07T20:32:41.5481421Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.5481766Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.5482068Z ) 2025-05-07T20:32:41.5482258Z else: 2025-05-07T20:32:41.5482466Z scale_ub_tensor = None 2025-05-07T20:32:41.5482708Z 2025-05-07T20:32:41.5482931Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.5483233Z op = silu_mul_quant 2025-05-07T20:32:41.5483473Z if compiled: 2025-05-07T20:32:41.5483717Z op = torch.compile(op) 2025-05-07T20:32:41.5484004Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5484265Z 2025-05-07T20:32:41.5484454Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.5484744Z 2025-05-07T20:32:41.5484845Z moe/activation_test.py:117: 2025-05-07T20:32:41.5485139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5485469Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.5485743Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5486302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.5486847Z return fn(*args, **kwargs) 2025-05-07T20:32:41.5487500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.5488175Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.5488712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.5489382Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.5490044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.5490566Z kernel = self.compile( 2025-05-07T20:32:41.5491102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.5491745Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.5492147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5492370Z 2025-05-07T20:32:41.5492580Z self = 2025-05-07T20:32:41.5493671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.5495040Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c48c72e0>} 2025-05-07T20:32:41.5496476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.5497482Z context = 2025-05-07T20:32:41.5497764Z 2025-05-07T20:32:41.5497932Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.5498444Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.5498913Z module_map=module_map) 2025-05-07T20:32:41.5499269Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.5499616Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.5499877Z E ^ 2025-05-07T20:32:41.5500335Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.5500780Z 2025-05-07T20:32:41.5501220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.5501724Z 2025-05-07T20:32:41.5501825Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5502231Z self=, 2025-05-07T20:32:41.5502620Z T=16384, 2025-05-07T20:32:41.5502807Z D=7168, 2025-05-07T20:32:41.5502994Z scale_ub=1200.0, 2025-05-07T20:32:41.5503207Z contiguous=True, 2025-05-07T20:32:41.5503428Z compiled=True, 2025-05-07T20:32:41.5503630Z ) 2025-05-07T20:32:41.5503941Z self = 2025-05-07T20:32:41.5504425Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.5504693Z 2025-05-07T20:32:41.5504858Z @given( 2025-05-07T20:32:41.5505079Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5505387Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5505689Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5506012Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5506339Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5506619Z ) 2025-05-07T20:32:41.5506985Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5507550Z def test_silu_mul_quant( 2025-05-07T20:32:41.5507801Z self, 2025-05-07T20:32:41.5507983Z T: int, 2025-05-07T20:32:41.5508173Z D: int, 2025-05-07T20:32:41.5508384Z scale_ub: Optional[float], 2025-05-07T20:32:41.5508641Z contiguous: bool, 2025-05-07T20:32:41.5508875Z compiled: bool, 2025-05-07T20:32:41.5509094Z ) -> None: 2025-05-07T20:32:41.5509297Z torch.manual_seed(2025) 2025-05-07T20:32:41.5509544Z 2025-05-07T20:32:41.5509807Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5510139Z 2025-05-07T20:32:41.5510329Z x_sign = torch.sign(x) 2025-05-07T20:32:41.5510614Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.5510912Z x = x_sign * x_clamp 2025-05-07T20:32:41.5511140Z x0 = x[:, :D] 2025-05-07T20:32:41.5511349Z x1 = x[:, D:] 2025-05-07T20:32:41.5511555Z 2025-05-07T20:32:41.5511728Z if contiguous: 2025-05-07T20:32:41.5511952Z x0 = x0.contiguous() 2025-05-07T20:32:41.5512196Z x1 = x1.contiguous() 2025-05-07T20:32:41.5512420Z 2025-05-07T20:32:41.5512602Z if scale_ub is not None: 2025-05-07T20:32:41.5512869Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.5513191Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.5513490Z ) 2025-05-07T20:32:41.5513676Z else: 2025-05-07T20:32:41.5513877Z scale_ub_tensor = None 2025-05-07T20:32:41.5514120Z 2025-05-07T20:32:41.5514342Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.5514733Z op = silu_mul_quant 2025-05-07T20:32:41.5514976Z if compiled: 2025-05-07T20:32:41.5515217Z op = torch.compile(op) 2025-05-07T20:32:41.5515505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5515765Z 2025-05-07T20:32:41.5515946Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.5516105Z 2025-05-07T20:32:41.5516212Z moe/activation_test.py:117: 2025-05-07T20:32:41.5516494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5516812Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.5517085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5517629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.5518181Z return fn(*args, **kwargs) 2025-05-07T20:32:41.5518835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.5519504Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.5520028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.5520699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.5521352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.5521878Z kernel = self.compile( 2025-05-07T20:32:41.5522414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.5523053Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.5523443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5523751Z 2025-05-07T20:32:41.5523951Z self = 2025-05-07T20:32:41.5525074Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.5526417Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4aaca40>} 2025-05-07T20:32:41.5527779Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.5528783Z context = 2025-05-07T20:32:41.5529068Z 2025-05-07T20:32:41.5529228Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.5529745Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.5530213Z module_map=module_map) 2025-05-07T20:32:41.5530564Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.5530908Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.5531161Z E ^ 2025-05-07T20:32:41.5531609Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.5532045Z 2025-05-07T20:32:41.5532460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.6680868Z 2025-05-07T20:32:41.6681207Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6681807Z self=, 2025-05-07T20:32:41.6682371Z T=16384, 2025-05-07T20:32:41.6682628Z D=5120, 2025-05-07T20:32:41.6682940Z scale_ub=1200.0, 2025-05-07T20:32:41.6683353Z contiguous=True, 2025-05-07T20:32:41.6683579Z compiled=False, 2025-05-07T20:32:41.6683778Z ) 2025-05-07T20:32:41.6684100Z self = 2025-05-07T20:32:41.6684592Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.6684879Z 2025-05-07T20:32:41.6684951Z @given( 2025-05-07T20:32:41.6685170Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6685473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6685762Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6686077Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6686403Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6686666Z ) 2025-05-07T20:32:41.6687061Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6687518Z def test_silu_mul_quant( 2025-05-07T20:32:41.6687757Z self, 2025-05-07T20:32:41.6687943Z T: int, 2025-05-07T20:32:41.6688130Z D: int, 2025-05-07T20:32:41.6688335Z scale_ub: Optional[float], 2025-05-07T20:32:41.6688599Z contiguous: bool, 2025-05-07T20:32:41.6688822Z compiled: bool, 2025-05-07T20:32:41.6689037Z ) -> None: 2025-05-07T20:32:41.6689242Z torch.manual_seed(2025) 2025-05-07T20:32:41.6689474Z 2025-05-07T20:32:41.6689735Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6690065Z 2025-05-07T20:32:41.6690261Z x_sign = torch.sign(x) 2025-05-07T20:32:41.6690549Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.6690841Z x = x_sign * x_clamp 2025-05-07T20:32:41.6691077Z x0 = x[:, :D] 2025-05-07T20:32:41.6691368Z x1 = x[:, D:] 2025-05-07T20:32:41.6691848Z 2025-05-07T20:32:41.6692030Z if contiguous: 2025-05-07T20:32:41.6692262Z x0 = x0.contiguous() 2025-05-07T20:32:41.6692513Z x1 = x1.contiguous() 2025-05-07T20:32:41.6692743Z 2025-05-07T20:32:41.6692935Z if scale_ub is not None: 2025-05-07T20:32:41.6693198Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.6693517Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.6693813Z ) 2025-05-07T20:32:41.6693998Z else: 2025-05-07T20:32:41.6694210Z scale_ub_tensor = None 2025-05-07T20:32:41.6694445Z 2025-05-07T20:32:41.6694667Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.6694966Z op = silu_mul_quant 2025-05-07T20:32:41.6695213Z if compiled: 2025-05-07T20:32:41.6695461Z op = torch.compile(op) 2025-05-07T20:32:41.6695748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.6696029Z 2025-05-07T20:32:41.6696210Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.6696368Z 2025-05-07T20:32:41.6696469Z moe/activation_test.py:117: 2025-05-07T20:32:41.6696750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.6697065Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.6697338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.6698014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.6698690Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.6699217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.6699908Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.6700557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.6701088Z kernel = self.compile( 2025-05-07T20:32:41.6701713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.6702360Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.6702748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.6702986Z 2025-05-07T20:32:41.6703193Z self = 2025-05-07T20:32:41.6704252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.6705603Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4aad440>} 2025-05-07T20:32:41.6706942Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.6708113Z context = 2025-05-07T20:32:41.6708391Z 2025-05-07T20:32:41.6708558Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.6709069Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.6709515Z module_map=module_map) 2025-05-07T20:32:41.6709879Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.6710228Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.6710464Z E ^ 2025-05-07T20:32:41.6710914Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.6711437Z 2025-05-07T20:32:41.6711852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.6712357Z 2025-05-07T20:32:41.6712459Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6712853Z self=, 2025-05-07T20:32:41.6713253Z T=1, 2025-05-07T20:32:41.6713429Z D=7168, 2025-05-07T20:32:41.6713607Z scale_ub=1200.0, 2025-05-07T20:32:41.6713816Z contiguous=False, 2025-05-07T20:32:41.6714029Z compiled=False, 2025-05-07T20:32:41.6714221Z ) 2025-05-07T20:32:41.6714528Z self = 2025-05-07T20:32:41.6715005Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.6715265Z 2025-05-07T20:32:41.6715344Z @given( 2025-05-07T20:32:41.6715569Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6715878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6716166Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6716482Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6716797Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6717070Z ) 2025-05-07T20:32:41.6717398Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6717822Z def test_silu_mul_quant( 2025-05-07T20:32:41.6718051Z self, 2025-05-07T20:32:41.6718232Z T: int, 2025-05-07T20:32:41.6718418Z D: int, 2025-05-07T20:32:41.6718622Z scale_ub: Optional[float], 2025-05-07T20:32:41.6718880Z contiguous: bool, 2025-05-07T20:32:41.6719112Z compiled: bool, 2025-05-07T20:32:41.6719320Z ) -> None: 2025-05-07T20:32:41.6719515Z torch.manual_seed(2025) 2025-05-07T20:32:41.6719747Z 2025-05-07T20:32:41.6720019Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6720346Z 2025-05-07T20:32:41.6720524Z x_sign = torch.sign(x) 2025-05-07T20:32:41.6720804Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.6721190Z x = x_sign * x_clamp 2025-05-07T20:32:41.6721412Z x0 = x[:, :D] 2025-05-07T20:32:41.6721616Z x1 = x[:, D:] 2025-05-07T20:32:41.6721811Z 2025-05-07T20:32:41.6721977Z if contiguous: 2025-05-07T20:32:41.6722199Z x0 = x0.contiguous() 2025-05-07T20:32:41.6722440Z x1 = x1.contiguous() 2025-05-07T20:32:41.6722666Z 2025-05-07T20:32:41.6722851Z if scale_ub is not None: 2025-05-07T20:32:41.6723122Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.6723443Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.6723742Z ) 2025-05-07T20:32:41.6723923Z else: 2025-05-07T20:32:41.6724118Z scale_ub_tensor = None 2025-05-07T20:32:41.6724353Z 2025-05-07T20:32:41.6724574Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.6724877Z op = silu_mul_quant 2025-05-07T20:32:41.6725115Z if compiled: 2025-05-07T20:32:41.6732831Z op = torch.compile(op) 2025-05-07T20:32:41.6733156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.6733431Z 2025-05-07T20:32:41.6733622Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.6733787Z 2025-05-07T20:32:41.6733886Z moe/activation_test.py:117: 2025-05-07T20:32:41.6734185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.6734516Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.6734783Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.6735454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.6736122Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.6736647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.6737493Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.6738144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.6738661Z kernel = self.compile( 2025-05-07T20:32:41.6739189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.6739826Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.6740673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.6740905Z 2025-05-07T20:32:41.6741111Z self = 2025-05-07T20:32:41.6742165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.6743528Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4aae7a0>} 2025-05-07T20:32:41.6744847Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.6745850Z context = 2025-05-07T20:32:41.6746128Z 2025-05-07T20:32:41.6746293Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.6746796Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.6747255Z module_map=module_map) 2025-05-07T20:32:41.6747689Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.6748024Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.6748271Z E ^ 2025-05-07T20:32:41.6748878Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.6749350Z 2025-05-07T20:32:41.6749765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.6750262Z 2025-05-07T20:32:41.6750357Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6750752Z self=, 2025-05-07T20:32:41.6751156Z T=4096, 2025-05-07T20:32:41.6751332Z D=7168, 2025-05-07T20:32:41.6751511Z scale_ub=1200.0, 2025-05-07T20:32:41.6751727Z contiguous=False, 2025-05-07T20:32:41.6751939Z compiled=True, 2025-05-07T20:32:41.8333923Z ) 2025-05-07T20:32:41.8334519Z self = 2025-05-07T20:32:41.8335236Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.8335606Z 2025-05-07T20:32:41.8335718Z @given( 2025-05-07T20:32:41.8336022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8336389Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8336689Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8337012Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8337343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8337625Z ) 2025-05-07T20:32:41.8337972Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8338426Z def test_silu_mul_quant( 2025-05-07T20:32:41.8338678Z self, 2025-05-07T20:32:41.8338868Z T: int, 2025-05-07T20:32:41.8339065Z D: int, 2025-05-07T20:32:41.8339280Z scale_ub: Optional[float], 2025-05-07T20:32:41.8339730Z contiguous: bool, 2025-05-07T20:32:41.8339970Z compiled: bool, 2025-05-07T20:32:41.8340492Z ) -> None: 2025-05-07T20:32:41.8340717Z torch.manual_seed(2025) 2025-05-07T20:32:41.8340956Z 2025-05-07T20:32:41.8341259Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8341590Z 2025-05-07T20:32:41.8341776Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8342057Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8342358Z x = x_sign * x_clamp 2025-05-07T20:32:41.8342590Z x0 = x[:, :D] 2025-05-07T20:32:41.8342799Z x1 = x[:, D:] 2025-05-07T20:32:41.8343005Z 2025-05-07T20:32:41.8343187Z if contiguous: 2025-05-07T20:32:41.8343405Z x0 = x0.contiguous() 2025-05-07T20:32:41.8343650Z x1 = x1.contiguous() 2025-05-07T20:32:41.8343893Z 2025-05-07T20:32:41.8344079Z if scale_ub is not None: 2025-05-07T20:32:41.8344358Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.8344685Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.8344994Z ) 2025-05-07T20:32:41.8345184Z else: 2025-05-07T20:32:41.8345391Z scale_ub_tensor = None 2025-05-07T20:32:41.8345632Z 2025-05-07T20:32:41.8345852Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.8346166Z op = silu_mul_quant 2025-05-07T20:32:41.8346406Z if compiled: 2025-05-07T20:32:41.8346641Z op = torch.compile(op) 2025-05-07T20:32:41.8346955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8347258Z 2025-05-07T20:32:41.8347534Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.8347696Z 2025-05-07T20:32:41.8347791Z moe/activation_test.py:117: 2025-05-07T20:32:41.8348071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8348395Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.8348661Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8349351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.8349919Z return fn(*args, **kwargs) 2025-05-07T20:32:41.8350563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.8351244Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.8351776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.8352442Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.8353100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.8353622Z kernel = self.compile( 2025-05-07T20:32:41.8354173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.8354833Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.8355225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8355466Z 2025-05-07T20:32:41.8355672Z self = 2025-05-07T20:32:41.8356735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.8358094Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4aafa60>} 2025-05-07T20:32:41.8359404Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.8360542Z context = 2025-05-07T20:32:41.8360820Z 2025-05-07T20:32:41.8360989Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.8361501Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.8361961Z module_map=module_map) 2025-05-07T20:32:41.8362322Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.8362674Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.8362927Z E ^ 2025-05-07T20:32:41.8363383Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8363817Z 2025-05-07T20:32:41.8364244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.8364748Z 2025-05-07T20:32:41.8364853Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8365251Z self=, 2025-05-07T20:32:41.8365639Z T=128, 2025-05-07T20:32:41.8365817Z D=7168, 2025-05-07T20:32:41.8365996Z scale_ub=1200.0, 2025-05-07T20:32:41.8366210Z contiguous=False, 2025-05-07T20:32:41.8366425Z compiled=True, 2025-05-07T20:32:41.8366616Z ) 2025-05-07T20:32:41.8366929Z self = 2025-05-07T20:32:41.8367453Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.8367714Z 2025-05-07T20:32:41.8367793Z @given( 2025-05-07T20:32:41.8368007Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8368312Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8368606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8368923Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8369244Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8369516Z ) 2025-05-07T20:32:41.8369963Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8370405Z def test_silu_mul_quant( 2025-05-07T20:32:41.8370634Z self, 2025-05-07T20:32:41.8370813Z T: int, 2025-05-07T20:32:41.8371004Z D: int, 2025-05-07T20:32:41.8371211Z scale_ub: Optional[float], 2025-05-07T20:32:41.8371472Z contiguous: bool, 2025-05-07T20:32:41.8371698Z compiled: bool, 2025-05-07T20:32:41.8371913Z ) -> None: 2025-05-07T20:32:41.8372115Z torch.manual_seed(2025) 2025-05-07T20:32:41.8372336Z 2025-05-07T20:32:41.8372592Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8372925Z 2025-05-07T20:32:41.8373102Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8373390Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8373687Z x = x_sign * x_clamp 2025-05-07T20:32:41.8373914Z x0 = x[:, :D] 2025-05-07T20:32:41.8374129Z x1 = x[:, D:] 2025-05-07T20:32:41.8374324Z 2025-05-07T20:32:41.8374497Z if contiguous: 2025-05-07T20:32:41.8374724Z x0 = x0.contiguous() 2025-05-07T20:32:41.8374976Z x1 = x1.contiguous() 2025-05-07T20:32:41.8375200Z 2025-05-07T20:32:41.8375383Z if scale_ub is not None: 2025-05-07T20:32:41.8375650Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.8375970Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.8376267Z ) 2025-05-07T20:32:41.8376450Z else: 2025-05-07T20:32:41.8376646Z scale_ub_tensor = None 2025-05-07T20:32:41.8376886Z 2025-05-07T20:32:41.8377110Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.8377465Z op = silu_mul_quant 2025-05-07T20:32:41.8377837Z if compiled: 2025-05-07T20:32:41.8378076Z op = torch.compile(op) 2025-05-07T20:32:41.8378366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8378632Z 2025-05-07T20:32:41.8378808Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.8378966Z 2025-05-07T20:32:41.8379066Z moe/activation_test.py:117: 2025-05-07T20:32:41.8379342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8379657Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.8379932Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8380478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.8381017Z return fn(*args, **kwargs) 2025-05-07T20:32:41.8381668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.8382340Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.8382870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.8383535Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.8384184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.8384704Z kernel = self.compile( 2025-05-07T20:32:41.8385240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.8385879Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.8386262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8386478Z 2025-05-07T20:32:41.8386678Z self = 2025-05-07T20:32:41.8387883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.8389255Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c47c0d60>} 2025-05-07T20:32:41.8390567Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.8391569Z context = 2025-05-07T20:32:41.8391846Z 2025-05-07T20:32:41.8392007Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.8392520Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.8392994Z module_map=module_map) 2025-05-07T20:32:41.8393349Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.8393691Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.8393938Z E ^ 2025-05-07T20:32:41.8394392Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8394831Z 2025-05-07T20:32:41.8395258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.8395762Z 2025-05-07T20:32:41.8395861Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8396261Z self=, 2025-05-07T20:32:41.8396649Z T=2048, 2025-05-07T20:32:41.8396823Z D=7168, 2025-05-07T20:32:41.8397002Z scale_ub=None, 2025-05-07T20:32:41.8397204Z contiguous=True, 2025-05-07T20:32:41.8397409Z compiled=True, 2025-05-07T20:32:41.9600468Z ) 2025-05-07T20:32:41.9601150Z self = 2025-05-07T20:32:41.9601876Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:41.9602228Z 2025-05-07T20:32:41.9602330Z @given( 2025-05-07T20:32:41.9602627Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9602928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9603256Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9603582Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9603896Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9604186Z ) 2025-05-07T20:32:41.9604533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9604978Z def test_silu_mul_quant( 2025-05-07T20:32:41.9605223Z self, 2025-05-07T20:32:41.9605413Z T: int, 2025-05-07T20:32:41.9605604Z D: int, 2025-05-07T20:32:41.9605824Z scale_ub: Optional[float], 2025-05-07T20:32:41.9606087Z contiguous: bool, 2025-05-07T20:32:41.9606323Z compiled: bool, 2025-05-07T20:32:41.9606541Z ) -> None: 2025-05-07T20:32:41.9606755Z torch.manual_seed(2025) 2025-05-07T20:32:41.9607002Z 2025-05-07T20:32:41.9607263Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9607602Z 2025-05-07T20:32:41.9607791Z x_sign = torch.sign(x) 2025-05-07T20:32:41.9608069Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.9608374Z x = x_sign * x_clamp 2025-05-07T20:32:41.9608608Z x0 = x[:, :D] 2025-05-07T20:32:41.9608815Z x1 = x[:, D:] 2025-05-07T20:32:41.9609022Z 2025-05-07T20:32:41.9609207Z if contiguous: 2025-05-07T20:32:41.9609433Z x0 = x0.contiguous() 2025-05-07T20:32:41.9609686Z x1 = x1.contiguous() 2025-05-07T20:32:41.9609928Z 2025-05-07T20:32:41.9610113Z if scale_ub is not None: 2025-05-07T20:32:41.9610383Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.9610888Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.9611203Z ) 2025-05-07T20:32:41.9611393Z else: 2025-05-07T20:32:41.9611596Z scale_ub_tensor = None 2025-05-07T20:32:41.9611843Z 2025-05-07T20:32:41.9612064Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.9612375Z op = silu_mul_quant 2025-05-07T20:32:41.9612623Z if compiled: 2025-05-07T20:32:41.9612854Z op = torch.compile(op) 2025-05-07T20:32:41.9613135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.9613407Z 2025-05-07T20:32:41.9613580Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.9613740Z 2025-05-07T20:32:41.9613835Z moe/activation_test.py:117: 2025-05-07T20:32:41.9614122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.9614446Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.9614726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.9615295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.9615848Z return fn(*args, **kwargs) 2025-05-07T20:32:41.9616495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.9617215Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.9617740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.9618410Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.9619052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.9619574Z kernel = self.compile( 2025-05-07T20:32:41.9620249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.9620891Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.9621275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.9621515Z 2025-05-07T20:32:41.9621718Z self = 2025-05-07T20:32:41.9622793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.9624141Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c47c1b20>} 2025-05-07T20:32:41.9625455Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.9626458Z context = 2025-05-07T20:32:41.9626737Z 2025-05-07T20:32:41.9626910Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.9627428Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.9627972Z module_map=module_map) 2025-05-07T20:32:41.9628344Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.9628692Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.9628940Z E ^ 2025-05-07T20:32:41.9629403Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.9629848Z 2025-05-07T20:32:41.9630270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.9630781Z 2025-05-07T20:32:41.9630978Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9631378Z self=, 2025-05-07T20:32:41.9631773Z T=16384, 2025-05-07T20:32:41.9631966Z D=5120, 2025-05-07T20:32:41.9632147Z scale_ub=None, 2025-05-07T20:32:41.9632354Z contiguous=False, 2025-05-07T20:32:41.9632570Z compiled=False, 2025-05-07T20:32:41.9632762Z ) 2025-05-07T20:32:41.9633071Z self = 2025-05-07T20:32:41.9633565Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.9633839Z 2025-05-07T20:32:41.9633915Z @given( 2025-05-07T20:32:41.9634124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9634426Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9634723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9635033Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9635353Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9635624Z ) 2025-05-07T20:32:41.9635956Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9636396Z def test_silu_mul_quant( 2025-05-07T20:32:41.9636632Z self, 2025-05-07T20:32:41.9636810Z T: int, 2025-05-07T20:32:41.9637071Z D: int, 2025-05-07T20:32:41.9637318Z scale_ub: Optional[float], 2025-05-07T20:32:41.9637575Z contiguous: bool, 2025-05-07T20:32:41.9637803Z compiled: bool, 2025-05-07T20:32:41.9638013Z ) -> None: 2025-05-07T20:32:41.9638224Z torch.manual_seed(2025) 2025-05-07T20:32:41.9638451Z 2025-05-07T20:32:41.9638711Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9639049Z 2025-05-07T20:32:41.9639323Z x_sign = torch.sign(x) 2025-05-07T20:32:41.9639606Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.9641894Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9643781Z 2025-05-07T20:32:41.9643899Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.9644104Z 2025-05-07T20:32:41.9644202Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9644597Z self=, 2025-05-07T20:32:41.9645005Z T=4096, 2025-05-07T20:32:41.9645186Z D=7168, 2025-05-07T20:32:41.9645359Z scale_ub=1200.0, 2025-05-07T20:32:41.9645574Z contiguous=True, 2025-05-07T20:32:41.9645782Z compiled=True, 2025-05-07T20:32:41.9645974Z ) 2025-05-07T20:32:41.9646275Z self = 2025-05-07T20:32:41.9646751Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.9647036Z 2025-05-07T20:32:41.9647119Z @given( 2025-05-07T20:32:41.9647357Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9647688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9648111Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9648555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9648877Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9649142Z ) 2025-05-07T20:32:41.9649474Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9649918Z def test_silu_mul_quant( 2025-05-07T20:32:41.9650148Z self, 2025-05-07T20:32:41.9650484Z T: int, 2025-05-07T20:32:41.9650667Z D: int, 2025-05-07T20:32:41.9650880Z scale_ub: Optional[float], 2025-05-07T20:32:41.9651141Z contiguous: bool, 2025-05-07T20:32:41.9651361Z compiled: bool, 2025-05-07T20:32:41.9651571Z ) -> None: 2025-05-07T20:32:41.9651772Z torch.manual_seed(2025) 2025-05-07T20:32:41.9651993Z 2025-05-07T20:32:41.9652249Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9652583Z 2025-05-07T20:32:41.9652759Z x_sign = torch.sign(x) 2025-05-07T20:32:41.9653036Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.9655008Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9656905Z 2025-05-07T20:32:41.9657033Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.9657246Z 2025-05-07T20:32:41.9657349Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9657738Z self=, 2025-05-07T20:32:41.9658138Z T=16384, 2025-05-07T20:32:41.9658318Z D=7168, 2025-05-07T20:32:41.9658491Z scale_ub=None, 2025-05-07T20:32:41.9658693Z contiguous=False, 2025-05-07T20:32:41.9658907Z compiled=False, 2025-05-07T20:32:41.9659095Z ) 2025-05-07T20:32:41.9659551Z self = 2025-05-07T20:32:41.9660040Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.9660324Z 2025-05-07T20:32:41.9660394Z @given( 2025-05-07T20:32:41.9660612Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9660914Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9661202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9661520Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9661836Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9662114Z ) 2025-05-07T20:32:41.9662448Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9670249Z def test_silu_mul_quant( 2025-05-07T20:32:41.9670517Z self, 2025-05-07T20:32:41.9670706Z T: int, 2025-05-07T20:32:41.9670894Z D: int, 2025-05-07T20:32:41.9671116Z scale_ub: Optional[float], 2025-05-07T20:32:41.9671378Z contiguous: bool, 2025-05-07T20:32:41.9671617Z compiled: bool, 2025-05-07T20:32:41.9671840Z ) -> None: 2025-05-07T20:32:41.9672050Z torch.manual_seed(2025) 2025-05-07T20:32:41.9672287Z 2025-05-07T20:32:41.9672554Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9674573Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9676424Z 2025-05-07T20:32:41.9676539Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.0895565Z 2025-05-07T20:32:42.0895810Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.0896715Z self=, 2025-05-07T20:32:42.0897805Z T=2048, 2025-05-07T20:32:42.0898302Z D=7168, 2025-05-07T20:32:42.0898824Z scale_ub=1200.0, 2025-05-07T20:32:42.0899425Z contiguous=True, 2025-05-07T20:32:42.0899893Z compiled=True, 2025-05-07T20:32:42.0900286Z ) 2025-05-07T20:32:42.0900907Z self = 2025-05-07T20:32:42.0901887Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.0902431Z 2025-05-07T20:32:42.0902592Z @given( 2025-05-07T20:32:42.0903020Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.0903622Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.0904212Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.0904858Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.0905484Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.0906045Z ) 2025-05-07T20:32:42.0906719Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.0907319Z def test_silu_mul_quant( 2025-05-07T20:32:42.0907646Z self, 2025-05-07T20:32:42.0907839Z T: int, 2025-05-07T20:32:42.0908030Z D: int, 2025-05-07T20:32:42.0908255Z scale_ub: Optional[float], 2025-05-07T20:32:42.0908519Z contiguous: bool, 2025-05-07T20:32:42.0908751Z compiled: bool, 2025-05-07T20:32:42.0908973Z ) -> None: 2025-05-07T20:32:42.0909182Z torch.manual_seed(2025) 2025-05-07T20:32:42.0909420Z 2025-05-07T20:32:42.0909689Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.0910028Z 2025-05-07T20:32:42.0910219Z x_sign = torch.sign(x) 2025-05-07T20:32:42.0910644Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.0912622Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.0914551Z 2025-05-07T20:32:42.0914669Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.0914876Z 2025-05-07T20:32:42.0914984Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.0915382Z self=, 2025-05-07T20:32:42.0915795Z T=2048, 2025-05-07T20:32:42.0915980Z D=7168, 2025-05-07T20:32:42.0916165Z scale_ub=None, 2025-05-07T20:32:42.0916366Z contiguous=True, 2025-05-07T20:32:42.0916590Z compiled=False, 2025-05-07T20:32:42.0916789Z ) 2025-05-07T20:32:42.0917096Z self = 2025-05-07T20:32:42.0917570Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.0917846Z 2025-05-07T20:32:42.0917927Z @given( 2025-05-07T20:32:42.0918140Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.0918442Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.0918736Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.0919046Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.0919372Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.0919646Z ) 2025-05-07T20:32:42.0919987Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.0920430Z def test_silu_mul_quant( 2025-05-07T20:32:42.0920662Z self, 2025-05-07T20:32:42.0920857Z T: int, 2025-05-07T20:32:42.0921126Z D: int, 2025-05-07T20:32:42.0921335Z scale_ub: Optional[float], 2025-05-07T20:32:42.0921599Z contiguous: bool, 2025-05-07T20:32:42.0921823Z compiled: bool, 2025-05-07T20:32:42.0922036Z ) -> None: 2025-05-07T20:32:42.0922240Z torch.manual_seed(2025) 2025-05-07T20:32:42.0922477Z 2025-05-07T20:32:42.0922736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.0923063Z 2025-05-07T20:32:42.0923243Z > x_sign = torch.sign(x) 2025-05-07T20:32:42.0925146Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.0927061Z 2025-05-07T20:32:42.0927173Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:42.0927381Z 2025-05-07T20:32:42.0927476Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.0927874Z self=, 2025-05-07T20:32:42.0928271Z T=1, 2025-05-07T20:32:42.0928446Z D=7168, 2025-05-07T20:32:42.0928628Z scale_ub=1200.0, 2025-05-07T20:32:42.0928830Z contiguous=True, 2025-05-07T20:32:42.0929038Z compiled=False, 2025-05-07T20:32:42.0929229Z ) 2025-05-07T20:32:42.0929527Z self = 2025-05-07T20:32:42.0929994Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.0930345Z 2025-05-07T20:32:42.0930419Z @given( 2025-05-07T20:32:42.0930643Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.0930944Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.0931240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.0931556Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.0931868Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.0932145Z ) 2025-05-07T20:32:42.0932483Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.0932905Z def test_silu_mul_quant( 2025-05-07T20:32:42.0933136Z self, 2025-05-07T20:32:42.0933328Z T: int, 2025-05-07T20:32:42.0933519Z D: int, 2025-05-07T20:32:42.0933729Z scale_ub: Optional[float], 2025-05-07T20:32:42.0933991Z contiguous: bool, 2025-05-07T20:32:42.0934221Z compiled: bool, 2025-05-07T20:32:42.0934430Z ) -> None: 2025-05-07T20:32:42.0934645Z torch.manual_seed(2025) 2025-05-07T20:32:42.0934888Z 2025-05-07T20:32:42.0935144Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.0935471Z 2025-05-07T20:32:42.0935658Z x_sign = torch.sign(x) 2025-05-07T20:32:42.0935935Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.0936231Z x = x_sign * x_clamp 2025-05-07T20:32:42.0936459Z x0 = x[:, :D] 2025-05-07T20:32:42.0936659Z x1 = x[:, D:] 2025-05-07T20:32:42.0936865Z 2025-05-07T20:32:42.0937046Z if contiguous: 2025-05-07T20:32:42.0937299Z x0 = x0.contiguous() 2025-05-07T20:32:42.0937560Z x1 = x1.contiguous() 2025-05-07T20:32:42.0937786Z 2025-05-07T20:32:42.0937966Z if scale_ub is not None: 2025-05-07T20:32:42.0938232Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.0938567Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.0938873Z ) 2025-05-07T20:32:42.0939071Z else: 2025-05-07T20:32:42.0939360Z scale_ub_tensor = None 2025-05-07T20:32:42.0939605Z 2025-05-07T20:32:42.0939824Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.0940379Z op = silu_mul_quant 2025-05-07T20:32:42.0940625Z if compiled: 2025-05-07T20:32:42.0940856Z op = torch.compile(op) 2025-05-07T20:32:42.0941149Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0941409Z 2025-05-07T20:32:42.0941592Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.0941763Z 2025-05-07T20:32:42.0941857Z moe/activation_test.py:117: 2025-05-07T20:32:42.0942139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0942453Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.0942728Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0943414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.0944098Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.0944620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.0945285Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.0945937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.0946456Z kernel = self.compile( 2025-05-07T20:32:42.0946988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.0947681Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.0948063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0948413Z 2025-05-07T20:32:42.0948612Z self = 2025-05-07T20:32:42.0949695Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.0951038Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4528a40>} 2025-05-07T20:32:42.0952352Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.0953350Z context = 2025-05-07T20:32:42.0953629Z 2025-05-07T20:32:42.0953790Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.0954305Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.0954767Z module_map=module_map) 2025-05-07T20:32:42.0955115Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.0955466Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.0955719Z E ^ 2025-05-07T20:32:42.0956165Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.0956600Z 2025-05-07T20:32:42.0957003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.0957517Z 2025-05-07T20:32:42.0957616Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.0958015Z self=, 2025-05-07T20:32:42.0958417Z T=128, 2025-05-07T20:32:42.0958597Z D=5120, 2025-05-07T20:32:42.0958781Z scale_ub=None, 2025-05-07T20:32:42.0958991Z contiguous=True, 2025-05-07T20:32:42.0959197Z compiled=False, 2025-05-07T20:32:42.0959511Z ) 2025-05-07T20:32:42.0959822Z self = 2025-05-07T20:32:42.0960294Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.0960560Z 2025-05-07T20:32:42.0960636Z @given( 2025-05-07T20:32:42.0960857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.0961156Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.0961459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.0961776Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.0962090Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.0962361Z ) 2025-05-07T20:32:42.0962729Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.0963186Z def test_silu_mul_quant( 2025-05-07T20:32:42.0963419Z self, 2025-05-07T20:32:42.0963598Z T: int, 2025-05-07T20:32:42.0963784Z D: int, 2025-05-07T20:32:42.0964002Z scale_ub: Optional[float], 2025-05-07T20:32:42.0964260Z contiguous: bool, 2025-05-07T20:32:42.0964493Z compiled: bool, 2025-05-07T20:32:42.0964708Z ) -> None: 2025-05-07T20:32:42.0964908Z torch.manual_seed(2025) 2025-05-07T20:32:42.0965140Z 2025-05-07T20:32:42.0965406Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.0965736Z 2025-05-07T20:32:42.0965919Z x_sign = torch.sign(x) 2025-05-07T20:32:42.0966202Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.0966495Z x = x_sign * x_clamp 2025-05-07T20:32:42.0966726Z x0 = x[:, :D] 2025-05-07T20:32:42.0966948Z x1 = x[:, D:] 2025-05-07T20:32:42.0967179Z 2025-05-07T20:32:42.0967370Z if contiguous: 2025-05-07T20:32:42.0967709Z x0 = x0.contiguous() 2025-05-07T20:32:42.0967955Z x1 = x1.contiguous() 2025-05-07T20:32:42.0968183Z 2025-05-07T20:32:42.0968368Z if scale_ub is not None: 2025-05-07T20:32:42.0968626Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.0968946Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.0969235Z ) 2025-05-07T20:32:42.0969419Z else: 2025-05-07T20:32:42.0969613Z scale_ub_tensor = None 2025-05-07T20:32:42.0969851Z 2025-05-07T20:32:42.0970074Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.0970369Z op = silu_mul_quant 2025-05-07T20:32:42.0970608Z if compiled: 2025-05-07T20:32:42.0970840Z op = torch.compile(op) 2025-05-07T20:32:42.0971121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0971382Z 2025-05-07T20:32:42.0971566Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.0971733Z 2025-05-07T20:32:42.0971827Z moe/activation_test.py:117: 2025-05-07T20:32:42.0972110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0972431Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.0972700Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0973369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.0974037Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.0974563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.0975221Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.0975869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.0976391Z kernel = self.compile( 2025-05-07T20:32:42.0976952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.0977668Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.0978056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0978278Z 2025-05-07T20:32:42.0978485Z self = 2025-05-07T20:32:42.0979539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.0980875Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4529940>} 2025-05-07T20:32:42.0982181Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.0983191Z context = 2025-05-07T20:32:42.0983470Z 2025-05-07T20:32:42.0983637Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.0984142Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.0984594Z module_map=module_map) 2025-05-07T20:32:42.0984947Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.0985289Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.0985534Z E ^ 2025-05-07T20:32:42.0985979Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.0986417Z 2025-05-07T20:32:42.0986844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2098699Z 2025-05-07T20:32:42.2099165Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2099846Z self=, 2025-05-07T20:32:42.2100382Z T=128, 2025-05-07T20:32:42.2100634Z D=7168, 2025-05-07T20:32:42.2100880Z scale_ub=None, 2025-05-07T20:32:42.2101144Z contiguous=True, 2025-05-07T20:32:42.2101429Z compiled=False, 2025-05-07T20:32:42.2101699Z ) 2025-05-07T20:32:42.2102077Z self = 2025-05-07T20:32:42.2102562Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.2102840Z 2025-05-07T20:32:42.2102920Z @given( 2025-05-07T20:32:42.2103150Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2103452Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2103761Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2104094Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2104410Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2104688Z ) 2025-05-07T20:32:42.2105026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2105464Z def test_silu_mul_quant( 2025-05-07T20:32:42.2105728Z self, 2025-05-07T20:32:42.2105911Z T: int, 2025-05-07T20:32:42.2106105Z D: int, 2025-05-07T20:32:42.2106335Z scale_ub: Optional[float], 2025-05-07T20:32:42.2106604Z contiguous: bool, 2025-05-07T20:32:42.2106841Z compiled: bool, 2025-05-07T20:32:42.2107070Z ) -> None: 2025-05-07T20:32:42.2107276Z torch.manual_seed(2025) 2025-05-07T20:32:42.2107589Z 2025-05-07T20:32:42.2107859Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2108179Z 2025-05-07T20:32:42.2108355Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2108639Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2108932Z x = x_sign * x_clamp 2025-05-07T20:32:42.2109345Z x0 = x[:, :D] 2025-05-07T20:32:42.2109559Z x1 = x[:, D:] 2025-05-07T20:32:42.2109764Z 2025-05-07T20:32:42.2109938Z if contiguous: 2025-05-07T20:32:42.2110161Z x0 = x0.contiguous() 2025-05-07T20:32:42.2110407Z x1 = x1.contiguous() 2025-05-07T20:32:42.2110632Z 2025-05-07T20:32:42.2110819Z if scale_ub is not None: 2025-05-07T20:32:42.2111089Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2111414Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2111712Z ) 2025-05-07T20:32:42.2111897Z else: 2025-05-07T20:32:42.2112094Z scale_ub_tensor = None 2025-05-07T20:32:42.2112333Z 2025-05-07T20:32:42.2112552Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2112853Z op = silu_mul_quant 2025-05-07T20:32:42.2113090Z if compiled: 2025-05-07T20:32:42.2113328Z op = torch.compile(op) 2025-05-07T20:32:42.2113609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2113871Z 2025-05-07T20:32:42.2114056Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2114216Z 2025-05-07T20:32:42.2114316Z moe/activation_test.py:117: 2025-05-07T20:32:42.2114600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2114923Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2115195Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2115899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2116570Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2117097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2117899Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2118569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2119085Z kernel = self.compile( 2025-05-07T20:32:42.2119627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2120261Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2120646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2120876Z 2025-05-07T20:32:42.2121076Z self = 2025-05-07T20:32:42.2122132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2123497Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c452a700>} 2025-05-07T20:32:42.2124807Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2125852Z context = 2025-05-07T20:32:42.2126129Z 2025-05-07T20:32:42.2126296Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2126799Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2127256Z module_map=module_map) 2025-05-07T20:32:42.2127614Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2127956Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2128202Z E ^ 2025-05-07T20:32:42.2128735Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2129173Z 2025-05-07T20:32:42.2129589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2130088Z 2025-05-07T20:32:42.2130194Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2130590Z self=, 2025-05-07T20:32:42.2130982Z T=2048, 2025-05-07T20:32:42.2131157Z D=7168, 2025-05-07T20:32:42.2131333Z scale_ub=1200.0, 2025-05-07T20:32:42.2131546Z contiguous=True, 2025-05-07T20:32:42.2131761Z compiled=False, 2025-05-07T20:32:42.2131949Z ) 2025-05-07T20:32:42.2132255Z self = 2025-05-07T20:32:42.2132744Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.2133003Z 2025-05-07T20:32:42.2133081Z @given( 2025-05-07T20:32:42.2133302Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2133602Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2133897Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2134209Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2134523Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2134795Z ) 2025-05-07T20:32:42.2135126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2135546Z def test_silu_mul_quant( 2025-05-07T20:32:42.2135776Z self, 2025-05-07T20:32:42.2135951Z T: int, 2025-05-07T20:32:42.2136132Z D: int, 2025-05-07T20:32:42.2136339Z scale_ub: Optional[float], 2025-05-07T20:32:42.2136595Z contiguous: bool, 2025-05-07T20:32:42.2136906Z compiled: bool, 2025-05-07T20:32:42.2137119Z ) -> None: 2025-05-07T20:32:42.2137359Z torch.manual_seed(2025) 2025-05-07T20:32:42.2137596Z 2025-05-07T20:32:42.2137859Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2139873Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2142047Z 2025-05-07T20:32:42.2142159Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2142361Z 2025-05-07T20:32:42.2142472Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2142865Z self=, 2025-05-07T20:32:42.2143264Z T=1, 2025-05-07T20:32:42.2143431Z D=5120, 2025-05-07T20:32:42.2143606Z scale_ub=1200.0, 2025-05-07T20:32:42.2143820Z contiguous=True, 2025-05-07T20:32:42.2144038Z compiled=False, 2025-05-07T20:32:42.2144230Z ) 2025-05-07T20:32:42.2144529Z self = 2025-05-07T20:32:42.2144996Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.2145249Z 2025-05-07T20:32:42.2145331Z @given( 2025-05-07T20:32:42.2145541Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2145837Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2146132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2146445Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2146762Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2147037Z ) 2025-05-07T20:32:42.2147581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2148008Z def test_silu_mul_quant( 2025-05-07T20:32:42.2148236Z self, 2025-05-07T20:32:42.2148420Z T: int, 2025-05-07T20:32:42.2148599Z D: int, 2025-05-07T20:32:42.2156536Z scale_ub: Optional[float], 2025-05-07T20:32:42.2156825Z contiguous: bool, 2025-05-07T20:32:42.2157068Z compiled: bool, 2025-05-07T20:32:42.2157283Z ) -> None: 2025-05-07T20:32:42.2157493Z torch.manual_seed(2025) 2025-05-07T20:32:42.2157723Z 2025-05-07T20:32:42.2157981Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2158319Z 2025-05-07T20:32:42.2158509Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2158791Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2159094Z x = x_sign * x_clamp 2025-05-07T20:32:42.2159323Z x0 = x[:, :D] 2025-05-07T20:32:42.2159526Z x1 = x[:, D:] 2025-05-07T20:32:42.2159728Z 2025-05-07T20:32:42.2159913Z if contiguous: 2025-05-07T20:32:42.2160130Z x0 = x0.contiguous() 2025-05-07T20:32:42.2160375Z x1 = x1.contiguous() 2025-05-07T20:32:42.2160603Z 2025-05-07T20:32:42.2160775Z if scale_ub is not None: 2025-05-07T20:32:42.2161029Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2161352Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2161650Z ) 2025-05-07T20:32:42.2161828Z else: 2025-05-07T20:32:42.2162026Z scale_ub_tensor = None 2025-05-07T20:32:42.2162263Z 2025-05-07T20:32:42.2162481Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2162782Z op = silu_mul_quant 2025-05-07T20:32:42.2163025Z if compiled: 2025-05-07T20:32:42.2163255Z op = torch.compile(op) 2025-05-07T20:32:42.2163703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2163965Z 2025-05-07T20:32:42.2164149Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2164314Z 2025-05-07T20:32:42.2164407Z moe/activation_test.py:117: 2025-05-07T20:32:42.2164694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2165012Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2165281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2165959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2166633Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2167157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2167821Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2168479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2168993Z kernel = self.compile( 2025-05-07T20:32:42.2169536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2170211Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2170596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2170820Z 2025-05-07T20:32:42.2171027Z self = 2025-05-07T20:32:42.2172083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2173436Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c452bce0>} 2025-05-07T20:32:42.2174840Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2175894Z context = 2025-05-07T20:32:42.2176174Z 2025-05-07T20:32:42.2176337Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2176848Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2177304Z module_map=module_map) 2025-05-07T20:32:42.2177665Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2178008Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2178254Z E ^ 2025-05-07T20:32:42.2178704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2179143Z 2025-05-07T20:32:42.2179554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2983154Z 2025-05-07T20:32:42.2983416Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2984033Z self=, 2025-05-07T20:32:42.2984585Z T=2048, 2025-05-07T20:32:42.2984833Z D=5120, 2025-05-07T20:32:42.2985076Z scale_ub=None, 2025-05-07T20:32:42.2985357Z contiguous=True, 2025-05-07T20:32:42.2985653Z compiled=False, 2025-05-07T20:32:42.2985882Z ) 2025-05-07T20:32:42.2986197Z self = 2025-05-07T20:32:42.2986681Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.2986946Z 2025-05-07T20:32:42.2987221Z @given( 2025-05-07T20:32:42.2987528Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2987831Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2988147Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2988472Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2988786Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2989064Z ) 2025-05-07T20:32:42.2989397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2989842Z def test_silu_mul_quant( 2025-05-07T20:32:42.2990068Z self, 2025-05-07T20:32:42.2990249Z T: int, 2025-05-07T20:32:42.2990450Z D: int, 2025-05-07T20:32:42.2990666Z scale_ub: Optional[float], 2025-05-07T20:32:42.2990928Z contiguous: bool, 2025-05-07T20:32:42.2991154Z compiled: bool, 2025-05-07T20:32:42.2991367Z ) -> None: 2025-05-07T20:32:42.2991568Z torch.manual_seed(2025) 2025-05-07T20:32:42.2991834Z 2025-05-07T20:32:42.2992095Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2992435Z 2025-05-07T20:32:42.2992626Z > x_sign = torch.sign(x) 2025-05-07T20:32:42.2994534Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2996360Z 2025-05-07T20:32:42.2996474Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:42.2996680Z 2025-05-07T20:32:42.2996779Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2997193Z self=, 2025-05-07T20:32:42.2997586Z T=16384, 2025-05-07T20:32:42.2997902Z D=5120, 2025-05-07T20:32:42.2998086Z scale_ub=None, 2025-05-07T20:32:42.2998302Z contiguous=True, 2025-05-07T20:32:42.2998512Z compiled=False, 2025-05-07T20:32:42.2998718Z ) 2025-05-07T20:32:42.2999032Z self = 2025-05-07T20:32:42.2999513Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.2999792Z 2025-05-07T20:32:42.2999866Z @given( 2025-05-07T20:32:42.3000083Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3000385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3000681Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3001006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3001331Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3001618Z ) 2025-05-07T20:32:42.3001959Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3002402Z def test_silu_mul_quant( 2025-05-07T20:32:42.3002633Z self, 2025-05-07T20:32:42.3002818Z T: int, 2025-05-07T20:32:42.3003003Z D: int, 2025-05-07T20:32:42.3003208Z scale_ub: Optional[float], 2025-05-07T20:32:42.3003476Z contiguous: bool, 2025-05-07T20:32:42.3003703Z compiled: bool, 2025-05-07T20:32:42.3003908Z ) -> None: 2025-05-07T20:32:42.3004123Z torch.manual_seed(2025) 2025-05-07T20:32:42.3004367Z 2025-05-07T20:32:42.3004624Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3006639Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.3008630Z 2025-05-07T20:32:42.3008743Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.3008950Z 2025-05-07T20:32:42.3009049Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3009461Z self=, 2025-05-07T20:32:42.3009860Z T=4096, 2025-05-07T20:32:42.3010034Z D=5120, 2025-05-07T20:32:42.3010216Z scale_ub=None, 2025-05-07T20:32:42.3010429Z contiguous=True, 2025-05-07T20:32:42.3010654Z compiled=False, 2025-05-07T20:32:42.3010853Z ) 2025-05-07T20:32:42.3011163Z self = 2025-05-07T20:32:42.3011646Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.3011920Z 2025-05-07T20:32:42.3011995Z @given( 2025-05-07T20:32:42.3012226Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3012528Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3012833Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3013152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3013457Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3013733Z ) 2025-05-07T20:32:42.3014072Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3014512Z def test_silu_mul_quant( 2025-05-07T20:32:42.3014742Z self, 2025-05-07T20:32:42.3014930Z T: int, 2025-05-07T20:32:42.3015116Z D: int, 2025-05-07T20:32:42.3015320Z scale_ub: Optional[float], 2025-05-07T20:32:42.3015576Z contiguous: bool, 2025-05-07T20:32:42.3015819Z compiled: bool, 2025-05-07T20:32:42.3016026Z ) -> None: 2025-05-07T20:32:42.3016231Z torch.manual_seed(2025) 2025-05-07T20:32:42.3016551Z 2025-05-07T20:32:42.3016813Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3018821Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.3020733Z 2025-05-07T20:32:42.3020844Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.3021057Z 2025-05-07T20:32:42.3021157Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3021559Z self=, 2025-05-07T20:32:42.3021941Z T=2048, 2025-05-07T20:32:42.3022117Z D=5120, 2025-05-07T20:32:42.3022297Z scale_ub=None, 2025-05-07T20:32:42.3022496Z contiguous=False, 2025-05-07T20:32:42.3022710Z compiled=False, 2025-05-07T20:32:42.3022905Z ) 2025-05-07T20:32:42.3023207Z self = 2025-05-07T20:32:42.3023681Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.3023953Z 2025-05-07T20:32:42.3024030Z @given( 2025-05-07T20:32:42.3024242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3024541Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3024835Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3025150Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3025545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3025826Z ) 2025-05-07T20:32:42.3026172Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3026604Z def test_silu_mul_quant( 2025-05-07T20:32:42.3026837Z self, 2025-05-07T20:32:42.3027035Z T: int, 2025-05-07T20:32:42.3027244Z D: int, 2025-05-07T20:32:42.3027535Z scale_ub: Optional[float], 2025-05-07T20:32:42.3027796Z contiguous: bool, 2025-05-07T20:32:42.3028020Z compiled: bool, 2025-05-07T20:32:42.3028233Z ) -> None: 2025-05-07T20:32:42.3028437Z torch.manual_seed(2025) 2025-05-07T20:32:42.3028669Z 2025-05-07T20:32:42.3028924Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3030929Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.3032843Z 2025-05-07T20:32:42.3032954Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.3033161Z 2025-05-07T20:32:42.3033265Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3033658Z self=, 2025-05-07T20:32:42.3034058Z T=4096, 2025-05-07T20:32:42.3034239Z D=7168, 2025-05-07T20:32:42.3034416Z scale_ub=None, 2025-05-07T20:32:42.3034617Z contiguous=True, 2025-05-07T20:32:42.3034832Z compiled=True, 2025-05-07T20:32:42.3035026Z ) 2025-05-07T20:32:42.3035332Z self = 2025-05-07T20:32:42.3035817Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.3036192Z 2025-05-07T20:32:42.3036275Z @given( 2025-05-07T20:32:42.3036486Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3036783Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3037072Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3037381Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3037689Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3037959Z ) 2025-05-07T20:32:42.3038294Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3038736Z def test_silu_mul_quant( 2025-05-07T20:32:42.3038964Z self, 2025-05-07T20:32:42.3039144Z T: int, 2025-05-07T20:32:42.3039337Z D: int, 2025-05-07T20:32:42.3039556Z scale_ub: Optional[float], 2025-05-07T20:32:42.3039807Z contiguous: bool, 2025-05-07T20:32:42.3040040Z compiled: bool, 2025-05-07T20:32:42.3040611Z ) -> None: 2025-05-07T20:32:42.3040831Z torch.manual_seed(2025) 2025-05-07T20:32:42.3041060Z 2025-05-07T20:32:42.3041315Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3043317Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.3045234Z 2025-05-07T20:32:42.3045500Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.3045707Z 2025-05-07T20:32:42.3045805Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3046210Z self=, 2025-05-07T20:32:42.3046607Z T=2048, 2025-05-07T20:32:42.3046794Z D=5120, 2025-05-07T20:32:42.3046975Z scale_ub=1200.0, 2025-05-07T20:32:42.3047189Z contiguous=False, 2025-05-07T20:32:42.3047408Z compiled=False, 2025-05-07T20:32:42.3592192Z ) 2025-05-07T20:32:42.3592584Z self = 2025-05-07T20:32:42.3593256Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.3593584Z 2025-05-07T20:32:42.3593663Z @given( 2025-05-07T20:32:42.3593879Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3594179Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3594551Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3594964Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3595434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3595848Z ) 2025-05-07T20:32:42.3596212Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3596646Z def test_silu_mul_quant( 2025-05-07T20:32:42.3596885Z self, 2025-05-07T20:32:42.3597078Z T: int, 2025-05-07T20:32:42.3597263Z D: int, 2025-05-07T20:32:42.3597472Z scale_ub: Optional[float], 2025-05-07T20:32:42.3597739Z contiguous: bool, 2025-05-07T20:32:42.3597971Z compiled: bool, 2025-05-07T20:32:42.3598191Z ) -> None: 2025-05-07T20:32:42.3598394Z torch.manual_seed(2025) 2025-05-07T20:32:42.3598630Z 2025-05-07T20:32:42.3598891Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3601065Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.3602993Z 2025-05-07T20:32:42.3603109Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.3603316Z 2025-05-07T20:32:42.3603422Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3603818Z self=, 2025-05-07T20:32:42.3604224Z T=4096, 2025-05-07T20:32:42.3604410Z D=7168, 2025-05-07T20:32:42.3604596Z scale_ub=1200.0, 2025-05-07T20:32:42.3604817Z contiguous=True, 2025-05-07T20:32:42.3605042Z compiled=False, 2025-05-07T20:32:42.3605230Z ) 2025-05-07T20:32:42.3605538Z self = 2025-05-07T20:32:42.3606016Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.3606275Z 2025-05-07T20:32:42.3606350Z @given( 2025-05-07T20:32:42.3606565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3606863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3607160Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3607473Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3607800Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3608083Z ) 2025-05-07T20:32:42.3608410Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3608845Z def test_silu_mul_quant( 2025-05-07T20:32:42.3609082Z self, 2025-05-07T20:32:42.3609261Z T: int, 2025-05-07T20:32:42.3609568Z D: int, 2025-05-07T20:32:42.3609779Z scale_ub: Optional[float], 2025-05-07T20:32:42.3610036Z contiguous: bool, 2025-05-07T20:32:42.3610267Z compiled: bool, 2025-05-07T20:32:42.3610475Z ) -> None: 2025-05-07T20:32:42.3610687Z torch.manual_seed(2025) 2025-05-07T20:32:42.3610912Z 2025-05-07T20:32:42.3611170Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3613174Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.3615090Z 2025-05-07T20:32:42.3615212Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.3615416Z 2025-05-07T20:32:42.3615520Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3615922Z self=, 2025-05-07T20:32:42.3616323Z T=16384, 2025-05-07T20:32:42.3616503Z D=7168, 2025-05-07T20:32:42.3616689Z scale_ub=None, 2025-05-07T20:32:42.3616897Z contiguous=False, 2025-05-07T20:32:42.3617123Z compiled=True, 2025-05-07T20:32:42.3617342Z ) 2025-05-07T20:32:42.3617659Z self = 2025-05-07T20:32:42.3618142Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.3618412Z 2025-05-07T20:32:42.3618488Z @given( 2025-05-07T20:32:42.3618707Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3619013Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3619310Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3619633Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3620040Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3620323Z ) 2025-05-07T20:32:42.3620566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3620654Z def test_silu_mul_quant( 2025-05-07T20:32:42.3620732Z self, 2025-05-07T20:32:42.3620806Z T: int, 2025-05-07T20:32:42.3620876Z D: int, 2025-05-07T20:32:42.3620972Z scale_ub: Optional[float], 2025-05-07T20:32:42.3621058Z contiguous: bool, 2025-05-07T20:32:42.3621138Z compiled: bool, 2025-05-07T20:32:42.3621215Z ) -> None: 2025-05-07T20:32:42.3621305Z torch.manual_seed(2025) 2025-05-07T20:32:42.3621376Z 2025-05-07T20:32:42.3621542Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3623310Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.3623323Z 2025-05-07T20:32:42.3623435Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.3623439Z 2025-05-07T20:32:42.3623534Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3623752Z self=, 2025-05-07T20:32:42.3623824Z T=4096, 2025-05-07T20:32:42.3623895Z D=7168, 2025-05-07T20:32:42.3623986Z scale_ub=None, 2025-05-07T20:32:42.3624152Z contiguous=True, 2025-05-07T20:32:42.3624238Z compiled=False, 2025-05-07T20:32:42.3624319Z ) 2025-05-07T20:32:42.3624538Z self = 2025-05-07T20:32:42.3624702Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.3624713Z 2025-05-07T20:32:42.3624786Z @given( 2025-05-07T20:32:42.3624939Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3625041Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3625152Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3625271Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3625382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3625452Z ) 2025-05-07T20:32:42.3625697Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3625785Z def test_silu_mul_quant( 2025-05-07T20:32:42.3625867Z self, 2025-05-07T20:32:42.3625943Z T: int, 2025-05-07T20:32:42.3626016Z D: int, 2025-05-07T20:32:42.3626113Z scale_ub: Optional[float], 2025-05-07T20:32:42.3626201Z contiguous: bool, 2025-05-07T20:32:42.3626282Z compiled: bool, 2025-05-07T20:32:42.3626362Z ) -> None: 2025-05-07T20:32:42.3626451Z torch.manual_seed(2025) 2025-05-07T20:32:42.3626521Z 2025-05-07T20:32:42.3626692Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3628573Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.3628585Z 2025-05-07T20:32:42.3628783Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.3628788Z 2025-05-07T20:32:42.3628886Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3629102Z self=, 2025-05-07T20:32:42.3629181Z T=16384, 2025-05-07T20:32:42.3629257Z D=7168, 2025-05-07T20:32:42.3629337Z scale_ub=None, 2025-05-07T20:32:42.3629420Z contiguous=True, 2025-05-07T20:32:42.3629502Z compiled=False, 2025-05-07T20:32:42.3629575Z ) 2025-05-07T20:32:42.3629787Z self = 2025-05-07T20:32:42.3629956Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.3629961Z 2025-05-07T20:32:42.3630040Z @given( 2025-05-07T20:32:42.3630154Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3630254Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3630371Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3630488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3630597Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3630666Z ) 2025-05-07T20:32:42.3630904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3631001Z def test_silu_mul_quant( 2025-05-07T20:32:42.3631074Z self, 2025-05-07T20:32:42.3631144Z T: int, 2025-05-07T20:32:42.3631223Z D: int, 2025-05-07T20:32:42.3631317Z scale_ub: Optional[float], 2025-05-07T20:32:42.3631400Z contiguous: bool, 2025-05-07T20:32:42.3631485Z compiled: bool, 2025-05-07T20:32:42.3631562Z ) -> None: 2025-05-07T20:32:42.3631655Z torch.manual_seed(2025) 2025-05-07T20:32:42.3631729Z 2025-05-07T20:32:42.3631998Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3633747Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.3633753Z 2025-05-07T20:32:42.3633865Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.3633869Z 2025-05-07T20:32:42.3633971Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3634191Z self=, 2025-05-07T20:32:42.3634266Z T=16384, 2025-05-07T20:32:42.3634350Z D=7168, 2025-05-07T20:32:42.3634429Z scale_ub=1200.0, 2025-05-07T20:32:42.3634506Z contiguous=True, 2025-05-07T20:32:42.3634594Z compiled=False, 2025-05-07T20:32:42.3634664Z ) 2025-05-07T20:32:42.3634880Z self = 2025-05-07T20:32:42.3635054Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.3635059Z 2025-05-07T20:32:42.3635129Z @given( 2025-05-07T20:32:42.3635246Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3635339Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3635447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3635566Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3635673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3635746Z ) 2025-05-07T20:32:42.3635987Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3636081Z def test_silu_mul_quant( 2025-05-07T20:32:42.3636155Z self, 2025-05-07T20:32:42.3636234Z T: int, 2025-05-07T20:32:42.3636700Z D: int, 2025-05-07T20:32:42.3636797Z scale_ub: Optional[float], 2025-05-07T20:32:42.3636886Z contiguous: bool, 2025-05-07T20:32:42.3636973Z compiled: bool, 2025-05-07T20:32:42.3637070Z ) -> None: 2025-05-07T20:32:42.3637167Z torch.manual_seed(2025) 2025-05-07T20:32:42.3637254Z 2025-05-07T20:32:42.3637418Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3639176Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.3639186Z 2025-05-07T20:32:42.3639300Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.5447209Z 2025-05-07T20:32:42.5447562Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5448004Z self=, 2025-05-07T20:32:42.5448519Z T=128, 2025-05-07T20:32:42.5448707Z D=5120, 2025-05-07T20:32:42.5448897Z scale_ub=1200.0, 2025-05-07T20:32:42.5449110Z contiguous=False, 2025-05-07T20:32:42.5449339Z compiled=False, 2025-05-07T20:32:42.5449533Z ) 2025-05-07T20:32:42.5449835Z self = 2025-05-07T20:32:42.5450314Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.5450582Z 2025-05-07T20:32:42.5450864Z @given( 2025-05-07T20:32:42.5451135Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5459454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5459807Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5460131Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5460453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5460731Z ) 2025-05-07T20:32:42.5461076Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5461523Z def test_silu_mul_quant( 2025-05-07T20:32:42.5461760Z self, 2025-05-07T20:32:42.5461942Z T: int, 2025-05-07T20:32:42.5462139Z D: int, 2025-05-07T20:32:42.5462343Z scale_ub: Optional[float], 2025-05-07T20:32:42.5462604Z contiguous: bool, 2025-05-07T20:32:42.5462838Z compiled: bool, 2025-05-07T20:32:42.5463052Z ) -> None: 2025-05-07T20:32:42.5463260Z torch.manual_seed(2025) 2025-05-07T20:32:42.5463499Z 2025-05-07T20:32:42.5463756Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5464089Z 2025-05-07T20:32:42.5464274Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5464552Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5464850Z x = x_sign * x_clamp 2025-05-07T20:32:42.5465079Z x0 = x[:, :D] 2025-05-07T20:32:42.5465289Z x1 = x[:, D:] 2025-05-07T20:32:42.5465481Z 2025-05-07T20:32:42.5465654Z if contiguous: 2025-05-07T20:32:42.5465876Z x0 = x0.contiguous() 2025-05-07T20:32:42.5466118Z x1 = x1.contiguous() 2025-05-07T20:32:42.5466344Z 2025-05-07T20:32:42.5466523Z if scale_ub is not None: 2025-05-07T20:32:42.5466783Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5467109Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5467403Z ) 2025-05-07T20:32:42.5467659Z else: 2025-05-07T20:32:42.5467863Z scale_ub_tensor = None 2025-05-07T20:32:42.5468103Z 2025-05-07T20:32:42.5468483Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5468791Z op = silu_mul_quant 2025-05-07T20:32:42.5469033Z if compiled: 2025-05-07T20:32:42.5469270Z op = torch.compile(op) 2025-05-07T20:32:42.5469557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5469824Z 2025-05-07T20:32:42.5470004Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5470167Z 2025-05-07T20:32:42.5470262Z moe/activation_test.py:117: 2025-05-07T20:32:42.5470546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5470865Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5471133Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5471810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5472492Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5473021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5473707Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5474357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5474873Z kernel = self.compile( 2025-05-07T20:32:42.5475408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5476048Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5476433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5476653Z 2025-05-07T20:32:42.5476861Z self = 2025-05-07T20:32:42.5478008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5479358Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4483600>} 2025-05-07T20:32:42.5480666Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5481665Z context = 2025-05-07T20:32:42.5481942Z 2025-05-07T20:32:42.5482107Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5482619Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5483091Z module_map=module_map) 2025-05-07T20:32:42.5483449Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5483792Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5484041Z E ^ 2025-05-07T20:32:42.5484484Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5484922Z 2025-05-07T20:32:42.5485342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5485843Z 2025-05-07T20:32:42.5485942Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5486343Z self=, 2025-05-07T20:32:42.5486737Z T=2048, 2025-05-07T20:32:42.5486914Z D=7168, 2025-05-07T20:32:42.5487120Z scale_ub=None, 2025-05-07T20:32:42.5487355Z contiguous=False, 2025-05-07T20:32:42.5487578Z compiled=False, 2025-05-07T20:32:42.5487771Z ) 2025-05-07T20:32:42.5488159Z self = 2025-05-07T20:32:42.5488634Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.5488898Z 2025-05-07T20:32:42.5488975Z @given( 2025-05-07T20:32:42.5489189Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5489486Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5489774Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5490091Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5490404Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5490676Z ) 2025-05-07T20:32:42.5491016Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5491452Z def test_silu_mul_quant( 2025-05-07T20:32:42.5491679Z self, 2025-05-07T20:32:42.5491863Z T: int, 2025-05-07T20:32:42.5492048Z D: int, 2025-05-07T20:32:42.5492254Z scale_ub: Optional[float], 2025-05-07T20:32:42.5492512Z contiguous: bool, 2025-05-07T20:32:42.5492750Z compiled: bool, 2025-05-07T20:32:42.5492961Z ) -> None: 2025-05-07T20:32:42.5493158Z torch.manual_seed(2025) 2025-05-07T20:32:42.5493390Z 2025-05-07T20:32:42.5493655Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5495666Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.5497578Z 2025-05-07T20:32:42.5497695Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.5497904Z 2025-05-07T20:32:42.5498008Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5498399Z self=, 2025-05-07T20:32:42.5498794Z T=128, 2025-05-07T20:32:42.5498976Z D=7168, 2025-05-07T20:32:42.5499154Z scale_ub=1200.0, 2025-05-07T20:32:42.5499368Z contiguous=True, 2025-05-07T20:32:42.5499576Z compiled=True, 2025-05-07T20:32:42.5499761Z ) 2025-05-07T20:32:42.5500080Z self = 2025-05-07T20:32:42.5500549Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.5500816Z 2025-05-07T20:32:42.5500895Z @given( 2025-05-07T20:32:42.5501110Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5501419Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5501708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5502026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5502337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5502607Z ) 2025-05-07T20:32:42.5502939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5503384Z def test_silu_mul_quant( 2025-05-07T20:32:42.5503610Z self, 2025-05-07T20:32:42.5503788Z T: int, 2025-05-07T20:32:42.5503974Z D: int, 2025-05-07T20:32:42.5504193Z scale_ub: Optional[float], 2025-05-07T20:32:42.5504451Z contiguous: bool, 2025-05-07T20:32:42.5504685Z compiled: bool, 2025-05-07T20:32:42.5504898Z ) -> None: 2025-05-07T20:32:42.5505098Z torch.manual_seed(2025) 2025-05-07T20:32:42.5505328Z 2025-05-07T20:32:42.5505598Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5505925Z 2025-05-07T20:32:42.5506104Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5506466Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5506761Z x = x_sign * x_clamp 2025-05-07T20:32:42.5506985Z x0 = x[:, :D] 2025-05-07T20:32:42.5507190Z x1 = x[:, D:] 2025-05-07T20:32:42.5507388Z 2025-05-07T20:32:42.5507607Z if contiguous: 2025-05-07T20:32:42.5507828Z x0 = x0.contiguous() 2025-05-07T20:32:42.5508080Z x1 = x1.contiguous() 2025-05-07T20:32:42.5508299Z 2025-05-07T20:32:42.5508481Z if scale_ub is not None: 2025-05-07T20:32:42.5508739Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5509058Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5509354Z ) 2025-05-07T20:32:42.5509548Z else: 2025-05-07T20:32:42.5509742Z scale_ub_tensor = None 2025-05-07T20:32:42.5509981Z 2025-05-07T20:32:42.5510213Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5510504Z op = silu_mul_quant 2025-05-07T20:32:42.5510759Z if compiled: 2025-05-07T20:32:42.5511004Z op = torch.compile(op) 2025-05-07T20:32:42.5511289Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5511546Z 2025-05-07T20:32:42.5511724Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5511882Z 2025-05-07T20:32:42.5511981Z moe/activation_test.py:117: 2025-05-07T20:32:42.5512263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5512582Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5512851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5513403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.5513946Z return fn(*args, **kwargs) 2025-05-07T20:32:42.5514581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5515343Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5515876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5516539Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5517218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5517750Z kernel = self.compile( 2025-05-07T20:32:42.5518277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5518908Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5519298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5519520Z 2025-05-07T20:32:42.5519729Z self = 2025-05-07T20:32:42.5520829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5522176Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4260900>} 2025-05-07T20:32:42.5523484Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5524484Z context = 2025-05-07T20:32:42.5524765Z 2025-05-07T20:32:42.5524924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5525448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5525986Z module_map=module_map) 2025-05-07T20:32:42.5526341Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5526681Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5526932Z E ^ 2025-05-07T20:32:42.5527379Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5527819Z 2025-05-07T20:32:42.5528230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1403380Z 2025-05-07T20:32:43.1403981Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1404598Z self=, 2025-05-07T20:32:43.1405147Z T=128, 2025-05-07T20:32:43.1405356Z D=7168, 2025-05-07T20:32:43.1405560Z scale_ub=1200.0, 2025-05-07T20:32:43.1405815Z contiguous=True, 2025-05-07T20:32:43.1406031Z compiled=False, 2025-05-07T20:32:43.1406226Z ) 2025-05-07T20:32:43.1406538Z self = 2025-05-07T20:32:43.1407010Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.1407280Z 2025-05-07T20:32:43.1407363Z @given( 2025-05-07T20:32:43.1407583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1407890Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1408185Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1408498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1408815Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1409087Z ) 2025-05-07T20:32:43.1409418Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1410044Z def test_silu_mul_quant( 2025-05-07T20:32:43.1410278Z self, 2025-05-07T20:32:43.1410478Z T: int, 2025-05-07T20:32:43.1410663Z D: int, 2025-05-07T20:32:43.1410883Z scale_ub: Optional[float], 2025-05-07T20:32:43.1411149Z contiguous: bool, 2025-05-07T20:32:43.1411371Z compiled: bool, 2025-05-07T20:32:43.1411592Z ) -> None: 2025-05-07T20:32:43.1411793Z torch.manual_seed(2025) 2025-05-07T20:32:43.1412025Z 2025-05-07T20:32:43.1412298Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1412645Z 2025-05-07T20:32:43.1412826Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1413113Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1415080Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.1417009Z 2025-05-07T20:32:43.1417128Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.1417335Z 2025-05-07T20:32:43.1417446Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1417837Z self=, 2025-05-07T20:32:43.1418240Z T=128, 2025-05-07T20:32:43.1418417Z D=5120, 2025-05-07T20:32:43.1418593Z scale_ub=1200.0, 2025-05-07T20:32:43.1418808Z contiguous=True, 2025-05-07T20:32:43.1419018Z compiled=True, 2025-05-07T20:32:43.1419209Z ) 2025-05-07T20:32:43.1419510Z self = 2025-05-07T20:32:43.1419991Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.1420257Z 2025-05-07T20:32:43.1420333Z @given( 2025-05-07T20:32:43.1420668Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1420974Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1421273Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1421590Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1421904Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1422176Z ) 2025-05-07T20:32:43.1422505Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1422946Z def test_silu_mul_quant( 2025-05-07T20:32:43.1423175Z self, 2025-05-07T20:32:43.1423354Z T: int, 2025-05-07T20:32:43.1423544Z D: int, 2025-05-07T20:32:43.1423753Z scale_ub: Optional[float], 2025-05-07T20:32:43.1424012Z contiguous: bool, 2025-05-07T20:32:43.1424257Z compiled: bool, 2025-05-07T20:32:43.1424472Z ) -> None: 2025-05-07T20:32:43.1424686Z torch.manual_seed(2025) 2025-05-07T20:32:43.1424916Z 2025-05-07T20:32:43.1425183Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1425519Z 2025-05-07T20:32:43.1425697Z > x_sign = torch.sign(x) 2025-05-07T20:32:43.1427653Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.1429555Z 2025-05-07T20:32:43.1429755Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:43.1429957Z 2025-05-07T20:32:43.1430060Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1430464Z self=, 2025-05-07T20:32:43.1430867Z T=128, 2025-05-07T20:32:43.1431041Z D=7168, 2025-05-07T20:32:43.1431224Z scale_ub=None, 2025-05-07T20:32:43.1431423Z contiguous=True, 2025-05-07T20:32:43.1431636Z compiled=True, 2025-05-07T20:32:43.1431832Z ) 2025-05-07T20:32:43.1432135Z self = 2025-05-07T20:32:43.1432605Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.1432857Z 2025-05-07T20:32:43.1432939Z @given( 2025-05-07T20:32:43.1433149Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1433449Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1433740Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1434053Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1434366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1434647Z ) 2025-05-07T20:32:43.1434982Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1435403Z def test_silu_mul_quant( 2025-05-07T20:32:43.1435636Z self, 2025-05-07T20:32:43.1435817Z T: int, 2025-05-07T20:32:43.1436002Z D: int, 2025-05-07T20:32:43.1436208Z scale_ub: Optional[float], 2025-05-07T20:32:43.1436472Z contiguous: bool, 2025-05-07T20:32:43.1436696Z compiled: bool, 2025-05-07T20:32:43.1436906Z ) -> None: 2025-05-07T20:32:43.1437114Z torch.manual_seed(2025) 2025-05-07T20:32:43.1437342Z 2025-05-07T20:32:43.1437598Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1439675Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.1441752Z 2025-05-07T20:32:43.1441865Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.1442069Z 2025-05-07T20:32:43.1493520Z FAILED 2025-05-07T20:32:43.1493677Z 2025-05-07T20:32:43.1493862Z =================================== FAILURES =================================== 2025-05-07T20:32:43.1494447Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:43.1495057Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:43.1495777Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:43.1496311Z | yield 2025-05-07T20:32:43.1496921Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:32:43.1497655Z | self._callTestMethod(testMethod) 2025-05-07T20:32:43.1498037Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:43.1498777Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:32:43.1499561Z | if method() is not None: 2025-05-07T20:32:43.1499894Z | ~~~~~~^^ 2025-05-07T20:32:43.1500788Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:43.1501757Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1502151Z | ^^^^^^^ 2025-05-07T20:32:43.1503189Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:43.1504069Z | raise the_error_hypothesis_found 2025-05-07T20:32:43.1504637Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:43.1505206Z +-+---------------- 1 ---------------- 2025-05-07T20:32:43.1505594Z | Traceback (most recent call last): 2025-05-07T20:32:43.1506551Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:43.1507769Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1510569Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.1513268Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:43.1513864Z | self=, 2025-05-07T20:32:43.1514396Z | T=128, 2025-05-07T20:32:43.1514670Z | D=7168, 2025-05-07T20:32:43.1514936Z | scale_ub=1200.0, 2025-05-07T20:32:43.1515177Z | contiguous=True, 2025-05-07T20:32:43.1515419Z | compiled=False, 2025-05-07T20:32:43.1515645Z | ) 2025-05-07T20:32:43.1515817Z | 2025-05-07T20:32:43.1516335Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:32:43.1516940Z +---------------- 2 ---------------- 2025-05-07T20:32:43.1517261Z | Traceback (most recent call last): 2025-05-07T20:32:43.1518142Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:43.1518909Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1520949Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.1522889Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:43.1523340Z | self=, 2025-05-07T20:32:43.1523749Z | T=128, 2025-05-07T20:32:43.1523949Z | D=7168, 2025-05-07T20:32:43.1524155Z | scale_ub=None, 2025-05-07T20:32:43.1524384Z | contiguous=True, 2025-05-07T20:32:43.1524627Z | compiled=True, 2025-05-07T20:32:43.1524849Z | ) 2025-05-07T20:32:43.1525026Z | 2025-05-07T20:32:43.1525551Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:43.1526154Z +---------------- 3 ---------------- 2025-05-07T20:32:43.1526440Z | Traceback (most recent call last): 2025-05-07T20:32:43.1527130Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:43.1527975Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1530077Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.1532132Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:43.1532566Z | self=, 2025-05-07T20:32:43.1533025Z | T=128, 2025-05-07T20:32:43.1533231Z | D=5120, 2025-05-07T20:32:43.1545933Z | scale_ub=1200.0, 2025-05-07T20:32:43.1546338Z | contiguous=True, 2025-05-07T20:32:43.1546675Z | compiled=True, 2025-05-07T20:32:43.1546970Z | ) 2025-05-07T20:32:43.1547230Z | 2025-05-07T20:32:43.1548091Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:43.1548937Z +---------------- 4 ---------------- 2025-05-07T20:32:43.1549320Z | Traceback (most recent call last): 2025-05-07T20:32:43.1550311Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:43.1551292Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.1551690Z | ~~~~~~^^ 2025-05-07T20:32:43.1552580Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:43.1553527Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1554859Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:43.1555942Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.1556322Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:32:43.1556676Z | a, 2025-05-07T20:32:43.1556925Z | ^^ 2025-05-07T20:32:43.1557230Z | ...<23 lines>... 2025-05-07T20:32:43.1557575Z | USE_INT64=use_int64, 2025-05-07T20:32:43.1557916Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:43.1558230Z | ) 2025-05-07T20:32:43.1558477Z | ^ 2025-05-07T20:32:43.1559178Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:43.1560148Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1560800Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:43.1561699Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:43.1562753Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.1563386Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:43.1564260Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:43.1565208Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.1565720Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:43.1566525Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:43.1567461Z | fn() 2025-05-07T20:32:43.1567657Z | ~~^^ 2025-05-07T20:32:43.1568214Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:43.1568840Z | self.fn.run( 2025-05-07T20:32:43.1569058Z | ~~~~~~~~~~~^ 2025-05-07T20:32:43.1569265Z | *args, 2025-05-07T20:32:43.1569471Z | ^^^^^^ 2025-05-07T20:32:43.1569683Z | **current, 2025-05-07T20:32:43.1569898Z | ^^^^^^^^^^ 2025-05-07T20:32:43.1570111Z | ) 2025-05-07T20:32:43.1570296Z | ^ 2025-05-07T20:32:43.1570785Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:43.1571370Z | kernel = self.compile( 2025-05-07T20:32:43.1571621Z | src, 2025-05-07T20:32:43.1571843Z | target=target, 2025-05-07T20:32:43.1572089Z | options=options.__dict__, 2025-05-07T20:32:43.1572352Z | ) 2025-05-07T20:32:43.1572888Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:43.1573572Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1574383Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:43.1575475Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1576124Z | module_map=module_map) 2025-05-07T20:32:43.1576611Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1577092Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.1577505Z | ^ 2025-05-07T20:32:43.1578153Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1578972Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:43.1579682Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:43.1580389Z | self=, 2025-05-07T20:32:43.1581017Z | T=1, # or any other generated value 2025-05-07T20:32:43.1581324Z | D=5120, # or any other generated value 2025-05-07T20:32:43.1581649Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:43.1581995Z | contiguous=True, # or any other generated value 2025-05-07T20:32:43.1582346Z | compiled=True, # or any other generated value 2025-05-07T20:32:43.1582642Z | ) 2025-05-07T20:32:43.1582811Z | 2025-05-07T20:32:43.1583332Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:43.1583955Z +------------------------------------ 2025-05-07T20:32:43.1584313Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:43.1584667Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1585076Z self=, 2025-05-07T20:32:43.1585464Z T=1, 2025-05-07T20:32:43.1586694Z D=5120, 2025-05-07T20:32:43.1586969Z scale_ub=None, 2025-05-07T20:32:43.1587272Z contiguous=True, 2025-05-07T20:32:43.1587717Z compiled=True, 2025-05-07T20:32:43.1588006Z ) 2025-05-07T20:32:43.1588472Z self = 2025-05-07T20:32:43.1589147Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.1589501Z 2025-05-07T20:32:43.1589605Z @given( 2025-05-07T20:32:43.1589919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1590496Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1590899Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1591347Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1591786Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1592158Z ) 2025-05-07T20:32:43.1592636Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1593241Z def test_silu_mul_quant( 2025-05-07T20:32:43.1593568Z self, 2025-05-07T20:32:43.1593823Z T: int, 2025-05-07T20:32:43.1594078Z D: int, 2025-05-07T20:32:43.1594369Z scale_ub: Optional[float], 2025-05-07T20:32:43.1594725Z contiguous: bool, 2025-05-07T20:32:43.1595045Z compiled: bool, 2025-05-07T20:32:43.1595351Z ) -> None: 2025-05-07T20:32:43.1595630Z torch.manual_seed(2025) 2025-05-07T20:32:43.1595957Z 2025-05-07T20:32:43.1596315Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1596779Z 2025-05-07T20:32:43.1597031Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1597416Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1597828Z x = x_sign * x_clamp 2025-05-07T20:32:43.1598157Z x0 = x[:, :D] 2025-05-07T20:32:43.1598441Z x1 = x[:, D:] 2025-05-07T20:32:43.1598713Z 2025-05-07T20:32:43.1598962Z if contiguous: 2025-05-07T20:32:43.1599274Z x0 = x0.contiguous() 2025-05-07T20:32:43.1599616Z x1 = x1.contiguous() 2025-05-07T20:32:43.1599946Z 2025-05-07T20:32:43.1600205Z if scale_ub is not None: 2025-05-07T20:32:43.1600585Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1601028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1601415Z ) 2025-05-07T20:32:43.1601653Z else: 2025-05-07T20:32:43.1601896Z scale_ub_tensor = None 2025-05-07T20:32:43.1602205Z 2025-05-07T20:32:43.1602498Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1602903Z op = silu_mul_quant 2025-05-07T20:32:43.1603233Z if compiled: 2025-05-07T20:32:43.1603685Z op = torch.compile(op) 2025-05-07T20:32:43.1604086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1604476Z 2025-05-07T20:32:43.1604746Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.1605136Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.1605533Z 2025-05-07T20:32:43.1605856Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1606304Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.1606693Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.1607125Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.1607645Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1608063Z 2025-05-07T20:32:43.1608337Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.1608610Z 2025-05-07T20:32:43.1608752Z moe/activation_test.py:126: 2025-05-07T20:32:43.1609148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1609598Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.1610044Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1611132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.1612163Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.1612928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1613869Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1614817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.1615948Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.1616981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.1617903Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.1618718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.1619456Z fn() 2025-05-07T20:32:43.1620161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.1620948Z self.fn.run( 2025-05-07T20:32:43.1621573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1622283Z kernel = self.compile( 2025-05-07T20:32:43.1623020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1623911Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1624458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1624783Z 2025-05-07T20:32:43.1625061Z self = 2025-05-07T20:32:43.1626550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1628599Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13ea25d4e0>} 2025-05-07T20:32:43.1630450Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1631831Z context = 2025-05-07T20:32:43.1632318Z 2025-05-07T20:32:43.1632555Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1633223Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1633843Z module_map=module_map) 2025-05-07T20:32:43.1634305Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1634740Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.1635082Z E ^ 2025-05-07T20:32:43.1635690Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1636287Z 2025-05-07T20:32:43.1636840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1637514Z 2025-05-07T20:32:43.1637658Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1638216Z self=, 2025-05-07T20:32:43.1638756Z T=2048, 2025-05-07T20:32:43.1639006Z D=5120, 2025-05-07T20:32:43.1639255Z scale_ub=1200.0, 2025-05-07T20:32:43.1639553Z contiguous=True, 2025-05-07T20:32:43.1639850Z compiled=False, 2025-05-07T20:32:43.1640434Z ) 2025-05-07T20:32:43.1640871Z self = 2025-05-07T20:32:43.1641537Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.1641914Z 2025-05-07T20:32:43.1642023Z @given( 2025-05-07T20:32:43.1642323Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1642744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1643163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1643897Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1644342Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1644727Z ) 2025-05-07T20:32:43.1645203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1645803Z def test_silu_mul_quant( 2025-05-07T20:32:43.1646124Z self, 2025-05-07T20:32:43.1646361Z T: int, 2025-05-07T20:32:43.1646608Z D: int, 2025-05-07T20:32:43.1646882Z scale_ub: Optional[float], 2025-05-07T20:32:43.1647216Z contiguous: bool, 2025-05-07T20:32:43.1647516Z compiled: bool, 2025-05-07T20:32:43.1647796Z ) -> None: 2025-05-07T20:32:43.1648062Z torch.manual_seed(2025) 2025-05-07T20:32:43.1648386Z 2025-05-07T20:32:43.1648751Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1649203Z 2025-05-07T20:32:43.1649449Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1649834Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1650260Z x = x_sign * x_clamp 2025-05-07T20:32:43.1650572Z x0 = x[:, :D] 2025-05-07T20:32:43.1650864Z x1 = x[:, D:] 2025-05-07T20:32:43.1651122Z 2025-05-07T20:32:43.1651355Z if contiguous: 2025-05-07T20:32:43.1651666Z x0 = x0.contiguous() 2025-05-07T20:32:43.1651976Z x1 = x1.contiguous() 2025-05-07T20:32:43.1652253Z 2025-05-07T20:32:43.1652481Z if scale_ub is not None: 2025-05-07T20:32:43.1652806Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1653198Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1653561Z ) 2025-05-07T20:32:43.1653797Z else: 2025-05-07T20:32:43.1654071Z scale_ub_tensor = None 2025-05-07T20:32:43.1654381Z 2025-05-07T20:32:43.1654649Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1655039Z op = silu_mul_quant 2025-05-07T20:32:43.1655375Z if compiled: 2025-05-07T20:32:43.1655709Z op = torch.compile(op) 2025-05-07T20:32:43.1656078Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1656582Z 2025-05-07T20:32:43.1656849Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1657061Z 2025-05-07T20:32:43.1657182Z moe/activation_test.py:117: 2025-05-07T20:32:43.1657528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1657923Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1658285Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1659198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1660144Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1660871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1661790Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1662722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1663404Z kernel = self.compile( 2025-05-07T20:32:43.1664150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1665000Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1665469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1665750Z 2025-05-07T20:32:43.1665995Z self = 2025-05-07T20:32:43.1667572Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1669581Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13ea28e160>} 2025-05-07T20:32:43.1671379Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1672689Z context = 2025-05-07T20:32:43.1673038Z 2025-05-07T20:32:43.1673252Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1673931Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1674502Z module_map=module_map) 2025-05-07T20:32:43.1674947Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1675368Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1675701Z E ^ 2025-05-07T20:32:43.1676262Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1676830Z 2025-05-07T20:32:43.1677349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1678034Z 2025-05-07T20:32:43.1678183Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1678744Z self=, 2025-05-07T20:32:43.1679297Z T=2048, 2025-05-07T20:32:43.1679546Z D=5120, 2025-05-07T20:32:43.1679803Z scale_ub=1200.0, 2025-05-07T20:32:43.1680098Z contiguous=True, 2025-05-07T20:32:43.1680398Z compiled=True, 2025-05-07T20:32:43.1680672Z ) 2025-05-07T20:32:43.1681100Z self = 2025-05-07T20:32:43.1681762Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.1682132Z 2025-05-07T20:32:43.1682242Z @given( 2025-05-07T20:32:43.1682541Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1683075Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1683503Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1683948Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1684396Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1684783Z ) 2025-05-07T20:32:43.1685259Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1685857Z def test_silu_mul_quant( 2025-05-07T20:32:43.1686188Z self, 2025-05-07T20:32:43.1686450Z T: int, 2025-05-07T20:32:43.1686711Z D: int, 2025-05-07T20:32:43.1687005Z scale_ub: Optional[float], 2025-05-07T20:32:43.1687363Z contiguous: bool, 2025-05-07T20:32:43.1687679Z compiled: bool, 2025-05-07T20:32:43.1687982Z ) -> None: 2025-05-07T20:32:43.1688268Z torch.manual_seed(2025) 2025-05-07T20:32:43.1688590Z 2025-05-07T20:32:43.1688966Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1689434Z 2025-05-07T20:32:43.1689686Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1690081Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1690507Z x = x_sign * x_clamp 2025-05-07T20:32:43.1690827Z x0 = x[:, :D] 2025-05-07T20:32:43.1691119Z x1 = x[:, D:] 2025-05-07T20:32:43.1691402Z 2025-05-07T20:32:43.1691636Z if contiguous: 2025-05-07T20:32:43.1691948Z x0 = x0.contiguous() 2025-05-07T20:32:43.1692276Z x1 = x1.contiguous() 2025-05-07T20:32:43.1692590Z 2025-05-07T20:32:43.1692820Z if scale_ub is not None: 2025-05-07T20:32:43.1693178Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1693595Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1694106Z ) 2025-05-07T20:32:43.1694375Z else: 2025-05-07T20:32:43.1694668Z scale_ub_tensor = None 2025-05-07T20:32:43.1695002Z 2025-05-07T20:32:43.1695320Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1695735Z op = silu_mul_quant 2025-05-07T20:32:43.1696073Z if compiled: 2025-05-07T20:32:43.1696425Z op = torch.compile(op) 2025-05-07T20:32:43.1696834Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1697206Z 2025-05-07T20:32:43.1697468Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.1697863Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.1698260Z 2025-05-07T20:32:43.1698575Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1699041Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.1699448Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.1699876Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.1700380Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1700816Z 2025-05-07T20:32:43.1701086Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.1701362Z 2025-05-07T20:32:43.1701495Z moe/activation_test.py:126: 2025-05-07T20:32:43.1701908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1702363Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.1702813Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1703900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.1704927Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.1705686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1706617Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1707816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.1708824Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.1709823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.1710731Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.1711580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.1712302Z fn() 2025-05-07T20:32:43.1712982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.1713781Z self.fn.run( 2025-05-07T20:32:43.1714414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1715134Z kernel = self.compile( 2025-05-07T20:32:43.1715897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1716797Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1717329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1717650Z 2025-05-07T20:32:43.1717926Z self = 2025-05-07T20:32:43.1719407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1721319Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13ea1be660>} 2025-05-07T20:32:43.1723263Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1724606Z context = 2025-05-07T20:32:43.1724994Z 2025-05-07T20:32:43.1725223Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1725862Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1741654Z module_map=module_map) 2025-05-07T20:32:43.1742166Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1742629Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.1742979Z E ^ 2025-05-07T20:32:43.1743627Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1744303Z 2025-05-07T20:32:43.1744894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1745597Z 2025-05-07T20:32:43.1745738Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1746283Z self=, 2025-05-07T20:32:43.1746819Z T=16384, 2025-05-07T20:32:43.1747082Z D=7168, 2025-05-07T20:32:43.1747332Z scale_ub=1200.0, 2025-05-07T20:32:43.1747746Z contiguous=False, 2025-05-07T20:32:43.1748043Z compiled=False, 2025-05-07T20:32:43.1748308Z ) 2025-05-07T20:32:43.1748710Z self = 2025-05-07T20:32:43.1749365Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.1749729Z 2025-05-07T20:32:43.1749839Z @given( 2025-05-07T20:32:43.1750136Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1750533Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1750925Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1751689Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1752145Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1752523Z ) 2025-05-07T20:32:43.1752947Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1753511Z def test_silu_mul_quant( 2025-05-07T20:32:43.1753798Z self, 2025-05-07T20:32:43.1754030Z T: int, 2025-05-07T20:32:43.1754257Z D: int, 2025-05-07T20:32:43.1754517Z scale_ub: Optional[float], 2025-05-07T20:32:43.1754839Z contiguous: bool, 2025-05-07T20:32:43.1755115Z compiled: bool, 2025-05-07T20:32:43.1755381Z ) -> None: 2025-05-07T20:32:43.1755645Z torch.manual_seed(2025) 2025-05-07T20:32:43.1755931Z 2025-05-07T20:32:43.1756271Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1756718Z 2025-05-07T20:32:43.1756945Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1757328Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1757749Z x = x_sign * x_clamp 2025-05-07T20:32:43.1758062Z x0 = x[:, :D] 2025-05-07T20:32:43.1758346Z x1 = x[:, D:] 2025-05-07T20:32:43.1758627Z 2025-05-07T20:32:43.1758859Z if contiguous: 2025-05-07T20:32:43.1759149Z x0 = x0.contiguous() 2025-05-07T20:32:43.1759483Z x1 = x1.contiguous() 2025-05-07T20:32:43.1759781Z 2025-05-07T20:32:43.1760026Z if scale_ub is not None: 2025-05-07T20:32:43.1760381Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1760808Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1761191Z ) 2025-05-07T20:32:43.1761437Z else: 2025-05-07T20:32:43.1761705Z scale_ub_tensor = None 2025-05-07T20:32:43.1762186Z 2025-05-07T20:32:43.1762480Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1762881Z op = silu_mul_quant 2025-05-07T20:32:43.1763198Z if compiled: 2025-05-07T20:32:43.1763512Z op = torch.compile(op) 2025-05-07T20:32:43.1763889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1764235Z 2025-05-07T20:32:43.1764479Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1764690Z 2025-05-07T20:32:43.1764824Z moe/activation_test.py:117: 2025-05-07T20:32:43.1765192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1765623Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1765979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1766881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1767778Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1768507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1769435Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1770333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1771058Z kernel = self.compile( 2025-05-07T20:32:43.1771803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1772688Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1773208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1773526Z 2025-05-07T20:32:43.1773798Z self = 2025-05-07T20:32:43.1775415Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1777404Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13ea2a1080>} 2025-05-07T20:32:43.1779214Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1780599Z context = 2025-05-07T20:32:43.1780985Z 2025-05-07T20:32:43.1781196Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1781887Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1782512Z module_map=module_map) 2025-05-07T20:32:43.1782995Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1783473Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1783813Z E ^ 2025-05-07T20:32:43.1784428Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1785043Z 2025-05-07T20:32:43.1785830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1786546Z 2025-05-07T20:32:43.1786695Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1787281Z self=, 2025-05-07T20:32:43.1787904Z T=1, 2025-05-07T20:32:43.1788151Z D=7168, 2025-05-07T20:32:43.1788410Z scale_ub=None, 2025-05-07T20:32:43.1788694Z contiguous=True, 2025-05-07T20:32:43.1788997Z compiled=True, 2025-05-07T20:32:43.1789384Z ) 2025-05-07T20:32:43.1789809Z self = 2025-05-07T20:32:43.1790474Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.1790814Z 2025-05-07T20:32:43.1790920Z @given( 2025-05-07T20:32:43.1791210Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1791638Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1792045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1792452Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1792884Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1793274Z ) 2025-05-07T20:32:43.1793753Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1794351Z def test_silu_mul_quant( 2025-05-07T20:32:43.1794683Z self, 2025-05-07T20:32:43.1794945Z T: int, 2025-05-07T20:32:43.1795200Z D: int, 2025-05-07T20:32:43.1795498Z scale_ub: Optional[float], 2025-05-07T20:32:43.1795866Z contiguous: bool, 2025-05-07T20:32:43.1796182Z compiled: bool, 2025-05-07T20:32:43.1796489Z ) -> None: 2025-05-07T20:32:43.1796781Z torch.manual_seed(2025) 2025-05-07T20:32:43.1797114Z 2025-05-07T20:32:43.1797519Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1797985Z 2025-05-07T20:32:43.1798236Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1798641Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1799068Z x = x_sign * x_clamp 2025-05-07T20:32:43.1799397Z x0 = x[:, :D] 2025-05-07T20:32:43.1799685Z x1 = x[:, D:] 2025-05-07T20:32:43.1799965Z 2025-05-07T20:32:43.1800223Z if contiguous: 2025-05-07T20:32:43.1800529Z x0 = x0.contiguous() 2025-05-07T20:32:43.1800889Z x1 = x1.contiguous() 2025-05-07T20:32:43.1801220Z 2025-05-07T20:32:43.1801473Z if scale_ub is not None: 2025-05-07T20:32:43.1801855Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1802309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1802826Z ) 2025-05-07T20:32:43.1803057Z else: 2025-05-07T20:32:43.1803312Z scale_ub_tensor = None 2025-05-07T20:32:43.1803646Z 2025-05-07T20:32:43.1803961Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1804392Z op = silu_mul_quant 2025-05-07T20:32:43.1804733Z if compiled: 2025-05-07T20:32:43.1805072Z op = torch.compile(op) 2025-05-07T20:32:43.1805475Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1805848Z 2025-05-07T20:32:43.1806103Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.1806488Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.1806893Z 2025-05-07T20:32:43.1807207Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1807666Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.1808059Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.1808482Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.1808958Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1809341Z 2025-05-07T20:32:43.1809588Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.1809843Z 2025-05-07T20:32:43.1809965Z moe/activation_test.py:126: 2025-05-07T20:32:43.1810338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1810777Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.1811195Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1812267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.1813292Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.1814137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1815073Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1816018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.1817004Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.1817989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.1818822Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.1819557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.1820188Z fn() 2025-05-07T20:32:43.1820800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.1821595Z self.fn.run( 2025-05-07T20:32:43.1822262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1823001Z kernel = self.compile( 2025-05-07T20:32:43.1823781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1824703Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1825248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1825562Z 2025-05-07T20:32:43.1825834Z self = 2025-05-07T20:32:43.1826980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1828617Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13ea2a3740>} 2025-05-07T20:32:43.1829960Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1830960Z context = 2025-05-07T20:32:43.1831255Z 2025-05-07T20:32:43.1831421Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1831938Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1832407Z module_map=module_map) 2025-05-07T20:32:43.1832762Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1833112Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.1833373Z E ^ 2025-05-07T20:32:43.1833830Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1834276Z 2025-05-07T20:32:43.1834690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1835201Z 2025-05-07T20:32:43.1835300Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1835701Z self=, 2025-05-07T20:32:43.1836102Z T=4096, 2025-05-07T20:32:43.1836287Z D=5120, 2025-05-07T20:32:43.1836466Z scale_ub=None, 2025-05-07T20:32:43.1836671Z contiguous=False, 2025-05-07T20:32:43.1836904Z compiled=False, 2025-05-07T20:32:43.1837111Z ) 2025-05-07T20:32:43.1837428Z self = 2025-05-07T20:32:43.1837912Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.1838282Z 2025-05-07T20:32:43.1838356Z @given( 2025-05-07T20:32:43.1838597Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1838898Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1839200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1839530Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1839843Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1840477Z ) 2025-05-07T20:32:43.1840836Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1841448Z def test_silu_mul_quant( 2025-05-07T20:32:43.1841735Z self, 2025-05-07T20:32:43.1842194Z T: int, 2025-05-07T20:32:43.1842475Z D: int, 2025-05-07T20:32:43.1842769Z scale_ub: Optional[float], 2025-05-07T20:32:43.1843184Z contiguous: bool, 2025-05-07T20:32:43.1843482Z compiled: bool, 2025-05-07T20:32:43.1843792Z ) -> None: 2025-05-07T20:32:43.1844141Z torch.manual_seed(2025) 2025-05-07T20:32:43.1845068Z 2025-05-07T20:32:43.1845430Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1845917Z 2025-05-07T20:32:43.1846177Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1846553Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1847018Z x = x_sign * x_clamp 2025-05-07T20:32:43.1847391Z x0 = x[:, :D] 2025-05-07T20:32:43.1847674Z x1 = x[:, D:] 2025-05-07T20:32:43.1848023Z 2025-05-07T20:32:43.1848306Z if contiguous: 2025-05-07T20:32:43.1848607Z x0 = x0.contiguous() 2025-05-07T20:32:43.1848983Z x1 = x1.contiguous() 2025-05-07T20:32:43.1849376Z 2025-05-07T20:32:43.1849635Z if scale_ub is not None: 2025-05-07T20:32:43.1850025Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1850457Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1850832Z ) 2025-05-07T20:32:43.1851147Z else: 2025-05-07T20:32:43.1851459Z scale_ub_tensor = None 2025-05-07T20:32:43.1852009Z 2025-05-07T20:32:43.1852333Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1852763Z op = silu_mul_quant 2025-05-07T20:32:43.1853100Z if compiled: 2025-05-07T20:32:43.1853426Z op = torch.compile(op) 2025-05-07T20:32:43.1853914Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1854268Z 2025-05-07T20:32:43.1854546Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1854810Z 2025-05-07T20:32:43.1854933Z moe/activation_test.py:117: 2025-05-07T20:32:43.1855313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1855821Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1856154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1856920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1857817Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1858400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1859168Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1862194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1862805Z kernel = self.compile( 2025-05-07T20:32:43.1863399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1864206Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1864688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1865116Z 2025-05-07T20:32:43.1865417Z self = 2025-05-07T20:32:43.1866869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1868492Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13e0527100>} 2025-05-07T20:32:43.1869947Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1871107Z context = 2025-05-07T20:32:43.1871415Z 2025-05-07T20:32:43.1871666Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1872306Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1872964Z module_map=module_map) 2025-05-07T20:32:43.1873442Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1873915Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1874233Z E ^ 2025-05-07T20:32:43.1874796Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1875264Z 2025-05-07T20:32:43.1875774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1876320Z 2025-05-07T20:32:43.1876509Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1876965Z self=, 2025-05-07T20:32:43.1877538Z T=4096, 2025-05-07T20:32:43.1877849Z D=7168, 2025-05-07T20:32:43.1878089Z scale_ub=None, 2025-05-07T20:32:43.1878500Z contiguous=False, 2025-05-07T20:32:43.1878840Z compiled=False, 2025-05-07T20:32:43.1879182Z ) 2025-05-07T20:32:43.1879619Z self = 2025-05-07T20:32:43.1880226Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.1880518Z 2025-05-07T20:32:43.1880654Z @given( 2025-05-07T20:32:43.1880962Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1881390Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1881777Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1882210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1882640Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1883080Z ) 2025-05-07T20:32:43.1883533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1884091Z def test_silu_mul_quant( 2025-05-07T20:32:43.1884412Z self, 2025-05-07T20:32:43.1884783Z T: int, 2025-05-07T20:32:43.1885026Z D: int, 2025-05-07T20:32:43.1885352Z scale_ub: Optional[float], 2025-05-07T20:32:43.1885776Z contiguous: bool, 2025-05-07T20:32:43.1886063Z compiled: bool, 2025-05-07T20:32:43.1886388Z ) -> None: 2025-05-07T20:32:43.1886734Z torch.manual_seed(2025) 2025-05-07T20:32:43.1887018Z 2025-05-07T20:32:43.1887470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1887941Z 2025-05-07T20:32:43.1888180Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1888579Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1889018Z x = x_sign * x_clamp 2025-05-07T20:32:43.1889356Z x0 = x[:, :D] 2025-05-07T20:32:43.1889618Z x1 = x[:, D:] 2025-05-07T20:32:43.1889959Z 2025-05-07T20:32:43.1890249Z if contiguous: 2025-05-07T20:32:43.1890613Z x0 = x0.contiguous() 2025-05-07T20:32:43.1891001Z x1 = x1.contiguous() 2025-05-07T20:32:43.1891406Z 2025-05-07T20:32:43.1891650Z if scale_ub is not None: 2025-05-07T20:32:43.1892055Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1892495Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1892846Z ) 2025-05-07T20:32:43.1893194Z else: 2025-05-07T20:32:43.1893487Z scale_ub_tensor = None 2025-05-07T20:32:43.1893801Z 2025-05-07T20:32:43.1894164Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1894555Z op = silu_mul_quant 2025-05-07T20:32:43.1894873Z if compiled: 2025-05-07T20:32:43.1895259Z op = torch.compile(op) 2025-05-07T20:32:43.1895678Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1895776Z 2025-05-07T20:32:43.1895913Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1895925Z 2025-05-07T20:32:43.1896125Z moe/activation_test.py:117: 2025-05-07T20:32:43.1896328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1896461Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1896585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1897164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1897272Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1897784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1898082Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1898444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1898601Z kernel = self.compile( 2025-05-07T20:32:43.1899014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1899309Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1899569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1899574Z 2025-05-07T20:32:43.1899806Z self = 2025-05-07T20:32:43.1900637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1901215Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13e0526f20>} 2025-05-07T20:32:43.1901999Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1902278Z context = 2025-05-07T20:32:43.1902283Z 2025-05-07T20:32:43.1902529Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1902856Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1902992Z module_map=module_map) 2025-05-07T20:32:43.1903204Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1903349Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1903529Z E ^ 2025-05-07T20:32:43.1903962Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1903968Z 2025-05-07T20:32:43.1904415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1904523Z 2025-05-07T20:32:43.1904653Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1904949Z self=, 2025-05-07T20:32:43.1905041Z T=128, 2025-05-07T20:32:43.1905248Z D=7168, 2025-05-07T20:32:43.1905358Z scale_ub=None, 2025-05-07T20:32:43.1905495Z contiguous=False, 2025-05-07T20:32:43.1905640Z compiled=True, 2025-05-07T20:32:43.1905738Z ) 2025-05-07T20:32:43.1905976Z self = 2025-05-07T20:32:43.1906277Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.1906282Z 2025-05-07T20:32:43.1906414Z @given( 2025-05-07T20:32:43.1906596Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1906721Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1906863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1907118Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1907272Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1907400Z ) 2025-05-07T20:32:43.1907760Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1907883Z def test_silu_mul_quant( 2025-05-07T20:32:43.1907988Z self, 2025-05-07T20:32:43.1908161Z T: int, 2025-05-07T20:32:43.1908304Z D: int, 2025-05-07T20:32:43.1908462Z scale_ub: Optional[float], 2025-05-07T20:32:43.1908577Z contiguous: bool, 2025-05-07T20:32:43.1908689Z compiled: bool, 2025-05-07T20:32:43.1908815Z ) -> None: 2025-05-07T20:32:43.1908983Z torch.manual_seed(2025) 2025-05-07T20:32:43.1909123Z 2025-05-07T20:32:43.1909352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1909453Z 2025-05-07T20:32:43.1909573Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1909750Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1909913Z x = x_sign * x_clamp 2025-05-07T20:32:43.1910175Z x0 = x[:, :D] 2025-05-07T20:32:43.1910287Z x1 = x[:, D:] 2025-05-07T20:32:43.1910387Z 2025-05-07T20:32:43.1910565Z if contiguous: 2025-05-07T20:32:43.1910695Z x0 = x0.contiguous() 2025-05-07T20:32:43.1910863Z x1 = x1.contiguous() 2025-05-07T20:32:43.1911010Z 2025-05-07T20:32:43.1911127Z if scale_ub is not None: 2025-05-07T20:32:43.1911258Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1911593Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1911680Z ) 2025-05-07T20:32:43.1911832Z else: 2025-05-07T20:32:43.1911997Z scale_ub_tensor = None 2025-05-07T20:32:43.1912092Z 2025-05-07T20:32:43.1912282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1912427Z op = silu_mul_quant 2025-05-07T20:32:43.1912528Z if compiled: 2025-05-07T20:32:43.1912752Z op = torch.compile(op) 2025-05-07T20:32:43.1912891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1912988Z 2025-05-07T20:32:43.1913165Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.1913312Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.1913393Z 2025-05-07T20:32:43.1913653Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1913780Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.1913944Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.1914190Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.1914355Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1914534Z 2025-05-07T20:32:43.1914676Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.1914681Z 2025-05-07T20:32:43.1914804Z moe/activation_test.py:126: 2025-05-07T20:32:43.1915099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1915235Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.1915417Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1916074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.1916215Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.1916661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1916908Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1917357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.1917623Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.1918099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.1918392Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.1918761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.1918869Z fn() 2025-05-07T20:32:43.1919345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.1919442Z self.fn.run( 2025-05-07T20:32:43.1919918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1920075Z kernel = self.compile( 2025-05-07T20:32:43.1920559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1920792Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1920953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1920958Z 2025-05-07T20:32:43.1921340Z self = 2025-05-07T20:32:43.1922173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1922742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13e0527e20>} 2025-05-07T20:32:43.1923537Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1923759Z context = 2025-05-07T20:32:43.1923764Z 2025-05-07T20:32:43.1923982Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1924336Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1924517Z module_map=module_map) 2025-05-07T20:32:43.1924705Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1924831Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.1924966Z E ^ 2025-05-07T20:32:43.1925312Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1942757Z 2025-05-07T20:32:43.1943229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1943236Z 2025-05-07T20:32:43.1943337Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1943748Z self=, 2025-05-07T20:32:43.1943826Z T=128, 2025-05-07T20:32:43.1943904Z D=7168, 2025-05-07T20:32:43.1943992Z scale_ub=None, 2025-05-07T20:32:43.1944076Z contiguous=False, 2025-05-07T20:32:43.1944155Z compiled=False, 2025-05-07T20:32:43.1944232Z ) 2025-05-07T20:32:43.1944447Z self = 2025-05-07T20:32:43.1944624Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.1944629Z 2025-05-07T20:32:43.1944707Z @given( 2025-05-07T20:32:43.1944824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1944930Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1945044Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1945160Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1945281Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1945359Z ) 2025-05-07T20:32:43.1945602Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1945702Z def test_silu_mul_quant( 2025-05-07T20:32:43.1945776Z self, 2025-05-07T20:32:43.1945854Z T: int, 2025-05-07T20:32:43.1945925Z D: int, 2025-05-07T20:32:43.1946019Z scale_ub: Optional[float], 2025-05-07T20:32:43.1946112Z contiguous: bool, 2025-05-07T20:32:43.1946194Z compiled: bool, 2025-05-07T20:32:43.1946267Z ) -> None: 2025-05-07T20:32:43.1946364Z torch.manual_seed(2025) 2025-05-07T20:32:43.1946429Z 2025-05-07T20:32:43.1946596Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1946673Z 2025-05-07T20:32:43.1946765Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1946885Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1946976Z x = x_sign * x_clamp 2025-05-07T20:32:43.1947061Z x0 = x[:, :D] 2025-05-07T20:32:43.1947148Z x1 = x[:, D:] 2025-05-07T20:32:43.1947219Z 2025-05-07T20:32:43.1947300Z if contiguous: 2025-05-07T20:32:43.1947634Z x0 = x0.contiguous() 2025-05-07T20:32:43.1947725Z x1 = x1.contiguous() 2025-05-07T20:32:43.1947793Z 2025-05-07T20:32:43.1947885Z if scale_ub is not None: 2025-05-07T20:32:43.1947988Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1948120Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1948205Z ) 2025-05-07T20:32:43.1948280Z else: 2025-05-07T20:32:43.1948370Z scale_ub_tensor = None 2025-05-07T20:32:43.1948448Z 2025-05-07T20:32:43.1948573Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1948658Z op = silu_mul_quant 2025-05-07T20:32:43.1948743Z if compiled: 2025-05-07T20:32:43.1948839Z op = torch.compile(op) 2025-05-07T20:32:43.1948951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1949017Z 2025-05-07T20:32:43.1949102Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1949107Z 2025-05-07T20:32:43.1949211Z moe/activation_test.py:117: 2025-05-07T20:32:43.1949333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1949428Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1949533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1950028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1950128Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1950484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1950703Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1951049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1951222Z kernel = self.compile( 2025-05-07T20:32:43.1951621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1951796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1951918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1951923Z 2025-05-07T20:32:43.1952124Z self = 2025-05-07T20:32:43.1952886Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1953377Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c73999e0>} 2025-05-07T20:32:43.1954121Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1954304Z context = 2025-05-07T20:32:43.1954309Z 2025-05-07T20:32:43.1954472Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1954729Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1954838Z module_map=module_map) 2025-05-07T20:32:43.1954996Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1955088Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1955163Z E ^ 2025-05-07T20:32:43.1955512Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1955521Z 2025-05-07T20:32:43.1956027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1956033Z 2025-05-07T20:32:43.1956136Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1956352Z self=, 2025-05-07T20:32:43.1956430Z T=4096, 2025-05-07T20:32:43.1956502Z D=5120, 2025-05-07T20:32:43.1956581Z scale_ub=1200.0, 2025-05-07T20:32:43.1956661Z contiguous=True, 2025-05-07T20:32:43.1956739Z compiled=False, 2025-05-07T20:32:43.1956809Z ) 2025-05-07T20:32:43.1957028Z self = 2025-05-07T20:32:43.1957219Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.1957224Z 2025-05-07T20:32:43.1957300Z @given( 2025-05-07T20:32:43.1957450Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1957549Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1957666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1957778Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1957887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1957963Z ) 2025-05-07T20:32:43.1958203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1958292Z def test_silu_mul_quant( 2025-05-07T20:32:43.1958367Z self, 2025-05-07T20:32:43.1958440Z T: int, 2025-05-07T20:32:43.1958512Z D: int, 2025-05-07T20:32:43.1958610Z scale_ub: Optional[float], 2025-05-07T20:32:43.1958694Z contiguous: bool, 2025-05-07T20:32:43.1958777Z compiled: bool, 2025-05-07T20:32:43.1958857Z ) -> None: 2025-05-07T20:32:43.1958947Z torch.manual_seed(2025) 2025-05-07T20:32:43.1959111Z 2025-05-07T20:32:43.1959272Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1959339Z 2025-05-07T20:32:43.1959438Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1959560Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1959645Z x = x_sign * x_clamp 2025-05-07T20:32:43.1959724Z x0 = x[:, :D] 2025-05-07T20:32:43.1959799Z x1 = x[:, D:] 2025-05-07T20:32:43.1959869Z 2025-05-07T20:32:43.1959954Z if contiguous: 2025-05-07T20:32:43.1960042Z x0 = x0.contiguous() 2025-05-07T20:32:43.1960123Z x1 = x1.contiguous() 2025-05-07T20:32:43.1960193Z 2025-05-07T20:32:43.1960281Z if scale_ub is not None: 2025-05-07T20:32:43.1960385Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1960514Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1960584Z ) 2025-05-07T20:32:43.1960659Z else: 2025-05-07T20:32:43.1960754Z scale_ub_tensor = None 2025-05-07T20:32:43.1960823Z 2025-05-07T20:32:43.1960951Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1961041Z op = silu_mul_quant 2025-05-07T20:32:43.1961120Z if compiled: 2025-05-07T20:32:43.1961220Z op = torch.compile(op) 2025-05-07T20:32:43.1961321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1961390Z 2025-05-07T20:32:43.1961479Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1961484Z 2025-05-07T20:32:43.1961574Z moe/activation_test.py:117: 2025-05-07T20:32:43.1961704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1961800Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1961895Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1962387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1962489Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1962842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1963140Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1963478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1963571Z kernel = self.compile( 2025-05-07T20:32:43.1963951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1964119Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1964245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1964249Z 2025-05-07T20:32:43.1964444Z self = 2025-05-07T20:32:43.1965216Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1965709Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c739a200>} 2025-05-07T20:32:43.1966440Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1966627Z context = 2025-05-07T20:32:43.1966631Z 2025-05-07T20:32:43.1966793Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1967052Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1967258Z module_map=module_map) 2025-05-07T20:32:43.1967442Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1967550Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1967626Z E ^ 2025-05-07T20:32:43.1967980Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1967985Z 2025-05-07T20:32:43.1968391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1968396Z 2025-05-07T20:32:43.1968491Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1968713Z self=, 2025-05-07T20:32:43.1968786Z T=1, 2025-05-07T20:32:43.1968860Z D=5120, 2025-05-07T20:32:43.1968946Z scale_ub=None, 2025-05-07T20:32:43.1969029Z contiguous=True, 2025-05-07T20:32:43.1969109Z compiled=True, 2025-05-07T20:32:43.1969184Z ) 2025-05-07T20:32:43.1969395Z self = 2025-05-07T20:32:43.1969559Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.1969563Z 2025-05-07T20:32:43.1969637Z @given( 2025-05-07T20:32:43.1969752Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1969849Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1969960Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1970069Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1970180Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1970252Z ) 2025-05-07T20:32:43.1970491Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1970580Z def test_silu_mul_quant( 2025-05-07T20:32:43.1970652Z self, 2025-05-07T20:32:43.1970735Z T: int, 2025-05-07T20:32:43.1970813Z D: int, 2025-05-07T20:32:43.1970905Z scale_ub: Optional[float], 2025-05-07T20:32:43.1970993Z contiguous: bool, 2025-05-07T20:32:43.1971152Z compiled: bool, 2025-05-07T20:32:43.1971227Z ) -> None: 2025-05-07T20:32:43.1971321Z torch.manual_seed(2025) 2025-05-07T20:32:43.1971390Z 2025-05-07T20:32:43.1971550Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1971624Z 2025-05-07T20:32:43.1971711Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1971834Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1971920Z x = x_sign * x_clamp 2025-05-07T20:32:43.1971997Z x0 = x[:, :D] 2025-05-07T20:32:43.1972076Z x1 = x[:, D:] 2025-05-07T20:32:43.1972144Z 2025-05-07T20:32:43.1972223Z if contiguous: 2025-05-07T20:32:43.1972311Z x0 = x0.contiguous() 2025-05-07T20:32:43.1972394Z x1 = x1.contiguous() 2025-05-07T20:32:43.1972463Z 2025-05-07T20:32:43.1972558Z if scale_ub is not None: 2025-05-07T20:32:43.1972660Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1972796Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1972875Z ) 2025-05-07T20:32:43.1972948Z else: 2025-05-07T20:32:43.1973039Z scale_ub_tensor = None 2025-05-07T20:32:43.1973113Z 2025-05-07T20:32:43.1973237Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1973326Z op = silu_mul_quant 2025-05-07T20:32:43.1973406Z if compiled: 2025-05-07T20:32:43.1973499Z op = torch.compile(op) 2025-05-07T20:32:43.1973603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1973671Z 2025-05-07T20:32:43.1973756Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.1973880Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.1973947Z 2025-05-07T20:32:43.1974076Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1974284Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.1974378Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.1974499Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.1974638Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1974707Z 2025-05-07T20:32:43.1974805Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.1974809Z 2025-05-07T20:32:43.1974903Z moe/activation_test.py:126: 2025-05-07T20:32:43.1975023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1975126Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.1975253Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1975802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.1975908Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.1976262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1976486Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1976848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.1977101Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.1977521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.1977686Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.1978028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.1978102Z fn() 2025-05-07T20:32:43.1978498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.1978585Z self.fn.run( 2025-05-07T20:32:43.1978995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1979085Z kernel = self.compile( 2025-05-07T20:32:43.1979481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1979648Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1979773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1979777Z 2025-05-07T20:32:43.1979972Z self = 2025-05-07T20:32:43.1980734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1981240Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c739ac00>} 2025-05-07T20:32:43.1981968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1982156Z context = 2025-05-07T20:32:43.1982161Z 2025-05-07T20:32:43.1982317Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1982575Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1982680Z module_map=module_map) 2025-05-07T20:32:43.1982837Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1983018Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.1983088Z E ^ 2025-05-07T20:32:43.1983437Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1983442Z 2025-05-07T20:32:43.1983856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1983860Z 2025-05-07T20:32:43.1983957Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1984173Z self=, 2025-05-07T20:32:43.1984245Z T=2048, 2025-05-07T20:32:43.1984319Z D=5120, 2025-05-07T20:32:43.1984399Z scale_ub=None, 2025-05-07T20:32:43.1984479Z contiguous=True, 2025-05-07T20:32:43.1984554Z compiled=True, 2025-05-07T20:32:43.1984627Z ) 2025-05-07T20:32:43.1984837Z self = 2025-05-07T20:32:43.1985007Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.1985012Z 2025-05-07T20:32:43.1985090Z @given( 2025-05-07T20:32:43.1985215Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1985317Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1985426Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1985536Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1985652Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1985727Z ) 2025-05-07T20:32:43.1985963Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1986054Z def test_silu_mul_quant( 2025-05-07T20:32:43.1986128Z self, 2025-05-07T20:32:43.1986199Z T: int, 2025-05-07T20:32:43.1986276Z D: int, 2025-05-07T20:32:43.1986367Z scale_ub: Optional[float], 2025-05-07T20:32:43.1986459Z contiguous: bool, 2025-05-07T20:32:43.1986543Z compiled: bool, 2025-05-07T20:32:43.1986613Z ) -> None: 2025-05-07T20:32:43.1986703Z torch.manual_seed(2025) 2025-05-07T20:32:43.1986772Z 2025-05-07T20:32:43.1987014Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1987091Z 2025-05-07T20:32:43.1987178Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1987298Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1987454Z x = x_sign * x_clamp 2025-05-07T20:32:43.1987538Z x0 = x[:, :D] 2025-05-07T20:32:43.1987632Z x1 = x[:, D:] 2025-05-07T20:32:43.1987702Z 2025-05-07T20:32:43.1987780Z if contiguous: 2025-05-07T20:32:43.1987866Z x0 = x0.contiguous() 2025-05-07T20:32:43.1987954Z x1 = x1.contiguous() 2025-05-07T20:32:43.1988022Z 2025-05-07T20:32:43.1988109Z if scale_ub is not None: 2025-05-07T20:32:43.1988210Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1988343Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1988415Z ) 2025-05-07T20:32:43.1988487Z else: 2025-05-07T20:32:43.1988580Z scale_ub_tensor = None 2025-05-07T20:32:43.1988649Z 2025-05-07T20:32:43.1988772Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1988855Z op = silu_mul_quant 2025-05-07T20:32:43.1988938Z if compiled: 2025-05-07T20:32:43.1989032Z op = torch.compile(op) 2025-05-07T20:32:43.1989131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1989203Z 2025-05-07T20:32:43.1989290Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.1989411Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.1989480Z 2025-05-07T20:32:43.1989609Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1989710Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.1989803Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.1990002Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.1990147Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1990217Z 2025-05-07T20:32:43.1990313Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.1990318Z 2025-05-07T20:32:43.1990414Z moe/activation_test.py:126: 2025-05-07T20:32:43.1990535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1990639Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.1990768Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1991317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.1991420Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.1991774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1991998Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1992372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.1992622Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.1992993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.1993153Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.1993489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.1993567Z fn() 2025-05-07T20:32:43.1993961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.1994042Z self.fn.run( 2025-05-07T20:32:43.1994379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1994468Z kernel = self.compile( 2025-05-07T20:32:43.1994927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1995094Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1995218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1995223Z 2025-05-07T20:32:43.1995425Z self = 2025-05-07T20:32:43.1996186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1996681Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c77ca020>} 2025-05-07T20:32:43.1997424Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1997611Z context = 2025-05-07T20:32:43.1997615Z 2025-05-07T20:32:43.1997772Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1998028Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1998135Z module_map=module_map) 2025-05-07T20:32:43.1998291Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1998390Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.1998467Z E ^ 2025-05-07T20:32:43.1998813Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1998897Z 2025-05-07T20:32:43.1999337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1999341Z 2025-05-07T20:32:43.1999438Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1999652Z self=, 2025-05-07T20:32:43.1999730Z T=128, 2025-05-07T20:32:43.1999808Z D=5120, 2025-05-07T20:32:43.1999888Z scale_ub=None, 2025-05-07T20:32:43.1999974Z contiguous=True, 2025-05-07T20:32:43.2000049Z compiled=True, 2025-05-07T20:32:43.2000121Z ) 2025-05-07T20:32:43.2000333Z self = 2025-05-07T20:32:43.2000495Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2000500Z 2025-05-07T20:32:43.2000579Z @given( 2025-05-07T20:32:43.2000697Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2000791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2000909Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2001020Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2001130Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2001204Z ) 2025-05-07T20:32:43.2001445Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2001542Z def test_silu_mul_quant( 2025-05-07T20:32:43.2001619Z self, 2025-05-07T20:32:43.2001696Z T: int, 2025-05-07T20:32:43.2001781Z D: int, 2025-05-07T20:32:43.2001876Z scale_ub: Optional[float], 2025-05-07T20:32:43.2001961Z contiguous: bool, 2025-05-07T20:32:43.2002050Z compiled: bool, 2025-05-07T20:32:43.2002128Z ) -> None: 2025-05-07T20:32:43.2002221Z torch.manual_seed(2025) 2025-05-07T20:32:43.2002303Z 2025-05-07T20:32:43.2002469Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2002538Z 2025-05-07T20:32:43.2002717Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2002838Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2002930Z x = x_sign * x_clamp 2025-05-07T20:32:43.2003008Z x0 = x[:, :D] 2025-05-07T20:32:43.2003084Z x1 = x[:, D:] 2025-05-07T20:32:43.2003162Z 2025-05-07T20:32:43.2003244Z if contiguous: 2025-05-07T20:32:43.2003334Z x0 = x0.contiguous() 2025-05-07T20:32:43.2003428Z x1 = x1.contiguous() 2025-05-07T20:32:43.2003500Z 2025-05-07T20:32:43.2003588Z if scale_ub is not None: 2025-05-07T20:32:43.2003698Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2003833Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2003906Z ) 2025-05-07T20:32:43.2003985Z else: 2025-05-07T20:32:43.2004079Z scale_ub_tensor = None 2025-05-07T20:32:43.2004161Z 2025-05-07T20:32:43.2004292Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2004388Z op = silu_mul_quant 2025-05-07T20:32:43.2004475Z if compiled: 2025-05-07T20:32:43.2004575Z op = torch.compile(op) 2025-05-07T20:32:43.2004679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2004752Z 2025-05-07T20:32:43.2004840Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.2004959Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.2005033Z 2025-05-07T20:32:43.2005164Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2005262Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.2005364Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.2005481Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.2005621Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2005799Z 2025-05-07T20:32:43.2005896Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.2005900Z 2025-05-07T20:32:43.2006006Z moe/activation_test.py:126: 2025-05-07T20:32:43.2006131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2006233Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.2006368Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2006919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.2007022Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.2007403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2007645Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2008019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.2008275Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2008653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.2008814Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.2009152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.2009234Z fn() 2025-05-07T20:32:43.2009629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.2009709Z self.fn.run( 2025-05-07T20:32:43.2010050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2010144Z kernel = self.compile( 2025-05-07T20:32:43.2010536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2010787Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2010911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2010916Z 2025-05-07T20:32:43.2011121Z self = 2025-05-07T20:32:43.2011882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2012378Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c679b420>} 2025-05-07T20:32:43.2013112Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2013301Z context = 2025-05-07T20:32:43.2013306Z 2025-05-07T20:32:43.2013470Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2013726Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2013838Z module_map=module_map) 2025-05-07T20:32:43.2013995Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2014092Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.2014172Z E ^ 2025-05-07T20:32:43.2014519Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2014524Z 2025-05-07T20:32:43.2015021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2015026Z 2025-05-07T20:32:43.2015130Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2015350Z self=, 2025-05-07T20:32:43.2015429Z T=4096, 2025-05-07T20:32:43.2015502Z D=5120, 2025-05-07T20:32:43.2015580Z scale_ub=None, 2025-05-07T20:32:43.2015675Z contiguous=True, 2025-05-07T20:32:43.2015759Z compiled=True, 2025-05-07T20:32:43.2015831Z ) 2025-05-07T20:32:43.2016053Z self = 2025-05-07T20:32:43.2016220Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2016225Z 2025-05-07T20:32:43.2016306Z @given( 2025-05-07T20:32:43.2016421Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2016517Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2016644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2016757Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2016871Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2016950Z ) 2025-05-07T20:32:43.2017189Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2017280Z def test_silu_mul_quant( 2025-05-07T20:32:43.2017366Z self, 2025-05-07T20:32:43.2017458Z T: int, 2025-05-07T20:32:43.2017543Z D: int, 2025-05-07T20:32:43.2017660Z scale_ub: Optional[float], 2025-05-07T20:32:43.2017749Z contiguous: bool, 2025-05-07T20:32:43.2017837Z compiled: bool, 2025-05-07T20:32:43.2017913Z ) -> None: 2025-05-07T20:32:43.2018004Z torch.manual_seed(2025) 2025-05-07T20:32:43.2018085Z 2025-05-07T20:32:43.2018248Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2018322Z 2025-05-07T20:32:43.2018423Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2018543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2018714Z x = x_sign * x_clamp 2025-05-07T20:32:43.2018802Z x0 = x[:, :D] 2025-05-07T20:32:43.2018878Z x1 = x[:, D:] 2025-05-07T20:32:43.2018949Z 2025-05-07T20:32:43.2019038Z if contiguous: 2025-05-07T20:32:43.2019127Z x0 = x0.contiguous() 2025-05-07T20:32:43.2019218Z x1 = x1.contiguous() 2025-05-07T20:32:43.2019288Z 2025-05-07T20:32:43.2019376Z if scale_ub is not None: 2025-05-07T20:32:43.2019486Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2019617Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2019692Z ) 2025-05-07T20:32:43.2019770Z else: 2025-05-07T20:32:43.2019864Z scale_ub_tensor = None 2025-05-07T20:32:43.2019939Z 2025-05-07T20:32:43.2020074Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2020166Z op = silu_mul_quant 2025-05-07T20:32:43.2020249Z if compiled: 2025-05-07T20:32:43.2020359Z op = torch.compile(op) 2025-05-07T20:32:43.2020462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2020538Z 2025-05-07T20:32:43.2020629Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.2020746Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.2020822Z 2025-05-07T20:32:43.2020954Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2021052Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.2021159Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.2021275Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.2021409Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2021489Z 2025-05-07T20:32:43.2021585Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.2021671Z 2025-05-07T20:32:43.2021772Z moe/activation_test.py:126: 2025-05-07T20:32:43.2021896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2022007Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.2022142Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2022692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.2022789Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.2023148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2023365Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2023735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.2023985Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2024367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.2024534Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.2024872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.2024951Z fn() 2025-05-07T20:32:43.2025348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.2025430Z self.fn.run( 2025-05-07T20:32:43.2025769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2025861Z kernel = self.compile( 2025-05-07T20:32:43.2026237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2026417Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2026618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2026623Z 2025-05-07T20:32:43.2026827Z self = 2025-05-07T20:32:43.2027682Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2028174Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c67eaac0>} 2025-05-07T20:32:43.2028911Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2029103Z context = 2025-05-07T20:32:43.2029107Z 2025-05-07T20:32:43.2029276Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2029533Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2029638Z module_map=module_map) 2025-05-07T20:32:43.2029800Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2029899Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.2029981Z E ^ 2025-05-07T20:32:43.2030330Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2030334Z 2025-05-07T20:32:43.2030765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2030849Z 2025-05-07T20:32:43.2030957Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2031173Z self=, 2025-05-07T20:32:43.2031262Z T=16384, 2025-05-07T20:32:43.2031342Z D=5120, 2025-05-07T20:32:43.2031421Z scale_ub=None, 2025-05-07T20:32:43.2031508Z contiguous=True, 2025-05-07T20:32:43.2031589Z compiled=True, 2025-05-07T20:32:43.2031660Z ) 2025-05-07T20:32:43.2031877Z self = 2025-05-07T20:32:43.2032049Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2032053Z 2025-05-07T20:32:43.2032127Z @given( 2025-05-07T20:32:43.2032249Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2032346Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2032459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2032579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2032693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2032773Z ) 2025-05-07T20:32:43.2033017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2033107Z def test_silu_mul_quant( 2025-05-07T20:32:43.2033189Z self, 2025-05-07T20:32:43.2033266Z T: int, 2025-05-07T20:32:43.2033343Z D: int, 2025-05-07T20:32:43.2033442Z scale_ub: Optional[float], 2025-05-07T20:32:43.2033530Z contiguous: bool, 2025-05-07T20:32:43.2033613Z compiled: bool, 2025-05-07T20:32:43.2033694Z ) -> None: 2025-05-07T20:32:43.2033786Z torch.manual_seed(2025) 2025-05-07T20:32:43.2033855Z 2025-05-07T20:32:43.2034023Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2034093Z 2025-05-07T20:32:43.2034187Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2034309Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2034399Z x = x_sign * x_clamp 2025-05-07T20:32:43.2034483Z x0 = x[:, :D] 2025-05-07T20:32:43.2034561Z x1 = x[:, D:] 2025-05-07T20:32:43.2034633Z 2025-05-07T20:32:43.2034889Z if contiguous: 2025-05-07T20:32:43.2034981Z x0 = x0.contiguous() 2025-05-07T20:32:43.2035069Z x1 = x1.contiguous() 2025-05-07T20:32:43.2035146Z 2025-05-07T20:32:43.2035234Z if scale_ub is not None: 2025-05-07T20:32:43.2035334Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2035478Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2035555Z ) 2025-05-07T20:32:43.2035635Z else: 2025-05-07T20:32:43.2035726Z scale_ub_tensor = None 2025-05-07T20:32:43.2035799Z 2025-05-07T20:32:43.2035933Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2036018Z op = silu_mul_quant 2025-05-07T20:32:43.2036099Z if compiled: 2025-05-07T20:32:43.2036205Z op = torch.compile(op) 2025-05-07T20:32:43.2036306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2036379Z 2025-05-07T20:32:43.2036478Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.2036596Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.2036668Z 2025-05-07T20:32:43.2036804Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2036902Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.2037005Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.2037120Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.2037255Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2037331Z 2025-05-07T20:32:43.2037427Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.2037432Z 2025-05-07T20:32:43.2037525Z moe/activation_test.py:126: 2025-05-07T20:32:43.2037653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2037838Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.2037967Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2038527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.2038624Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.2038984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2039200Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2039561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.2039819Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2040426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.2040639Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.2040986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.2041059Z fn() 2025-05-07T20:32:43.2041461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.2041539Z self.fn.run( 2025-05-07T20:32:43.2041871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2041966Z kernel = self.compile( 2025-05-07T20:32:43.2042341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2042514Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2042637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2042645Z 2025-05-07T20:32:43.2042843Z self = 2025-05-07T20:32:43.2043747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2044239Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13f8b5a520>} 2025-05-07T20:32:43.2044976Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2045162Z context = 2025-05-07T20:32:43.2045170Z 2025-05-07T20:32:43.2045334Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2045596Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2045698Z module_map=module_map) 2025-05-07T20:32:43.2045860Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2045955Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.2046028Z E ^ 2025-05-07T20:32:43.2046382Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2046386Z 2025-05-07T20:32:43.2046795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2046799Z 2025-05-07T20:32:43.2046902Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2047119Z self=, 2025-05-07T20:32:43.2047318Z T=1, 2025-05-07T20:32:43.2047397Z D=5120, 2025-05-07T20:32:43.2047477Z scale_ub=1200.0, 2025-05-07T20:32:43.2047563Z contiguous=True, 2025-05-07T20:32:43.2047650Z compiled=True, 2025-05-07T20:32:43.2047716Z ) 2025-05-07T20:32:43.2047933Z self = 2025-05-07T20:32:43.2048100Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2048105Z 2025-05-07T20:32:43.2048180Z @given( 2025-05-07T20:32:43.2048305Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2048400Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2048510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2048632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2048743Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2048816Z ) 2025-05-07T20:32:43.2049061Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2049159Z def test_silu_mul_quant( 2025-05-07T20:32:43.2049239Z self, 2025-05-07T20:32:43.2049323Z T: int, 2025-05-07T20:32:43.2049395Z D: int, 2025-05-07T20:32:43.2049496Z scale_ub: Optional[float], 2025-05-07T20:32:43.2049585Z contiguous: bool, 2025-05-07T20:32:43.2049665Z compiled: bool, 2025-05-07T20:32:43.2049748Z ) -> None: 2025-05-07T20:32:43.2049840Z torch.manual_seed(2025) 2025-05-07T20:32:43.2049910Z 2025-05-07T20:32:43.2050079Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2050153Z 2025-05-07T20:32:43.2050243Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2050371Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2050460Z x = x_sign * x_clamp 2025-05-07T20:32:43.2050538Z x0 = x[:, :D] 2025-05-07T20:32:43.2050619Z x1 = x[:, D:] 2025-05-07T20:32:43.2050695Z 2025-05-07T20:32:43.2050782Z if contiguous: 2025-05-07T20:32:43.2050869Z x0 = x0.contiguous() 2025-05-07T20:32:43.2051035Z x1 = x1.contiguous() 2025-05-07T20:32:43.2051112Z 2025-05-07T20:32:43.2051197Z if scale_ub is not None: 2025-05-07T20:32:43.2051302Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2051437Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2051510Z ) 2025-05-07T20:32:43.2051582Z else: 2025-05-07T20:32:43.2051674Z scale_ub_tensor = None 2025-05-07T20:32:43.2051743Z 2025-05-07T20:32:43.2051868Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2051959Z op = silu_mul_quant 2025-05-07T20:32:43.2052042Z if compiled: 2025-05-07T20:32:43.2052142Z op = torch.compile(op) 2025-05-07T20:32:43.2052246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2052312Z 2025-05-07T20:32:43.2052407Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2052411Z 2025-05-07T20:32:43.2052503Z moe/activation_test.py:117: 2025-05-07T20:32:43.2052633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2052735Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2052831Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2053194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2053287Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2053771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2053870Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2054223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2054441Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2054862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2054957Z kernel = self.compile( 2025-05-07T20:32:43.2055339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2055507Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2055629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2055634Z 2025-05-07T20:32:43.2055836Z self = 2025-05-07T20:32:43.2056597Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2057105Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c5d0f1a0>} 2025-05-07T20:32:43.2057889Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2058073Z context = 2025-05-07T20:32:43.2058078Z 2025-05-07T20:32:43.2058242Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2058498Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2058606Z module_map=module_map) 2025-05-07T20:32:43.2058762Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2058857Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2058943Z E ^ 2025-05-07T20:32:43.2059290Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2059372Z 2025-05-07T20:32:43.2059780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2059790Z 2025-05-07T20:32:43.2059889Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2060104Z self=, 2025-05-07T20:32:43.2060183Z T=1, 2025-05-07T20:32:43.2060256Z D=5120, 2025-05-07T20:32:43.2060336Z scale_ub=None, 2025-05-07T20:32:43.2060427Z contiguous=False, 2025-05-07T20:32:43.2060505Z compiled=True, 2025-05-07T20:32:43.2060570Z ) 2025-05-07T20:32:43.2060792Z self = 2025-05-07T20:32:43.2060951Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2060959Z 2025-05-07T20:32:43.2061040Z @given( 2025-05-07T20:32:43.2061154Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2061255Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2061369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2061481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2061589Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2065747Z ) 2025-05-07T20:32:43.2066011Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2066106Z def test_silu_mul_quant( 2025-05-07T20:32:43.2066182Z self, 2025-05-07T20:32:43.2066257Z T: int, 2025-05-07T20:32:43.2066330Z D: int, 2025-05-07T20:32:43.2066428Z scale_ub: Optional[float], 2025-05-07T20:32:43.2066513Z contiguous: bool, 2025-05-07T20:32:43.2066596Z compiled: bool, 2025-05-07T20:32:43.2066671Z ) -> None: 2025-05-07T20:32:43.2066888Z torch.manual_seed(2025) 2025-05-07T20:32:43.2066961Z 2025-05-07T20:32:43.2067126Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2067201Z 2025-05-07T20:32:43.2067296Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2067481Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2067566Z x = x_sign * x_clamp 2025-05-07T20:32:43.2067647Z x0 = x[:, :D] 2025-05-07T20:32:43.2067719Z x1 = x[:, D:] 2025-05-07T20:32:43.2067786Z 2025-05-07T20:32:43.2067872Z if contiguous: 2025-05-07T20:32:43.2067957Z x0 = x0.contiguous() 2025-05-07T20:32:43.2068043Z x1 = x1.contiguous() 2025-05-07T20:32:43.2068118Z 2025-05-07T20:32:43.2068206Z if scale_ub is not None: 2025-05-07T20:32:43.2068311Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2068442Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2068519Z ) 2025-05-07T20:32:43.2068591Z else: 2025-05-07T20:32:43.2068679Z scale_ub_tensor = None 2025-05-07T20:32:43.2068743Z 2025-05-07T20:32:43.2068878Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2068963Z op = silu_mul_quant 2025-05-07T20:32:43.2069044Z if compiled: 2025-05-07T20:32:43.2069142Z op = torch.compile(op) 2025-05-07T20:32:43.2069243Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2069314Z 2025-05-07T20:32:43.2069406Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.2069521Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.2069592Z 2025-05-07T20:32:43.2069722Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2069820Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.2069918Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.2070031Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.2070168Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2070243Z 2025-05-07T20:32:43.2070419Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.2070425Z 2025-05-07T20:32:43.2070521Z moe/activation_test.py:126: 2025-05-07T20:32:43.2070655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2070755Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.2070889Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2071442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.2071536Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.2071899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2072115Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2072488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.2072742Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2073113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.2073278Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.2073614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.2073688Z fn() 2025-05-07T20:32:43.2074087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.2074166Z self.fn.run( 2025-05-07T20:32:43.2074502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2074672Z kernel = self.compile( 2025-05-07T20:32:43.2075053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2075225Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2075348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2075353Z 2025-05-07T20:32:43.2075555Z self = 2025-05-07T20:32:43.2076315Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2076808Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c64782c0>} 2025-05-07T20:32:43.2077602Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2077785Z context = 2025-05-07T20:32:43.2077790Z 2025-05-07T20:32:43.2077950Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2078203Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2078306Z module_map=module_map) 2025-05-07T20:32:43.2078464Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2078560Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.2078633Z E ^ 2025-05-07T20:32:43.2078982Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2078990Z 2025-05-07T20:32:43.2079393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2079474Z 2025-05-07T20:32:43.2079577Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2079792Z self=, 2025-05-07T20:32:43.2079861Z T=1, 2025-05-07T20:32:43.2079937Z D=5120, 2025-05-07T20:32:43.2080011Z scale_ub=None, 2025-05-07T20:32:43.2080092Z contiguous=True, 2025-05-07T20:32:43.2080170Z compiled=False, 2025-05-07T20:32:43.2080238Z ) 2025-05-07T20:32:43.2080452Z self = 2025-05-07T20:32:43.2080612Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2080616Z 2025-05-07T20:32:43.2080693Z @given( 2025-05-07T20:32:43.2080811Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2080916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2081026Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2081147Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2081253Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2081331Z ) 2025-05-07T20:32:43.2081567Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2081656Z def test_silu_mul_quant( 2025-05-07T20:32:43.2081734Z self, 2025-05-07T20:32:43.2081808Z T: int, 2025-05-07T20:32:43.2081882Z D: int, 2025-05-07T20:32:43.2081976Z scale_ub: Optional[float], 2025-05-07T20:32:43.2082060Z contiguous: bool, 2025-05-07T20:32:43.2082139Z compiled: bool, 2025-05-07T20:32:43.2082224Z ) -> None: 2025-05-07T20:32:43.2082315Z torch.manual_seed(2025) 2025-05-07T20:32:43.2082382Z 2025-05-07T20:32:43.2082550Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2082702Z 2025-05-07T20:32:43.2082793Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2082916Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2083003Z x = x_sign * x_clamp 2025-05-07T20:32:43.2083083Z x0 = x[:, :D] 2025-05-07T20:32:43.2083157Z x1 = x[:, D:] 2025-05-07T20:32:43.2083225Z 2025-05-07T20:32:43.2083306Z if contiguous: 2025-05-07T20:32:43.2083391Z x0 = x0.contiguous() 2025-05-07T20:32:43.2083473Z x1 = x1.contiguous() 2025-05-07T20:32:43.2083543Z 2025-05-07T20:32:43.2083628Z if scale_ub is not None: 2025-05-07T20:32:43.2083729Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2083866Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2083938Z ) 2025-05-07T20:32:43.2084006Z else: 2025-05-07T20:32:43.2084099Z scale_ub_tensor = None 2025-05-07T20:32:43.2084165Z 2025-05-07T20:32:43.2084299Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2084384Z op = silu_mul_quant 2025-05-07T20:32:43.2084470Z if compiled: 2025-05-07T20:32:43.2084566Z op = torch.compile(op) 2025-05-07T20:32:43.2084666Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2084734Z 2025-05-07T20:32:43.2084821Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2084825Z 2025-05-07T20:32:43.2084918Z moe/activation_test.py:117: 2025-05-07T20:32:43.2085040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2085136Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2085228Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2085722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2085814Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2086173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2086476Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2086811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2086900Z kernel = self.compile( 2025-05-07T20:32:43.2087300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2087492Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2087624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2087629Z 2025-05-07T20:32:43.2087822Z self = 2025-05-07T20:32:43.2088580Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2089081Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c63eeac0>} 2025-05-07T20:32:43.2089810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2089993Z context = 2025-05-07T20:32:43.2089998Z 2025-05-07T20:32:43.2090157Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2090414Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2090515Z module_map=module_map) 2025-05-07T20:32:43.2090751Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2090846Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2090925Z E ^ 2025-05-07T20:32:43.2091274Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2091279Z 2025-05-07T20:32:43.2091711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2091715Z 2025-05-07T20:32:43.2091812Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2092029Z self=, 2025-05-07T20:32:43.2092102Z T=128, 2025-05-07T20:32:43.2092176Z D=5120, 2025-05-07T20:32:43.2092258Z scale_ub=None, 2025-05-07T20:32:43.2092340Z contiguous=False, 2025-05-07T20:32:43.2092416Z compiled=True, 2025-05-07T20:32:43.2092488Z ) 2025-05-07T20:32:43.2092703Z self = 2025-05-07T20:32:43.2092868Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2092882Z 2025-05-07T20:32:43.2092953Z @given( 2025-05-07T20:32:43.2093065Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2093162Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2093271Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2093380Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2093489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2093557Z ) 2025-05-07T20:32:43.2093793Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2093885Z def test_silu_mul_quant( 2025-05-07T20:32:43.2093957Z self, 2025-05-07T20:32:43.2094032Z T: int, 2025-05-07T20:32:43.2094110Z D: int, 2025-05-07T20:32:43.2094201Z scale_ub: Optional[float], 2025-05-07T20:32:43.2094293Z contiguous: bool, 2025-05-07T20:32:43.2094373Z compiled: bool, 2025-05-07T20:32:43.2094445Z ) -> None: 2025-05-07T20:32:43.2094622Z torch.manual_seed(2025) 2025-05-07T20:32:43.2094692Z 2025-05-07T20:32:43.2094854Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2094927Z 2025-05-07T20:32:43.2095012Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2095129Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2095214Z x = x_sign * x_clamp 2025-05-07T20:32:43.2095291Z x0 = x[:, :D] 2025-05-07T20:32:43.2095367Z x1 = x[:, D:] 2025-05-07T20:32:43.2095435Z 2025-05-07T20:32:43.2095515Z if contiguous: 2025-05-07T20:32:43.2095606Z x0 = x0.contiguous() 2025-05-07T20:32:43.2095689Z x1 = x1.contiguous() 2025-05-07T20:32:43.2095756Z 2025-05-07T20:32:43.2095851Z if scale_ub is not None: 2025-05-07T20:32:43.2095958Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2096087Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2096169Z ) 2025-05-07T20:32:43.2096242Z else: 2025-05-07T20:32:43.2096332Z scale_ub_tensor = None 2025-05-07T20:32:43.2096408Z 2025-05-07T20:32:43.2096532Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2096616Z op = silu_mul_quant 2025-05-07T20:32:43.2096703Z if compiled: 2025-05-07T20:32:43.2096797Z op = torch.compile(op) 2025-05-07T20:32:43.2096904Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2096972Z 2025-05-07T20:32:43.2097060Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2097065Z 2025-05-07T20:32:43.2097163Z moe/activation_test.py:117: 2025-05-07T20:32:43.2097310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2097415Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2097630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2097993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2098080Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2098571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2098665Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2099018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2099233Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2099566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2099657Z kernel = self.compile( 2025-05-07T20:32:43.2100034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2100213Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2100341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2100345Z 2025-05-07T20:32:43.2100538Z self = 2025-05-07T20:32:43.2101297Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2101783Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c647be20>} 2025-05-07T20:32:43.2102516Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2102779Z context = 2025-05-07T20:32:43.2102784Z 2025-05-07T20:32:43.2102943Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2103204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2103306Z module_map=module_map) 2025-05-07T20:32:43.2103469Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2103563Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2103635Z E ^ 2025-05-07T20:32:43.2103982Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2103986Z 2025-05-07T20:32:43.2104392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2104400Z 2025-05-07T20:32:43.2104499Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2104717Z self=, 2025-05-07T20:32:43.2104791Z T=128, 2025-05-07T20:32:43.2104868Z D=7168, 2025-05-07T20:32:43.2104946Z scale_ub=1200.0, 2025-05-07T20:32:43.2105029Z contiguous=False, 2025-05-07T20:32:43.2105112Z compiled=False, 2025-05-07T20:32:43.2105178Z ) 2025-05-07T20:32:43.2105389Z self = 2025-05-07T20:32:43.2105558Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2105563Z 2025-05-07T20:32:43.2105636Z @given( 2025-05-07T20:32:43.2105753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2105847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2105957Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2106152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2106260Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2106332Z ) 2025-05-07T20:32:43.2106572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2106658Z def test_silu_mul_quant( 2025-05-07T20:32:43.2106731Z self, 2025-05-07T20:32:43.2106806Z T: int, 2025-05-07T20:32:43.2106881Z D: int, 2025-05-07T20:32:43.2106972Z scale_ub: Optional[float], 2025-05-07T20:32:43.2107060Z contiguous: bool, 2025-05-07T20:32:43.2107140Z compiled: bool, 2025-05-07T20:32:43.2107215Z ) -> None: 2025-05-07T20:32:43.2107304Z torch.manual_seed(2025) 2025-05-07T20:32:43.2107369Z 2025-05-07T20:32:43.2107608Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2107678Z 2025-05-07T20:32:43.2107765Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2107895Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2107979Z x = x_sign * x_clamp 2025-05-07T20:32:43.2108052Z x0 = x[:, :D] 2025-05-07T20:32:43.2108135Z x1 = x[:, D:] 2025-05-07T20:32:43.2108206Z 2025-05-07T20:32:43.2108285Z if contiguous: 2025-05-07T20:32:43.2108373Z x0 = x0.contiguous() 2025-05-07T20:32:43.2108456Z x1 = x1.contiguous() 2025-05-07T20:32:43.2108527Z 2025-05-07T20:32:43.2108612Z if scale_ub is not None: 2025-05-07T20:32:43.2108709Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2108841Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2108912Z ) 2025-05-07T20:32:43.2108982Z else: 2025-05-07T20:32:43.2109072Z scale_ub_tensor = None 2025-05-07T20:32:43.2109141Z 2025-05-07T20:32:43.2109265Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2109351Z op = silu_mul_quant 2025-05-07T20:32:43.2109435Z if compiled: 2025-05-07T20:32:43.2109528Z op = torch.compile(op) 2025-05-07T20:32:43.2109714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2109780Z 2025-05-07T20:32:43.2109869Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2109874Z 2025-05-07T20:32:43.2109966Z moe/activation_test.py:117: 2025-05-07T20:32:43.2110088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2110185Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2110277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2110764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2110859Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2111211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2111435Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2111770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2111856Z kernel = self.compile( 2025-05-07T20:32:43.2112252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2112419Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2112539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2112548Z 2025-05-07T20:32:43.2112744Z self = 2025-05-07T20:32:43.2113500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2114095Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c647bb00>} 2025-05-07T20:32:43.2114829Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2115017Z context = 2025-05-07T20:32:43.2115021Z 2025-05-07T20:32:43.2115176Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2115429Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2115535Z module_map=module_map) 2025-05-07T20:32:43.2115690Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2115781Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2115868Z E ^ 2025-05-07T20:32:43.2116220Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2116225Z 2025-05-07T20:32:43.2116638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2116642Z 2025-05-07T20:32:43.2116738Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2116952Z self=, 2025-05-07T20:32:43.2117031Z T=128, 2025-05-07T20:32:43.2117106Z D=5120, 2025-05-07T20:32:43.2117191Z scale_ub=None, 2025-05-07T20:32:43.2117272Z contiguous=False, 2025-05-07T20:32:43.2117348Z compiled=False, 2025-05-07T20:32:43.2117419Z ) 2025-05-07T20:32:43.2117629Z self = 2025-05-07T20:32:43.2117792Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2117802Z 2025-05-07T20:32:43.2117876Z @given( 2025-05-07T20:32:43.2118068Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2118161Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2118273Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2118384Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2118494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2118564Z ) 2025-05-07T20:32:43.2118799Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2118892Z def test_silu_mul_quant( 2025-05-07T20:32:43.2118964Z self, 2025-05-07T20:32:43.2119037Z T: int, 2025-05-07T20:32:43.2119112Z D: int, 2025-05-07T20:32:43.2119204Z scale_ub: Optional[float], 2025-05-07T20:32:43.2119285Z contiguous: bool, 2025-05-07T20:32:43.2119370Z compiled: bool, 2025-05-07T20:32:43.2119446Z ) -> None: 2025-05-07T20:32:43.2119534Z torch.manual_seed(2025) 2025-05-07T20:32:43.2119607Z 2025-05-07T20:32:43.2119771Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2119842Z 2025-05-07T20:32:43.2119927Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2120045Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2120130Z x = x_sign * x_clamp 2025-05-07T20:32:43.2120204Z x0 = x[:, :D] 2025-05-07T20:32:43.2120276Z x1 = x[:, D:] 2025-05-07T20:32:43.2120342Z 2025-05-07T20:32:43.2120420Z if contiguous: 2025-05-07T20:32:43.2120508Z x0 = x0.contiguous() 2025-05-07T20:32:43.2120593Z x1 = x1.contiguous() 2025-05-07T20:32:43.2120659Z 2025-05-07T20:32:43.2120741Z if scale_ub is not None: 2025-05-07T20:32:43.2120845Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2120973Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2121133Z ) 2025-05-07T20:32:43.2121206Z else: 2025-05-07T20:32:43.2121293Z scale_ub_tensor = None 2025-05-07T20:32:43.2121373Z 2025-05-07T20:32:43.2121500Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2121583Z op = silu_mul_quant 2025-05-07T20:32:43.2121664Z if compiled: 2025-05-07T20:32:43.2121761Z op = torch.compile(op) 2025-05-07T20:32:43.2121862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2121936Z 2025-05-07T20:32:43.2122023Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2122027Z 2025-05-07T20:32:43.2122119Z moe/activation_test.py:117: 2025-05-07T20:32:43.2122244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2122338Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2122434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2122921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2123019Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2123379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2123595Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2123928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2124020Z kernel = self.compile( 2025-05-07T20:32:43.2124415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2124584Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2124703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2124707Z 2025-05-07T20:32:43.2124908Z self = 2025-05-07T20:32:43.2125773Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2126262Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c63edb20>} 2025-05-07T20:32:43.2126998Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2127182Z context = 2025-05-07T20:32:43.2127186Z 2025-05-07T20:32:43.2127349Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2127607Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2127716Z module_map=module_map) 2025-05-07T20:32:43.2127876Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2127968Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2128041Z E ^ 2025-05-07T20:32:43.2128387Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2128392Z 2025-05-07T20:32:43.2128820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2128824Z 2025-05-07T20:32:43.2128932Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2129147Z self=, 2025-05-07T20:32:43.2129217Z T=128, 2025-05-07T20:32:43.2129294Z D=5120, 2025-05-07T20:32:43.2129455Z scale_ub=1200.0, 2025-05-07T20:32:43.2129534Z contiguous=True, 2025-05-07T20:32:43.2129617Z compiled=False, 2025-05-07T20:32:43.2129685Z ) 2025-05-07T20:32:43.2129901Z self = 2025-05-07T20:32:43.2130069Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2130074Z 2025-05-07T20:32:43.2130147Z @given( 2025-05-07T20:32:43.2130262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2130356Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2130465Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2130577Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2130683Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2130749Z ) 2025-05-07T20:32:43.2130986Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2131079Z def test_silu_mul_quant( 2025-05-07T20:32:43.2131147Z self, 2025-05-07T20:32:43.2131221Z T: int, 2025-05-07T20:32:43.2131294Z D: int, 2025-05-07T20:32:43.2131395Z scale_ub: Optional[float], 2025-05-07T20:32:43.2131479Z contiguous: bool, 2025-05-07T20:32:43.2131560Z compiled: bool, 2025-05-07T20:32:43.2131637Z ) -> None: 2025-05-07T20:32:43.2131727Z torch.manual_seed(2025) 2025-05-07T20:32:43.2131796Z 2025-05-07T20:32:43.2131957Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2132025Z 2025-05-07T20:32:43.2132109Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2132230Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2132313Z x = x_sign * x_clamp 2025-05-07T20:32:43.2132388Z x0 = x[:, :D] 2025-05-07T20:32:43.2132466Z x1 = x[:, D:] 2025-05-07T20:32:43.2132532Z 2025-05-07T20:32:43.2132614Z if contiguous: 2025-05-07T20:32:43.2132703Z x0 = x0.contiguous() 2025-05-07T20:32:43.2132787Z x1 = x1.contiguous() 2025-05-07T20:32:43.2132861Z 2025-05-07T20:32:43.2133030Z if scale_ub is not None: 2025-05-07T20:32:43.2133131Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2133265Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2133337Z ) 2025-05-07T20:32:43.2133411Z else: 2025-05-07T20:32:43.2133503Z scale_ub_tensor = None 2025-05-07T20:32:43.2133569Z 2025-05-07T20:32:43.2133694Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2133781Z op = silu_mul_quant 2025-05-07T20:32:43.2133859Z if compiled: 2025-05-07T20:32:43.2133956Z op = torch.compile(op) 2025-05-07T20:32:43.2134055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2134124Z 2025-05-07T20:32:43.2134210Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2134214Z 2025-05-07T20:32:43.2134310Z moe/activation_test.py:117: 2025-05-07T20:32:43.2134434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2134536Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2134628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2135118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2135213Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2135565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2135783Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2136110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2136198Z kernel = self.compile( 2025-05-07T20:32:43.2136586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2136834Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2136965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2136969Z 2025-05-07T20:32:43.2137164Z self = 2025-05-07T20:32:43.2137973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2138470Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c5d59ee0>} 2025-05-07T20:32:43.2139198Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2139399Z context = 2025-05-07T20:32:43.2139403Z 2025-05-07T20:32:43.2139559Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2139811Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2139915Z module_map=module_map) 2025-05-07T20:32:43.2140330Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2140479Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2140580Z E ^ 2025-05-07T20:32:43.2141000Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2141005Z 2025-05-07T20:32:43.2141420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2141432Z 2025-05-07T20:32:43.2141531Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2141903Z self=, 2025-05-07T20:32:43.2141984Z T=1, 2025-05-07T20:32:43.2142063Z D=7168, 2025-05-07T20:32:43.2142150Z scale_ub=1200.0, 2025-05-07T20:32:43.2142232Z contiguous=True, 2025-05-07T20:32:43.2142312Z compiled=True, 2025-05-07T20:32:43.2142387Z ) 2025-05-07T20:32:43.2142599Z self = 2025-05-07T20:32:43.2142758Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2142763Z 2025-05-07T20:32:43.2142840Z @given( 2025-05-07T20:32:43.2142952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2143046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2143161Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2143279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2143392Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2143472Z ) 2025-05-07T20:32:43.2143711Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2143803Z def test_silu_mul_quant( 2025-05-07T20:32:43.2143874Z self, 2025-05-07T20:32:43.2143947Z T: int, 2025-05-07T20:32:43.2144025Z D: int, 2025-05-07T20:32:43.2144118Z scale_ub: Optional[float], 2025-05-07T20:32:43.2144204Z contiguous: bool, 2025-05-07T20:32:43.2144291Z compiled: bool, 2025-05-07T20:32:43.2144367Z ) -> None: 2025-05-07T20:32:43.2144454Z torch.manual_seed(2025) 2025-05-07T20:32:43.2144526Z 2025-05-07T20:32:43.2144687Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2144763Z 2025-05-07T20:32:43.2144850Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2145099Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2145188Z x = x_sign * x_clamp 2025-05-07T20:32:43.2145262Z x0 = x[:, :D] 2025-05-07T20:32:43.2145342Z x1 = x[:, D:] 2025-05-07T20:32:43.2145417Z 2025-05-07T20:32:43.2145497Z if contiguous: 2025-05-07T20:32:43.2145583Z x0 = x0.contiguous() 2025-05-07T20:32:43.2145668Z x1 = x1.contiguous() 2025-05-07T20:32:43.2145737Z 2025-05-07T20:32:43.2145820Z if scale_ub is not None: 2025-05-07T20:32:43.2145924Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2146054Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2146132Z ) 2025-05-07T20:32:43.2146206Z else: 2025-05-07T20:32:43.2146294Z scale_ub_tensor = None 2025-05-07T20:32:43.2146368Z 2025-05-07T20:32:43.2146497Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2146583Z op = silu_mul_quant 2025-05-07T20:32:43.2146670Z if compiled: 2025-05-07T20:32:43.2146764Z op = torch.compile(op) 2025-05-07T20:32:43.2146871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2146940Z 2025-05-07T20:32:43.2147027Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2147032Z 2025-05-07T20:32:43.2147121Z moe/activation_test.py:117: 2025-05-07T20:32:43.2147249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2147351Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2147527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2147912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2148001Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2148488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2148584Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2148936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2149239Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2149575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2149666Z kernel = self.compile( 2025-05-07T20:32:43.2150046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2150214Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2150340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2150344Z 2025-05-07T20:32:43.2150541Z self = 2025-05-07T20:32:43.2151310Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2151807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c5d599e0>} 2025-05-07T20:32:43.2152545Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2152727Z context = 2025-05-07T20:32:43.2152732Z 2025-05-07T20:32:43.2152888Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2153149Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2153330Z module_map=module_map) 2025-05-07T20:32:43.2153488Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2153595Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2153665Z E ^ 2025-05-07T20:32:43.2154013Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2154018Z 2025-05-07T20:32:43.2154424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2154428Z 2025-05-07T20:32:43.2154523Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2154739Z self=, 2025-05-07T20:32:43.2154814Z T=1, 2025-05-07T20:32:43.2154893Z D=7168, 2025-05-07T20:32:43.2154973Z scale_ub=1200.0, 2025-05-07T20:32:43.2155057Z contiguous=False, 2025-05-07T20:32:43.2155138Z compiled=True, 2025-05-07T20:32:43.2155209Z ) 2025-05-07T20:32:43.2155420Z self = 2025-05-07T20:32:43.2155593Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2155597Z 2025-05-07T20:32:43.2155670Z @given( 2025-05-07T20:32:43.2155783Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2155878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2155986Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2156097Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2156207Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2156277Z ) 2025-05-07T20:32:43.2156518Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2156608Z def test_silu_mul_quant( 2025-05-07T20:32:43.2156681Z self, 2025-05-07T20:32:43.2156756Z T: int, 2025-05-07T20:32:43.2156835Z D: int, 2025-05-07T20:32:43.2156932Z scale_ub: Optional[float], 2025-05-07T20:32:43.2157017Z contiguous: bool, 2025-05-07T20:32:43.2157205Z compiled: bool, 2025-05-07T20:32:43.2157281Z ) -> None: 2025-05-07T20:32:43.2157378Z torch.manual_seed(2025) 2025-05-07T20:32:43.2157448Z 2025-05-07T20:32:43.2157610Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2157682Z 2025-05-07T20:32:43.2157770Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2157891Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2157976Z x = x_sign * x_clamp 2025-05-07T20:32:43.2158052Z x0 = x[:, :D] 2025-05-07T20:32:43.2158131Z x1 = x[:, D:] 2025-05-07T20:32:43.2158199Z 2025-05-07T20:32:43.2158275Z if contiguous: 2025-05-07T20:32:43.2158363Z x0 = x0.contiguous() 2025-05-07T20:32:43.2158448Z x1 = x1.contiguous() 2025-05-07T20:32:43.2158522Z 2025-05-07T20:32:43.2158616Z if scale_ub is not None: 2025-05-07T20:32:43.2158716Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2158851Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2158930Z ) 2025-05-07T20:32:43.2159003Z else: 2025-05-07T20:32:43.2159094Z scale_ub_tensor = None 2025-05-07T20:32:43.2159162Z 2025-05-07T20:32:43.2159288Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2159378Z op = silu_mul_quant 2025-05-07T20:32:43.2159459Z if compiled: 2025-05-07T20:32:43.2159553Z op = torch.compile(op) 2025-05-07T20:32:43.2159657Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2159725Z 2025-05-07T20:32:43.2159810Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2159814Z 2025-05-07T20:32:43.2159910Z moe/activation_test.py:117: 2025-05-07T20:32:43.2160031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2160269Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2160362Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2160729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2160818Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2161300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2161394Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2161749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2161965Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2162302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2162391Z kernel = self.compile( 2025-05-07T20:32:43.2162790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2162969Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2163091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2163095Z 2025-05-07T20:32:43.2163293Z self = 2025-05-07T20:32:43.2164057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2164549Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c5d59d00>} 2025-05-07T20:32:43.2165360Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2165551Z context = 2025-05-07T20:32:43.2165555Z 2025-05-07T20:32:43.2165715Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2165970Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2166073Z module_map=module_map) 2025-05-07T20:32:43.2166233Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2166325Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2166400Z E ^ 2025-05-07T20:32:43.2166748Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2166753Z 2025-05-07T20:32:43.2167188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2167193Z 2025-05-07T20:32:43.2167303Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2167519Z self=, 2025-05-07T20:32:43.2167593Z T=1, 2025-05-07T20:32:43.2167673Z D=7168, 2025-05-07T20:32:43.2167752Z scale_ub=None, 2025-05-07T20:32:43.2167836Z contiguous=False, 2025-05-07T20:32:43.2167919Z compiled=True, 2025-05-07T20:32:43.2167988Z ) 2025-05-07T20:32:43.2168202Z self = 2025-05-07T20:32:43.2168361Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2168365Z 2025-05-07T20:32:43.2168438Z @given( 2025-05-07T20:32:43.2168557Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2168651Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2168838Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2168950Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2169063Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2169136Z ) 2025-05-07T20:32:43.2169373Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2169459Z def test_silu_mul_quant( 2025-05-07T20:32:43.2169535Z self, 2025-05-07T20:32:43.2169608Z T: int, 2025-05-07T20:32:43.2169680Z D: int, 2025-05-07T20:32:43.2169776Z scale_ub: Optional[float], 2025-05-07T20:32:43.2169862Z contiguous: bool, 2025-05-07T20:32:43.2169942Z compiled: bool, 2025-05-07T20:32:43.2170018Z ) -> None: 2025-05-07T20:32:43.2170108Z torch.manual_seed(2025) 2025-05-07T20:32:43.2170177Z 2025-05-07T20:32:43.2170341Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2170409Z 2025-05-07T20:32:43.2170503Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2170626Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2170708Z x = x_sign * x_clamp 2025-05-07T20:32:43.2170792Z x0 = x[:, :D] 2025-05-07T20:32:43.2170868Z x1 = x[:, D:] 2025-05-07T20:32:43.2170939Z 2025-05-07T20:32:43.2171023Z if contiguous: 2025-05-07T20:32:43.2171109Z x0 = x0.contiguous() 2025-05-07T20:32:43.2171191Z x1 = x1.contiguous() 2025-05-07T20:32:43.2171258Z 2025-05-07T20:32:43.2171344Z if scale_ub is not None: 2025-05-07T20:32:43.2171444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2171576Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2171650Z ) 2025-05-07T20:32:43.2171723Z else: 2025-05-07T20:32:43.2171814Z scale_ub_tensor = None 2025-05-07T20:32:43.2171884Z 2025-05-07T20:32:43.2172014Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2172104Z op = silu_mul_quant 2025-05-07T20:32:43.2172186Z if compiled: 2025-05-07T20:32:43.2172284Z op = torch.compile(op) 2025-05-07T20:32:43.2172466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2172536Z 2025-05-07T20:32:43.2172628Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.2172745Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.2172814Z 2025-05-07T20:32:43.2172951Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2173048Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.2173142Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.2173260Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.2173394Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2173467Z 2025-05-07T20:32:43.2173563Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.2173573Z 2025-05-07T20:32:43.2173666Z moe/activation_test.py:126: 2025-05-07T20:32:43.2173793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2173900Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.2174030Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2174582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.2174679Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.2175037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2175253Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2175613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.2175864Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2176316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.2176481Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.2176819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.2176892Z fn() 2025-05-07T20:32:43.2177290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.2177370Z self.fn.run( 2025-05-07T20:32:43.2177701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2177795Z kernel = self.compile( 2025-05-07T20:32:43.2178169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2178349Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2178478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2178482Z 2025-05-07T20:32:43.2178678Z self = 2025-05-07T20:32:43.2179440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2179928Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c53ac2c0>} 2025-05-07T20:32:43.2180659Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2180848Z context = 2025-05-07T20:32:43.2180852Z 2025-05-07T20:32:43.2181084Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2181346Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2181450Z module_map=module_map) 2025-05-07T20:32:43.2181611Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2181709Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.2181782Z E ^ 2025-05-07T20:32:43.2182133Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2182139Z 2025-05-07T20:32:43.2182568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2182572Z 2025-05-07T20:32:43.2182677Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2186537Z self=, 2025-05-07T20:32:43.2186640Z T=1, 2025-05-07T20:32:43.2186720Z D=5120, 2025-05-07T20:32:43.2186804Z scale_ub=1200.0, 2025-05-07T20:32:43.2186892Z contiguous=False, 2025-05-07T20:32:43.2186980Z compiled=True, 2025-05-07T20:32:43.2187051Z ) 2025-05-07T20:32:43.2187283Z self = 2025-05-07T20:32:43.2187516Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2187522Z 2025-05-07T20:32:43.2187600Z @given( 2025-05-07T20:32:43.2187722Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2187820Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2187933Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2188052Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2188295Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2188369Z ) 2025-05-07T20:32:43.2188620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2188713Z def test_silu_mul_quant( 2025-05-07T20:32:43.2188790Z self, 2025-05-07T20:32:43.2188873Z T: int, 2025-05-07T20:32:43.2188952Z D: int, 2025-05-07T20:32:43.2189049Z scale_ub: Optional[float], 2025-05-07T20:32:43.2189141Z contiguous: bool, 2025-05-07T20:32:43.2189226Z compiled: bool, 2025-05-07T20:32:43.2189310Z ) -> None: 2025-05-07T20:32:43.2189404Z torch.manual_seed(2025) 2025-05-07T20:32:43.2189479Z 2025-05-07T20:32:43.2189647Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2189719Z 2025-05-07T20:32:43.2189809Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2189934Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2190028Z x = x_sign * x_clamp 2025-05-07T20:32:43.2190111Z x0 = x[:, :D] 2025-05-07T20:32:43.2190197Z x1 = x[:, D:] 2025-05-07T20:32:43.2190268Z 2025-05-07T20:32:43.2190359Z if contiguous: 2025-05-07T20:32:43.2190455Z x0 = x0.contiguous() 2025-05-07T20:32:43.2190545Z x1 = x1.contiguous() 2025-05-07T20:32:43.2190621Z 2025-05-07T20:32:43.2190713Z if scale_ub is not None: 2025-05-07T20:32:43.2190821Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2190959Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2191035Z ) 2025-05-07T20:32:43.2191112Z else: 2025-05-07T20:32:43.2191211Z scale_ub_tensor = None 2025-05-07T20:32:43.2191286Z 2025-05-07T20:32:43.2191415Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2191511Z op = silu_mul_quant 2025-05-07T20:32:43.2191595Z if compiled: 2025-05-07T20:32:43.2191693Z op = torch.compile(op) 2025-05-07T20:32:43.2191805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2191878Z 2025-05-07T20:32:43.2192047Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2192056Z 2025-05-07T20:32:43.2192153Z moe/activation_test.py:117: 2025-05-07T20:32:43.2192276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2192376Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2192476Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2192844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2192937Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2193423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2193520Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2193874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2194100Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2194438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2194528Z kernel = self.compile( 2025-05-07T20:32:43.2194907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2195081Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2195202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2195207Z 2025-05-07T20:32:43.2195409Z self = 2025-05-07T20:32:43.2196169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2196744Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c589f9c0>} 2025-05-07T20:32:43.2197474Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2197657Z context = 2025-05-07T20:32:43.2197662Z 2025-05-07T20:32:43.2197823Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2198077Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2198177Z module_map=module_map) 2025-05-07T20:32:43.2198341Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2198435Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2198509Z E ^ 2025-05-07T20:32:43.2198861Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2198866Z 2025-05-07T20:32:43.2199276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2199280Z 2025-05-07T20:32:43.2199379Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2199594Z self=, 2025-05-07T20:32:43.2199667Z T=1, 2025-05-07T20:32:43.2199741Z D=5120, 2025-05-07T20:32:43.2199817Z scale_ub=1200.0, 2025-05-07T20:32:43.2199899Z contiguous=False, 2025-05-07T20:32:43.2199976Z compiled=False, 2025-05-07T20:32:43.2200044Z ) 2025-05-07T20:32:43.2200256Z self = 2025-05-07T20:32:43.2200426Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2200430Z 2025-05-07T20:32:43.2200581Z @given( 2025-05-07T20:32:43.2200700Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2200798Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2200908Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2201019Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2201127Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2201200Z ) 2025-05-07T20:32:43.2201435Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2201523Z def test_silu_mul_quant( 2025-05-07T20:32:43.2201602Z self, 2025-05-07T20:32:43.2201675Z T: int, 2025-05-07T20:32:43.2201748Z D: int, 2025-05-07T20:32:43.2201841Z scale_ub: Optional[float], 2025-05-07T20:32:43.2201932Z contiguous: bool, 2025-05-07T20:32:43.2202012Z compiled: bool, 2025-05-07T20:32:43.2202095Z ) -> None: 2025-05-07T20:32:43.2202194Z torch.manual_seed(2025) 2025-05-07T20:32:43.2202267Z 2025-05-07T20:32:43.2202428Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2202497Z 2025-05-07T20:32:43.2202587Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2202705Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2202790Z x = x_sign * x_clamp 2025-05-07T20:32:43.2202873Z x0 = x[:, :D] 2025-05-07T20:32:43.2202949Z x1 = x[:, D:] 2025-05-07T20:32:43.2203019Z 2025-05-07T20:32:43.2203102Z if contiguous: 2025-05-07T20:32:43.2203188Z x0 = x0.contiguous() 2025-05-07T20:32:43.2203272Z x1 = x1.contiguous() 2025-05-07T20:32:43.2203339Z 2025-05-07T20:32:43.2203424Z if scale_ub is not None: 2025-05-07T20:32:43.2203529Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2203747Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2203824Z ) 2025-05-07T20:32:43.2203904Z else: 2025-05-07T20:32:43.2203995Z scale_ub_tensor = None 2025-05-07T20:32:43.2204063Z 2025-05-07T20:32:43.2204192Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2204278Z op = silu_mul_quant 2025-05-07T20:32:43.2204359Z if compiled: 2025-05-07T20:32:43.2204455Z op = torch.compile(op) 2025-05-07T20:32:43.2204555Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2204624Z 2025-05-07T20:32:43.2204714Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2204719Z 2025-05-07T20:32:43.2204810Z moe/activation_test.py:117: 2025-05-07T20:32:43.2204935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2205031Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2205135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2205633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2205725Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2206078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2206294Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2206628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2206720Z kernel = self.compile( 2025-05-07T20:32:43.2207098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2207289Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2207431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2207441Z 2025-05-07T20:32:43.2207641Z self = 2025-05-07T20:32:43.2208482Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2208973Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c6b9a3e0>} 2025-05-07T20:32:43.2209702Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2209887Z context = 2025-05-07T20:32:43.2209895Z 2025-05-07T20:32:43.2210053Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2210313Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2210416Z module_map=module_map) 2025-05-07T20:32:43.2210572Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2210669Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2210743Z E ^ 2025-05-07T20:32:43.2211090Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2211100Z 2025-05-07T20:32:43.2211507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2211511Z 2025-05-07T20:32:43.2211608Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2211827Z self=, 2025-05-07T20:32:43.2211980Z T=16384, 2025-05-07T20:32:43.2212052Z D=5120, 2025-05-07T20:32:43.2212136Z scale_ub=1200.0, 2025-05-07T20:32:43.2212223Z contiguous=False, 2025-05-07T20:32:43.2212302Z compiled=True, 2025-05-07T20:32:43.2212376Z ) 2025-05-07T20:32:43.2212585Z self = 2025-05-07T20:32:43.2212758Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2212763Z 2025-05-07T20:32:43.2212835Z @given( 2025-05-07T20:32:43.2212947Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2213045Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2213153Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2213263Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2213372Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2213442Z ) 2025-05-07T20:32:43.2213685Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2213774Z def test_silu_mul_quant( 2025-05-07T20:32:43.2213850Z self, 2025-05-07T20:32:43.2213935Z T: int, 2025-05-07T20:32:43.2214010Z D: int, 2025-05-07T20:32:43.2214102Z scale_ub: Optional[float], 2025-05-07T20:32:43.2214190Z contiguous: bool, 2025-05-07T20:32:43.2214272Z compiled: bool, 2025-05-07T20:32:43.2214346Z ) -> None: 2025-05-07T20:32:43.2214436Z torch.manual_seed(2025) 2025-05-07T20:32:43.2214506Z 2025-05-07T20:32:43.2214667Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2214738Z 2025-05-07T20:32:43.2214826Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2214943Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2215029Z x = x_sign * x_clamp 2025-05-07T20:32:43.2215107Z x0 = x[:, :D] 2025-05-07T20:32:43.2215187Z x1 = x[:, D:] 2025-05-07T20:32:43.2215259Z 2025-05-07T20:32:43.2215339Z if contiguous: 2025-05-07T20:32:43.2215429Z x0 = x0.contiguous() 2025-05-07T20:32:43.2215601Z x1 = x1.contiguous() 2025-05-07T20:32:43.2215671Z 2025-05-07T20:32:43.2215762Z if scale_ub is not None: 2025-05-07T20:32:43.2215864Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2215993Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2216068Z ) 2025-05-07T20:32:43.2216140Z else: 2025-05-07T20:32:43.2216230Z scale_ub_tensor = None 2025-05-07T20:32:43.2216306Z 2025-05-07T20:32:43.2216430Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2216517Z op = silu_mul_quant 2025-05-07T20:32:43.2216595Z if compiled: 2025-05-07T20:32:43.2216689Z op = torch.compile(op) 2025-05-07T20:32:43.2216796Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2216868Z 2025-05-07T20:32:43.2216953Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2216957Z 2025-05-07T20:32:43.2217051Z moe/activation_test.py:117: 2025-05-07T20:32:43.2217180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2217280Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2217398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2217782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2217875Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2218357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2218446Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2218798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2219011Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2219453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2219544Z kernel = self.compile( 2025-05-07T20:32:43.2219921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2220092Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2220212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2220216Z 2025-05-07T20:32:43.2220410Z self = 2025-05-07T20:32:43.2221171Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2221668Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c63c3560>} 2025-05-07T20:32:43.2222406Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2222588Z context = 2025-05-07T20:32:43.2222593Z 2025-05-07T20:32:43.2222753Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2223005Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2223107Z module_map=module_map) 2025-05-07T20:32:43.2223265Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2223360Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2223440Z E ^ 2025-05-07T20:32:43.2223869Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2223874Z 2025-05-07T20:32:43.2224281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2224286Z 2025-05-07T20:32:43.2224388Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2224604Z self=, 2025-05-07T20:32:43.2224676Z T=2048, 2025-05-07T20:32:43.2224753Z D=7168, 2025-05-07T20:32:43.2224835Z scale_ub=1200.0, 2025-05-07T20:32:43.2224916Z contiguous=False, 2025-05-07T20:32:43.2224997Z compiled=True, 2025-05-07T20:32:43.2225065Z ) 2025-05-07T20:32:43.2225275Z self = 2025-05-07T20:32:43.2225445Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2225452Z 2025-05-07T20:32:43.2225525Z @given( 2025-05-07T20:32:43.2225642Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2225742Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2225852Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2225965Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2226073Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2226143Z ) 2025-05-07T20:32:43.2226382Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2226468Z def test_silu_mul_quant( 2025-05-07T20:32:43.2226543Z self, 2025-05-07T20:32:43.2226614Z T: int, 2025-05-07T20:32:43.2226685Z D: int, 2025-05-07T20:32:43.2226779Z scale_ub: Optional[float], 2025-05-07T20:32:43.2226863Z contiguous: bool, 2025-05-07T20:32:43.2226942Z compiled: bool, 2025-05-07T20:32:43.2227100Z ) -> None: 2025-05-07T20:32:43.2227189Z torch.manual_seed(2025) 2025-05-07T20:32:43.2227259Z 2025-05-07T20:32:43.2227483Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2227551Z 2025-05-07T20:32:43.2227638Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2227759Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2227844Z x = x_sign * x_clamp 2025-05-07T20:32:43.2227916Z x0 = x[:, :D] 2025-05-07T20:32:43.2227995Z x1 = x[:, D:] 2025-05-07T20:32:43.2228063Z 2025-05-07T20:32:43.2228145Z if contiguous: 2025-05-07T20:32:43.2228230Z x0 = x0.contiguous() 2025-05-07T20:32:43.2228313Z x1 = x1.contiguous() 2025-05-07T20:32:43.2228385Z 2025-05-07T20:32:43.2228471Z if scale_ub is not None: 2025-05-07T20:32:43.2228567Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2228698Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2228775Z ) 2025-05-07T20:32:43.2228846Z else: 2025-05-07T20:32:43.2228937Z scale_ub_tensor = None 2025-05-07T20:32:43.2229007Z 2025-05-07T20:32:43.2229134Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2229218Z op = silu_mul_quant 2025-05-07T20:32:43.2229297Z if compiled: 2025-05-07T20:32:43.2229393Z op = torch.compile(op) 2025-05-07T20:32:43.2229490Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2229558Z 2025-05-07T20:32:43.2229643Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2229647Z 2025-05-07T20:32:43.2229738Z moe/activation_test.py:117: 2025-05-07T20:32:43.2229861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2229958Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2230052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2230414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2230510Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2231073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2231170Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2231520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2231734Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2232068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2232153Z kernel = self.compile( 2025-05-07T20:32:43.2232534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2232699Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2232822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2232827Z 2025-05-07T20:32:43.2233035Z self = 2025-05-07T20:32:43.2233794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2234290Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c63428e0>} 2025-05-07T20:32:43.2235020Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2235280Z context = 2025-05-07T20:32:43.2235284Z 2025-05-07T20:32:43.2235443Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2235697Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2235802Z module_map=module_map) 2025-05-07T20:32:43.2235958Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2236051Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2236127Z E ^ 2025-05-07T20:32:43.2236470Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2236474Z 2025-05-07T20:32:43.2236875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2236884Z 2025-05-07T20:32:43.2236981Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2237200Z self=, 2025-05-07T20:32:43.2237276Z T=1, 2025-05-07T20:32:43.2237347Z D=5120, 2025-05-07T20:32:43.2237429Z scale_ub=None, 2025-05-07T20:32:43.2237511Z contiguous=False, 2025-05-07T20:32:43.2237591Z compiled=False, 2025-05-07T20:32:43.2237659Z ) 2025-05-07T20:32:43.2237870Z self = 2025-05-07T20:32:43.2238031Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2238036Z 2025-05-07T20:32:43.2238107Z @given( 2025-05-07T20:32:43.2238220Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2238315Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2238424Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2238533Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2238639Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2238720Z ) 2025-05-07T20:32:43.2238955Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2239119Z def test_silu_mul_quant( 2025-05-07T20:32:43.2239196Z self, 2025-05-07T20:32:43.2239271Z T: int, 2025-05-07T20:32:43.2239340Z D: int, 2025-05-07T20:32:43.2239437Z scale_ub: Optional[float], 2025-05-07T20:32:43.2239518Z contiguous: bool, 2025-05-07T20:32:43.2239599Z compiled: bool, 2025-05-07T20:32:43.2239673Z ) -> None: 2025-05-07T20:32:43.2239759Z torch.manual_seed(2025) 2025-05-07T20:32:43.2239833Z 2025-05-07T20:32:43.2239991Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2240249Z 2025-05-07T20:32:43.2240388Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2240562Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2240690Z x = x_sign * x_clamp 2025-05-07T20:32:43.2240812Z x0 = x[:, :D] 2025-05-07T20:32:43.2240919Z x1 = x[:, D:] 2025-05-07T20:32:43.2241014Z 2025-05-07T20:32:43.2241133Z if contiguous: 2025-05-07T20:32:43.2241254Z x0 = x0.contiguous() 2025-05-07T20:32:43.2241343Z x1 = x1.contiguous() 2025-05-07T20:32:43.2241412Z 2025-05-07T20:32:43.2241497Z if scale_ub is not None: 2025-05-07T20:32:43.2241599Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2241727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2241795Z ) 2025-05-07T20:32:43.2241870Z else: 2025-05-07T20:32:43.2241958Z scale_ub_tensor = None 2025-05-07T20:32:43.2242023Z 2025-05-07T20:32:43.2242151Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2242238Z op = silu_mul_quant 2025-05-07T20:32:43.2242317Z if compiled: 2025-05-07T20:32:43.2242419Z op = torch.compile(op) 2025-05-07T20:32:43.2242672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2242743Z 2025-05-07T20:32:43.2242828Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2242833Z 2025-05-07T20:32:43.2242927Z moe/activation_test.py:117: 2025-05-07T20:32:43.2243051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2243144Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2243236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2243726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2243818Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2244172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2244387Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2244719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2244813Z kernel = self.compile( 2025-05-07T20:32:43.2245212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2245379Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2245503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2245507Z 2025-05-07T20:32:43.2245703Z self = 2025-05-07T20:32:43.2246466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2246953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c63434c0>} 2025-05-07T20:32:43.2247830Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2248015Z context = 2025-05-07T20:32:43.2248020Z 2025-05-07T20:32:43.2248178Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2248433Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2248534Z module_map=module_map) 2025-05-07T20:32:43.2248690Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2248784Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2248861Z E ^ 2025-05-07T20:32:43.2249215Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2249225Z 2025-05-07T20:32:43.2249659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2249664Z 2025-05-07T20:32:43.2249759Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2249977Z self=, 2025-05-07T20:32:43.2250049Z T=4096, 2025-05-07T20:32:43.2250125Z D=7168, 2025-05-07T20:32:43.2250204Z scale_ub=1200.0, 2025-05-07T20:32:43.2250282Z contiguous=False, 2025-05-07T20:32:43.2250359Z compiled=False, 2025-05-07T20:32:43.2250426Z ) 2025-05-07T20:32:43.2250637Z self = 2025-05-07T20:32:43.2250808Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2250813Z 2025-05-07T20:32:43.2250885Z @given( 2025-05-07T20:32:43.2250996Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2251171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2251285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2251397Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2251502Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2251572Z ) 2025-05-07T20:32:43.2251808Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2251895Z def test_silu_mul_quant( 2025-05-07T20:32:43.2251965Z self, 2025-05-07T20:32:43.2252039Z T: int, 2025-05-07T20:32:43.2252108Z D: int, 2025-05-07T20:32:43.2252201Z scale_ub: Optional[float], 2025-05-07T20:32:43.2252286Z contiguous: bool, 2025-05-07T20:32:43.2252365Z compiled: bool, 2025-05-07T20:32:43.2252440Z ) -> None: 2025-05-07T20:32:43.2252530Z torch.manual_seed(2025) 2025-05-07T20:32:43.2252595Z 2025-05-07T20:32:43.2252762Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2252833Z 2025-05-07T20:32:43.2252918Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2253043Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2253125Z x = x_sign * x_clamp 2025-05-07T20:32:43.2253199Z x0 = x[:, :D] 2025-05-07T20:32:43.2253274Z x1 = x[:, D:] 2025-05-07T20:32:43.2253339Z 2025-05-07T20:32:43.2253416Z if contiguous: 2025-05-07T20:32:43.2253506Z x0 = x0.contiguous() 2025-05-07T20:32:43.2253590Z x1 = x1.contiguous() 2025-05-07T20:32:43.2253655Z 2025-05-07T20:32:43.2253744Z if scale_ub is not None: 2025-05-07T20:32:43.2253844Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2253971Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2254045Z ) 2025-05-07T20:32:43.2254116Z else: 2025-05-07T20:32:43.2254207Z scale_ub_tensor = None 2025-05-07T20:32:43.2254277Z 2025-05-07T20:32:43.2254397Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2254564Z op = silu_mul_quant 2025-05-07T20:32:43.2254645Z if compiled: 2025-05-07T20:32:43.2254738Z op = torch.compile(op) 2025-05-07T20:32:43.2254839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2254902Z 2025-05-07T20:32:43.2254985Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2254989Z 2025-05-07T20:32:43.2255082Z moe/activation_test.py:117: 2025-05-07T20:32:43.2255203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2255301Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2255395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2255877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2255976Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2256326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2256547Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2256884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2256974Z kernel = self.compile( 2025-05-07T20:32:43.2257376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2257542Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2257662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2257666Z 2025-05-07T20:32:43.2257865Z self = 2025-05-07T20:32:43.2258632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2259204Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c53ad080>} 2025-05-07T20:32:43.2259934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2260116Z context = 2025-05-07T20:32:43.2260123Z 2025-05-07T20:32:43.2260278Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2260529Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2260638Z module_map=module_map) 2025-05-07T20:32:43.2260792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2260888Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2260962Z E ^ 2025-05-07T20:32:43.2261306Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2261310Z 2025-05-07T20:32:43.2261744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2261748Z 2025-05-07T20:32:43.2261846Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2262059Z self=, 2025-05-07T20:32:43.2262137Z T=16384, 2025-05-07T20:32:43.2262210Z D=7168, 2025-05-07T20:32:43.2262289Z scale_ub=None, 2025-05-07T20:32:43.2262372Z contiguous=True, 2025-05-07T20:32:43.2262447Z compiled=True, 2025-05-07T20:32:43.2262518Z ) 2025-05-07T20:32:43.2262730Z self = 2025-05-07T20:32:43.2262970Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2262975Z 2025-05-07T20:32:43.2263051Z @given( 2025-05-07T20:32:43.2263165Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2263262Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2263379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2263492Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2263601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2263674Z ) 2025-05-07T20:32:43.2263909Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2264001Z def test_silu_mul_quant( 2025-05-07T20:32:43.2264075Z self, 2025-05-07T20:32:43.2264149Z T: int, 2025-05-07T20:32:43.2264229Z D: int, 2025-05-07T20:32:43.2264320Z scale_ub: Optional[float], 2025-05-07T20:32:43.2264404Z contiguous: bool, 2025-05-07T20:32:43.2264485Z compiled: bool, 2025-05-07T20:32:43.2264563Z ) -> None: 2025-05-07T20:32:43.2264649Z torch.manual_seed(2025) 2025-05-07T20:32:43.2264722Z 2025-05-07T20:32:43.2264881Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2264952Z 2025-05-07T20:32:43.2265040Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2265157Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2265242Z x = x_sign * x_clamp 2025-05-07T20:32:43.2265324Z x0 = x[:, :D] 2025-05-07T20:32:43.2265396Z x1 = x[:, D:] 2025-05-07T20:32:43.2265465Z 2025-05-07T20:32:43.2265541Z if contiguous: 2025-05-07T20:32:43.2265624Z x0 = x0.contiguous() 2025-05-07T20:32:43.2265707Z x1 = x1.contiguous() 2025-05-07T20:32:43.2265772Z 2025-05-07T20:32:43.2265936Z if scale_ub is not None: 2025-05-07T20:32:43.2266040Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2266174Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2266242Z ) 2025-05-07T20:32:43.2266319Z else: 2025-05-07T20:32:43.2266408Z scale_ub_tensor = None 2025-05-07T20:32:43.2266475Z 2025-05-07T20:32:43.2266600Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2266682Z op = silu_mul_quant 2025-05-07T20:32:43.2266764Z if compiled: 2025-05-07T20:32:43.2266855Z op = torch.compile(op) 2025-05-07T20:32:43.2266956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2267026Z 2025-05-07T20:32:43.2267109Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2267114Z 2025-05-07T20:32:43.2267205Z moe/activation_test.py:117: 2025-05-07T20:32:43.2267332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2267484Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2267577Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2267945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2268033Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2268519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2268610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2268959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2269179Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2269511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2269599Z kernel = self.compile( 2025-05-07T20:32:43.2269979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2270227Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2270350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2270354Z 2025-05-07T20:32:43.2270552Z self = 2025-05-07T20:32:43.2271312Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2271802Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c53ae2a0>} 2025-05-07T20:32:43.2272530Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2272726Z context = 2025-05-07T20:32:43.2272731Z 2025-05-07T20:32:43.2272886Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2273146Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2273247Z module_map=module_map) 2025-05-07T20:32:43.2273400Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2273493Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2273563Z E ^ 2025-05-07T20:32:43.2273907Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2273912Z 2025-05-07T20:32:43.2274319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2274400Z 2025-05-07T20:32:43.2274503Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2274726Z self=, 2025-05-07T20:32:43.2274799Z T=4096, 2025-05-07T20:32:43.2274867Z D=5120, 2025-05-07T20:32:43.2274946Z scale_ub=None, 2025-05-07T20:32:43.2275026Z contiguous=False, 2025-05-07T20:32:43.2275100Z compiled=True, 2025-05-07T20:32:43.2275170Z ) 2025-05-07T20:32:43.2275379Z self = 2025-05-07T20:32:43.2275543Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2275551Z 2025-05-07T20:32:43.2275621Z @given( 2025-05-07T20:32:43.2275731Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2275828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2275939Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2276051Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2276169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2276240Z ) 2025-05-07T20:32:43.2276474Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2276564Z def test_silu_mul_quant( 2025-05-07T20:32:43.2276633Z self, 2025-05-07T20:32:43.2276704Z T: int, 2025-05-07T20:32:43.2276778Z D: int, 2025-05-07T20:32:43.2276868Z scale_ub: Optional[float], 2025-05-07T20:32:43.2276954Z contiguous: bool, 2025-05-07T20:32:43.2277034Z compiled: bool, 2025-05-07T20:32:43.2277104Z ) -> None: 2025-05-07T20:32:43.2277194Z torch.manual_seed(2025) 2025-05-07T20:32:43.2277260Z 2025-05-07T20:32:43.2277430Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2277519Z 2025-05-07T20:32:43.2277617Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2277752Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2277838Z x = x_sign * x_clamp 2025-05-07T20:32:43.2278017Z x0 = x[:, :D] 2025-05-07T20:32:43.2278091Z x1 = x[:, D:] 2025-05-07T20:32:43.2278158Z 2025-05-07T20:32:43.2278237Z if contiguous: 2025-05-07T20:32:43.2278326Z x0 = x0.contiguous() 2025-05-07T20:32:43.2278408Z x1 = x1.contiguous() 2025-05-07T20:32:43.2278477Z 2025-05-07T20:32:43.2278563Z if scale_ub is not None: 2025-05-07T20:32:43.2278664Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2278791Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2278862Z ) 2025-05-07T20:32:43.2278932Z else: 2025-05-07T20:32:43.2279019Z scale_ub_tensor = None 2025-05-07T20:32:43.2279094Z 2025-05-07T20:32:43.2279216Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2279305Z op = silu_mul_quant 2025-05-07T20:32:43.2279386Z if compiled: 2025-05-07T20:32:43.2279479Z op = torch.compile(op) 2025-05-07T20:32:43.2279586Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2279652Z 2025-05-07T20:32:43.2279735Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2279739Z 2025-05-07T20:32:43.2279831Z moe/activation_test.py:117: 2025-05-07T20:32:43.2279953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2280044Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2280137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2280494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2280583Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2281069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2281242Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2281601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2281817Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2282149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2282245Z kernel = self.compile( 2025-05-07T20:32:43.2282621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2282791Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2282910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2282914Z 2025-05-07T20:32:43.2283108Z self = 2025-05-07T20:32:43.2283880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2284368Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c53aefc0>} 2025-05-07T20:32:43.2285099Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2285280Z context = 2025-05-07T20:32:43.2285285Z 2025-05-07T20:32:43.2285441Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2285702Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2285806Z module_map=module_map) 2025-05-07T20:32:43.2286041Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2286135Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2286208Z E ^ 2025-05-07T20:32:43.2286561Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2286565Z 2025-05-07T20:32:43.2286971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2286975Z 2025-05-07T20:32:43.2287074Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2287288Z self=, 2025-05-07T20:32:43.2287358Z T=4096, 2025-05-07T20:32:43.2287430Z D=5120, 2025-05-07T20:32:43.2287506Z scale_ub=1200.0, 2025-05-07T20:32:43.2287589Z contiguous=False, 2025-05-07T20:32:43.2287670Z compiled=False, 2025-05-07T20:32:43.2287735Z ) 2025-05-07T20:32:43.2287951Z self = 2025-05-07T20:32:43.2288124Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2288128Z 2025-05-07T20:32:43.2288199Z @given( 2025-05-07T20:32:43.2288313Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2288404Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2288512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2288624Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2288731Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2288798Z ) 2025-05-07T20:32:43.2289036Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2289122Z def test_silu_mul_quant( 2025-05-07T20:32:43.2289194Z self, 2025-05-07T20:32:43.2289415Z T: int, 2025-05-07T20:32:43.2289483Z D: int, 2025-05-07T20:32:43.2289576Z scale_ub: Optional[float], 2025-05-07T20:32:43.2289668Z contiguous: bool, 2025-05-07T20:32:43.2289749Z compiled: bool, 2025-05-07T20:32:43.2289822Z ) -> None: 2025-05-07T20:32:43.2289909Z torch.manual_seed(2025) 2025-05-07T20:32:43.2289975Z 2025-05-07T20:32:43.2290139Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2290208Z 2025-05-07T20:32:43.2290293Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2290415Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2290496Z x = x_sign * x_clamp 2025-05-07T20:32:43.2290571Z x0 = x[:, :D] 2025-05-07T20:32:43.2290649Z x1 = x[:, D:] 2025-05-07T20:32:43.2290714Z 2025-05-07T20:32:43.2290790Z if contiguous: 2025-05-07T20:32:43.2290879Z x0 = x0.contiguous() 2025-05-07T20:32:43.2290959Z x1 = x1.contiguous() 2025-05-07T20:32:43.2291033Z 2025-05-07T20:32:43.2291114Z if scale_ub is not None: 2025-05-07T20:32:43.2291220Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2291350Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2291422Z ) 2025-05-07T20:32:43.2291493Z else: 2025-05-07T20:32:43.2291583Z scale_ub_tensor = None 2025-05-07T20:32:43.2291647Z 2025-05-07T20:32:43.2291770Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2291856Z op = silu_mul_quant 2025-05-07T20:32:43.2291936Z if compiled: 2025-05-07T20:32:43.2292029Z op = torch.compile(op) 2025-05-07T20:32:43.2292131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2292200Z 2025-05-07T20:32:43.2292290Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2292294Z 2025-05-07T20:32:43.2292385Z moe/activation_test.py:117: 2025-05-07T20:32:43.2292512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2292607Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2292777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2293266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2293360Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2293714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2293933Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2294261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2294347Z kernel = self.compile( 2025-05-07T20:32:43.2294745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2294917Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2295041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2295048Z 2025-05-07T20:32:43.2295243Z self = 2025-05-07T20:32:43.2296006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2296493Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4ca8360>} 2025-05-07T20:32:43.2297224Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2297490Z context = 2025-05-07T20:32:43.2297495Z 2025-05-07T20:32:43.2297658Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2297910Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2298016Z module_map=module_map) 2025-05-07T20:32:43.2298168Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2298258Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2298335Z E ^ 2025-05-07T20:32:43.2298678Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2298683Z 2025-05-07T20:32:43.2299107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2299117Z 2025-05-07T20:32:43.2299213Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2299425Z self=, 2025-05-07T20:32:43.2299505Z T=4096, 2025-05-07T20:32:43.2299576Z D=5120, 2025-05-07T20:32:43.2299654Z scale_ub=1200.0, 2025-05-07T20:32:43.2299735Z contiguous=False, 2025-05-07T20:32:43.2299811Z compiled=True, 2025-05-07T20:32:43.2299878Z ) 2025-05-07T20:32:43.2300086Z self = 2025-05-07T20:32:43.2300253Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2300258Z 2025-05-07T20:32:43.2300334Z @given( 2025-05-07T20:32:43.2300445Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2300539Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2300651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2300763Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2300877Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2304503Z ) 2025-05-07T20:32:43.2304861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2304955Z def test_silu_mul_quant( 2025-05-07T20:32:43.2305028Z self, 2025-05-07T20:32:43.2305100Z T: int, 2025-05-07T20:32:43.2305172Z D: int, 2025-05-07T20:32:43.2305264Z scale_ub: Optional[float], 2025-05-07T20:32:43.2305349Z contiguous: bool, 2025-05-07T20:32:43.2305430Z compiled: bool, 2025-05-07T20:32:43.2305503Z ) -> None: 2025-05-07T20:32:43.2305591Z torch.manual_seed(2025) 2025-05-07T20:32:43.2305662Z 2025-05-07T20:32:43.2305825Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2305890Z 2025-05-07T20:32:43.2305984Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2306105Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2306203Z x = x_sign * x_clamp 2025-05-07T20:32:43.2306274Z x0 = x[:, :D] 2025-05-07T20:32:43.2306348Z x1 = x[:, D:] 2025-05-07T20:32:43.2306427Z 2025-05-07T20:32:43.2306503Z if contiguous: 2025-05-07T20:32:43.2306585Z x0 = x0.contiguous() 2025-05-07T20:32:43.2306671Z x1 = x1.contiguous() 2025-05-07T20:32:43.2306739Z 2025-05-07T20:32:43.2306826Z if scale_ub is not None: 2025-05-07T20:32:43.2306930Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2307059Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2307125Z ) 2025-05-07T20:32:43.2307200Z else: 2025-05-07T20:32:43.2307286Z scale_ub_tensor = None 2025-05-07T20:32:43.2307351Z 2025-05-07T20:32:43.2307560Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2307644Z op = silu_mul_quant 2025-05-07T20:32:43.2307727Z if compiled: 2025-05-07T20:32:43.2307931Z op = torch.compile(op) 2025-05-07T20:32:43.2308033Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2308102Z 2025-05-07T20:32:43.2308192Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2308197Z 2025-05-07T20:32:43.2308290Z moe/activation_test.py:117: 2025-05-07T20:32:43.2308417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2308513Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2308612Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2308974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2309060Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2309549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2309641Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2310000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2310226Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2310561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2310651Z kernel = self.compile( 2025-05-07T20:32:43.2311046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2311213Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2311335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2311340Z 2025-05-07T20:32:43.2311537Z self = 2025-05-07T20:32:43.2312299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2312876Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4ca94e0>} 2025-05-07T20:32:43.2313608Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2313794Z context = 2025-05-07T20:32:43.2313798Z 2025-05-07T20:32:43.2313955Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2314211Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2314312Z module_map=module_map) 2025-05-07T20:32:43.2314471Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2314569Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2314645Z E ^ 2025-05-07T20:32:43.2314992Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2315001Z 2025-05-07T20:32:43.2315431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2315435Z 2025-05-07T20:32:43.2315532Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2315751Z self=, 2025-05-07T20:32:43.2315824Z T=2048, 2025-05-07T20:32:43.2315893Z D=7168, 2025-05-07T20:32:43.2315973Z scale_ub=1200.0, 2025-05-07T20:32:43.2316053Z contiguous=False, 2025-05-07T20:32:43.2316130Z compiled=False, 2025-05-07T20:32:43.2316197Z ) 2025-05-07T20:32:43.2316489Z self = 2025-05-07T20:32:43.2316661Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2316671Z 2025-05-07T20:32:43.2316741Z @given( 2025-05-07T20:32:43.2316855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2316950Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2317059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2317170Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2317279Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2317348Z ) 2025-05-07T20:32:43.2317585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2317673Z def test_silu_mul_quant( 2025-05-07T20:32:43.2317742Z self, 2025-05-07T20:32:43.2317814Z T: int, 2025-05-07T20:32:43.2317884Z D: int, 2025-05-07T20:32:43.2317976Z scale_ub: Optional[float], 2025-05-07T20:32:43.2318067Z contiguous: bool, 2025-05-07T20:32:43.2318148Z compiled: bool, 2025-05-07T20:32:43.2318218Z ) -> None: 2025-05-07T20:32:43.2318318Z torch.manual_seed(2025) 2025-05-07T20:32:43.2318388Z 2025-05-07T20:32:43.2318550Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2318623Z 2025-05-07T20:32:43.2318708Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2318825Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2318915Z x = x_sign * x_clamp 2025-05-07T20:32:43.2318989Z x0 = x[:, :D] 2025-05-07T20:32:43.2319068Z x1 = x[:, D:] 2025-05-07T20:32:43.2319136Z 2025-05-07T20:32:43.2319215Z if contiguous: 2025-05-07T20:32:43.2319303Z x0 = x0.contiguous() 2025-05-07T20:32:43.2319386Z x1 = x1.contiguous() 2025-05-07T20:32:43.2319457Z 2025-05-07T20:32:43.2319546Z if scale_ub is not None: 2025-05-07T20:32:43.2319651Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2319778Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2319936Z ) 2025-05-07T20:32:43.2320010Z else: 2025-05-07T20:32:43.2320097Z scale_ub_tensor = None 2025-05-07T20:32:43.2320168Z 2025-05-07T20:32:43.2320292Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2320385Z op = silu_mul_quant 2025-05-07T20:32:43.2320466Z if compiled: 2025-05-07T20:32:43.2320559Z op = torch.compile(op) 2025-05-07T20:32:43.2320664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2320729Z 2025-05-07T20:32:43.2320816Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2320820Z 2025-05-07T20:32:43.2320912Z moe/activation_test.py:117: 2025-05-07T20:32:43.2321040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2321136Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2321238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2321733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2321827Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2322179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2322394Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2322731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2322818Z kernel = self.compile( 2025-05-07T20:32:43.2323195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2323363Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2323564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2323568Z 2025-05-07T20:32:43.2323771Z self = 2025-05-07T20:32:43.2324533Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2325029Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4ca9f80>} 2025-05-07T20:32:43.2325759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2325942Z context = 2025-05-07T20:32:43.2325952Z 2025-05-07T20:32:43.2326113Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2326372Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2326478Z module_map=module_map) 2025-05-07T20:32:43.2326632Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2326726Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2326800Z E ^ 2025-05-07T20:32:43.2327144Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2327149Z 2025-05-07T20:32:43.2327558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2327565Z 2025-05-07T20:32:43.2327661Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2327877Z self=, 2025-05-07T20:32:43.2327956Z T=1, 2025-05-07T20:32:43.2328027Z D=7168, 2025-05-07T20:32:43.2328101Z scale_ub=None, 2025-05-07T20:32:43.2328262Z contiguous=True, 2025-05-07T20:32:43.2328343Z compiled=False, 2025-05-07T20:32:43.2328412Z ) 2025-05-07T20:32:43.2328629Z self = 2025-05-07T20:32:43.2328786Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2328791Z 2025-05-07T20:32:43.2328865Z @given( 2025-05-07T20:32:43.2328981Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2329076Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2329191Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2329301Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2329409Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2329480Z ) 2025-05-07T20:32:43.2329722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2329809Z def test_silu_mul_quant( 2025-05-07T20:32:43.2329890Z self, 2025-05-07T20:32:43.2329963Z T: int, 2025-05-07T20:32:43.2330036Z D: int, 2025-05-07T20:32:43.2330132Z scale_ub: Optional[float], 2025-05-07T20:32:43.2330216Z contiguous: bool, 2025-05-07T20:32:43.2330298Z compiled: bool, 2025-05-07T20:32:43.2330371Z ) -> None: 2025-05-07T20:32:43.2330457Z torch.manual_seed(2025) 2025-05-07T20:32:43.2330527Z 2025-05-07T20:32:43.2330689Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2330756Z 2025-05-07T20:32:43.2330845Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2330962Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2331043Z x = x_sign * x_clamp 2025-05-07T20:32:43.2331119Z x0 = x[:, :D] 2025-05-07T20:32:43.2331279Z x1 = x[:, D:] 2025-05-07T20:32:43.2331346Z 2025-05-07T20:32:43.2331428Z if contiguous: 2025-05-07T20:32:43.2331512Z x0 = x0.contiguous() 2025-05-07T20:32:43.2331600Z x1 = x1.contiguous() 2025-05-07T20:32:43.2331672Z 2025-05-07T20:32:43.2331755Z if scale_ub is not None: 2025-05-07T20:32:43.2331857Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2331983Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2332054Z ) 2025-05-07T20:32:43.2332129Z else: 2025-05-07T20:32:43.2332217Z scale_ub_tensor = None 2025-05-07T20:32:43.2332281Z 2025-05-07T20:32:43.2332409Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2332494Z op = silu_mul_quant 2025-05-07T20:32:43.2332573Z if compiled: 2025-05-07T20:32:43.2332669Z op = torch.compile(op) 2025-05-07T20:32:43.2332774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2332848Z 2025-05-07T20:32:43.2332936Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2332940Z 2025-05-07T20:32:43.2333032Z moe/activation_test.py:117: 2025-05-07T20:32:43.2333170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2333264Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2333358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2333849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2333942Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2334293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2334515Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2334848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2334948Z kernel = self.compile( 2025-05-07T20:32:43.2335424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2335594Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2335721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2335725Z 2025-05-07T20:32:43.2335920Z self = 2025-05-07T20:32:43.2336682Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2337171Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4cab2e0>} 2025-05-07T20:32:43.2337913Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2338102Z context = 2025-05-07T20:32:43.2338107Z 2025-05-07T20:32:43.2338266Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2338528Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2338635Z module_map=module_map) 2025-05-07T20:32:43.2338792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2338890Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2338966Z E ^ 2025-05-07T20:32:43.2339317Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2340045Z 2025-05-07T20:32:43.2342149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2342167Z 2025-05-07T20:32:43.2342277Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2342504Z self=, 2025-05-07T20:32:43.2342583Z T=16384, 2025-05-07T20:32:43.2342656Z D=7168, 2025-05-07T20:32:43.2342741Z scale_ub=1200.0, 2025-05-07T20:32:43.2342822Z contiguous=False, 2025-05-07T20:32:43.2342904Z compiled=True, 2025-05-07T20:32:43.2342970Z ) 2025-05-07T20:32:43.2343185Z self = 2025-05-07T20:32:43.2343366Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2343371Z 2025-05-07T20:32:43.2343443Z @given( 2025-05-07T20:32:43.2343558Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2343661Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2343771Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2343884Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2343998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2344067Z ) 2025-05-07T20:32:43.2344310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2344398Z def test_silu_mul_quant( 2025-05-07T20:32:43.2344471Z self, 2025-05-07T20:32:43.2344549Z T: int, 2025-05-07T20:32:43.2344621Z D: int, 2025-05-07T20:32:43.2344712Z scale_ub: Optional[float], 2025-05-07T20:32:43.2344798Z contiguous: bool, 2025-05-07T20:32:43.2344876Z compiled: bool, 2025-05-07T20:32:43.2344951Z ) -> None: 2025-05-07T20:32:43.2345040Z torch.manual_seed(2025) 2025-05-07T20:32:43.2345108Z 2025-05-07T20:32:43.2345272Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2345354Z 2025-05-07T20:32:43.2345442Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2345833Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2345923Z x = x_sign * x_clamp 2025-05-07T20:32:43.2345996Z x0 = x[:, :D] 2025-05-07T20:32:43.2346071Z x1 = x[:, D:] 2025-05-07T20:32:43.2346135Z 2025-05-07T20:32:43.2346212Z if contiguous: 2025-05-07T20:32:43.2346301Z x0 = x0.contiguous() 2025-05-07T20:32:43.2346386Z x1 = x1.contiguous() 2025-05-07T20:32:43.2346453Z 2025-05-07T20:32:43.2346542Z if scale_ub is not None: 2025-05-07T20:32:43.2346642Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2346771Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2346849Z ) 2025-05-07T20:32:43.2346923Z else: 2025-05-07T20:32:43.2347018Z scale_ub_tensor = None 2025-05-07T20:32:43.2347090Z 2025-05-07T20:32:43.2347215Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2347303Z op = silu_mul_quant 2025-05-07T20:32:43.2347390Z if compiled: 2025-05-07T20:32:43.2347583Z op = torch.compile(op) 2025-05-07T20:32:43.2347689Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2347754Z 2025-05-07T20:32:43.2347840Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2347844Z 2025-05-07T20:32:43.2347939Z moe/activation_test.py:117: 2025-05-07T20:32:43.2348066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2348163Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2348261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2348625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2348716Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2349212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2349440Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2349803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2350021Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2350360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2350448Z kernel = self.compile( 2025-05-07T20:32:43.2350829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2351000Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2351121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2351126Z 2025-05-07T20:32:43.2351328Z self = 2025-05-07T20:32:43.2352107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2352602Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4bb85e0>} 2025-05-07T20:32:43.2353495Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2353684Z context = 2025-05-07T20:32:43.2353688Z 2025-05-07T20:32:43.2353851Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2354114Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2354312Z module_map=module_map) 2025-05-07T20:32:43.2354477Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2354571Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2354643Z E ^ 2025-05-07T20:32:43.2354993Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2354998Z 2025-05-07T20:32:43.2355409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2355414Z 2025-05-07T20:32:43.2355514Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2355729Z self=, 2025-05-07T20:32:43.2355799Z T=1, 2025-05-07T20:32:43.2355874Z D=7168, 2025-05-07T20:32:43.2355953Z scale_ub=None, 2025-05-07T20:32:43.2356036Z contiguous=False, 2025-05-07T20:32:43.2356120Z compiled=False, 2025-05-07T20:32:43.2356186Z ) 2025-05-07T20:32:43.2356405Z self = 2025-05-07T20:32:43.2356567Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2356572Z 2025-05-07T20:32:43.2356643Z @given( 2025-05-07T20:32:43.2356760Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2356852Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2356962Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2357077Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2357182Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2357253Z ) 2025-05-07T20:32:43.2357495Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2357664Z def test_silu_mul_quant( 2025-05-07T20:32:43.2357742Z self, 2025-05-07T20:32:43.2357820Z T: int, 2025-05-07T20:32:43.2357890Z D: int, 2025-05-07T20:32:43.2357989Z scale_ub: Optional[float], 2025-05-07T20:32:43.2358076Z contiguous: bool, 2025-05-07T20:32:43.2358153Z compiled: bool, 2025-05-07T20:32:43.2358227Z ) -> None: 2025-05-07T20:32:43.2358315Z torch.manual_seed(2025) 2025-05-07T20:32:43.2358381Z 2025-05-07T20:32:43.2358546Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2358615Z 2025-05-07T20:32:43.2358698Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2358819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2358901Z x = x_sign * x_clamp 2025-05-07T20:32:43.2358981Z x0 = x[:, :D] 2025-05-07T20:32:43.2359055Z x1 = x[:, D:] 2025-05-07T20:32:43.2359122Z 2025-05-07T20:32:43.2359200Z if contiguous: 2025-05-07T20:32:43.2359292Z x0 = x0.contiguous() 2025-05-07T20:32:43.2359374Z x1 = x1.contiguous() 2025-05-07T20:32:43.2359446Z 2025-05-07T20:32:43.2359537Z if scale_ub is not None: 2025-05-07T20:32:43.2359637Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2359769Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2359840Z ) 2025-05-07T20:32:43.2359913Z else: 2025-05-07T20:32:43.2360008Z scale_ub_tensor = None 2025-05-07T20:32:43.2360075Z 2025-05-07T20:32:43.2360199Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2360293Z op = silu_mul_quant 2025-05-07T20:32:43.2360371Z if compiled: 2025-05-07T20:32:43.2360469Z op = torch.compile(op) 2025-05-07T20:32:43.2360569Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2360638Z 2025-05-07T20:32:43.2360726Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2360730Z 2025-05-07T20:32:43.2360825Z moe/activation_test.py:117: 2025-05-07T20:32:43.2360953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2361131Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2361227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2361717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2361805Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2362159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2362378Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2362714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2362803Z kernel = self.compile( 2025-05-07T20:32:43.2363186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2363364Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2363494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2363499Z 2025-05-07T20:32:43.2363760Z self = 2025-05-07T20:32:43.2364568Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2365063Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4bb8fe0>} 2025-05-07T20:32:43.2365798Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2366076Z context = 2025-05-07T20:32:43.2366081Z 2025-05-07T20:32:43.2366237Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2366494Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2366594Z module_map=module_map) 2025-05-07T20:32:43.2366748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2366843Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2366915Z E ^ 2025-05-07T20:32:43.2367288Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2367293Z 2025-05-07T20:32:43.2367749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2367759Z 2025-05-07T20:32:43.2367853Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2368073Z self=, 2025-05-07T20:32:43.2368142Z T=2048, 2025-05-07T20:32:43.2368215Z D=7168, 2025-05-07T20:32:43.2368295Z scale_ub=None, 2025-05-07T20:32:43.2368378Z contiguous=False, 2025-05-07T20:32:43.2368452Z compiled=True, 2025-05-07T20:32:43.2368524Z ) 2025-05-07T20:32:43.2368734Z self = 2025-05-07T20:32:43.2368903Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2368908Z 2025-05-07T20:32:43.2368982Z @given( 2025-05-07T20:32:43.2369092Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2369186Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2369293Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2369409Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2369524Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2369675Z ) 2025-05-07T20:32:43.2369914Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2370003Z def test_silu_mul_quant( 2025-05-07T20:32:43.2370074Z self, 2025-05-07T20:32:43.2370148Z T: int, 2025-05-07T20:32:43.2370223Z D: int, 2025-05-07T20:32:43.2370313Z scale_ub: Optional[float], 2025-05-07T20:32:43.2370399Z contiguous: bool, 2025-05-07T20:32:43.2370476Z compiled: bool, 2025-05-07T20:32:43.2370544Z ) -> None: 2025-05-07T20:32:43.2370633Z torch.manual_seed(2025) 2025-05-07T20:32:43.2370698Z 2025-05-07T20:32:43.2370857Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2370928Z 2025-05-07T20:32:43.2371014Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2371134Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2371218Z x = x_sign * x_clamp 2025-05-07T20:32:43.2371291Z x0 = x[:, :D] 2025-05-07T20:32:43.2371369Z x1 = x[:, D:] 2025-05-07T20:32:43.2371438Z 2025-05-07T20:32:43.2371514Z if contiguous: 2025-05-07T20:32:43.2371604Z x0 = x0.contiguous() 2025-05-07T20:32:43.2371687Z x1 = x1.contiguous() 2025-05-07T20:32:43.2371755Z 2025-05-07T20:32:43.2371839Z if scale_ub is not None: 2025-05-07T20:32:43.2371940Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2372068Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2372141Z ) 2025-05-07T20:32:43.2372212Z else: 2025-05-07T20:32:43.2372298Z scale_ub_tensor = None 2025-05-07T20:32:43.2372372Z 2025-05-07T20:32:43.2372494Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2372577Z op = silu_mul_quant 2025-05-07T20:32:43.2372772Z if compiled: 2025-05-07T20:32:43.2372864Z op = torch.compile(op) 2025-05-07T20:32:43.2372968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2373038Z 2025-05-07T20:32:43.2373119Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2373124Z 2025-05-07T20:32:43.2373218Z moe/activation_test.py:117: 2025-05-07T20:32:43.2373341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2373436Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2373533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2373894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2373979Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2374553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2374699Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2375236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2375503Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2375839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2375930Z kernel = self.compile( 2025-05-07T20:32:43.2376308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2376474Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2376596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2376601Z 2025-05-07T20:32:43.2376795Z self = 2025-05-07T20:32:43.2377708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2378206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4bba7a0>} 2025-05-07T20:32:43.2378944Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2379128Z context = 2025-05-07T20:32:43.2379132Z 2025-05-07T20:32:43.2379288Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2379544Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2379651Z module_map=module_map) 2025-05-07T20:32:43.2379808Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2379903Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2379976Z E ^ 2025-05-07T20:32:43.2380328Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2380333Z 2025-05-07T20:32:43.2380741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2380746Z 2025-05-07T20:32:43.2380853Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2381070Z self=, 2025-05-07T20:32:43.2381142Z T=4096, 2025-05-07T20:32:43.2381215Z D=7168, 2025-05-07T20:32:43.2381291Z scale_ub=None, 2025-05-07T20:32:43.2381375Z contiguous=False, 2025-05-07T20:32:43.2381534Z compiled=True, 2025-05-07T20:32:43.2381602Z ) 2025-05-07T20:32:43.2381813Z self = 2025-05-07T20:32:43.2381987Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2381992Z 2025-05-07T20:32:43.2382062Z @given( 2025-05-07T20:32:43.2382172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2382266Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2382373Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2382488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2382593Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2382661Z ) 2025-05-07T20:32:43.2382902Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2382994Z def test_silu_mul_quant( 2025-05-07T20:32:43.2383066Z self, 2025-05-07T20:32:43.2383137Z T: int, 2025-05-07T20:32:43.2383212Z D: int, 2025-05-07T20:32:43.2383301Z scale_ub: Optional[float], 2025-05-07T20:32:43.2383395Z contiguous: bool, 2025-05-07T20:32:43.2383478Z compiled: bool, 2025-05-07T20:32:43.2383556Z ) -> None: 2025-05-07T20:32:43.2383645Z torch.manual_seed(2025) 2025-05-07T20:32:43.2383712Z 2025-05-07T20:32:43.2383881Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2383953Z 2025-05-07T20:32:43.2384041Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2384163Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2384244Z x = x_sign * x_clamp 2025-05-07T20:32:43.2384316Z x0 = x[:, :D] 2025-05-07T20:32:43.2384398Z x1 = x[:, D:] 2025-05-07T20:32:43.2384465Z 2025-05-07T20:32:43.2384541Z if contiguous: 2025-05-07T20:32:43.2384631Z x0 = x0.contiguous() 2025-05-07T20:32:43.2384713Z x1 = x1.contiguous() 2025-05-07T20:32:43.2384785Z 2025-05-07T20:32:43.2384872Z if scale_ub is not None: 2025-05-07T20:32:43.2384972Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2385252Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2385359Z ) 2025-05-07T20:32:43.2385433Z else: 2025-05-07T20:32:43.2385526Z scale_ub_tensor = None 2025-05-07T20:32:43.2385593Z 2025-05-07T20:32:43.2385718Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2385806Z op = silu_mul_quant 2025-05-07T20:32:43.2385883Z if compiled: 2025-05-07T20:32:43.2385972Z op = torch.compile(op) 2025-05-07T20:32:43.2386074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2386139Z 2025-05-07T20:32:43.2386223Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2386230Z 2025-05-07T20:32:43.2386321Z moe/activation_test.py:117: 2025-05-07T20:32:43.2386446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2386549Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2386643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2387012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2387104Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2387670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2387759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2388113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2388328Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2388672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2388758Z kernel = self.compile( 2025-05-07T20:32:43.2389226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2389400Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2389519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2389524Z 2025-05-07T20:32:43.2389724Z self = 2025-05-07T20:32:43.2390489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2390983Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4bbb4c0>} 2025-05-07T20:32:43.2391724Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2391914Z context = 2025-05-07T20:32:43.2391918Z 2025-05-07T20:32:43.2392078Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2392330Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2392430Z module_map=module_map) 2025-05-07T20:32:43.2392588Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2392679Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2392755Z E ^ 2025-05-07T20:32:43.2393101Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2393106Z 2025-05-07T20:32:43.2393519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2393524Z 2025-05-07T20:32:43.2393705Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2393924Z self=, 2025-05-07T20:32:43.2394004Z T=16384, 2025-05-07T20:32:43.2394078Z D=5120, 2025-05-07T20:32:43.2394151Z scale_ub=1200.0, 2025-05-07T20:32:43.2394236Z contiguous=False, 2025-05-07T20:32:43.2394315Z compiled=False, 2025-05-07T20:32:43.2394383Z ) 2025-05-07T20:32:43.2394603Z self = 2025-05-07T20:32:43.2394778Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2394783Z 2025-05-07T20:32:43.2394855Z @given( 2025-05-07T20:32:43.2394969Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2395062Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2395174Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2395287Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2395400Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2395476Z ) 2025-05-07T20:32:43.2395714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2395830Z def test_silu_mul_quant( 2025-05-07T20:32:43.2395936Z self, 2025-05-07T20:32:43.2396040Z T: int, 2025-05-07T20:32:43.2396127Z D: int, 2025-05-07T20:32:43.2396225Z scale_ub: Optional[float], 2025-05-07T20:32:43.2396307Z contiguous: bool, 2025-05-07T20:32:43.2396384Z compiled: bool, 2025-05-07T20:32:43.2396459Z ) -> None: 2025-05-07T20:32:43.2396546Z torch.manual_seed(2025) 2025-05-07T20:32:43.2396613Z 2025-05-07T20:32:43.2396779Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2396942Z 2025-05-07T20:32:43.2397035Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2397156Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2397262Z x = x_sign * x_clamp 2025-05-07T20:32:43.2397342Z x0 = x[:, :D] 2025-05-07T20:32:43.2397431Z x1 = x[:, D:] 2025-05-07T20:32:43.2397507Z 2025-05-07T20:32:43.2397584Z if contiguous: 2025-05-07T20:32:43.2397668Z x0 = x0.contiguous() 2025-05-07T20:32:43.2397749Z x1 = x1.contiguous() 2025-05-07T20:32:43.2397823Z 2025-05-07T20:32:43.2397907Z if scale_ub is not None: 2025-05-07T20:32:43.2398007Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2398137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2398207Z ) 2025-05-07T20:32:43.2398280Z else: 2025-05-07T20:32:43.2398370Z scale_ub_tensor = None 2025-05-07T20:32:43.2398442Z 2025-05-07T20:32:43.2398573Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2398661Z op = silu_mul_quant 2025-05-07T20:32:43.2398739Z if compiled: 2025-05-07T20:32:43.2398842Z op = torch.compile(op) 2025-05-07T20:32:43.2398944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2399013Z 2025-05-07T20:32:43.2399101Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2399105Z 2025-05-07T20:32:43.2399197Z moe/activation_test.py:117: 2025-05-07T20:32:43.2399325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2399418Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2399519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2400011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2400102Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2400457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2400682Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2401128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2401219Z kernel = self.compile( 2025-05-07T20:32:43.2401597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2401765Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2401887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2401891Z 2025-05-07T20:32:43.2402086Z self = 2025-05-07T20:32:43.2402853Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2403355Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4f1c860>} 2025-05-07T20:32:43.2404093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2404278Z context = 2025-05-07T20:32:43.2404282Z 2025-05-07T20:32:43.2404439Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2404697Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2404800Z module_map=module_map) 2025-05-07T20:32:43.2404953Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2405128Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2405199Z E ^ 2025-05-07T20:32:43.2405551Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2405561Z 2025-05-07T20:32:43.2405971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2405976Z 2025-05-07T20:32:43.2406071Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2406292Z self=, 2025-05-07T20:32:43.2406364Z T=16384, 2025-05-07T20:32:43.2406438Z D=5120, 2025-05-07T20:32:43.2406539Z scale_ub=1200.0, 2025-05-07T20:32:43.2406655Z contiguous=True, 2025-05-07T20:32:43.2406769Z compiled=True, 2025-05-07T20:32:43.2406860Z ) 2025-05-07T20:32:43.2407075Z self = 2025-05-07T20:32:43.2407256Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2407261Z 2025-05-07T20:32:43.2407339Z @given( 2025-05-07T20:32:43.2407451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2407545Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2407653Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2407767Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2407887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2407957Z ) 2025-05-07T20:32:43.2408192Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2408286Z def test_silu_mul_quant( 2025-05-07T20:32:43.2408358Z self, 2025-05-07T20:32:43.2408435Z T: int, 2025-05-07T20:32:43.2408507Z D: int, 2025-05-07T20:32:43.2408596Z scale_ub: Optional[float], 2025-05-07T20:32:43.2408684Z contiguous: bool, 2025-05-07T20:32:43.2408762Z compiled: bool, 2025-05-07T20:32:43.2408835Z ) -> None: 2025-05-07T20:32:43.2408928Z torch.manual_seed(2025) 2025-05-07T20:32:43.2409085Z 2025-05-07T20:32:43.2409248Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2409320Z 2025-05-07T20:32:43.2409405Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2409523Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2409611Z x = x_sign * x_clamp 2025-05-07T20:32:43.2409684Z x0 = x[:, :D] 2025-05-07T20:32:43.2409760Z x1 = x[:, D:] 2025-05-07T20:32:43.2409828Z 2025-05-07T20:32:43.2409905Z if contiguous: 2025-05-07T20:32:43.2409991Z x0 = x0.contiguous() 2025-05-07T20:32:43.2410076Z x1 = x1.contiguous() 2025-05-07T20:32:43.2410145Z 2025-05-07T20:32:43.2410232Z if scale_ub is not None: 2025-05-07T20:32:43.2410332Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2410462Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2410537Z ) 2025-05-07T20:32:43.2410614Z else: 2025-05-07T20:32:43.2410702Z scale_ub_tensor = None 2025-05-07T20:32:43.2410785Z 2025-05-07T20:32:43.2410908Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2410993Z op = silu_mul_quant 2025-05-07T20:32:43.2411075Z if compiled: 2025-05-07T20:32:43.2411169Z op = torch.compile(op) 2025-05-07T20:32:43.2411274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2411342Z 2025-05-07T20:32:43.2411426Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2411430Z 2025-05-07T20:32:43.2411524Z moe/activation_test.py:117: 2025-05-07T20:32:43.2411646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2411739Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2411833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2412284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2412377Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2412864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2412953Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2413307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2413521Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2413857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2413946Z kernel = self.compile( 2025-05-07T20:32:43.2414325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2414504Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2414628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2414633Z 2025-05-07T20:32:43.2414830Z self = 2025-05-07T20:32:43.2415597Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2416089Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4f1db20>} 2025-05-07T20:32:43.2416826Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2417012Z context = 2025-05-07T20:32:43.2417096Z 2025-05-07T20:32:43.2417276Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2417648Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2417792Z module_map=module_map) 2025-05-07T20:32:43.2418015Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2418146Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2418248Z E ^ 2025-05-07T20:32:43.2418653Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2418658Z 2025-05-07T20:32:43.2419067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2419078Z 2025-05-07T20:32:43.2419177Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2419398Z self=, 2025-05-07T20:32:43.2419472Z T=16384, 2025-05-07T20:32:43.2419550Z D=5120, 2025-05-07T20:32:43.2419627Z scale_ub=None, 2025-05-07T20:32:43.2419711Z contiguous=False, 2025-05-07T20:32:43.2419793Z compiled=True, 2025-05-07T20:32:43.2419864Z ) 2025-05-07T20:32:43.2420076Z self = 2025-05-07T20:32:43.2420251Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2420256Z 2025-05-07T20:32:43.2420328Z @given( 2025-05-07T20:32:43.2420451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2420549Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2420658Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2420776Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2420986Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2421058Z ) 2025-05-07T20:32:43.2421304Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2421393Z def test_silu_mul_quant( 2025-05-07T20:32:43.2421465Z self, 2025-05-07T20:32:43.2421540Z T: int, 2025-05-07T20:32:43.2421612Z D: int, 2025-05-07T20:32:43.2421709Z scale_ub: Optional[float], 2025-05-07T20:32:43.2421793Z contiguous: bool, 2025-05-07T20:32:43.2421874Z compiled: bool, 2025-05-07T20:32:43.2421950Z ) -> None: 2025-05-07T20:32:43.2422041Z torch.manual_seed(2025) 2025-05-07T20:32:43.2422110Z 2025-05-07T20:32:43.2422274Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2422344Z 2025-05-07T20:32:43.2422430Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2426500Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2426617Z x = x_sign * x_clamp 2025-05-07T20:32:43.2426699Z x0 = x[:, :D] 2025-05-07T20:32:43.2426779Z x1 = x[:, D:] 2025-05-07T20:32:43.2426851Z 2025-05-07T20:32:43.2426933Z if contiguous: 2025-05-07T20:32:43.2427026Z x0 = x0.contiguous() 2025-05-07T20:32:43.2427111Z x1 = x1.contiguous() 2025-05-07T20:32:43.2427183Z 2025-05-07T20:32:43.2427266Z if scale_ub is not None: 2025-05-07T20:32:43.2427368Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2427581Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2427655Z ) 2025-05-07T20:32:43.2427729Z else: 2025-05-07T20:32:43.2427826Z scale_ub_tensor = None 2025-05-07T20:32:43.2427896Z 2025-05-07T20:32:43.2428061Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2428192Z op = silu_mul_quant 2025-05-07T20:32:43.2428308Z if compiled: 2025-05-07T20:32:43.2428431Z op = torch.compile(op) 2025-05-07T20:32:43.2428540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2428606Z 2025-05-07T20:32:43.2428811Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2428817Z 2025-05-07T20:32:43.2428912Z moe/activation_test.py:117: 2025-05-07T20:32:43.2429040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2429141Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2429235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2429613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2429703Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2430189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2430283Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2430640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2430863Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2431202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2431292Z kernel = self.compile( 2025-05-07T20:32:43.2431675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2431850Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2431972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2431977Z 2025-05-07T20:32:43.2432177Z self = 2025-05-07T20:32:43.2432944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2433568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4f1e8e0>} 2025-05-07T20:32:43.2434303Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2434489Z context = 2025-05-07T20:32:43.2434493Z 2025-05-07T20:32:43.2434655Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2434913Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2435022Z module_map=module_map) 2025-05-07T20:32:43.2435186Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2435281Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2435360Z E ^ 2025-05-07T20:32:43.2435710Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2435716Z 2025-05-07T20:32:43.2436120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2436129Z 2025-05-07T20:32:43.2436225Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2436442Z self=, 2025-05-07T20:32:43.2436517Z T=2048, 2025-05-07T20:32:43.2436590Z D=5120, 2025-05-07T20:32:43.2436679Z scale_ub=None, 2025-05-07T20:32:43.2436766Z contiguous=False, 2025-05-07T20:32:43.2436844Z compiled=True, 2025-05-07T20:32:43.2436910Z ) 2025-05-07T20:32:43.2437133Z self = 2025-05-07T20:32:43.2437407Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2437412Z 2025-05-07T20:32:43.2437490Z @given( 2025-05-07T20:32:43.2437611Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2437708Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2437834Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2437947Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2438055Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2438132Z ) 2025-05-07T20:32:43.2438374Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2438461Z def test_silu_mul_quant( 2025-05-07T20:32:43.2438537Z self, 2025-05-07T20:32:43.2438610Z T: int, 2025-05-07T20:32:43.2438684Z D: int, 2025-05-07T20:32:43.2438807Z scale_ub: Optional[float], 2025-05-07T20:32:43.2438933Z contiguous: bool, 2025-05-07T20:32:43.2439055Z compiled: bool, 2025-05-07T20:32:43.2439143Z ) -> None: 2025-05-07T20:32:43.2439234Z torch.manual_seed(2025) 2025-05-07T20:32:43.2439306Z 2025-05-07T20:32:43.2439471Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2439545Z 2025-05-07T20:32:43.2439638Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2439758Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2439840Z x = x_sign * x_clamp 2025-05-07T20:32:43.2439926Z x0 = x[:, :D] 2025-05-07T20:32:43.2440001Z x1 = x[:, D:] 2025-05-07T20:32:43.2440489Z 2025-05-07T20:32:43.2440623Z if contiguous: 2025-05-07T20:32:43.2440751Z x0 = x0.contiguous() 2025-05-07T20:32:43.2440871Z x1 = x1.contiguous() 2025-05-07T20:32:43.2440976Z 2025-05-07T20:32:43.2441112Z if scale_ub is not None: 2025-05-07T20:32:43.2441411Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2441545Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2441623Z ) 2025-05-07T20:32:43.2441701Z else: 2025-05-07T20:32:43.2441792Z scale_ub_tensor = None 2025-05-07T20:32:43.2441862Z 2025-05-07T20:32:43.2441993Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2442080Z op = silu_mul_quant 2025-05-07T20:32:43.2442164Z if compiled: 2025-05-07T20:32:43.2442266Z op = torch.compile(op) 2025-05-07T20:32:43.2442369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2442437Z 2025-05-07T20:32:43.2442525Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2442530Z 2025-05-07T20:32:43.2442625Z moe/activation_test.py:117: 2025-05-07T20:32:43.2442760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2442860Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2442955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2443329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2443419Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2443909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2444007Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2444364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2444586Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2444922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2445010Z kernel = self.compile( 2025-05-07T20:32:43.2445414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2445708Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2445836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2445842Z 2025-05-07T20:32:43.2446036Z self = 2025-05-07T20:32:43.2446810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2447312Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c48c4040>} 2025-05-07T20:32:43.2448052Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2448254Z context = 2025-05-07T20:32:43.2448259Z 2025-05-07T20:32:43.2448417Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2448673Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2448781Z module_map=module_map) 2025-05-07T20:32:43.2448938Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2449036Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2449109Z E ^ 2025-05-07T20:32:43.2449456Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2449461Z 2025-05-07T20:32:43.2450032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2450134Z 2025-05-07T20:32:43.2450237Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2450497Z self=, 2025-05-07T20:32:43.2450601Z T=2048, 2025-05-07T20:32:43.2450685Z D=5120, 2025-05-07T20:32:43.2450767Z scale_ub=1200.0, 2025-05-07T20:32:43.2450848Z contiguous=False, 2025-05-07T20:32:43.2450933Z compiled=True, 2025-05-07T20:32:43.2451001Z ) 2025-05-07T20:32:43.2451215Z self = 2025-05-07T20:32:43.2451384Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2451389Z 2025-05-07T20:32:43.2451466Z @given( 2025-05-07T20:32:43.2451583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2451680Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2451791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2451908Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2452020Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2452094Z ) 2025-05-07T20:32:43.2452332Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2452424Z def test_silu_mul_quant( 2025-05-07T20:32:43.2452496Z self, 2025-05-07T20:32:43.2452570Z T: int, 2025-05-07T20:32:43.2452645Z D: int, 2025-05-07T20:32:43.2452737Z scale_ub: Optional[float], 2025-05-07T20:32:43.2452823Z contiguous: bool, 2025-05-07T20:32:43.2452907Z compiled: bool, 2025-05-07T20:32:43.2452981Z ) -> None: 2025-05-07T20:32:43.2453075Z torch.manual_seed(2025) 2025-05-07T20:32:43.2453143Z 2025-05-07T20:32:43.2453306Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2453376Z 2025-05-07T20:32:43.2453464Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2453587Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2453676Z x = x_sign * x_clamp 2025-05-07T20:32:43.2453836Z x0 = x[:, :D] 2025-05-07T20:32:43.2453914Z x1 = x[:, D:] 2025-05-07T20:32:43.2453986Z 2025-05-07T20:32:43.2454063Z if contiguous: 2025-05-07T20:32:43.2454147Z x0 = x0.contiguous() 2025-05-07T20:32:43.2454233Z x1 = x1.contiguous() 2025-05-07T20:32:43.2454302Z 2025-05-07T20:32:43.2454386Z if scale_ub is not None: 2025-05-07T20:32:43.2454490Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2454617Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2454691Z ) 2025-05-07T20:32:43.2454763Z else: 2025-05-07T20:32:43.2454852Z scale_ub_tensor = None 2025-05-07T20:32:43.2454921Z 2025-05-07T20:32:43.2455047Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2455139Z op = silu_mul_quant 2025-05-07T20:32:43.2455224Z if compiled: 2025-05-07T20:32:43.2455317Z op = torch.compile(op) 2025-05-07T20:32:43.2455422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2455496Z 2025-05-07T20:32:43.2455582Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2455587Z 2025-05-07T20:32:43.2455680Z moe/activation_test.py:117: 2025-05-07T20:32:43.2455803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2455897Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2455993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2456355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2456443Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2456932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2457108Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2457525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2457743Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2458079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2458173Z kernel = self.compile( 2025-05-07T20:32:43.2458573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2458747Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2458875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2458879Z 2025-05-07T20:32:43.2459075Z self = 2025-05-07T20:32:43.2459848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2460344Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c48c4e00>} 2025-05-07T20:32:43.2461082Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2461268Z context = 2025-05-07T20:32:43.2461273Z 2025-05-07T20:32:43.2461434Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2461693Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2461800Z module_map=module_map) 2025-05-07T20:32:43.2461956Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2462137Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2462211Z E ^ 2025-05-07T20:32:43.2462562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2462568Z 2025-05-07T20:32:43.2462979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2462984Z 2025-05-07T20:32:43.2463082Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2463299Z self=, 2025-05-07T20:32:43.2463374Z T=4096, 2025-05-07T20:32:43.2463450Z D=5120, 2025-05-07T20:32:43.2463530Z scale_ub=1200.0, 2025-05-07T20:32:43.2463609Z contiguous=True, 2025-05-07T20:32:43.2463693Z compiled=True, 2025-05-07T20:32:43.2463762Z ) 2025-05-07T20:32:43.2463975Z self = 2025-05-07T20:32:43.2464150Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2464155Z 2025-05-07T20:32:43.2464226Z @given( 2025-05-07T20:32:43.2464340Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2464439Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2464549Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2464668Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2464774Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2464847Z ) 2025-05-07T20:32:43.2465088Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2465177Z def test_silu_mul_quant( 2025-05-07T20:32:43.2465248Z self, 2025-05-07T20:32:43.2465323Z T: int, 2025-05-07T20:32:43.2465498Z D: int, 2025-05-07T20:32:43.2465591Z scale_ub: Optional[float], 2025-05-07T20:32:43.2465680Z contiguous: bool, 2025-05-07T20:32:43.2465763Z compiled: bool, 2025-05-07T20:32:43.2465834Z ) -> None: 2025-05-07T20:32:43.2465928Z torch.manual_seed(2025) 2025-05-07T20:32:43.2465996Z 2025-05-07T20:32:43.2466162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2466232Z 2025-05-07T20:32:43.2466317Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2466442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2466527Z x = x_sign * x_clamp 2025-05-07T20:32:43.2466601Z x0 = x[:, :D] 2025-05-07T20:32:43.2466679Z x1 = x[:, D:] 2025-05-07T20:32:43.2466745Z 2025-05-07T20:32:43.2466822Z if contiguous: 2025-05-07T20:32:43.2466911Z x0 = x0.contiguous() 2025-05-07T20:32:43.2466995Z x1 = x1.contiguous() 2025-05-07T20:32:43.2467073Z 2025-05-07T20:32:43.2467167Z if scale_ub is not None: 2025-05-07T20:32:43.2467291Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2467524Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2467611Z ) 2025-05-07T20:32:43.2467685Z else: 2025-05-07T20:32:43.2467777Z scale_ub_tensor = None 2025-05-07T20:32:43.2467845Z 2025-05-07T20:32:43.2467969Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2468056Z op = silu_mul_quant 2025-05-07T20:32:43.2468137Z if compiled: 2025-05-07T20:32:43.2468233Z op = torch.compile(op) 2025-05-07T20:32:43.2468338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2468407Z 2025-05-07T20:32:43.2468491Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2468495Z 2025-05-07T20:32:43.2468592Z moe/activation_test.py:117: 2025-05-07T20:32:43.2468719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2468822Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2468915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2469408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2469504Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2469986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2470078Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2470435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2470652Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2470985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2471077Z kernel = self.compile( 2025-05-07T20:32:43.2471460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2471634Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2471754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2471759Z 2025-05-07T20:32:43.2471958Z self = 2025-05-07T20:32:43.2472722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2473212Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c48c60c0>} 2025-05-07T20:32:43.2474110Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2474296Z context = 2025-05-07T20:32:43.2474302Z 2025-05-07T20:32:43.2474462Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2474717Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2474822Z module_map=module_map) 2025-05-07T20:32:43.2474982Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2475074Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2475146Z E ^ 2025-05-07T20:32:43.2475495Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2475500Z 2025-05-07T20:32:43.2475908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2475912Z 2025-05-07T20:32:43.2476017Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2476232Z self=, 2025-05-07T20:32:43.2476305Z T=128, 2025-05-07T20:32:43.2476383Z D=5120, 2025-05-07T20:32:43.2476463Z scale_ub=1200.0, 2025-05-07T20:32:43.2476547Z contiguous=False, 2025-05-07T20:32:43.2476623Z compiled=True, 2025-05-07T20:32:43.2476692Z ) 2025-05-07T20:32:43.2476911Z self = 2025-05-07T20:32:43.2477075Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2477079Z 2025-05-07T20:32:43.2477152Z @given( 2025-05-07T20:32:43.2477268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2477363Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2477478Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2477594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2477782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2477860Z ) 2025-05-07T20:32:43.2478098Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2478184Z def test_silu_mul_quant( 2025-05-07T20:32:43.2478259Z self, 2025-05-07T20:32:43.2478333Z T: int, 2025-05-07T20:32:43.2478405Z D: int, 2025-05-07T20:32:43.2478498Z scale_ub: Optional[float], 2025-05-07T20:32:43.2478585Z contiguous: bool, 2025-05-07T20:32:43.2478664Z compiled: bool, 2025-05-07T20:32:43.2478738Z ) -> None: 2025-05-07T20:32:43.2478828Z torch.manual_seed(2025) 2025-05-07T20:32:43.2478897Z 2025-05-07T20:32:43.2479062Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2479137Z 2025-05-07T20:32:43.2479226Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2479343Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2479431Z x = x_sign * x_clamp 2025-05-07T20:32:43.2479509Z x0 = x[:, :D] 2025-05-07T20:32:43.2479583Z x1 = x[:, D:] 2025-05-07T20:32:43.2479648Z 2025-05-07T20:32:43.2479730Z if contiguous: 2025-05-07T20:32:43.2479814Z x0 = x0.contiguous() 2025-05-07T20:32:43.2479896Z x1 = x1.contiguous() 2025-05-07T20:32:43.2479969Z 2025-05-07T20:32:43.2480053Z if scale_ub is not None: 2025-05-07T20:32:43.2480151Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2480284Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2480358Z ) 2025-05-07T20:32:43.2480432Z else: 2025-05-07T20:32:43.2480522Z scale_ub_tensor = None 2025-05-07T20:32:43.2480591Z 2025-05-07T20:32:43.2480718Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2480886Z op = silu_mul_quant 2025-05-07T20:32:43.2480966Z if compiled: 2025-05-07T20:32:43.2481068Z op = torch.compile(op) 2025-05-07T20:32:43.2481168Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2481237Z 2025-05-07T20:32:43.2481326Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2481330Z 2025-05-07T20:32:43.2481423Z moe/activation_test.py:117: 2025-05-07T20:32:43.2481546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2481644Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2481737Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2482103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2482192Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2482674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2482775Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2483133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2483348Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2483682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2483770Z kernel = self.compile( 2025-05-07T20:32:43.2484169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2484336Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2484456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2484460Z 2025-05-07T20:32:43.2484657Z self = 2025-05-07T20:32:43.2485506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2486000Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c48c72e0>} 2025-05-07T20:32:43.2486735Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2486922Z context = 2025-05-07T20:32:43.2486926Z 2025-05-07T20:32:43.2487083Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2487340Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2487449Z module_map=module_map) 2025-05-07T20:32:43.2487609Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2487703Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2487776Z E ^ 2025-05-07T20:32:43.2488121Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2488125Z 2025-05-07T20:32:43.2488534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2488538Z 2025-05-07T20:32:43.2488636Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2488852Z self=, 2025-05-07T20:32:43.2488929Z T=16384, 2025-05-07T20:32:43.2488999Z D=7168, 2025-05-07T20:32:43.2489076Z scale_ub=1200.0, 2025-05-07T20:32:43.2489234Z contiguous=True, 2025-05-07T20:32:43.2489310Z compiled=True, 2025-05-07T20:32:43.2489380Z ) 2025-05-07T20:32:43.2489596Z self = 2025-05-07T20:32:43.2489767Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2489772Z 2025-05-07T20:32:43.2489847Z @given( 2025-05-07T20:32:43.2489959Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2490054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2490168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2490279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2490387Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2490463Z ) 2025-05-07T20:32:43.2490698Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2490787Z def test_silu_mul_quant( 2025-05-07T20:32:43.2490864Z self, 2025-05-07T20:32:43.2490936Z T: int, 2025-05-07T20:32:43.2491010Z D: int, 2025-05-07T20:32:43.2491104Z scale_ub: Optional[float], 2025-05-07T20:32:43.2491193Z contiguous: bool, 2025-05-07T20:32:43.2491281Z compiled: bool, 2025-05-07T20:32:43.2491356Z ) -> None: 2025-05-07T20:32:43.2491443Z torch.manual_seed(2025) 2025-05-07T20:32:43.2491512Z 2025-05-07T20:32:43.2491674Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2491742Z 2025-05-07T20:32:43.2491832Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2491949Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2492035Z x = x_sign * x_clamp 2025-05-07T20:32:43.2492111Z x0 = x[:, :D] 2025-05-07T20:32:43.2492184Z x1 = x[:, D:] 2025-05-07T20:32:43.2492255Z 2025-05-07T20:32:43.2492332Z if contiguous: 2025-05-07T20:32:43.2492417Z x0 = x0.contiguous() 2025-05-07T20:32:43.2492508Z x1 = x1.contiguous() 2025-05-07T20:32:43.2492575Z 2025-05-07T20:32:43.2492660Z if scale_ub is not None: 2025-05-07T20:32:43.2492844Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2492974Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2493050Z ) 2025-05-07T20:32:43.2493123Z else: 2025-05-07T20:32:43.2493211Z scale_ub_tensor = None 2025-05-07T20:32:43.2493279Z 2025-05-07T20:32:43.2493406Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2493488Z op = silu_mul_quant 2025-05-07T20:32:43.2493573Z if compiled: 2025-05-07T20:32:43.2493668Z op = torch.compile(op) 2025-05-07T20:32:43.2493768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2493839Z 2025-05-07T20:32:43.2493923Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2493928Z 2025-05-07T20:32:43.2494018Z moe/activation_test.py:117: 2025-05-07T20:32:43.2494147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2494242Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2494339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2494702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2494790Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2495273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2495366Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2495717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2495941Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2496275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2496474Z kernel = self.compile( 2025-05-07T20:32:43.2496860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2497030Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2497152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2497156Z 2025-05-07T20:32:43.2497353Z self = 2025-05-07T20:32:43.2498122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2498614Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4aaca40>} 2025-05-07T20:32:43.2499361Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2499548Z context = 2025-05-07T20:32:43.2499552Z 2025-05-07T20:32:43.2499710Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2499965Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2500067Z module_map=module_map) 2025-05-07T20:32:43.2500222Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2500318Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2500389Z E ^ 2025-05-07T20:32:43.2500734Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2500747Z 2025-05-07T20:32:43.2501224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2501229Z 2025-05-07T20:32:43.2501327Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2501544Z self=, 2025-05-07T20:32:43.2501619Z T=16384, 2025-05-07T20:32:43.2501690Z D=5120, 2025-05-07T20:32:43.2501769Z scale_ub=1200.0, 2025-05-07T20:32:43.2501849Z contiguous=True, 2025-05-07T20:32:43.2501927Z compiled=False, 2025-05-07T20:32:43.2502002Z ) 2025-05-07T20:32:43.2502214Z self = 2025-05-07T20:32:43.2502393Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2502398Z 2025-05-07T20:32:43.2502470Z @given( 2025-05-07T20:32:43.2502582Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2502689Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2502798Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2502914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2503024Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2503092Z ) 2025-05-07T20:32:43.2503327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2503416Z def test_silu_mul_quant( 2025-05-07T20:32:43.2503488Z self, 2025-05-07T20:32:43.2503563Z T: int, 2025-05-07T20:32:43.2503634Z D: int, 2025-05-07T20:32:43.2503727Z scale_ub: Optional[float], 2025-05-07T20:32:43.2503816Z contiguous: bool, 2025-05-07T20:32:43.2503895Z compiled: bool, 2025-05-07T20:32:43.2503967Z ) -> None: 2025-05-07T20:32:43.2504060Z torch.manual_seed(2025) 2025-05-07T20:32:43.2504129Z 2025-05-07T20:32:43.2504292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2504444Z 2025-05-07T20:32:43.2504532Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2504654Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2504740Z x = x_sign * x_clamp 2025-05-07T20:32:43.2504812Z x0 = x[:, :D] 2025-05-07T20:32:43.2504891Z x1 = x[:, D:] 2025-05-07T20:32:43.2504957Z 2025-05-07T20:32:43.2505034Z if contiguous: 2025-05-07T20:32:43.2505126Z x0 = x0.contiguous() 2025-05-07T20:32:43.2505210Z x1 = x1.contiguous() 2025-05-07T20:32:43.2505278Z 2025-05-07T20:32:43.2505363Z if scale_ub is not None: 2025-05-07T20:32:43.2505464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2505594Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2505674Z ) 2025-05-07T20:32:43.2505747Z else: 2025-05-07T20:32:43.2505833Z scale_ub_tensor = None 2025-05-07T20:32:43.2505909Z 2025-05-07T20:32:43.2506032Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2506114Z op = silu_mul_quant 2025-05-07T20:32:43.2506200Z if compiled: 2025-05-07T20:32:43.2506294Z op = torch.compile(op) 2025-05-07T20:32:43.2506398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2506468Z 2025-05-07T20:32:43.2506554Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2506559Z 2025-05-07T20:32:43.2506655Z moe/activation_test.py:117: 2025-05-07T20:32:43.2506779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2506876Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2506972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2507520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2507615Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2507976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2508275Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2508613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2508701Z kernel = self.compile( 2025-05-07T20:32:43.2509098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2509283Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2509455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2509462Z 2025-05-07T20:32:43.2509722Z self = 2025-05-07T20:32:43.2510566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2511070Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4aad440>} 2025-05-07T20:32:43.2511805Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2511988Z context = 2025-05-07T20:32:43.2511993Z 2025-05-07T20:32:43.2512151Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2512404Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2512510Z module_map=module_map) 2025-05-07T20:32:43.2512762Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2512856Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2512932Z E ^ 2025-05-07T20:32:43.2513282Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2513286Z 2025-05-07T20:32:43.2513697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2513701Z 2025-05-07T20:32:43.2513799Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2514014Z self=, 2025-05-07T20:32:43.2514087Z T=1, 2025-05-07T20:32:43.2514157Z D=7168, 2025-05-07T20:32:43.2514235Z scale_ub=1200.0, 2025-05-07T20:32:43.2514317Z contiguous=False, 2025-05-07T20:32:43.2514397Z compiled=False, 2025-05-07T20:32:43.2514470Z ) 2025-05-07T20:32:43.2514682Z self = 2025-05-07T20:32:43.2514850Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2514854Z 2025-05-07T20:32:43.2514923Z @given( 2025-05-07T20:32:43.2515037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2515133Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2515248Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2515357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2515464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2515534Z ) 2025-05-07T20:32:43.2515774Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2515863Z def test_silu_mul_quant( 2025-05-07T20:32:43.2515939Z self, 2025-05-07T20:32:43.2516011Z T: int, 2025-05-07T20:32:43.2516082Z D: int, 2025-05-07T20:32:43.2516181Z scale_ub: Optional[float], 2025-05-07T20:32:43.2516263Z contiguous: bool, 2025-05-07T20:32:43.2516343Z compiled: bool, 2025-05-07T20:32:43.2516503Z ) -> None: 2025-05-07T20:32:43.2516597Z torch.manual_seed(2025) 2025-05-07T20:32:43.2516667Z 2025-05-07T20:32:43.2516829Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2516896Z 2025-05-07T20:32:43.2516983Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2517100Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2517185Z x = x_sign * x_clamp 2025-05-07T20:32:43.2517261Z x0 = x[:, :D] 2025-05-07T20:32:43.2517332Z x1 = x[:, D:] 2025-05-07T20:32:43.2517397Z 2025-05-07T20:32:43.2517476Z if contiguous: 2025-05-07T20:32:43.2517560Z x0 = x0.contiguous() 2025-05-07T20:32:43.2517641Z x1 = x1.contiguous() 2025-05-07T20:32:43.2517713Z 2025-05-07T20:32:43.2517802Z if scale_ub is not None: 2025-05-07T20:32:43.2517913Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2518042Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2518117Z ) 2025-05-07T20:32:43.2518190Z else: 2025-05-07T20:32:43.2518277Z scale_ub_tensor = None 2025-05-07T20:32:43.2518342Z 2025-05-07T20:32:43.2518468Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2518549Z op = silu_mul_quant 2025-05-07T20:32:43.2518630Z if compiled: 2025-05-07T20:32:43.2518734Z op = torch.compile(op) 2025-05-07T20:32:43.2518834Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2518899Z 2025-05-07T20:32:43.2518987Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2518991Z 2025-05-07T20:32:43.2519081Z moe/activation_test.py:117: 2025-05-07T20:32:43.2519207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2519385Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2519483Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2519976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2520066Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2520418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2520635Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2520964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2521058Z kernel = self.compile( 2025-05-07T20:32:43.2521435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2521603Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2521734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2521739Z 2025-05-07T20:32:43.2521938Z self = 2025-05-07T20:32:43.2522703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2523195Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4aae7a0>} 2025-05-07T20:32:43.2523929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2524123Z context = 2025-05-07T20:32:43.2524132Z 2025-05-07T20:32:43.2524288Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2524645Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2524749Z module_map=module_map) 2025-05-07T20:32:43.2524903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2524999Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2525072Z E ^ 2025-05-07T20:32:43.2525415Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2525424Z 2025-05-07T20:32:43.2525834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2525838Z 2025-05-07T20:32:43.2525937Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2526158Z self=, 2025-05-07T20:32:43.2526230Z T=4096, 2025-05-07T20:32:43.2526299Z D=7168, 2025-05-07T20:32:43.2526388Z scale_ub=1200.0, 2025-05-07T20:32:43.2526471Z contiguous=False, 2025-05-07T20:32:43.2526548Z compiled=True, 2025-05-07T20:32:43.2526620Z ) 2025-05-07T20:32:43.2526833Z self = 2025-05-07T20:32:43.2527009Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2527013Z 2025-05-07T20:32:43.2527085Z @given( 2025-05-07T20:32:43.2527198Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2527303Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2527433Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2527561Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2527679Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2527835Z ) 2025-05-07T20:32:43.2528086Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2528182Z def test_silu_mul_quant( 2025-05-07T20:32:43.2528253Z self, 2025-05-07T20:32:43.2528331Z T: int, 2025-05-07T20:32:43.2528403Z D: int, 2025-05-07T20:32:43.2528492Z scale_ub: Optional[float], 2025-05-07T20:32:43.2528578Z contiguous: bool, 2025-05-07T20:32:43.2528660Z compiled: bool, 2025-05-07T20:32:43.2528735Z ) -> None: 2025-05-07T20:32:43.2528832Z torch.manual_seed(2025) 2025-05-07T20:32:43.2528899Z 2025-05-07T20:32:43.2529061Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2529132Z 2025-05-07T20:32:43.2529219Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2529338Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2529424Z x = x_sign * x_clamp 2025-05-07T20:32:43.2529499Z x0 = x[:, :D] 2025-05-07T20:32:43.2529582Z x1 = x[:, D:] 2025-05-07T20:32:43.2529651Z 2025-05-07T20:32:43.2529729Z if contiguous: 2025-05-07T20:32:43.2529821Z x0 = x0.contiguous() 2025-05-07T20:32:43.2529904Z x1 = x1.contiguous() 2025-05-07T20:32:43.2529971Z 2025-05-07T20:32:43.2530056Z if scale_ub is not None: 2025-05-07T20:32:43.2530156Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2530285Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2530356Z ) 2025-05-07T20:32:43.2530426Z else: 2025-05-07T20:32:43.2530513Z scale_ub_tensor = None 2025-05-07T20:32:43.2530584Z 2025-05-07T20:32:43.2530706Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2530800Z op = silu_mul_quant 2025-05-07T20:32:43.2530881Z if compiled: 2025-05-07T20:32:43.2530975Z op = torch.compile(op) 2025-05-07T20:32:43.2531079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2531150Z 2025-05-07T20:32:43.2531233Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2531237Z 2025-05-07T20:32:43.2531417Z moe/activation_test.py:117: 2025-05-07T20:32:43.2531540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2531635Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2531730Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2532093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2532184Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2532668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2532763Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2533118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2533336Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2533676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2533768Z kernel = self.compile( 2025-05-07T20:32:43.2534164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2534335Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2534456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2534461Z 2025-05-07T20:32:43.2534654Z self = 2025-05-07T20:32:43.2535421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2535996Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4aafa60>} 2025-05-07T20:32:43.2536737Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2536920Z context = 2025-05-07T20:32:43.2536925Z 2025-05-07T20:32:43.2537085Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2537339Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2537461Z module_map=module_map) 2025-05-07T20:32:43.2537647Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2537749Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2537823Z E ^ 2025-05-07T20:32:43.2538178Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2538183Z 2025-05-07T20:32:43.2538591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2538596Z 2025-05-07T20:32:43.2538696Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2538910Z self=, 2025-05-07T20:32:43.2538979Z T=128, 2025-05-07T20:32:43.2539054Z D=7168, 2025-05-07T20:32:43.2539133Z scale_ub=1200.0, 2025-05-07T20:32:43.2539217Z contiguous=False, 2025-05-07T20:32:43.2539296Z compiled=True, 2025-05-07T20:32:43.2539362Z ) 2025-05-07T20:32:43.2539575Z self = 2025-05-07T20:32:43.2539748Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2539753Z 2025-05-07T20:32:43.2539823Z @given( 2025-05-07T20:32:43.2540014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2540352Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2540465Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2540577Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2540686Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2540755Z ) 2025-05-07T20:32:43.2540998Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2541086Z def test_silu_mul_quant( 2025-05-07T20:32:43.2541164Z self, 2025-05-07T20:32:43.2541235Z T: int, 2025-05-07T20:32:43.2541306Z D: int, 2025-05-07T20:32:43.2541403Z scale_ub: Optional[float], 2025-05-07T20:32:43.2541487Z contiguous: bool, 2025-05-07T20:32:43.2541573Z compiled: bool, 2025-05-07T20:32:43.2541649Z ) -> None: 2025-05-07T20:32:43.2541736Z torch.manual_seed(2025) 2025-05-07T20:32:43.2541804Z 2025-05-07T20:32:43.2541976Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2542045Z 2025-05-07T20:32:43.2542129Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2542251Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2542332Z x = x_sign * x_clamp 2025-05-07T20:32:43.2542406Z x0 = x[:, :D] 2025-05-07T20:32:43.2542484Z x1 = x[:, D:] 2025-05-07T20:32:43.2542549Z 2025-05-07T20:32:43.2542630Z if contiguous: 2025-05-07T20:32:43.2542717Z x0 = x0.contiguous() 2025-05-07T20:32:43.2542800Z x1 = x1.contiguous() 2025-05-07T20:32:43.2542869Z 2025-05-07T20:32:43.2542951Z if scale_ub is not None: 2025-05-07T20:32:43.2543050Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2547608Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2547698Z ) 2025-05-07T20:32:43.2547777Z else: 2025-05-07T20:32:43.2547882Z scale_ub_tensor = None 2025-05-07T20:32:43.2547952Z 2025-05-07T20:32:43.2548086Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2548179Z op = silu_mul_quant 2025-05-07T20:32:43.2548263Z if compiled: 2025-05-07T20:32:43.2548363Z op = torch.compile(op) 2025-05-07T20:32:43.2548466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2548538Z 2025-05-07T20:32:43.2548629Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2548634Z 2025-05-07T20:32:43.2548731Z moe/activation_test.py:117: 2025-05-07T20:32:43.2548862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2548966Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2549064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2549440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2549531Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2550031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2550132Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2550487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2550706Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2551048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2551141Z kernel = self.compile( 2025-05-07T20:32:43.2551530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2551705Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2551984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2551989Z 2025-05-07T20:32:43.2552196Z self = 2025-05-07T20:32:43.2552960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2553455Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c47c0d60>} 2025-05-07T20:32:43.2554188Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2554380Z context = 2025-05-07T20:32:43.2554385Z 2025-05-07T20:32:43.2554556Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2554812Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2554923Z module_map=module_map) 2025-05-07T20:32:43.2555083Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2555178Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2555260Z E ^ 2025-05-07T20:32:43.2555606Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2555611Z 2025-05-07T20:32:43.2556027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2556036Z 2025-05-07T20:32:43.2556214Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2556431Z self=, 2025-05-07T20:32:43.2556510Z T=2048, 2025-05-07T20:32:43.2556587Z D=7168, 2025-05-07T20:32:43.2556666Z scale_ub=None, 2025-05-07T20:32:43.2556755Z contiguous=True, 2025-05-07T20:32:43.2556833Z compiled=True, 2025-05-07T20:32:43.2556899Z ) 2025-05-07T20:32:43.2557116Z self = 2025-05-07T20:32:43.2557282Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2557287Z 2025-05-07T20:32:43.2557357Z @given( 2025-05-07T20:32:43.2557469Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2557565Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2557676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2557788Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2557902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2557976Z ) 2025-05-07T20:32:43.2558217Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2558306Z def test_silu_mul_quant( 2025-05-07T20:32:43.2558381Z self, 2025-05-07T20:32:43.2558454Z T: int, 2025-05-07T20:32:43.2558527Z D: int, 2025-05-07T20:32:43.2558621Z scale_ub: Optional[float], 2025-05-07T20:32:43.2558704Z contiguous: bool, 2025-05-07T20:32:43.2558785Z compiled: bool, 2025-05-07T20:32:43.2558857Z ) -> None: 2025-05-07T20:32:43.2558947Z torch.manual_seed(2025) 2025-05-07T20:32:43.2559017Z 2025-05-07T20:32:43.2559178Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2559249Z 2025-05-07T20:32:43.2559336Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2559455Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2559545Z x = x_sign * x_clamp 2025-05-07T20:32:43.2559623Z x0 = x[:, :D] 2025-05-07T20:32:43.2559694Z x1 = x[:, D:] 2025-05-07T20:32:43.2559760Z 2025-05-07T20:32:43.2559923Z if contiguous: 2025-05-07T20:32:43.2560011Z x0 = x0.contiguous() 2025-05-07T20:32:43.2560096Z x1 = x1.contiguous() 2025-05-07T20:32:43.2560166Z 2025-05-07T20:32:43.2560249Z if scale_ub is not None: 2025-05-07T20:32:43.2560354Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2560482Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2560555Z ) 2025-05-07T20:32:43.2560630Z else: 2025-05-07T20:32:43.2560717Z scale_ub_tensor = None 2025-05-07T20:32:43.2560781Z 2025-05-07T20:32:43.2560910Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2560993Z op = silu_mul_quant 2025-05-07T20:32:43.2561072Z if compiled: 2025-05-07T20:32:43.2561169Z op = torch.compile(op) 2025-05-07T20:32:43.2561273Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2561344Z 2025-05-07T20:32:43.2561431Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2561435Z 2025-05-07T20:32:43.2561524Z moe/activation_test.py:117: 2025-05-07T20:32:43.2561650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2561744Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2561837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2562206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2562293Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2562775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2562871Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2563220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2563523Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2563857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2563943Z kernel = self.compile( 2025-05-07T20:32:43.2564325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2564493Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2564616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2564620Z 2025-05-07T20:32:43.2564814Z self = 2025-05-07T20:32:43.2565570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2566075Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c47c1b20>} 2025-05-07T20:32:43.2566805Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2566993Z context = 2025-05-07T20:32:43.2566997Z 2025-05-07T20:32:43.2567154Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2567406Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2567513Z module_map=module_map) 2025-05-07T20:32:43.2567672Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2567766Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2567838Z E ^ 2025-05-07T20:32:43.2568262Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2568267Z 2025-05-07T20:32:43.2568702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2568707Z 2025-05-07T20:32:43.2568807Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2569028Z self=, 2025-05-07T20:32:43.2569100Z T=16384, 2025-05-07T20:32:43.2569174Z D=5120, 2025-05-07T20:32:43.2569254Z scale_ub=None, 2025-05-07T20:32:43.2569336Z contiguous=False, 2025-05-07T20:32:43.2569411Z compiled=False, 2025-05-07T20:32:43.2569486Z ) 2025-05-07T20:32:43.2569697Z self = 2025-05-07T20:32:43.2569876Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2569881Z 2025-05-07T20:32:43.2569961Z @given( 2025-05-07T20:32:43.2570073Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2570170Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2570277Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2570386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2570497Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2570567Z ) 2025-05-07T20:32:43.2570804Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2570895Z def test_silu_mul_quant( 2025-05-07T20:32:43.2570967Z self, 2025-05-07T20:32:43.2571056Z T: int, 2025-05-07T20:32:43.2571163Z D: int, 2025-05-07T20:32:43.2571294Z scale_ub: Optional[float], 2025-05-07T20:32:43.2571524Z contiguous: bool, 2025-05-07T20:32:43.2571647Z compiled: bool, 2025-05-07T20:32:43.2571719Z ) -> None: 2025-05-07T20:32:43.2571814Z torch.manual_seed(2025) 2025-05-07T20:32:43.2571879Z 2025-05-07T20:32:43.2572039Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2572108Z 2025-05-07T20:32:43.2572194Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2572312Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2574106Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2574120Z 2025-05-07T20:32:43.2574240Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.2574245Z 2025-05-07T20:32:43.2574344Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2574561Z self=, 2025-05-07T20:32:43.2574631Z T=4096, 2025-05-07T20:32:43.2574705Z D=7168, 2025-05-07T20:32:43.2574782Z scale_ub=1200.0, 2025-05-07T20:32:43.2574860Z contiguous=True, 2025-05-07T20:32:43.2574937Z compiled=True, 2025-05-07T20:32:43.2575004Z ) 2025-05-07T20:32:43.2575219Z self = 2025-05-07T20:32:43.2575384Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2575389Z 2025-05-07T20:32:43.2575456Z @given( 2025-05-07T20:32:43.2575572Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2575668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2575774Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2575972Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2576080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2576152Z ) 2025-05-07T20:32:43.2576387Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2576474Z def test_silu_mul_quant( 2025-05-07T20:32:43.2576549Z self, 2025-05-07T20:32:43.2576620Z T: int, 2025-05-07T20:32:43.2576689Z D: int, 2025-05-07T20:32:43.2576781Z scale_ub: Optional[float], 2025-05-07T20:32:43.2576865Z contiguous: bool, 2025-05-07T20:32:43.2576945Z compiled: bool, 2025-05-07T20:32:43.2577023Z ) -> None: 2025-05-07T20:32:43.2577111Z torch.manual_seed(2025) 2025-05-07T20:32:43.2577179Z 2025-05-07T20:32:43.2577342Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2577417Z 2025-05-07T20:32:43.2577506Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2577627Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2579387Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2579396Z 2025-05-07T20:32:43.2579509Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.2579515Z 2025-05-07T20:32:43.2579612Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2579934Z self=, 2025-05-07T20:32:43.2580004Z T=16384, 2025-05-07T20:32:43.2580085Z D=7168, 2025-05-07T20:32:43.2580164Z scale_ub=None, 2025-05-07T20:32:43.2580242Z contiguous=False, 2025-05-07T20:32:43.2580324Z compiled=False, 2025-05-07T20:32:43.2580398Z ) 2025-05-07T20:32:43.2580608Z self = 2025-05-07T20:32:43.2580783Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2580788Z 2025-05-07T20:32:43.2580863Z @given( 2025-05-07T20:32:43.2580975Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2581072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2581180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2581294Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2581405Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2581481Z ) 2025-05-07T20:32:43.2581719Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2581818Z def test_silu_mul_quant( 2025-05-07T20:32:43.2581889Z self, 2025-05-07T20:32:43.2581967Z T: int, 2025-05-07T20:32:43.2582039Z D: int, 2025-05-07T20:32:43.2582131Z scale_ub: Optional[float], 2025-05-07T20:32:43.2582216Z contiguous: bool, 2025-05-07T20:32:43.2582295Z compiled: bool, 2025-05-07T20:32:43.2582372Z ) -> None: 2025-05-07T20:32:43.2582465Z torch.manual_seed(2025) 2025-05-07T20:32:43.2582533Z 2025-05-07T20:32:43.2582693Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2584541Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2584553Z 2025-05-07T20:32:43.2584666Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2584670Z 2025-05-07T20:32:43.2584769Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2584985Z self=, 2025-05-07T20:32:43.2585064Z T=2048, 2025-05-07T20:32:43.2585136Z D=7168, 2025-05-07T20:32:43.2585215Z scale_ub=1200.0, 2025-05-07T20:32:43.2585299Z contiguous=True, 2025-05-07T20:32:43.2585379Z compiled=True, 2025-05-07T20:32:43.2585449Z ) 2025-05-07T20:32:43.2585661Z self = 2025-05-07T20:32:43.2585828Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2585832Z 2025-05-07T20:32:43.2585909Z @given( 2025-05-07T20:32:43.2586025Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2586116Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2586229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2586339Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2586445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2586517Z ) 2025-05-07T20:32:43.2586753Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2586841Z def test_silu_mul_quant( 2025-05-07T20:32:43.2586921Z self, 2025-05-07T20:32:43.2586992Z T: int, 2025-05-07T20:32:43.2587062Z D: int, 2025-05-07T20:32:43.2587160Z scale_ub: Optional[float], 2025-05-07T20:32:43.2587247Z contiguous: bool, 2025-05-07T20:32:43.2587465Z compiled: bool, 2025-05-07T20:32:43.2587546Z ) -> None: 2025-05-07T20:32:43.2587633Z torch.manual_seed(2025) 2025-05-07T20:32:43.2587708Z 2025-05-07T20:32:43.2587868Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2587940Z 2025-05-07T20:32:43.2588030Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2588148Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2589889Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2589903Z 2025-05-07T20:32:43.2590018Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.2590027Z 2025-05-07T20:32:43.2590126Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2590349Z self=, 2025-05-07T20:32:43.2590422Z T=2048, 2025-05-07T20:32:43.2590494Z D=7168, 2025-05-07T20:32:43.2590574Z scale_ub=None, 2025-05-07T20:32:43.2590650Z contiguous=True, 2025-05-07T20:32:43.2590734Z compiled=False, 2025-05-07T20:32:43.2590801Z ) 2025-05-07T20:32:43.2591011Z self = 2025-05-07T20:32:43.2591177Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2591181Z 2025-05-07T20:32:43.2591256Z @given( 2025-05-07T20:32:43.2591366Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2591464Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2591573Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2591768Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2591881Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2591952Z ) 2025-05-07T20:32:43.2592196Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2592284Z def test_silu_mul_quant( 2025-05-07T20:32:43.2592358Z self, 2025-05-07T20:32:43.2592433Z T: int, 2025-05-07T20:32:43.2592507Z D: int, 2025-05-07T20:32:43.2592604Z scale_ub: Optional[float], 2025-05-07T20:32:43.2592697Z contiguous: bool, 2025-05-07T20:32:43.2592777Z compiled: bool, 2025-05-07T20:32:43.2592850Z ) -> None: 2025-05-07T20:32:43.2592944Z torch.manual_seed(2025) 2025-05-07T20:32:43.2593011Z 2025-05-07T20:32:43.2593169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2593247Z 2025-05-07T20:32:43.2593331Z > x_sign = torch.sign(x) 2025-05-07T20:32:43.2595080Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2595086Z 2025-05-07T20:32:43.2595198Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:43.2595202Z 2025-05-07T20:32:43.2595301Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2595515Z self=, 2025-05-07T20:32:43.2595662Z T=1, 2025-05-07T20:32:43.2595737Z D=7168, 2025-05-07T20:32:43.2595811Z scale_ub=1200.0, 2025-05-07T20:32:43.2595890Z contiguous=True, 2025-05-07T20:32:43.2595974Z compiled=False, 2025-05-07T20:32:43.2596042Z ) 2025-05-07T20:32:43.2596250Z self = 2025-05-07T20:32:43.2596411Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2596416Z 2025-05-07T20:32:43.2596492Z @given( 2025-05-07T20:32:43.2596611Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2596705Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2596814Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2596927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2597035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2597105Z ) 2025-05-07T20:32:43.2597343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2597435Z def test_silu_mul_quant( 2025-05-07T20:32:43.2597504Z self, 2025-05-07T20:32:43.2597586Z T: int, 2025-05-07T20:32:43.2597656Z D: int, 2025-05-07T20:32:43.2597751Z scale_ub: Optional[float], 2025-05-07T20:32:43.2597833Z contiguous: bool, 2025-05-07T20:32:43.2597914Z compiled: bool, 2025-05-07T20:32:43.2597994Z ) -> None: 2025-05-07T20:32:43.2598082Z torch.manual_seed(2025) 2025-05-07T20:32:43.2598150Z 2025-05-07T20:32:43.2598314Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2598383Z 2025-05-07T20:32:43.2598469Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2598590Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2598676Z x = x_sign * x_clamp 2025-05-07T20:32:43.2598752Z x0 = x[:, :D] 2025-05-07T20:32:43.2598830Z x1 = x[:, D:] 2025-05-07T20:32:43.2598907Z 2025-05-07T20:32:43.2598985Z if contiguous: 2025-05-07T20:32:43.2599078Z x0 = x0.contiguous() 2025-05-07T20:32:43.2599162Z x1 = x1.contiguous() 2025-05-07T20:32:43.2599311Z 2025-05-07T20:32:43.2599396Z if scale_ub is not None: 2025-05-07T20:32:43.2599496Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2599629Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2599700Z ) 2025-05-07T20:32:43.2599770Z else: 2025-05-07T20:32:43.2599864Z scale_ub_tensor = None 2025-05-07T20:32:43.2599927Z 2025-05-07T20:32:43.2600051Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2600139Z op = silu_mul_quant 2025-05-07T20:32:43.2600220Z if compiled: 2025-05-07T20:32:43.2600313Z op = torch.compile(op) 2025-05-07T20:32:43.2600417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2600486Z 2025-05-07T20:32:43.2600580Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2600585Z 2025-05-07T20:32:43.2600675Z moe/activation_test.py:117: 2025-05-07T20:32:43.2600805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2600903Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2600994Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2601487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2601581Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2601935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2602154Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2602487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2602728Z kernel = self.compile( 2025-05-07T20:32:43.2603126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2603302Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2603424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2603432Z 2025-05-07T20:32:43.2603630Z self = 2025-05-07T20:32:43.2604392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2604887Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4528a40>} 2025-05-07T20:32:43.2605623Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2605818Z context = 2025-05-07T20:32:43.2605823Z 2025-05-07T20:32:43.2605980Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2606233Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2606338Z module_map=module_map) 2025-05-07T20:32:43.2606495Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2606589Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2606661Z E ^ 2025-05-07T20:32:43.2607006Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2607011Z 2025-05-07T20:32:43.2607437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2607442Z 2025-05-07T20:32:43.2607645Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2607862Z self=, 2025-05-07T20:32:43.2607941Z T=128, 2025-05-07T20:32:43.2608016Z D=5120, 2025-05-07T20:32:43.2608101Z scale_ub=None, 2025-05-07T20:32:43.2608185Z contiguous=True, 2025-05-07T20:32:43.2608263Z compiled=False, 2025-05-07T20:32:43.2608334Z ) 2025-05-07T20:32:43.2608544Z self = 2025-05-07T20:32:43.2608709Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2608714Z 2025-05-07T20:32:43.2608789Z @given( 2025-05-07T20:32:43.2608901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2608998Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2609113Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2609224Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2609338Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2609409Z ) 2025-05-07T20:32:43.2609645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2609738Z def test_silu_mul_quant( 2025-05-07T20:32:43.2609810Z self, 2025-05-07T20:32:43.2609881Z T: int, 2025-05-07T20:32:43.2609953Z D: int, 2025-05-07T20:32:43.2610046Z scale_ub: Optional[float], 2025-05-07T20:32:43.2610131Z contiguous: bool, 2025-05-07T20:32:43.2610214Z compiled: bool, 2025-05-07T20:32:43.2610286Z ) -> None: 2025-05-07T20:32:43.2610382Z torch.manual_seed(2025) 2025-05-07T20:32:43.2610452Z 2025-05-07T20:32:43.2610613Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2610765Z 2025-05-07T20:32:43.2610850Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2610968Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2611060Z x = x_sign * x_clamp 2025-05-07T20:32:43.2611137Z x0 = x[:, :D] 2025-05-07T20:32:43.2611215Z x1 = x[:, D:] 2025-05-07T20:32:43.2611291Z 2025-05-07T20:32:43.2611367Z if contiguous: 2025-05-07T20:32:43.2611450Z x0 = x0.contiguous() 2025-05-07T20:32:43.2611535Z x1 = x1.contiguous() 2025-05-07T20:32:43.2611603Z 2025-05-07T20:32:43.2611687Z if scale_ub is not None: 2025-05-07T20:32:43.2611790Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2611917Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2611990Z ) 2025-05-07T20:32:43.2612064Z else: 2025-05-07T20:32:43.2612152Z scale_ub_tensor = None 2025-05-07T20:32:43.2612223Z 2025-05-07T20:32:43.2612345Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2612434Z op = silu_mul_quant 2025-05-07T20:32:43.2612516Z if compiled: 2025-05-07T20:32:43.2612614Z op = torch.compile(op) 2025-05-07T20:32:43.2612714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2612784Z 2025-05-07T20:32:43.2612870Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2612874Z 2025-05-07T20:32:43.2612968Z moe/activation_test.py:117: 2025-05-07T20:32:43.2613094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2613189Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2613283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2613780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2613872Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2614235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2614457Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2614877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2614967Z kernel = self.compile( 2025-05-07T20:32:43.2615360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2615533Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2615653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2615657Z 2025-05-07T20:32:43.2615854Z self = 2025-05-07T20:32:43.2616617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2617116Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4529940>} 2025-05-07T20:32:43.2617849Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2618031Z context = 2025-05-07T20:32:43.2618036Z 2025-05-07T20:32:43.2618195Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2618451Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2618555Z module_map=module_map) 2025-05-07T20:32:43.2618716Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2618886Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2618958Z E ^ 2025-05-07T20:32:43.2619313Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2619318Z 2025-05-07T20:32:43.2619747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2619751Z 2025-05-07T20:32:43.2619857Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2620073Z self=, 2025-05-07T20:32:43.2620146Z T=128, 2025-05-07T20:32:43.2620221Z D=7168, 2025-05-07T20:32:43.2620297Z scale_ub=None, 2025-05-07T20:32:43.2620377Z contiguous=True, 2025-05-07T20:32:43.2620458Z compiled=False, 2025-05-07T20:32:43.2620524Z ) 2025-05-07T20:32:43.2620738Z self = 2025-05-07T20:32:43.2620907Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2620912Z 2025-05-07T20:32:43.2620987Z @given( 2025-05-07T20:32:43.2621104Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2621199Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2621307Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2621421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2621529Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2621596Z ) 2025-05-07T20:32:43.2621835Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2621925Z def test_silu_mul_quant( 2025-05-07T20:32:43.2622002Z self, 2025-05-07T20:32:43.2622076Z T: int, 2025-05-07T20:32:43.2622149Z D: int, 2025-05-07T20:32:43.2622250Z scale_ub: Optional[float], 2025-05-07T20:32:43.2622341Z contiguous: bool, 2025-05-07T20:32:43.2622425Z compiled: bool, 2025-05-07T20:32:43.2622505Z ) -> None: 2025-05-07T20:32:43.2622594Z torch.manual_seed(2025) 2025-05-07T20:32:43.2622737Z 2025-05-07T20:32:43.2622903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2622975Z 2025-05-07T20:32:43.2623059Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2623180Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2623263Z x = x_sign * x_clamp 2025-05-07T20:32:43.2623340Z x0 = x[:, :D] 2025-05-07T20:32:43.2623414Z x1 = x[:, D:] 2025-05-07T20:32:43.2623485Z 2025-05-07T20:32:43.2623573Z if contiguous: 2025-05-07T20:32:43.2623658Z x0 = x0.contiguous() 2025-05-07T20:32:43.2623742Z x1 = x1.contiguous() 2025-05-07T20:32:43.2623815Z 2025-05-07T20:32:43.2623901Z if scale_ub is not None: 2025-05-07T20:32:43.2624000Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2624135Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2624210Z ) 2025-05-07T20:32:43.2624285Z else: 2025-05-07T20:32:43.2624382Z scale_ub_tensor = None 2025-05-07T20:32:43.2624447Z 2025-05-07T20:32:43.2624578Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2624663Z op = silu_mul_quant 2025-05-07T20:32:43.2624744Z if compiled: 2025-05-07T20:32:43.2624845Z op = torch.compile(op) 2025-05-07T20:32:43.2624946Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2625017Z 2025-05-07T20:32:43.2625103Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2625108Z 2025-05-07T20:32:43.2625199Z moe/activation_test.py:117: 2025-05-07T20:32:43.2625321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2625423Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2625518Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2626098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2626193Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2626544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2626760Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2627094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2627181Z kernel = self.compile( 2025-05-07T20:32:43.2627626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2627795Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2627920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2627932Z 2025-05-07T20:32:43.2628128Z self = 2025-05-07T20:32:43.2628892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2629387Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c452a700>} 2025-05-07T20:32:43.2630117Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2630301Z context = 2025-05-07T20:32:43.2630311Z 2025-05-07T20:32:43.2630467Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2630796Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2630901Z module_map=module_map) 2025-05-07T20:32:43.2631057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2631154Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2631228Z E ^ 2025-05-07T20:32:43.2631573Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2631578Z 2025-05-07T20:32:43.2631991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2631995Z 2025-05-07T20:32:43.2632093Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2632311Z self=, 2025-05-07T20:32:43.2632387Z T=2048, 2025-05-07T20:32:43.2632460Z D=7168, 2025-05-07T20:32:43.2632541Z scale_ub=1200.0, 2025-05-07T20:32:43.2632621Z contiguous=True, 2025-05-07T20:32:43.2632701Z compiled=False, 2025-05-07T20:32:43.2632770Z ) 2025-05-07T20:32:43.2632980Z self = 2025-05-07T20:32:43.2633146Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2633154Z 2025-05-07T20:32:43.2633232Z @given( 2025-05-07T20:32:43.2633350Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2633446Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2633554Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2633665Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2633775Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2633843Z ) 2025-05-07T20:32:43.2634080Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2634275Z def test_silu_mul_quant( 2025-05-07T20:32:43.2634353Z self, 2025-05-07T20:32:43.2634432Z T: int, 2025-05-07T20:32:43.2634505Z D: int, 2025-05-07T20:32:43.2634596Z scale_ub: Optional[float], 2025-05-07T20:32:43.2634680Z contiguous: bool, 2025-05-07T20:32:43.2634759Z compiled: bool, 2025-05-07T20:32:43.2634834Z ) -> None: 2025-05-07T20:32:43.2634929Z torch.manual_seed(2025) 2025-05-07T20:32:43.2634997Z 2025-05-07T20:32:43.2635158Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2636931Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2636945Z 2025-05-07T20:32:43.2637057Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2637062Z 2025-05-07T20:32:43.2637164Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2637381Z self=, 2025-05-07T20:32:43.2637450Z T=1, 2025-05-07T20:32:43.2637519Z D=5120, 2025-05-07T20:32:43.2637595Z scale_ub=1200.0, 2025-05-07T20:32:43.2637671Z contiguous=True, 2025-05-07T20:32:43.2637749Z compiled=False, 2025-05-07T20:32:43.2637814Z ) 2025-05-07T20:32:43.2638029Z self = 2025-05-07T20:32:43.2638189Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2638198Z 2025-05-07T20:32:43.2638273Z @given( 2025-05-07T20:32:43.2638388Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2638586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2638696Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2638809Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2638915Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2638984Z ) 2025-05-07T20:32:43.2639218Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2639306Z def test_silu_mul_quant( 2025-05-07T20:32:43.2639381Z self, 2025-05-07T20:32:43.2639452Z T: int, 2025-05-07T20:32:43.2639526Z D: int, 2025-05-07T20:32:43.2639622Z scale_ub: Optional[float], 2025-05-07T20:32:43.2639707Z contiguous: bool, 2025-05-07T20:32:43.2639784Z compiled: bool, 2025-05-07T20:32:43.2639860Z ) -> None: 2025-05-07T20:32:43.2639951Z torch.manual_seed(2025) 2025-05-07T20:32:43.2640020Z 2025-05-07T20:32:43.2640424Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2640497Z 2025-05-07T20:32:43.2640584Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2640708Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2640792Z x = x_sign * x_clamp 2025-05-07T20:32:43.2640869Z x0 = x[:, :D] 2025-05-07T20:32:43.2640942Z x1 = x[:, D:] 2025-05-07T20:32:43.2641011Z 2025-05-07T20:32:43.2641090Z if contiguous: 2025-05-07T20:32:43.2641178Z x0 = x0.contiguous() 2025-05-07T20:32:43.2641260Z x1 = x1.contiguous() 2025-05-07T20:32:43.2641334Z 2025-05-07T20:32:43.2641418Z if scale_ub is not None: 2025-05-07T20:32:43.2641518Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2641651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2641862Z ) 2025-05-07T20:32:43.2641932Z else: 2025-05-07T20:32:43.2642024Z scale_ub_tensor = None 2025-05-07T20:32:43.2642094Z 2025-05-07T20:32:43.2642229Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2642315Z op = silu_mul_quant 2025-05-07T20:32:43.2642394Z if compiled: 2025-05-07T20:32:43.2642488Z op = torch.compile(op) 2025-05-07T20:32:43.2642587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2642656Z 2025-05-07T20:32:43.2642748Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2642752Z 2025-05-07T20:32:43.2642841Z moe/activation_test.py:117: 2025-05-07T20:32:43.2642963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2643060Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2643153Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2643641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2643739Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2644095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2644313Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2644649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2644738Z kernel = self.compile( 2025-05-07T20:32:43.2645133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2645300Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2645426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2645431Z 2025-05-07T20:32:43.2645627Z self = 2025-05-07T20:32:43.2646506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2647002Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c452bce0>} 2025-05-07T20:32:43.2647733Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2647921Z context = 2025-05-07T20:32:43.2647926Z 2025-05-07T20:32:43.2648086Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2648343Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2648447Z module_map=module_map) 2025-05-07T20:32:43.2648608Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2648704Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2648774Z E ^ 2025-05-07T20:32:43.2649120Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2649124Z 2025-05-07T20:32:43.2649536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2649541Z 2025-05-07T20:32:43.2649635Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2649853Z self=, 2025-05-07T20:32:43.2649925Z T=2048, 2025-05-07T20:32:43.2649992Z D=5120, 2025-05-07T20:32:43.2650068Z scale_ub=None, 2025-05-07T20:32:43.2650227Z contiguous=True, 2025-05-07T20:32:43.2650309Z compiled=False, 2025-05-07T20:32:43.2650385Z ) 2025-05-07T20:32:43.2650600Z self = 2025-05-07T20:32:43.2650767Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2650776Z 2025-05-07T20:32:43.2650847Z @given( 2025-05-07T20:32:43.2650959Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2651053Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2651162Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2651272Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2651383Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2651451Z ) 2025-05-07T20:32:43.2651691Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2651783Z def test_silu_mul_quant( 2025-05-07T20:32:43.2651860Z self, 2025-05-07T20:32:43.2651930Z T: int, 2025-05-07T20:32:43.2652002Z D: int, 2025-05-07T20:32:43.2652092Z scale_ub: Optional[float], 2025-05-07T20:32:43.2652181Z contiguous: bool, 2025-05-07T20:32:43.2652261Z compiled: bool, 2025-05-07T20:32:43.2652334Z ) -> None: 2025-05-07T20:32:43.2652425Z torch.manual_seed(2025) 2025-05-07T20:32:43.2652493Z 2025-05-07T20:32:43.2652654Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2652725Z 2025-05-07T20:32:43.2652810Z > x_sign = torch.sign(x) 2025-05-07T20:32:43.2654642Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2654653Z 2025-05-07T20:32:43.2654765Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:43.2654770Z 2025-05-07T20:32:43.2654866Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2655083Z self=, 2025-05-07T20:32:43.2655153Z T=16384, 2025-05-07T20:32:43.2655223Z D=5120, 2025-05-07T20:32:43.2655298Z scale_ub=None, 2025-05-07T20:32:43.2655375Z contiguous=True, 2025-05-07T20:32:43.2655453Z compiled=False, 2025-05-07T20:32:43.2655521Z ) 2025-05-07T20:32:43.2655734Z self = 2025-05-07T20:32:43.2655907Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2655911Z 2025-05-07T20:32:43.2655986Z @given( 2025-05-07T20:32:43.2656096Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2656191Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2656304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2656416Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2656522Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2656589Z ) 2025-05-07T20:32:43.2656828Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2656913Z def test_silu_mul_quant( 2025-05-07T20:32:43.2656984Z self, 2025-05-07T20:32:43.2657058Z T: int, 2025-05-07T20:32:43.2657130Z D: int, 2025-05-07T20:32:43.2657220Z scale_ub: Optional[float], 2025-05-07T20:32:43.2657309Z contiguous: bool, 2025-05-07T20:32:43.2657408Z compiled: bool, 2025-05-07T20:32:43.2657487Z ) -> None: 2025-05-07T20:32:43.2657601Z torch.manual_seed(2025) 2025-05-07T20:32:43.2657748Z 2025-05-07T20:32:43.2657910Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2659665Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2659671Z 2025-05-07T20:32:43.2659792Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2659796Z 2025-05-07T20:32:43.2659890Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2660104Z self=, 2025-05-07T20:32:43.2660183Z T=4096, 2025-05-07T20:32:43.2660253Z D=5120, 2025-05-07T20:32:43.2660330Z scale_ub=None, 2025-05-07T20:32:43.2660416Z contiguous=True, 2025-05-07T20:32:43.2660495Z compiled=False, 2025-05-07T20:32:43.2660565Z ) 2025-05-07T20:32:43.2660780Z self = 2025-05-07T20:32:43.2660941Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2660946Z 2025-05-07T20:32:43.2661020Z @given( 2025-05-07T20:32:43.2661129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2661219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2661329Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2661437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2661543Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2661616Z ) 2025-05-07T20:32:43.2661857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2661944Z def test_silu_mul_quant( 2025-05-07T20:32:43.2662098Z self, 2025-05-07T20:32:43.2662168Z T: int, 2025-05-07T20:32:43.2662245Z D: int, 2025-05-07T20:32:43.2662340Z scale_ub: Optional[float], 2025-05-07T20:32:43.2662426Z contiguous: bool, 2025-05-07T20:32:43.2662509Z compiled: bool, 2025-05-07T20:32:43.2662582Z ) -> None: 2025-05-07T20:32:43.2662671Z torch.manual_seed(2025) 2025-05-07T20:32:43.2662740Z 2025-05-07T20:32:43.2662901Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2664650Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2664660Z 2025-05-07T20:32:43.2664771Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2664775Z 2025-05-07T20:32:43.2664873Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2665091Z self=, 2025-05-07T20:32:43.2665164Z T=2048, 2025-05-07T20:32:43.2665238Z D=5120, 2025-05-07T20:32:43.2665315Z scale_ub=None, 2025-05-07T20:32:43.2665399Z contiguous=False, 2025-05-07T20:32:43.2665483Z compiled=False, 2025-05-07T20:32:43.2665551Z ) 2025-05-07T20:32:43.2665761Z self = 2025-05-07T20:32:43.2665928Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2666034Z 2025-05-07T20:32:43.2670126Z @given( 2025-05-07T20:32:43.2670267Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2670365Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2670475Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2670584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2670694Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2670765Z ) 2025-05-07T20:32:43.2671011Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2671100Z def test_silu_mul_quant( 2025-05-07T20:32:43.2671172Z self, 2025-05-07T20:32:43.2671247Z T: int, 2025-05-07T20:32:43.2671318Z D: int, 2025-05-07T20:32:43.2671409Z scale_ub: Optional[float], 2025-05-07T20:32:43.2671497Z contiguous: bool, 2025-05-07T20:32:43.2671576Z compiled: bool, 2025-05-07T20:32:43.2671657Z ) -> None: 2025-05-07T20:32:43.2671750Z torch.manual_seed(2025) 2025-05-07T20:32:43.2671818Z 2025-05-07T20:32:43.2671983Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2673731Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2673738Z 2025-05-07T20:32:43.2673850Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2673860Z 2025-05-07T20:32:43.2673953Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2674172Z self=, 2025-05-07T20:32:43.2674249Z T=4096, 2025-05-07T20:32:43.2674429Z D=7168, 2025-05-07T20:32:43.2674508Z scale_ub=None, 2025-05-07T20:32:43.2674589Z contiguous=True, 2025-05-07T20:32:43.2674668Z compiled=True, 2025-05-07T20:32:43.2674737Z ) 2025-05-07T20:32:43.2674948Z self = 2025-05-07T20:32:43.2675109Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2675114Z 2025-05-07T20:32:43.2675190Z @given( 2025-05-07T20:32:43.2675301Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2675395Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2675510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2675622Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2675730Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2675808Z ) 2025-05-07T20:32:43.2676044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2676137Z def test_silu_mul_quant( 2025-05-07T20:32:43.2676216Z self, 2025-05-07T20:32:43.2676288Z T: int, 2025-05-07T20:32:43.2676357Z D: int, 2025-05-07T20:32:43.2676451Z scale_ub: Optional[float], 2025-05-07T20:32:43.2676535Z contiguous: bool, 2025-05-07T20:32:43.2676621Z compiled: bool, 2025-05-07T20:32:43.2676696Z ) -> None: 2025-05-07T20:32:43.2676785Z torch.manual_seed(2025) 2025-05-07T20:32:43.2676857Z 2025-05-07T20:32:43.2677016Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2678768Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2678858Z 2025-05-07T20:32:43.2678969Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2678974Z 2025-05-07T20:32:43.2679069Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2679290Z self=, 2025-05-07T20:32:43.2679366Z T=2048, 2025-05-07T20:32:43.2679441Z D=5120, 2025-05-07T20:32:43.2679523Z scale_ub=1200.0, 2025-05-07T20:32:43.2679601Z contiguous=False, 2025-05-07T20:32:43.2679683Z compiled=False, 2025-05-07T20:32:43.2679747Z ) 2025-05-07T20:32:43.2679955Z self = 2025-05-07T20:32:43.2680137Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2680142Z 2025-05-07T20:32:43.2680216Z @given( 2025-05-07T20:32:43.2680330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2680427Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2680534Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2680644Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2680754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2680826Z ) 2025-05-07T20:32:43.2681063Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2681150Z def test_silu_mul_quant( 2025-05-07T20:32:43.2681219Z self, 2025-05-07T20:32:43.2681294Z T: int, 2025-05-07T20:32:43.2681365Z D: int, 2025-05-07T20:32:43.2681455Z scale_ub: Optional[float], 2025-05-07T20:32:43.2681540Z contiguous: bool, 2025-05-07T20:32:43.2681626Z compiled: bool, 2025-05-07T20:32:43.2681696Z ) -> None: 2025-05-07T20:32:43.2681787Z torch.manual_seed(2025) 2025-05-07T20:32:43.2681935Z 2025-05-07T20:32:43.2682097Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2683834Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2683840Z 2025-05-07T20:32:43.2683949Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2683961Z 2025-05-07T20:32:43.2684055Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2684278Z self=, 2025-05-07T20:32:43.2684349Z T=4096, 2025-05-07T20:32:43.2684422Z D=7168, 2025-05-07T20:32:43.2684499Z scale_ub=1200.0, 2025-05-07T20:32:43.2684582Z contiguous=True, 2025-05-07T20:32:43.2684661Z compiled=False, 2025-05-07T20:32:43.2684731Z ) 2025-05-07T20:32:43.2684946Z self = 2025-05-07T20:32:43.2685110Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2685114Z 2025-05-07T20:32:43.2685188Z @given( 2025-05-07T20:32:43.2685298Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2685388Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2685499Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2685608Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2685798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2685872Z ) 2025-05-07T20:32:43.2686109Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2686198Z def test_silu_mul_quant( 2025-05-07T20:32:43.2686273Z self, 2025-05-07T20:32:43.2686342Z T: int, 2025-05-07T20:32:43.2686413Z D: int, 2025-05-07T20:32:43.2686508Z scale_ub: Optional[float], 2025-05-07T20:32:43.2686593Z contiguous: bool, 2025-05-07T20:32:43.2686676Z compiled: bool, 2025-05-07T20:32:43.2686750Z ) -> None: 2025-05-07T20:32:43.2686838Z torch.manual_seed(2025) 2025-05-07T20:32:43.2686908Z 2025-05-07T20:32:43.2687064Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2688810Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2688824Z 2025-05-07T20:32:43.2688934Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2688939Z 2025-05-07T20:32:43.2689032Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2689250Z self=, 2025-05-07T20:32:43.2689325Z T=16384, 2025-05-07T20:32:43.2689395Z D=7168, 2025-05-07T20:32:43.2689477Z scale_ub=None, 2025-05-07T20:32:43.2689558Z contiguous=False, 2025-05-07T20:32:43.2689640Z compiled=True, 2025-05-07T20:32:43.2689710Z ) 2025-05-07T20:32:43.2689923Z self = 2025-05-07T20:32:43.2690175Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2690180Z 2025-05-07T20:32:43.2690253Z @given( 2025-05-07T20:32:43.2690363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2690458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2690566Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2690676Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2690788Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2690857Z ) 2025-05-07T20:32:43.2691095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2691181Z def test_silu_mul_quant( 2025-05-07T20:32:43.2691252Z self, 2025-05-07T20:32:43.2691327Z T: int, 2025-05-07T20:32:43.2691398Z D: int, 2025-05-07T20:32:43.2691493Z scale_ub: Optional[float], 2025-05-07T20:32:43.2691584Z contiguous: bool, 2025-05-07T20:32:43.2691668Z compiled: bool, 2025-05-07T20:32:43.2691746Z ) -> None: 2025-05-07T20:32:43.2691839Z torch.manual_seed(2025) 2025-05-07T20:32:43.2691905Z 2025-05-07T20:32:43.2692061Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2693805Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2693888Z 2025-05-07T20:32:43.2693998Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2694008Z 2025-05-07T20:32:43.2694101Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2694320Z self=, 2025-05-07T20:32:43.2694395Z T=4096, 2025-05-07T20:32:43.2694463Z D=7168, 2025-05-07T20:32:43.2694538Z scale_ub=None, 2025-05-07T20:32:43.2694618Z contiguous=True, 2025-05-07T20:32:43.2694697Z compiled=False, 2025-05-07T20:32:43.2694762Z ) 2025-05-07T20:32:43.2694975Z self = 2025-05-07T20:32:43.2695138Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2695142Z 2025-05-07T20:32:43.2695215Z @given( 2025-05-07T20:32:43.2695330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2695421Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2695531Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2695647Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2695752Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2695832Z ) 2025-05-07T20:32:43.2696066Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2696152Z def test_silu_mul_quant( 2025-05-07T20:32:43.2696228Z self, 2025-05-07T20:32:43.2696298Z T: int, 2025-05-07T20:32:43.2696369Z D: int, 2025-05-07T20:32:43.2696462Z scale_ub: Optional[float], 2025-05-07T20:32:43.2696543Z contiguous: bool, 2025-05-07T20:32:43.2696623Z compiled: bool, 2025-05-07T20:32:43.2696694Z ) -> None: 2025-05-07T20:32:43.2696782Z torch.manual_seed(2025) 2025-05-07T20:32:43.2696852Z 2025-05-07T20:32:43.2697011Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2698879Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2698893Z 2025-05-07T20:32:43.2699003Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2699008Z 2025-05-07T20:32:43.2699101Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2699319Z self=, 2025-05-07T20:32:43.2699392Z T=16384, 2025-05-07T20:32:43.2699462Z D=7168, 2025-05-07T20:32:43.2699547Z scale_ub=None, 2025-05-07T20:32:43.2699631Z contiguous=True, 2025-05-07T20:32:43.2699716Z compiled=False, 2025-05-07T20:32:43.2699787Z ) 2025-05-07T20:32:43.2699993Z self = 2025-05-07T20:32:43.2700167Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2700171Z 2025-05-07T20:32:43.2700243Z @given( 2025-05-07T20:32:43.2700353Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2700447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2700554Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2700661Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2700770Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2700837Z ) 2025-05-07T20:32:43.2701075Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2701161Z def test_silu_mul_quant( 2025-05-07T20:32:43.2701232Z self, 2025-05-07T20:32:43.2701411Z T: int, 2025-05-07T20:32:43.2701485Z D: int, 2025-05-07T20:32:43.2701576Z scale_ub: Optional[float], 2025-05-07T20:32:43.2701662Z contiguous: bool, 2025-05-07T20:32:43.2701747Z compiled: bool, 2025-05-07T20:32:43.2701822Z ) -> None: 2025-05-07T20:32:43.2701912Z torch.manual_seed(2025) 2025-05-07T20:32:43.2701979Z 2025-05-07T20:32:43.2702136Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2703877Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2703887Z 2025-05-07T20:32:43.2704006Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2704011Z 2025-05-07T20:32:43.2704109Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2704323Z self=, 2025-05-07T20:32:43.2704398Z T=16384, 2025-05-07T20:32:43.2704470Z D=7168, 2025-05-07T20:32:43.2704548Z scale_ub=1200.0, 2025-05-07T20:32:43.2704633Z contiguous=True, 2025-05-07T20:32:43.2704711Z compiled=False, 2025-05-07T20:32:43.2704777Z ) 2025-05-07T20:32:43.2704989Z self = 2025-05-07T20:32:43.2705156Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2705160Z 2025-05-07T20:32:43.2705236Z @given( 2025-05-07T20:32:43.2705345Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2705434Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2705551Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2705740Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2705848Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2705919Z ) 2025-05-07T20:32:43.2706154Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2706242Z def test_silu_mul_quant( 2025-05-07T20:32:43.2706314Z self, 2025-05-07T20:32:43.2706384Z T: int, 2025-05-07T20:32:43.2706456Z D: int, 2025-05-07T20:32:43.2706546Z scale_ub: Optional[float], 2025-05-07T20:32:43.2706629Z contiguous: bool, 2025-05-07T20:32:43.2706712Z compiled: bool, 2025-05-07T20:32:43.2706786Z ) -> None: 2025-05-07T20:32:43.2706872Z torch.manual_seed(2025) 2025-05-07T20:32:43.2706944Z 2025-05-07T20:32:43.2707101Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2708971Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2708981Z 2025-05-07T20:32:43.2709091Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2709096Z 2025-05-07T20:32:43.2709193Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2709411Z self=, 2025-05-07T20:32:43.2709484Z T=128, 2025-05-07T20:32:43.2709556Z D=5120, 2025-05-07T20:32:43.2709718Z scale_ub=1200.0, 2025-05-07T20:32:43.2709798Z contiguous=False, 2025-05-07T20:32:43.2709877Z compiled=False, 2025-05-07T20:32:43.2709940Z ) 2025-05-07T20:32:43.2710155Z self = 2025-05-07T20:32:43.2710322Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2710326Z 2025-05-07T20:32:43.2710399Z @given( 2025-05-07T20:32:43.2710508Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2710605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2710712Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2710819Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2710930Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2711000Z ) 2025-05-07T20:32:43.2711241Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2711329Z def test_silu_mul_quant( 2025-05-07T20:32:43.2711405Z self, 2025-05-07T20:32:43.2711479Z T: int, 2025-05-07T20:32:43.2711551Z D: int, 2025-05-07T20:32:43.2711650Z scale_ub: Optional[float], 2025-05-07T20:32:43.2711737Z contiguous: bool, 2025-05-07T20:32:43.2711816Z compiled: bool, 2025-05-07T20:32:43.2711887Z ) -> None: 2025-05-07T20:32:43.2711979Z torch.manual_seed(2025) 2025-05-07T20:32:43.2712046Z 2025-05-07T20:32:43.2712208Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2712278Z 2025-05-07T20:32:43.2712363Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2712488Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2712572Z x = x_sign * x_clamp 2025-05-07T20:32:43.2712646Z x0 = x[:, :D] 2025-05-07T20:32:43.2712726Z x1 = x[:, D:] 2025-05-07T20:32:43.2712792Z 2025-05-07T20:32:43.2712869Z if contiguous: 2025-05-07T20:32:43.2712957Z x0 = x0.contiguous() 2025-05-07T20:32:43.2713041Z x1 = x1.contiguous() 2025-05-07T20:32:43.2713106Z 2025-05-07T20:32:43.2713195Z if scale_ub is not None: 2025-05-07T20:32:43.2713374Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2713503Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2713581Z ) 2025-05-07T20:32:43.2713652Z else: 2025-05-07T20:32:43.2713739Z scale_ub_tensor = None 2025-05-07T20:32:43.2713805Z 2025-05-07T20:32:43.2713927Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2714011Z op = silu_mul_quant 2025-05-07T20:32:43.2714089Z if compiled: 2025-05-07T20:32:43.2714180Z op = torch.compile(op) 2025-05-07T20:32:43.2714286Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2714351Z 2025-05-07T20:32:43.2714433Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2714437Z 2025-05-07T20:32:43.2714535Z moe/activation_test.py:117: 2025-05-07T20:32:43.2714656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2714753Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2714847Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2715337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2715429Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2715782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2715998Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2716336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2716422Z kernel = self.compile( 2025-05-07T20:32:43.2716822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2717071Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2717194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2717199Z 2025-05-07T20:32:43.2717399Z self = 2025-05-07T20:32:43.2718159Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2718649Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4483600>} 2025-05-07T20:32:43.2719378Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2719570Z context = 2025-05-07T20:32:43.2719575Z 2025-05-07T20:32:43.2719733Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2719987Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2720094Z module_map=module_map) 2025-05-07T20:32:43.2720249Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2720340Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2720412Z E ^ 2025-05-07T20:32:43.2720758Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2720763Z 2025-05-07T20:32:43.2721174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2721183Z 2025-05-07T20:32:43.2721278Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2721565Z self=, 2025-05-07T20:32:43.2721645Z T=2048, 2025-05-07T20:32:43.2721717Z D=7168, 2025-05-07T20:32:43.2721790Z scale_ub=None, 2025-05-07T20:32:43.2721877Z contiguous=False, 2025-05-07T20:32:43.2721953Z compiled=False, 2025-05-07T20:32:43.2722021Z ) 2025-05-07T20:32:43.2722233Z self = 2025-05-07T20:32:43.2722399Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2722404Z 2025-05-07T20:32:43.2722477Z @given( 2025-05-07T20:32:43.2722589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2722683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2722791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2722908Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2723011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2723082Z ) 2025-05-07T20:32:43.2723320Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2723406Z def test_silu_mul_quant( 2025-05-07T20:32:43.2723479Z self, 2025-05-07T20:32:43.2723547Z T: int, 2025-05-07T20:32:43.2723624Z D: int, 2025-05-07T20:32:43.2723715Z scale_ub: Optional[float], 2025-05-07T20:32:43.2723797Z contiguous: bool, 2025-05-07T20:32:43.2723877Z compiled: bool, 2025-05-07T20:32:43.2723949Z ) -> None: 2025-05-07T20:32:43.2724033Z torch.manual_seed(2025) 2025-05-07T20:32:43.2724100Z 2025-05-07T20:32:43.2724258Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2726020Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2726106Z 2025-05-07T20:32:43.2726217Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2726221Z 2025-05-07T20:32:43.2726314Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2726531Z self=, 2025-05-07T20:32:43.2726602Z T=128, 2025-05-07T20:32:43.2726674Z D=7168, 2025-05-07T20:32:43.2726750Z scale_ub=1200.0, 2025-05-07T20:32:43.2726823Z contiguous=True, 2025-05-07T20:32:43.2726901Z compiled=True, 2025-05-07T20:32:43.2726973Z ) 2025-05-07T20:32:43.2727180Z self = 2025-05-07T20:32:43.2727346Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2727351Z 2025-05-07T20:32:43.2727425Z @given( 2025-05-07T20:32:43.2727534Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2727629Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2727734Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2727849Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2727954Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2728019Z ) 2025-05-07T20:32:43.2728258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2728345Z def test_silu_mul_quant( 2025-05-07T20:32:43.2728412Z self, 2025-05-07T20:32:43.2728483Z T: int, 2025-05-07T20:32:43.2728560Z D: int, 2025-05-07T20:32:43.2728650Z scale_ub: Optional[float], 2025-05-07T20:32:43.2728738Z contiguous: bool, 2025-05-07T20:32:43.2728817Z compiled: bool, 2025-05-07T20:32:43.2728988Z ) -> None: 2025-05-07T20:32:43.2729078Z torch.manual_seed(2025) 2025-05-07T20:32:43.2729142Z 2025-05-07T20:32:43.2729302Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2729368Z 2025-05-07T20:32:43.2729450Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2729570Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2729651Z x = x_sign * x_clamp 2025-05-07T20:32:43.2729722Z x0 = x[:, :D] 2025-05-07T20:32:43.2729795Z x1 = x[:, D:] 2025-05-07T20:32:43.2729859Z 2025-05-07T20:32:43.2729933Z if contiguous: 2025-05-07T20:32:43.2730019Z x0 = x0.contiguous() 2025-05-07T20:32:43.2730099Z x1 = x1.contiguous() 2025-05-07T20:32:43.2730164Z 2025-05-07T20:32:43.2730253Z if scale_ub is not None: 2025-05-07T20:32:43.2730352Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2730491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2730560Z ) 2025-05-07T20:32:43.2730628Z else: 2025-05-07T20:32:43.2730717Z scale_ub_tensor = None 2025-05-07T20:32:43.2730785Z 2025-05-07T20:32:43.2730911Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2730995Z op = silu_mul_quant 2025-05-07T20:32:43.2731072Z if compiled: 2025-05-07T20:32:43.2731164Z op = torch.compile(op) 2025-05-07T20:32:43.2731263Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2731327Z 2025-05-07T20:32:43.2731411Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2731416Z 2025-05-07T20:32:43.2731516Z moe/activation_test.py:117: 2025-05-07T20:32:43.2731639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2731888Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2731978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2732346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2732435Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2732917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2733007Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2733360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2733573Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2733905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2733991Z kernel = self.compile( 2025-05-07T20:32:43.2734388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2734565Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2734682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2734687Z 2025-05-07T20:32:43.2734883Z self = 2025-05-07T20:32:43.2735640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2736126Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f13c4260900>} 2025-05-07T20:32:43.2736855Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2737121Z context = 2025-05-07T20:32:43.2737126Z 2025-05-07T20:32:43.2737286Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2737538Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2737639Z module_map=module_map) 2025-05-07T20:32:43.2737795Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2737884Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2737956Z E ^ 2025-05-07T20:32:43.2738300Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2738305Z 2025-05-07T20:32:43.2738713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2738723Z 2025-05-07T20:32:43.2738827Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2739039Z self=, 2025-05-07T20:32:43.2739107Z T=128, 2025-05-07T20:32:43.2739183Z D=7168, 2025-05-07T20:32:43.2739259Z scale_ub=1200.0, 2025-05-07T20:32:43.2739337Z contiguous=True, 2025-05-07T20:32:43.2739413Z compiled=False, 2025-05-07T20:32:43.2739480Z ) 2025-05-07T20:32:43.2739691Z self = 2025-05-07T20:32:43.2739851Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2739856Z 2025-05-07T20:32:43.2739924Z @given( 2025-05-07T20:32:43.2740042Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2740366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2740614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2740728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2740840Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2740907Z ) 2025-05-07T20:32:43.2741143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2741230Z def test_silu_mul_quant( 2025-05-07T20:32:43.2741299Z self, 2025-05-07T20:32:43.2741368Z T: int, 2025-05-07T20:32:43.2741437Z D: int, 2025-05-07T20:32:43.2741531Z scale_ub: Optional[float], 2025-05-07T20:32:43.2741612Z contiguous: bool, 2025-05-07T20:32:43.2741689Z compiled: bool, 2025-05-07T20:32:43.2741764Z ) -> None: 2025-05-07T20:32:43.2741853Z torch.manual_seed(2025) 2025-05-07T20:32:43.2741915Z 2025-05-07T20:32:43.2742078Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2742142Z 2025-05-07T20:32:43.2742236Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2742351Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2744099Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2744108Z 2025-05-07T20:32:43.2744218Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.2744222Z 2025-05-07T20:32:43.2744316Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2744535Z self=, 2025-05-07T20:32:43.2744609Z T=128, 2025-05-07T20:32:43.2744677Z D=5120, 2025-05-07T20:32:43.2744753Z scale_ub=1200.0, 2025-05-07T20:32:43.2744950Z contiguous=True, 2025-05-07T20:32:43.2745030Z compiled=True, 2025-05-07T20:32:43.2745098Z ) 2025-05-07T20:32:43.2745308Z self = 2025-05-07T20:32:43.2745473Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2745477Z 2025-05-07T20:32:43.2745547Z @given( 2025-05-07T20:32:43.2745656Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2745750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2745856Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2745964Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2746070Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2746138Z ) 2025-05-07T20:32:43.2746372Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2746466Z def test_silu_mul_quant( 2025-05-07T20:32:43.2746538Z self, 2025-05-07T20:32:43.2746618Z T: int, 2025-05-07T20:32:43.2746690Z D: int, 2025-05-07T20:32:43.2746778Z scale_ub: Optional[float], 2025-05-07T20:32:43.2746868Z contiguous: bool, 2025-05-07T20:32:43.2746948Z compiled: bool, 2025-05-07T20:32:43.2747015Z ) -> None: 2025-05-07T20:32:43.2747107Z torch.manual_seed(2025) 2025-05-07T20:32:43.2747174Z 2025-05-07T20:32:43.2747356Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2747482Z 2025-05-07T20:32:43.2747575Z > x_sign = torch.sign(x) 2025-05-07T20:32:43.2749319Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2749409Z 2025-05-07T20:32:43.2749519Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:43.2749524Z 2025-05-07T20:32:43.2749624Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2749838Z self=, 2025-05-07T20:32:43.2749910Z T=128, 2025-05-07T20:32:43.2749981Z D=7168, 2025-05-07T20:32:43.2750052Z scale_ub=None, 2025-05-07T20:32:43.2750127Z contiguous=True, 2025-05-07T20:32:43.2750208Z compiled=True, 2025-05-07T20:32:43.2750272Z ) 2025-05-07T20:32:43.2750479Z self = 2025-05-07T20:32:43.2750646Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2750651Z 2025-05-07T20:32:43.2750719Z @given( 2025-05-07T20:32:43.2750836Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2750926Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2751031Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2751146Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2751251Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2751320Z ) 2025-05-07T20:32:43.2751557Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2751644Z def test_silu_mul_quant( 2025-05-07T20:32:43.2751713Z self, 2025-05-07T20:32:43.2751788Z T: int, 2025-05-07T20:32:43.2751856Z D: int, 2025-05-07T20:32:43.2751944Z scale_ub: Optional[float], 2025-05-07T20:32:43.2752026Z contiguous: bool, 2025-05-07T20:32:43.2752106Z compiled: bool, 2025-05-07T20:32:43.2752176Z ) -> None: 2025-05-07T20:32:43.2752263Z torch.manual_seed(2025) 2025-05-07T20:32:43.2752329Z 2025-05-07T20:32:43.2752569Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2754299Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2754305Z 2025-05-07T20:32:43.2754418Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2754551Z =============================== warnings summary =============================== 2025-05-07T20:32:43.2754856Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:43.2755149Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:43.2755436Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:43.2756301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:43.2756520Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:43.2756524Z 2025-05-07T20:32:43.2756695Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:32:43.2758014Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:32:43.2758192Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:32:43.2758197Z 2025-05-07T20:32:43.2758403Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:43.2758558Z ================== 1 failed, 1 passed, 13 warnings in 18.99s =================== 2025-05-07T20:32:45.1015068Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:45.1641137Z 2025-05-07T20:32:45.1641868Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:32:45.1642248Z 2025-05-07T20:32:45.1642253Z 2025-05-07T20:32:45.1662082Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:47.3190622Z ============================= test session starts ============================== 2025-05-07T20:32:47.3191258Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:47.3191791Z cachedir: .pytest_cache 2025-05-07T20:32:47.3192379Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:47.3193103Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:47.3193495Z plugins: hypothesis-6.131.14 2025-05-07T20:32:48.8795495Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:48.9758010Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:48.9758762Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:48.9758997Z 2025-05-07T20:32:50.8377792Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:50.8379439Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:50.8380802Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:50.8382268Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:50.8383272Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:50.8384554Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:50.8385912Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.8387194Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:50.8389082Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.8390122Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] module_map=module_map) 2025-05-07T20:32:50.8391421Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:50.8392645Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:50.8393467Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:50.8394649Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:50.8395832Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:50.8396840Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:50.8397837Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:50.8399024Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:50.8400427Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:50.8401309Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:50.8402371Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:50.8403394Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:50.8404139Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:50.8405285Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:50.8406628Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:50.8407665Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.8408555Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.8409319Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:50.8410383Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.8545628Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:50.8546693Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:50.8548046Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:50.8549474Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:50.8550429Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:50.8551713Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:50.8553065Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.8554816Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:50.8556158Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.8557407Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] module_map=module_map) 2025-05-07T20:32:50.8558666Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:50.8559941Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:50.8560767Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:50.8561941Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:50.8563133Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:50.8564147Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:50.8565174Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:50.8566369Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:50.8567617Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:50.8568626Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:50.8569750Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:50.8570771Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:50.8571523Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:50.8572673Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:50.8574009Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:50.8575054Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.8575951Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.8576674Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:50.8577673Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.2532121Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.2532965Z self=, 2025-05-07T20:32:51.2533415Z T=1, 2025-05-07T20:32:51.2533604Z D=5120, 2025-05-07T20:32:51.2533797Z scale_ub=None, 2025-05-07T20:32:51.2534377Z contiguous=True, 2025-05-07T20:32:51.2534611Z compiled=True, 2025-05-07T20:32:51.2534815Z ) 2025-05-07T20:32:51.2535143Z self = 2025-05-07T20:32:51.2535641Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.2535901Z 2025-05-07T20:32:51.2535981Z @given( 2025-05-07T20:32:51.2536213Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.2536527Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.2536833Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.2537162Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.2537481Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.2537765Z ) 2025-05-07T20:32:51.2538135Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.2538583Z def test_silu_mul_quant( 2025-05-07T20:32:51.2538831Z self, 2025-05-07T20:32:51.2539037Z T: int, 2025-05-07T20:32:51.2539251Z D: int, 2025-05-07T20:32:51.2539487Z scale_ub: Optional[float], 2025-05-07T20:32:51.2539757Z contiguous: bool, 2025-05-07T20:32:51.2539987Z compiled: bool, 2025-05-07T20:32:51.2540549Z ) -> None: 2025-05-07T20:32:51.2540767Z torch.manual_seed(2025) 2025-05-07T20:32:51.2541010Z 2025-05-07T20:32:51.2541279Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.2541626Z 2025-05-07T20:32:51.2541811Z x_sign = torch.sign(x) 2025-05-07T20:32:51.2542100Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.2542406Z x = x_sign * x_clamp 2025-05-07T20:32:51.2542640Z x0 = x[:, :D] 2025-05-07T20:32:51.2543043Z x1 = x[:, D:] 2025-05-07T20:32:51.2543249Z 2025-05-07T20:32:51.2543436Z if contiguous: 2025-05-07T20:32:51.2543661Z x0 = x0.contiguous() 2025-05-07T20:32:51.2543919Z x1 = x1.contiguous() 2025-05-07T20:32:51.2544164Z 2025-05-07T20:32:51.2544348Z if scale_ub is not None: 2025-05-07T20:32:51.2544616Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.2544950Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.2545255Z ) 2025-05-07T20:32:51.2545447Z else: 2025-05-07T20:32:51.2545655Z scale_ub_tensor = None 2025-05-07T20:32:51.2545894Z 2025-05-07T20:32:51.2546121Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.2546431Z op = silu_mul_quant 2025-05-07T20:32:51.2546670Z if compiled: 2025-05-07T20:32:51.2546915Z op = torch.compile(op) 2025-05-07T20:32:51.2547205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.2547557Z 2025-05-07T20:32:51.2547742Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.2548023Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.2548315Z 2025-05-07T20:32:51.2548540Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.2548875Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.2549161Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.2549463Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.2549818Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.2550120Z 2025-05-07T20:32:51.2550312Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.2550509Z 2025-05-07T20:32:51.2550607Z moe/activation_test.py:126: 2025-05-07T20:32:51.2550899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.2551226Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.2551548Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.2552452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.2553209Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.2553741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.2554419Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.2555107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.2555826Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.2556555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.2557184Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.2557781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.2558295Z fn() 2025-05-07T20:32:51.2558812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.2559380Z self.fn.run( 2025-05-07T20:32:51.2559845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.2560355Z kernel = self.compile( 2025-05-07T20:32:51.2560909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.2561575Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.2561967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.2562188Z 2025-05-07T20:32:51.2562394Z self = 2025-05-07T20:32:51.2563556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.2564918Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37b445b6a0>} 2025-05-07T20:32:51.2566270Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.2567316Z context = 2025-05-07T20:32:51.2567605Z 2025-05-07T20:32:51.2567767Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.2568284Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.2568748Z module_map=module_map) 2025-05-07T20:32:51.2569104Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.2569459Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.2569714Z E ^ 2025-05-07T20:32:51.2570167Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.2570627Z 2025-05-07T20:32:51.2571047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.2571559Z 2025-05-07T20:32:51.2571659Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.2572070Z self=, 2025-05-07T20:32:51.2572459Z T=2048, 2025-05-07T20:32:51.2572649Z D=5120, 2025-05-07T20:32:51.2572850Z scale_ub=1200.0, 2025-05-07T20:32:51.2573064Z contiguous=True, 2025-05-07T20:32:51.2573285Z compiled=False, 2025-05-07T20:32:51.2573488Z ) 2025-05-07T20:32:51.2573890Z self = 2025-05-07T20:32:51.2574378Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.2574651Z 2025-05-07T20:32:51.2574727Z @given( 2025-05-07T20:32:51.2574953Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.2575254Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.2575554Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.2575878Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.2576192Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.2576472Z ) 2025-05-07T20:32:51.2576821Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.2577259Z def test_silu_mul_quant( 2025-05-07T20:32:51.2577508Z self, 2025-05-07T20:32:51.2577697Z T: int, 2025-05-07T20:32:51.2577892Z D: int, 2025-05-07T20:32:51.2578110Z scale_ub: Optional[float], 2025-05-07T20:32:51.2578381Z contiguous: bool, 2025-05-07T20:32:51.2578614Z compiled: bool, 2025-05-07T20:32:51.2578826Z ) -> None: 2025-05-07T20:32:51.2579041Z torch.manual_seed(2025) 2025-05-07T20:32:51.2579283Z 2025-05-07T20:32:51.2579542Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.2579878Z 2025-05-07T20:32:51.2580065Z x_sign = torch.sign(x) 2025-05-07T20:32:51.2580346Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.2580654Z x = x_sign * x_clamp 2025-05-07T20:32:51.2580896Z x0 = x[:, :D] 2025-05-07T20:32:51.2581102Z x1 = x[:, D:] 2025-05-07T20:32:51.2581306Z 2025-05-07T20:32:51.2581484Z if contiguous: 2025-05-07T20:32:51.2581703Z x0 = x0.contiguous() 2025-05-07T20:32:51.2582042Z x1 = x1.contiguous() 2025-05-07T20:32:51.2582275Z 2025-05-07T20:32:51.2582452Z if scale_ub is not None: 2025-05-07T20:32:51.2582726Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.2583053Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.2583360Z ) 2025-05-07T20:32:51.2583548Z else: 2025-05-07T20:32:51.2583757Z scale_ub_tensor = None 2025-05-07T20:32:51.2584006Z 2025-05-07T20:32:51.2584229Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.2584535Z op = silu_mul_quant 2025-05-07T20:32:51.2584784Z if compiled: 2025-05-07T20:32:51.2585022Z op = torch.compile(op) 2025-05-07T20:32:51.2585314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.2585579Z 2025-05-07T20:32:51.2585762Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.2585928Z 2025-05-07T20:32:51.2586030Z moe/activation_test.py:117: 2025-05-07T20:32:51.2586321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.2586651Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.2586924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.2587785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.2588732Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.2589278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.2589949Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.2590603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.2591123Z kernel = self.compile( 2025-05-07T20:32:51.2591649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.2592321Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.2592840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.2593066Z 2025-05-07T20:32:51.2593269Z self = 2025-05-07T20:32:51.2594331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.2595681Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37b40c1f80>} 2025-05-07T20:32:51.2597038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.2598103Z context = 2025-05-07T20:32:51.2598383Z 2025-05-07T20:32:51.2598544Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.2599058Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.2599530Z module_map=module_map) 2025-05-07T20:32:51.2599885Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.2600224Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.2600481Z E ^ 2025-05-07T20:32:51.2600941Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.2601386Z 2025-05-07T20:32:51.2601803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.6506766Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:51.6508039Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:51.6509357Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:51.6510759Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:51.6511724Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:51.6513030Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:51.6514433Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.6515718Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:51.6517072Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.6518444Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] module_map=module_map) 2025-05-07T20:32:51.6519807Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:51.6521033Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:51.6521863Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:51.6523035Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:51.6524229Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:51.6525260Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:51.6526264Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:51.6527466Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:51.6528729Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:51.6529802Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:51.6530875Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:51.6531897Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:51.6532653Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:51.6533807Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:51.6535136Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:51.6536189Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.6537090Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.6537809Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:51.6538811Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.7276616Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:51.7277755Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:51.7279307Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:51.7280783Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:51.7281733Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:51.7283076Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:51.7284440Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.7285718Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:51.7287061Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.7288081Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] module_map=module_map) 2025-05-07T20:32:51.7289460Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:51.7290682Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:51.7291509Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:51.7292686Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:51.7293861Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:51.7294877Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:51.7295909Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:51.7297101Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:51.7298352Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:51.7299240Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:51.7300306Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:51.7301412Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:51.7302169Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:51.7303311Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:51.7304647Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:51.7305689Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.7306603Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.7307319Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:51.7308400Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3063535Z 2025-05-07T20:32:52.3063870Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3064332Z self=, 2025-05-07T20:32:52.3064737Z T=2048, 2025-05-07T20:32:52.3064981Z D=5120, 2025-05-07T20:32:52.3065246Z scale_ub=1200.0, 2025-05-07T20:32:52.3065542Z contiguous=True, 2025-05-07T20:32:52.3066071Z compiled=True, 2025-05-07T20:32:52.3066272Z ) 2025-05-07T20:32:52.3066592Z self = 2025-05-07T20:32:52.3067090Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.3067355Z 2025-05-07T20:32:52.3067509Z @given( 2025-05-07T20:32:52.3067735Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3068048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3068342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3068666Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3068985Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3069264Z ) 2025-05-07T20:32:52.3069624Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3070081Z def test_silu_mul_quant( 2025-05-07T20:32:52.3070321Z self, 2025-05-07T20:32:52.3070502Z T: int, 2025-05-07T20:32:52.3070703Z D: int, 2025-05-07T20:32:52.3070918Z scale_ub: Optional[float], 2025-05-07T20:32:52.3071180Z contiguous: bool, 2025-05-07T20:32:52.3071420Z compiled: bool, 2025-05-07T20:32:52.3071643Z ) -> None: 2025-05-07T20:32:52.3071847Z torch.manual_seed(2025) 2025-05-07T20:32:52.3072086Z 2025-05-07T20:32:52.3072386Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3072736Z 2025-05-07T20:32:52.3072931Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3073214Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3073526Z x = x_sign * x_clamp 2025-05-07T20:32:52.3073766Z x0 = x[:, :D] 2025-05-07T20:32:52.3073974Z x1 = x[:, D:] 2025-05-07T20:32:52.3074181Z 2025-05-07T20:32:52.3074361Z if contiguous: 2025-05-07T20:32:52.3074579Z x0 = x0.contiguous() 2025-05-07T20:32:52.3074832Z x1 = x1.contiguous() 2025-05-07T20:32:52.3075070Z 2025-05-07T20:32:52.3075249Z if scale_ub is not None: 2025-05-07T20:32:52.3075524Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3076012Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3076316Z ) 2025-05-07T20:32:52.3076501Z else: 2025-05-07T20:32:52.3076708Z scale_ub_tensor = None 2025-05-07T20:32:52.3076955Z 2025-05-07T20:32:52.3077173Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3077489Z op = silu_mul_quant 2025-05-07T20:32:52.3077734Z if compiled: 2025-05-07T20:32:52.3077973Z op = torch.compile(op) 2025-05-07T20:32:52.3078263Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3078535Z 2025-05-07T20:32:52.3078717Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.3078996Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.3079280Z 2025-05-07T20:32:52.3079507Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3079837Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.3080121Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.3080436Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.3080783Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3081085Z 2025-05-07T20:32:52.3081280Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.3081469Z 2025-05-07T20:32:52.3081565Z moe/activation_test.py:126: 2025-05-07T20:32:52.3081857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3082186Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.3082509Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3083292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.3084132Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.3084670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3085340Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3086026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.3086739Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.3087466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.3088087Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.3088695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.3089206Z fn() 2025-05-07T20:32:52.3089761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.3090361Z self.fn.run( 2025-05-07T20:32:52.3090841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3091360Z kernel = self.compile( 2025-05-07T20:32:52.3091892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3092561Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3092945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3093165Z 2025-05-07T20:32:52.3093374Z self = 2025-05-07T20:32:52.3094500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3096074Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37b41191c0>} 2025-05-07T20:32:52.3097398Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3098461Z context = 2025-05-07T20:32:52.3098747Z 2025-05-07T20:32:52.3098909Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3099431Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3099898Z module_map=module_map) 2025-05-07T20:32:52.3100268Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3100617Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.3100885Z E ^ 2025-05-07T20:32:52.3101348Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3101792Z 2025-05-07T20:32:52.3102216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3102723Z 2025-05-07T20:32:52.3102824Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3103231Z self=, 2025-05-07T20:32:52.3103632Z T=16384, 2025-05-07T20:32:52.3103816Z D=7168, 2025-05-07T20:32:52.3104010Z scale_ub=1200.0, 2025-05-07T20:32:52.3104232Z contiguous=False, 2025-05-07T20:32:52.3104452Z compiled=False, 2025-05-07T20:32:52.3104654Z ) 2025-05-07T20:32:52.3104965Z self = 2025-05-07T20:32:52.3105532Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.3105823Z 2025-05-07T20:32:52.3105898Z @given( 2025-05-07T20:32:52.3106133Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3106447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3106744Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3107076Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3107475Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3107755Z ) 2025-05-07T20:32:52.3108103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3108550Z def test_silu_mul_quant( 2025-05-07T20:32:52.3108785Z self, 2025-05-07T20:32:52.3108994Z T: int, 2025-05-07T20:32:52.3109486Z D: int, 2025-05-07T20:32:52.3109784Z scale_ub: Optional[float], 2025-05-07T20:32:52.3110085Z contiguous: bool, 2025-05-07T20:32:52.3110551Z compiled: bool, 2025-05-07T20:32:52.3117233Z ) -> None: 2025-05-07T20:32:52.3117488Z torch.manual_seed(2025) 2025-05-07T20:32:52.3117743Z 2025-05-07T20:32:52.3118014Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3118356Z 2025-05-07T20:32:52.3118540Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3118827Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3119140Z x = x_sign * x_clamp 2025-05-07T20:32:52.3119381Z x0 = x[:, :D] 2025-05-07T20:32:52.3119618Z x1 = x[:, D:] 2025-05-07T20:32:52.3119847Z 2025-05-07T20:32:52.3120032Z if contiguous: 2025-05-07T20:32:52.3120258Z x0 = x0.contiguous() 2025-05-07T20:32:52.3120519Z x1 = x1.contiguous() 2025-05-07T20:32:52.3120763Z 2025-05-07T20:32:52.3120960Z if scale_ub is not None: 2025-05-07T20:32:52.3121228Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3121567Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3121882Z ) 2025-05-07T20:32:52.3122063Z else: 2025-05-07T20:32:52.3122381Z scale_ub_tensor = None 2025-05-07T20:32:52.3122633Z 2025-05-07T20:32:52.3122860Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3123166Z op = silu_mul_quant 2025-05-07T20:32:52.3123412Z if compiled: 2025-05-07T20:32:52.3123649Z op = torch.compile(op) 2025-05-07T20:32:52.3123940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3124214Z 2025-05-07T20:32:52.3124397Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3124565Z 2025-05-07T20:32:52.3124662Z moe/activation_test.py:117: 2025-05-07T20:32:52.3124954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3125277Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3125550Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3126232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3126917Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3127451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3128114Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3128770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3129291Z kernel = self.compile( 2025-05-07T20:32:52.3129840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3130486Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3130875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3131183Z 2025-05-07T20:32:52.3131393Z self = 2025-05-07T20:32:52.3132468Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3133820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37b42aa980>} 2025-05-07T20:32:52.3135145Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3136206Z context = 2025-05-07T20:32:52.3136486Z 2025-05-07T20:32:52.3136654Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3137178Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3137644Z module_map=module_map) 2025-05-07T20:32:52.3138004Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3138352Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3138610Z E ^ 2025-05-07T20:32:52.3139070Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3139515Z 2025-05-07T20:32:52.3139938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.5385628Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:52.5386692Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:52.5388239Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:52.5389664Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:52.5390645Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.5391930Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:52.5393301Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.5394588Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:52.5395939Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.5396974Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] module_map=module_map) 2025-05-07T20:32:52.5398219Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:52.5399589Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:52.5400407Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:52.5401618Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:52.5402798Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:52.5403801Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:52.5404804Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:52.5405998Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:52.5407250Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:52.5408125Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:52.5409190Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:52.5410295Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:52.5411048Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:52.5412198Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:52.5413535Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:52.5414584Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.5415487Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.5416205Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:52.5417196Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.5927932Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:52.5928965Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:52.5930277Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:52.5931833Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:52.5932782Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.5934058Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:52.5935445Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.5936741Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:52.5938084Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.5939100Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] module_map=module_map) 2025-05-07T20:32:52.5940537Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:52.5941758Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:52.5942694Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:52.5943867Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:52.5945067Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:52.5946078Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:52.5947067Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:52.5948317Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:52.5949556Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:52.5950429Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:52.5951484Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:52.5952509Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:52.5953375Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:52.5954509Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:52.5955832Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:52.5956861Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.5957746Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.5958467Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:52.5959468Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.0441617Z 2025-05-07T20:32:53.0441885Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.0442307Z self=, 2025-05-07T20:32:53.0442723Z T=1, 2025-05-07T20:32:53.0442919Z D=7168, 2025-05-07T20:32:53.0443115Z scale_ub=None, 2025-05-07T20:32:53.0443328Z contiguous=True, 2025-05-07T20:32:53.0443585Z compiled=True, 2025-05-07T20:32:53.0443859Z ) 2025-05-07T20:32:53.0444206Z self = 2025-05-07T20:32:53.0444688Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:53.0444957Z 2025-05-07T20:32:53.0445032Z @given( 2025-05-07T20:32:53.0445254Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.0445769Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.0446077Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.0446418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.0446733Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.0447019Z ) 2025-05-07T20:32:53.0447364Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.0447801Z def test_silu_mul_quant( 2025-05-07T20:32:53.0448047Z self, 2025-05-07T20:32:53.0448241Z T: int, 2025-05-07T20:32:53.0448437Z D: int, 2025-05-07T20:32:53.0448659Z scale_ub: Optional[float], 2025-05-07T20:32:53.0448922Z contiguous: bool, 2025-05-07T20:32:53.0449159Z compiled: bool, 2025-05-07T20:32:53.0449376Z ) -> None: 2025-05-07T20:32:53.0449605Z torch.manual_seed(2025) 2025-05-07T20:32:53.0449842Z 2025-05-07T20:32:53.0450110Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.0450466Z 2025-05-07T20:32:53.0450653Z x_sign = torch.sign(x) 2025-05-07T20:32:53.0450939Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.0451254Z x = x_sign * x_clamp 2025-05-07T20:32:53.0451488Z x0 = x[:, :D] 2025-05-07T20:32:53.0451702Z x1 = x[:, D:] 2025-05-07T20:32:53.0451911Z 2025-05-07T20:32:53.0452091Z if contiguous: 2025-05-07T20:32:53.0452311Z x0 = x0.contiguous() 2025-05-07T20:32:53.0452568Z x1 = x1.contiguous() 2025-05-07T20:32:53.0452804Z 2025-05-07T20:32:53.0452987Z if scale_ub is not None: 2025-05-07T20:32:53.0453256Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.0453589Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.0454021Z ) 2025-05-07T20:32:53.0454209Z else: 2025-05-07T20:32:53.0454416Z scale_ub_tensor = None 2025-05-07T20:32:53.0454667Z 2025-05-07T20:32:53.0454894Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.0455197Z op = silu_mul_quant 2025-05-07T20:32:53.0455446Z if compiled: 2025-05-07T20:32:53.0455686Z op = torch.compile(op) 2025-05-07T20:32:53.0455984Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0456262Z 2025-05-07T20:32:53.0456455Z y_fp8, y_scale = fn() 2025-05-07T20:32:53.0456736Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:53.0457023Z 2025-05-07T20:32:53.0457249Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.0457573Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:53.0457858Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:53.0458164Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:53.0458516Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.0458823Z 2025-05-07T20:32:53.0459031Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:53.0459228Z 2025-05-07T20:32:53.0459327Z moe/activation_test.py:126: 2025-05-07T20:32:53.0459616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0459946Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:53.0460266Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.0461184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:53.0462107Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:53.0462764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.0463570Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.0464358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:53.0465288Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:53.0466151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:53.0466947Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:53.0467714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:53.0468331Z fn() 2025-05-07T20:32:53.0468996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:53.0469654Z self.fn.run( 2025-05-07T20:32:53.0470181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.0470870Z kernel = self.compile( 2025-05-07T20:32:53.0471512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.0472223Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.0472788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0473074Z 2025-05-07T20:32:53.0473309Z self = 2025-05-07T20:32:53.0474542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.0476137Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37aec76520>} 2025-05-07T20:32:53.0477607Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.0478710Z context = 2025-05-07T20:32:53.0479120Z 2025-05-07T20:32:53.0479313Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.0479913Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.0480443Z module_map=module_map) 2025-05-07T20:32:53.0480953Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.0481411Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:53.0481709Z E ^ 2025-05-07T20:32:53.0482310Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.0482885Z 2025-05-07T20:32:53.0483362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.0483902Z 2025-05-07T20:32:53.0484101Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.0484575Z self=, 2025-05-07T20:32:53.0485081Z T=4096, 2025-05-07T20:32:53.0485393Z D=5120, 2025-05-07T20:32:53.0485685Z scale_ub=None, 2025-05-07T20:32:53.0485975Z contiguous=False, 2025-05-07T20:32:53.0486324Z compiled=False, 2025-05-07T20:32:53.0486646Z ) 2025-05-07T20:32:53.0487019Z self = 2025-05-07T20:32:53.0487630Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:53.0487943Z 2025-05-07T20:32:53.0488108Z @given( 2025-05-07T20:32:53.0488432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.0488876Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.0489302Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.0489796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.0490216Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.0490624Z ) 2025-05-07T20:32:53.0491053Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.0491576Z def test_silu_mul_quant( 2025-05-07T20:32:53.0491935Z self, 2025-05-07T20:32:53.0492216Z T: int, 2025-05-07T20:32:53.0492537Z D: int, 2025-05-07T20:32:53.0492893Z scale_ub: Optional[float], 2025-05-07T20:32:53.0493248Z contiguous: bool, 2025-05-07T20:32:53.0493644Z compiled: bool, 2025-05-07T20:32:53.0493922Z ) -> None: 2025-05-07T20:32:53.0494220Z torch.manual_seed(2025) 2025-05-07T20:32:53.0494612Z 2025-05-07T20:32:53.0494935Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.0495386Z 2025-05-07T20:32:53.0495711Z x_sign = torch.sign(x) 2025-05-07T20:32:53.0496056Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.0496469Z x = x_sign * x_clamp 2025-05-07T20:32:53.0496844Z x0 = x[:, :D] 2025-05-07T20:32:53.0497191Z x1 = x[:, D:] 2025-05-07T20:32:53.0497471Z 2025-05-07T20:32:53.0497788Z if contiguous: 2025-05-07T20:32:53.0498103Z x0 = x0.contiguous() 2025-05-07T20:32:53.0498435Z x1 = x1.contiguous() 2025-05-07T20:32:53.0498807Z 2025-05-07T20:32:53.0499105Z if scale_ub is not None: 2025-05-07T20:32:53.0499429Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.0499894Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.0500313Z ) 2025-05-07T20:32:53.0500558Z else: 2025-05-07T20:32:53.0500943Z scale_ub_tensor = None 2025-05-07T20:32:53.0501391Z 2025-05-07T20:32:53.0501671Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.0502121Z op = silu_mul_quant 2025-05-07T20:32:53.0502482Z if compiled: 2025-05-07T20:32:53.0502802Z op = torch.compile(op) 2025-05-07T20:32:53.0503235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0503604Z 2025-05-07T20:32:53.0503867Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.0504111Z 2025-05-07T20:32:53.0504272Z moe/activation_test.py:117: 2025-05-07T20:32:53.0504650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0505046Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.0505545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0506299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.0507063Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.0507830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.0508561Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.0509290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.0509990Z kernel = self.compile( 2025-05-07T20:32:53.0510631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.0511357Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.0511902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0512186Z 2025-05-07T20:32:53.0512418Z self = 2025-05-07T20:32:53.0513722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.0515269Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37aec77f60>} 2025-05-07T20:32:53.0516666Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.0517797Z context = 2025-05-07T20:32:53.0518153Z 2025-05-07T20:32:53.0518347Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.0518966Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.0519522Z module_map=module_map) 2025-05-07T20:32:53.0520043Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.0520501Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.0520850Z E ^ 2025-05-07T20:32:53.0521426Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.0521933Z 2025-05-07T20:32:53.0522427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.3339676Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:53.3341038Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:53.3342522Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:53.3344186Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:53.3345261Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.3346669Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:53.3348207Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.3349594Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:53.3351160Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.3352287Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] module_map=module_map) 2025-05-07T20:32:53.3353671Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:53.3355112Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:53.3356159Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:53.3357390Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:53.3358730Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:53.3359836Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:53.3360908Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:53.3362298Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:53.3363645Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:53.3364588Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:53.3365804Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:53.3366967Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:53.3367983Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:53.3369230Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:53.3370675Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:53.3371844Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.3372839Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.3373673Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:53.3374763Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.5144731Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:53.5145999Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:53.5147386Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:53.5148968Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:53.5150352Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.5151751Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:53.5153235Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.5154588Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:53.5156129Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.5157307Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] module_map=module_map) 2025-05-07T20:32:53.5158665Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:53.5160045Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:53.5160968Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:53.5162393Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:53.5163654Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:53.5164832Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:53.5165911Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:53.5167150Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:53.5168564Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:53.5169590Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:53.5170770Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:53.5171954Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:53.5172758Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:53.5174072Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:53.5175549Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:53.5176687Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.5177691Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.5178484Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:53.5179654Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.0456098Z 2025-05-07T20:32:54.0456721Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.0457345Z self=, 2025-05-07T20:32:54.0465011Z T=4096, 2025-05-07T20:32:54.0465244Z D=7168, 2025-05-07T20:32:54.0465435Z scale_ub=None, 2025-05-07T20:32:54.0465643Z contiguous=False, 2025-05-07T20:32:54.0465877Z compiled=False, 2025-05-07T20:32:54.0466090Z ) 2025-05-07T20:32:54.0466410Z self = 2025-05-07T20:32:54.0467000Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.0467287Z 2025-05-07T20:32:54.0467375Z @given( 2025-05-07T20:32:54.0467657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.0467980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.0468282Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.0468809Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.0469130Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.0469407Z ) 2025-05-07T20:32:54.0469749Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.0470196Z def test_silu_mul_quant( 2025-05-07T20:32:54.0470446Z self, 2025-05-07T20:32:54.0470629Z T: int, 2025-05-07T20:32:54.0470820Z D: int, 2025-05-07T20:32:54.0471034Z scale_ub: Optional[float], 2025-05-07T20:32:54.0471303Z contiguous: bool, 2025-05-07T20:32:54.0471537Z compiled: bool, 2025-05-07T20:32:54.0471790Z ) -> None: 2025-05-07T20:32:54.0471999Z torch.manual_seed(2025) 2025-05-07T20:32:54.0472237Z 2025-05-07T20:32:54.0472501Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.0472833Z 2025-05-07T20:32:54.0473017Z x_sign = torch.sign(x) 2025-05-07T20:32:54.0473298Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.0473594Z x = x_sign * x_clamp 2025-05-07T20:32:54.0473834Z x0 = x[:, :D] 2025-05-07T20:32:54.0474043Z x1 = x[:, D:] 2025-05-07T20:32:54.0474238Z 2025-05-07T20:32:54.0474420Z if contiguous: 2025-05-07T20:32:54.0474643Z x0 = x0.contiguous() 2025-05-07T20:32:54.0474888Z x1 = x1.contiguous() 2025-05-07T20:32:54.0475117Z 2025-05-07T20:32:54.0475298Z if scale_ub is not None: 2025-05-07T20:32:54.0475559Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.0475889Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.0476182Z ) 2025-05-07T20:32:54.0476360Z else: 2025-05-07T20:32:54.0476561Z scale_ub_tensor = None 2025-05-07T20:32:54.0476806Z 2025-05-07T20:32:54.0477023Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.0477330Z op = silu_mul_quant 2025-05-07T20:32:54.0477577Z if compiled: 2025-05-07T20:32:54.0477812Z op = torch.compile(op) 2025-05-07T20:32:54.0478226Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.0478495Z 2025-05-07T20:32:54.0478673Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.0478836Z 2025-05-07T20:32:54.0478932Z moe/activation_test.py:117: 2025-05-07T20:32:54.0479219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.0479542Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.0479808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.0480501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.0481176Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.0481719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.0482582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.0483248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.0483770Z kernel = self.compile( 2025-05-07T20:32:54.0484319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.0484961Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.0485358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.0485579Z 2025-05-07T20:32:54.0485787Z self = 2025-05-07T20:32:54.0486857Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.0488374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37aec76ca0>} 2025-05-07T20:32:54.0489695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.0490749Z context = 2025-05-07T20:32:54.0491028Z 2025-05-07T20:32:54.0491196Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.0491711Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.0492169Z module_map=module_map) 2025-05-07T20:32:54.0492534Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.0492879Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.0493133Z E ^ 2025-05-07T20:32:54.0493596Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.0494040Z 2025-05-07T20:32:54.0494470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.0494984Z 2025-05-07T20:32:54.0495081Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.0495481Z self=, 2025-05-07T20:32:54.0495872Z T=128, 2025-05-07T20:32:54.0496045Z D=7168, 2025-05-07T20:32:54.0496235Z scale_ub=None, 2025-05-07T20:32:54.0496443Z contiguous=False, 2025-05-07T20:32:54.0496657Z compiled=True, 2025-05-07T20:32:54.0496850Z ) 2025-05-07T20:32:54.0497161Z self = 2025-05-07T20:32:54.0497650Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.0497922Z 2025-05-07T20:32:54.0497997Z @given( 2025-05-07T20:32:54.0498303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.0498606Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.0498894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.0499215Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.0499530Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.0499798Z ) 2025-05-07T20:32:54.0500137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.0500572Z def test_silu_mul_quant( 2025-05-07T20:32:54.0500801Z self, 2025-05-07T20:32:54.0500986Z T: int, 2025-05-07T20:32:54.0501174Z D: int, 2025-05-07T20:32:54.0501376Z scale_ub: Optional[float], 2025-05-07T20:32:54.0501638Z contiguous: bool, 2025-05-07T20:32:54.0501873Z compiled: bool, 2025-05-07T20:32:54.0502093Z ) -> None: 2025-05-07T20:32:54.0502295Z torch.manual_seed(2025) 2025-05-07T20:32:54.0502534Z 2025-05-07T20:32:54.0502808Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.0503138Z 2025-05-07T20:32:54.0503328Z x_sign = torch.sign(x) 2025-05-07T20:32:54.0503614Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.0503913Z x = x_sign * x_clamp 2025-05-07T20:32:54.0504149Z x0 = x[:, :D] 2025-05-07T20:32:54.0504364Z x1 = x[:, D:] 2025-05-07T20:32:54.0504558Z 2025-05-07T20:32:54.0504739Z if contiguous: 2025-05-07T20:32:54.0504969Z x0 = x0.contiguous() 2025-05-07T20:32:54.0505210Z x1 = x1.contiguous() 2025-05-07T20:32:54.0505443Z 2025-05-07T20:32:54.0505628Z if scale_ub is not None: 2025-05-07T20:32:54.0505888Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.0506219Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.0506619Z ) 2025-05-07T20:32:54.0506806Z else: 2025-05-07T20:32:54.0507006Z scale_ub_tensor = None 2025-05-07T20:32:54.0507246Z 2025-05-07T20:32:54.0507510Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.0507816Z op = silu_mul_quant 2025-05-07T20:32:54.0508056Z if compiled: 2025-05-07T20:32:54.0508296Z op = torch.compile(op) 2025-05-07T20:32:54.0508584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.0508844Z 2025-05-07T20:32:54.0509024Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.0509295Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.0509573Z 2025-05-07T20:32:54.0509812Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.0510129Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.0510412Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.0510723Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.0511074Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.0511376Z 2025-05-07T20:32:54.0511571Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.0511758Z 2025-05-07T20:32:54.0511856Z moe/activation_test.py:126: 2025-05-07T20:32:54.0512139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.0512468Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.0512785Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.0513579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.0514315Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.0514851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.0515527Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.0516294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.0517003Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.0517735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.0518361Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.0518964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.0519469Z fn() 2025-05-07T20:32:54.0519985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.0520571Z self.fn.run( 2025-05-07T20:32:54.0521038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.0521582Z kernel = self.compile( 2025-05-07T20:32:54.0522142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.0522805Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.0523200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.0523428Z 2025-05-07T20:32:54.0523634Z self = 2025-05-07T20:32:54.0524702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.0526052Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae368180>} 2025-05-07T20:32:54.0527482Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.0528491Z context = 2025-05-07T20:32:54.0528773Z 2025-05-07T20:32:54.0528946Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.0529455Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.0529925Z module_map=module_map) 2025-05-07T20:32:54.0530287Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.0530640Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.0530894Z E ^ 2025-05-07T20:32:54.0531345Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.0531795Z 2025-05-07T20:32:54.0532233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2908150Z 2025-05-07T20:32:54.2908426Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2908850Z self=, 2025-05-07T20:32:54.2909477Z T=128, 2025-05-07T20:32:54.2909730Z D=7168, 2025-05-07T20:32:54.2910000Z scale_ub=None, 2025-05-07T20:32:54.2910303Z contiguous=False, 2025-05-07T20:32:54.2910559Z compiled=False, 2025-05-07T20:32:54.2910760Z ) 2025-05-07T20:32:54.2911075Z self = 2025-05-07T20:32:54.2911550Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.2911825Z 2025-05-07T20:32:54.2911902Z @given( 2025-05-07T20:32:54.2912140Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2912440Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2912943Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2913275Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2913594Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2913876Z ) 2025-05-07T20:32:54.2914222Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2914653Z def test_silu_mul_quant( 2025-05-07T20:32:54.2914886Z self, 2025-05-07T20:32:54.2915074Z T: int, 2025-05-07T20:32:54.2915272Z D: int, 2025-05-07T20:32:54.2915477Z scale_ub: Optional[float], 2025-05-07T20:32:54.2915746Z contiguous: bool, 2025-05-07T20:32:54.2915979Z compiled: bool, 2025-05-07T20:32:54.2916190Z ) -> None: 2025-05-07T20:32:54.2916397Z torch.manual_seed(2025) 2025-05-07T20:32:54.2916637Z 2025-05-07T20:32:54.2916896Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2917231Z 2025-05-07T20:32:54.2917429Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2917708Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2918018Z x = x_sign * x_clamp 2025-05-07T20:32:54.2918258Z x0 = x[:, :D] 2025-05-07T20:32:54.2918461Z x1 = x[:, D:] 2025-05-07T20:32:54.2918662Z 2025-05-07T20:32:54.2918838Z if contiguous: 2025-05-07T20:32:54.2919056Z x0 = x0.contiguous() 2025-05-07T20:32:54.2919304Z x1 = x1.contiguous() 2025-05-07T20:32:54.2919540Z 2025-05-07T20:32:54.2919731Z if scale_ub is not None: 2025-05-07T20:32:54.2920019Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2920374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2920673Z ) 2025-05-07T20:32:54.2920855Z else: 2025-05-07T20:32:54.2921184Z scale_ub_tensor = None 2025-05-07T20:32:54.2921431Z 2025-05-07T20:32:54.2921653Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2921968Z op = silu_mul_quant 2025-05-07T20:32:54.2922213Z if compiled: 2025-05-07T20:32:54.2922449Z op = torch.compile(op) 2025-05-07T20:32:54.2922742Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2923010Z 2025-05-07T20:32:54.2923190Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2923353Z 2025-05-07T20:32:54.2923448Z moe/activation_test.py:117: 2025-05-07T20:32:54.2923733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2924057Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2924327Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2925008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2925693Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2926223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2926894Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2927568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2928096Z kernel = self.compile( 2025-05-07T20:32:54.2928645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2929291Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2929682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2929904Z 2025-05-07T20:32:54.2930118Z self = 2025-05-07T20:32:54.2931259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2932618Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae36b100>} 2025-05-07T20:32:54.2933933Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2934986Z context = 2025-05-07T20:32:54.2935263Z 2025-05-07T20:32:54.2935422Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2935940Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2936399Z module_map=module_map) 2025-05-07T20:32:54.2936752Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2937098Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2937350Z E ^ 2025-05-07T20:32:54.2937808Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2938270Z 2025-05-07T20:32:54.2938678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2939187Z 2025-05-07T20:32:54.2939290Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2939696Z self=, 2025-05-07T20:32:54.2940351Z T=4096, 2025-05-07T20:32:54.2940530Z D=5120, 2025-05-07T20:32:54.2940715Z scale_ub=1200.0, 2025-05-07T20:32:54.2940935Z contiguous=True, 2025-05-07T20:32:54.2941271Z compiled=False, 2025-05-07T20:32:54.2941472Z ) 2025-05-07T20:32:54.2941788Z self = 2025-05-07T20:32:54.2942269Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.2942536Z 2025-05-07T20:32:54.2942611Z @given( 2025-05-07T20:32:54.2942827Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2943131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2943421Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2943747Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2944063Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2944332Z ) 2025-05-07T20:32:54.2944676Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2945110Z def test_silu_mul_quant( 2025-05-07T20:32:54.2945342Z self, 2025-05-07T20:32:54.2945534Z T: int, 2025-05-07T20:32:54.2945737Z D: int, 2025-05-07T20:32:54.2945942Z scale_ub: Optional[float], 2025-05-07T20:32:54.2946212Z contiguous: bool, 2025-05-07T20:32:54.2946451Z compiled: bool, 2025-05-07T20:32:54.2946665Z ) -> None: 2025-05-07T20:32:54.2946878Z torch.manual_seed(2025) 2025-05-07T20:32:54.2947115Z 2025-05-07T20:32:54.2947380Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2947766Z 2025-05-07T20:32:54.2947956Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2948243Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2948544Z x = x_sign * x_clamp 2025-05-07T20:32:54.2948780Z x0 = x[:, :D] 2025-05-07T20:32:54.2948999Z x1 = x[:, D:] 2025-05-07T20:32:54.2949198Z 2025-05-07T20:32:54.2949377Z if contiguous: 2025-05-07T20:32:54.2949601Z x0 = x0.contiguous() 2025-05-07T20:32:54.2949849Z x1 = x1.contiguous() 2025-05-07T20:32:54.2950087Z 2025-05-07T20:32:54.2950268Z if scale_ub is not None: 2025-05-07T20:32:54.2950532Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2951005Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2951314Z ) 2025-05-07T20:32:54.2951493Z else: 2025-05-07T20:32:54.2951701Z scale_ub_tensor = None 2025-05-07T20:32:54.2951948Z 2025-05-07T20:32:54.2952166Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2952471Z op = silu_mul_quant 2025-05-07T20:32:54.2952717Z if compiled: 2025-05-07T20:32:54.2952966Z op = torch.compile(op) 2025-05-07T20:32:54.2953254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2953523Z 2025-05-07T20:32:54.2953713Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2953872Z 2025-05-07T20:32:54.2953967Z moe/activation_test.py:117: 2025-05-07T20:32:54.2954262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2954592Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2954860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2955570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2956244Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2956774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2957434Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2958103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2958625Z kernel = self.compile( 2025-05-07T20:32:54.2959161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2959888Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2960278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2960507Z 2025-05-07T20:32:54.2960715Z self = 2025-05-07T20:32:54.2961819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2963174Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae1b1f80>} 2025-05-07T20:32:54.2964526Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2965589Z context = 2025-05-07T20:32:54.2965868Z 2025-05-07T20:32:54.2966045Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2966555Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2967071Z module_map=module_map) 2025-05-07T20:32:54.2967427Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2967768Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2968021Z E ^ 2025-05-07T20:32:54.2968473Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2968930Z 2025-05-07T20:32:54.2969353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2969854Z 2025-05-07T20:32:54.2969961Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2970364Z self=, 2025-05-07T20:32:54.2970769Z T=1, 2025-05-07T20:32:54.2971019Z D=5120, 2025-05-07T20:32:54.2971208Z scale_ub=None, 2025-05-07T20:32:54.2971415Z contiguous=True, 2025-05-07T20:32:54.2971633Z compiled=True, 2025-05-07T20:32:54.2971826Z ) 2025-05-07T20:32:54.2972140Z self = 2025-05-07T20:32:54.2972617Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.2972872Z 2025-05-07T20:32:54.2972951Z @given( 2025-05-07T20:32:54.2973176Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2973482Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2973780Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2974123Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2974448Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2974735Z ) 2025-05-07T20:32:54.2975070Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2975523Z def test_silu_mul_quant( 2025-05-07T20:32:54.2975759Z self, 2025-05-07T20:32:54.2975944Z T: int, 2025-05-07T20:32:54.2976127Z D: int, 2025-05-07T20:32:54.2976344Z scale_ub: Optional[float], 2025-05-07T20:32:54.2976609Z contiguous: bool, 2025-05-07T20:32:54.2976835Z compiled: bool, 2025-05-07T20:32:54.2977047Z ) -> None: 2025-05-07T20:32:54.2977254Z torch.manual_seed(2025) 2025-05-07T20:32:54.2977480Z 2025-05-07T20:32:54.2977740Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2978077Z 2025-05-07T20:32:54.2978258Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2978545Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2978851Z x = x_sign * x_clamp 2025-05-07T20:32:54.2979192Z x0 = x[:, :D] 2025-05-07T20:32:54.2979409Z x1 = x[:, D:] 2025-05-07T20:32:54.2979614Z 2025-05-07T20:32:54.2979795Z if contiguous: 2025-05-07T20:32:54.2980042Z x0 = x0.contiguous() 2025-05-07T20:32:54.2980325Z x1 = x1.contiguous() 2025-05-07T20:32:54.2986667Z 2025-05-07T20:32:54.2986929Z if scale_ub is not None: 2025-05-07T20:32:54.2987202Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2987589Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2987882Z ) 2025-05-07T20:32:54.2988065Z else: 2025-05-07T20:32:54.2988279Z scale_ub_tensor = None 2025-05-07T20:32:54.2988527Z 2025-05-07T20:32:54.2988754Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2989069Z op = silu_mul_quant 2025-05-07T20:32:54.2989319Z if compiled: 2025-05-07T20:32:54.2989564Z op = torch.compile(op) 2025-05-07T20:32:54.2989876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2990185Z 2025-05-07T20:32:54.2990375Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.2990667Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.2990964Z 2025-05-07T20:32:54.2991200Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2991528Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.2991828Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.2992136Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.2992481Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2992784Z 2025-05-07T20:32:54.2992987Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.2993178Z 2025-05-07T20:32:54.2993274Z moe/activation_test.py:126: 2025-05-07T20:32:54.2993571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2993906Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.2994230Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2995121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.2995892Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.2996448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2997139Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2997835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.2998544Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.2999276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.2999903Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.3000504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.3001009Z fn() 2025-05-07T20:32:54.3001531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.3002113Z self.fn.run( 2025-05-07T20:32:54.3002572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.3003086Z kernel = self.compile( 2025-05-07T20:32:54.3003615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.3004278Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.3004665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.3004973Z 2025-05-07T20:32:54.3005186Z self = 2025-05-07T20:32:54.3006253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.3007602Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae35e520>} 2025-05-07T20:32:54.3008914Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.3009993Z context = 2025-05-07T20:32:54.3010297Z 2025-05-07T20:32:54.3010474Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.3010985Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.3011454Z module_map=module_map) 2025-05-07T20:32:54.3011822Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.3012170Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.3012430Z E ^ 2025-05-07T20:32:54.3012891Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.3013353Z 2025-05-07T20:32:54.3013767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5237690Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:54.5238750Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:54.5240518Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:54.5241952Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:54.5242917Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.5244193Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:54.5245590Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5246881Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:54.5248241Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5249278Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] module_map=module_map) 2025-05-07T20:32:54.5250525Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:54.5251924Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:54.5252741Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:54.5253934Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:54.5255129Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:54.5256145Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:54.5257150Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:54.5258351Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:54.5259625Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:54.5260572Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:54.5261647Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:54.5262747Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:54.5263513Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:54.5264660Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:54.5265998Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:54.5267031Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5267997Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5268725Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:54.5269724Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5859345Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:54.5860438Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:54.5861788Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:54.5863357Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:54.5864309Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.5865580Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:54.5866935Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5868285Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:54.5869630Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5870650Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] module_map=module_map) 2025-05-07T20:32:54.5871936Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:54.5873156Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:54.5874117Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:54.5875304Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:54.5876482Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:54.5877489Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:54.5878484Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:54.5879679Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:54.5880935Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:54.5881813Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:54.5882873Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:54.5883880Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:54.5884720Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:54.5885862Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:54.5887187Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:54.5888221Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5889105Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5889835Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:54.5890830Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.0785342Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:55.0786404Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:55.0787808Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:55.0789229Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:55.0790370Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:55.0791653Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:55.0793031Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.0794331Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:55.0795742Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.0796777Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] module_map=module_map) 2025-05-07T20:32:55.0798035Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:55.0799313Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:55.0800143Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:55.0801457Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:55.0802642Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:55.0803655Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:55.0804654Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:55.0805990Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:55.0807270Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:55.0808169Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:55.0809227Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:55.0810307Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:55.0811065Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:55.0812319Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:55.0813656Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:55.0814706Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.0815615Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.0816362Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:55.0817364Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.1401222Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:55.1402314Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:55.1403631Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:55.1405069Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:55.1406084Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:55.1407574Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:55.1408988Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.1410365Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:55.1411766Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.1412812Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] module_map=module_map) 2025-05-07T20:32:55.1414057Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:55.1415286Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:55.1416123Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:55.1417316Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:55.1418660Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:55.1419678Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:55.1420679Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:55.1421880Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:55.1423136Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:55.1424037Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:55.1425096Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:55.1426127Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:55.1426911Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:55.1428129Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:55.1429546Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:55.1430584Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.1431472Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.1432199Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:55.1433183Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4119066Z 2025-05-07T20:32:55.4119428Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4120037Z self=, 2025-05-07T20:32:55.4120649Z T=2048, 2025-05-07T20:32:55.4120932Z D=5120, 2025-05-07T20:32:55.4121198Z scale_ub=None, 2025-05-07T20:32:55.4121478Z contiguous=True, 2025-05-07T20:32:55.4121782Z compiled=True, 2025-05-07T20:32:55.4122063Z ) 2025-05-07T20:32:55.4122422Z self = 2025-05-07T20:32:55.4122931Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.4123204Z 2025-05-07T20:32:55.4123299Z @given( 2025-05-07T20:32:55.4123529Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4123847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4124152Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4124489Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4124809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4125121Z ) 2025-05-07T20:32:55.4125478Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4126122Z def test_silu_mul_quant( 2025-05-07T20:32:55.4126374Z self, 2025-05-07T20:32:55.4126588Z T: int, 2025-05-07T20:32:55.4126782Z D: int, 2025-05-07T20:32:55.4127006Z scale_ub: Optional[float], 2025-05-07T20:32:55.4127295Z contiguous: bool, 2025-05-07T20:32:55.4127528Z compiled: bool, 2025-05-07T20:32:55.4127763Z ) -> None: 2025-05-07T20:32:55.4127996Z torch.manual_seed(2025) 2025-05-07T20:32:55.4128238Z 2025-05-07T20:32:55.4128527Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4128883Z 2025-05-07T20:32:55.4129076Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4129372Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4129695Z x = x_sign * x_clamp 2025-05-07T20:32:55.4129945Z x0 = x[:, :D] 2025-05-07T20:32:55.4130181Z x1 = x[:, D:] 2025-05-07T20:32:55.4130427Z 2025-05-07T20:32:55.4130616Z if contiguous: 2025-05-07T20:32:55.4130847Z x0 = x0.contiguous() 2025-05-07T20:32:55.4131100Z x1 = x1.contiguous() 2025-05-07T20:32:55.4131341Z 2025-05-07T20:32:55.4131528Z if scale_ub is not None: 2025-05-07T20:32:55.4131798Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4132133Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4132439Z ) 2025-05-07T20:32:55.4132637Z else: 2025-05-07T20:32:55.4132845Z scale_ub_tensor = None 2025-05-07T20:32:55.4133087Z 2025-05-07T20:32:55.4133322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4133633Z op = silu_mul_quant 2025-05-07T20:32:55.4133880Z if compiled: 2025-05-07T20:32:55.4134134Z op = torch.compile(op) 2025-05-07T20:32:55.4134437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4134832Z 2025-05-07T20:32:55.4135021Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.4135315Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.4135604Z 2025-05-07T20:32:55.4135835Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4136161Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.4136449Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.4136752Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.4137106Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4137410Z 2025-05-07T20:32:55.4137598Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.4137791Z 2025-05-07T20:32:55.4137892Z moe/activation_test.py:126: 2025-05-07T20:32:55.4138185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4138518Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.4138840Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4139641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.4140647Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.4141184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4141857Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4142544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.4143258Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4143966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.4144593Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.4145310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.4145813Z fn() 2025-05-07T20:32:55.4146308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.4146892Z self.fn.run( 2025-05-07T20:32:55.4147348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4147940Z kernel = self.compile( 2025-05-07T20:32:55.4148494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4149153Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4149539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4149767Z 2025-05-07T20:32:55.4149971Z self = 2025-05-07T20:32:55.4151041Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4152398Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae10a840>} 2025-05-07T20:32:55.4153713Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4154761Z context = 2025-05-07T20:32:55.4155054Z 2025-05-07T20:32:55.4155219Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4155861Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4156333Z module_map=module_map) 2025-05-07T20:32:55.4156690Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4157046Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.4157308Z E ^ 2025-05-07T20:32:55.4157754Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4158211Z 2025-05-07T20:32:55.4158628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4159135Z 2025-05-07T20:32:55.4159235Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4159639Z self=, 2025-05-07T20:32:55.4160030Z T=128, 2025-05-07T20:32:55.4160244Z D=5120, 2025-05-07T20:32:55.4160458Z scale_ub=None, 2025-05-07T20:32:55.4160658Z contiguous=True, 2025-05-07T20:32:55.4160878Z compiled=True, 2025-05-07T20:32:55.4161087Z ) 2025-05-07T20:32:55.4161398Z self = 2025-05-07T20:32:55.4161873Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.4162137Z 2025-05-07T20:32:55.4162216Z @given( 2025-05-07T20:32:55.4162441Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4162739Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4163039Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4163366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4163679Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4163954Z ) 2025-05-07T20:32:55.4164292Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4164735Z def test_silu_mul_quant( 2025-05-07T20:32:55.4164973Z self, 2025-05-07T20:32:55.4165162Z T: int, 2025-05-07T20:32:55.4165356Z D: int, 2025-05-07T20:32:55.4165647Z scale_ub: Optional[float], 2025-05-07T20:32:55.4165916Z contiguous: bool, 2025-05-07T20:32:55.4166153Z compiled: bool, 2025-05-07T20:32:55.4166371Z ) -> None: 2025-05-07T20:32:55.4166582Z torch.manual_seed(2025) 2025-05-07T20:32:55.4166812Z 2025-05-07T20:32:55.4167074Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4167409Z 2025-05-07T20:32:55.4167596Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4167875Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4168181Z x = x_sign * x_clamp 2025-05-07T20:32:55.4168416Z x0 = x[:, :D] 2025-05-07T20:32:55.4168633Z x1 = x[:, D:] 2025-05-07T20:32:55.4168844Z 2025-05-07T20:32:55.4169030Z if contiguous: 2025-05-07T20:32:55.4169259Z x0 = x0.contiguous() 2025-05-07T20:32:55.4169515Z x1 = x1.contiguous() 2025-05-07T20:32:55.4169751Z 2025-05-07T20:32:55.4169943Z if scale_ub is not None: 2025-05-07T20:32:55.4170236Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4170591Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4170899Z ) 2025-05-07T20:32:55.4171086Z else: 2025-05-07T20:32:55.4171291Z scale_ub_tensor = None 2025-05-07T20:32:55.4171541Z 2025-05-07T20:32:55.4171765Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4172071Z op = silu_mul_quant 2025-05-07T20:32:55.4172313Z if compiled: 2025-05-07T20:32:55.4172552Z op = torch.compile(op) 2025-05-07T20:32:55.4172844Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4173109Z 2025-05-07T20:32:55.4173290Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.4173681Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.4173962Z 2025-05-07T20:32:55.4174185Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4174524Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.4174809Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.4175112Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.4175461Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4175766Z 2025-05-07T20:32:55.4175968Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.4176160Z 2025-05-07T20:32:55.4176258Z moe/activation_test.py:126: 2025-05-07T20:32:55.4176554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4176893Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.4177210Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4178016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.4178781Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.4179345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4180035Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4180714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.4181433Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4182150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.4182772Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.4183374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.4183892Z fn() 2025-05-07T20:32:55.4184479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.4185072Z self.fn.run( 2025-05-07T20:32:55.4185536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4186054Z kernel = self.compile( 2025-05-07T20:32:55.4186589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4187223Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4187665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4193915Z 2025-05-07T20:32:55.4194151Z self = 2025-05-07T20:32:55.4195290Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4196642Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37b43aa160>} 2025-05-07T20:32:55.4197997Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4199008Z context = 2025-05-07T20:32:55.4199290Z 2025-05-07T20:32:55.4199454Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4199970Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4200538Z module_map=module_map) 2025-05-07T20:32:55.4200895Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4201249Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.4201504Z E ^ 2025-05-07T20:32:55.4201961Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4202404Z 2025-05-07T20:32:55.4202826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.6468161Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:55.6470603Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:55.6472012Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:55.6473416Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:55.6474377Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:55.6475685Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:55.6477217Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.6478679Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:55.6480061Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.6481080Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] module_map=module_map) 2025-05-07T20:32:55.6482309Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:55.6483528Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:55.6484380Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:55.6485547Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:55.6486730Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:55.6487747Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:55.6488884Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:55.6490207Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:55.6491455Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:55.6492325Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:55.6493379Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:55.6494394Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:55.6495144Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:55.6496285Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:55.6497604Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:55.6498628Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.6499508Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.6500233Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:55.6501346Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.7080704Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:55.7081948Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:55.7083262Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:55.7084669Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:55.7085619Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:55.7086895Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:55.7088246Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.7089519Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:55.7091090Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.7092114Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] module_map=module_map) 2025-05-07T20:32:55.7093346Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:55.7094611Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:55.7095434Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:55.7096666Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:55.7097850Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:55.7098856Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:55.7099853Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:55.7101089Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:55.7102449Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:55.7103325Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:55.7104376Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:55.7105385Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:55.7106132Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:55.7107284Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:55.7108681Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:55.7109715Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.7110622Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.7111333Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:55.7112317Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.2513105Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:56.2514418Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:56.2515743Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:56.2517141Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:56.2518114Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.2519398Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:56.2520778Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.2522115Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:56.2523478Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.2524687Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] module_map=module_map) 2025-05-07T20:32:56.2525988Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:56.2527215Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:56.2528053Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:56.2529246Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:56.2530485Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:56.2531519Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:56.2532528Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:56.2533729Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:56.2535006Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:56.2536061Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:56.2537128Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:56.2538156Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:56.2538912Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:56.2540252Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:56.2541650Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:56.2542692Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.2543592Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.2544325Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:56.2545331Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.3132829Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:56.3135398Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:56.3138005Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:56.3140839Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:56.3141805Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.3143102Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:56.3144470Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.3145811Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:56.3147163Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.3148273Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] module_map=module_map) 2025-05-07T20:32:56.3149648Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:56.3150925Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:56.3151749Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:56.3152931Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:56.3154113Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:56.3155132Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:56.3156129Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:56.3157323Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:56.3158582Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:56.3159465Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:56.3160689Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:56.3161728Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:56.3162488Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:56.3163647Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:56.3164984Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:56.3166065Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.3166968Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.3167710Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:56.3168719Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.6200932Z 2025-05-07T20:32:56.6201423Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.6202468Z self=, 2025-05-07T20:32:56.6203339Z T=4096, 2025-05-07T20:32:56.6203716Z D=5120, 2025-05-07T20:32:56.6204456Z scale_ub=None, 2025-05-07T20:32:56.6204861Z contiguous=True, 2025-05-07T20:32:56.6205293Z compiled=True, 2025-05-07T20:32:56.6205680Z ) 2025-05-07T20:32:56.6206300Z self = 2025-05-07T20:32:56.6207283Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:56.6207826Z 2025-05-07T20:32:56.6207990Z @given( 2025-05-07T20:32:56.6208443Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.6209042Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.6209644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.6210297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.6210654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.6210938Z ) 2025-05-07T20:32:56.6211284Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.6211733Z def test_silu_mul_quant( 2025-05-07T20:32:56.6211979Z self, 2025-05-07T20:32:56.6212174Z T: int, 2025-05-07T20:32:56.6212362Z D: int, 2025-05-07T20:32:56.6212576Z scale_ub: Optional[float], 2025-05-07T20:32:56.6212843Z contiguous: bool, 2025-05-07T20:32:56.6213073Z compiled: bool, 2025-05-07T20:32:56.6213289Z ) -> None: 2025-05-07T20:32:56.6213503Z torch.manual_seed(2025) 2025-05-07T20:32:56.6213753Z 2025-05-07T20:32:56.6214012Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.6214351Z 2025-05-07T20:32:56.6214542Z x_sign = torch.sign(x) 2025-05-07T20:32:56.6214825Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.6215141Z x = x_sign * x_clamp 2025-05-07T20:32:56.6215381Z x0 = x[:, :D] 2025-05-07T20:32:56.6215584Z x1 = x[:, D:] 2025-05-07T20:32:56.6215803Z 2025-05-07T20:32:56.6215997Z if contiguous: 2025-05-07T20:32:56.6216230Z x0 = x0.contiguous() 2025-05-07T20:32:56.6216484Z x1 = x1.contiguous() 2025-05-07T20:32:56.6216720Z 2025-05-07T20:32:56.6216916Z if scale_ub is not None: 2025-05-07T20:32:56.6217306Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.6217650Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.6217967Z ) 2025-05-07T20:32:56.6218163Z else: 2025-05-07T20:32:56.6218371Z scale_ub_tensor = None 2025-05-07T20:32:56.6218632Z 2025-05-07T20:32:56.6218868Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.6219176Z op = silu_mul_quant 2025-05-07T20:32:56.6219423Z if compiled: 2025-05-07T20:32:56.6219672Z op = torch.compile(op) 2025-05-07T20:32:56.6219961Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.6220234Z 2025-05-07T20:32:56.6220431Z y_fp8, y_scale = fn() 2025-05-07T20:32:56.6220704Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:56.6220996Z 2025-05-07T20:32:56.6221228Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.6221556Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:56.6221842Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:56.6222148Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:56.6222500Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:56.6222802Z 2025-05-07T20:32:56.6223003Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:56.6223195Z 2025-05-07T20:32:56.6223301Z moe/activation_test.py:126: 2025-05-07T20:32:56.6223588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.6223915Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:56.6224240Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:56.6225014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:56.6225856Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:56.6226404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.6227078Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.6227831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:56.6228543Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:56.6229266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:56.6229894Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:56.6230486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:56.6230999Z fn() 2025-05-07T20:32:56.6231518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:56.6232100Z self.fn.run( 2025-05-07T20:32:56.6232565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.6233082Z kernel = self.compile( 2025-05-07T20:32:56.6233631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.6234266Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.6234659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.6234882Z 2025-05-07T20:32:56.6235090Z self = 2025-05-07T20:32:56.6236163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.6237604Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788e26de0>} 2025-05-07T20:32:56.6238919Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.6239975Z context = 2025-05-07T20:32:56.6240425Z 2025-05-07T20:32:56.6240595Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.6241103Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.6241576Z module_map=module_map) 2025-05-07T20:32:56.6241941Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.6242298Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:56.6242561Z E ^ 2025-05-07T20:32:56.6243023Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.6243478Z 2025-05-07T20:32:56.6243920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.6244422Z 2025-05-07T20:32:56.6244529Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.6244930Z self=, 2025-05-07T20:32:56.6245339Z T=16384, 2025-05-07T20:32:56.6245529Z D=5120, 2025-05-07T20:32:56.6245718Z scale_ub=None, 2025-05-07T20:32:56.6245930Z contiguous=True, 2025-05-07T20:32:56.6246150Z compiled=True, 2025-05-07T20:32:56.6246344Z ) 2025-05-07T20:32:56.6246865Z self = 2025-05-07T20:32:56.6247360Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:56.6247629Z 2025-05-07T20:32:56.6247706Z @given( 2025-05-07T20:32:56.6247933Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.6248241Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.6248537Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.6248857Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.6249177Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.6249453Z ) 2025-05-07T20:32:56.6249798Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.6250247Z def test_silu_mul_quant( 2025-05-07T20:32:56.6250483Z self, 2025-05-07T20:32:56.6250669Z T: int, 2025-05-07T20:32:56.6250862Z D: int, 2025-05-07T20:32:56.6251087Z scale_ub: Optional[float], 2025-05-07T20:32:56.6251345Z contiguous: bool, 2025-05-07T20:32:56.6251586Z compiled: bool, 2025-05-07T20:32:56.6251809Z ) -> None: 2025-05-07T20:32:56.6252017Z torch.manual_seed(2025) 2025-05-07T20:32:56.6252248Z 2025-05-07T20:32:56.6252510Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.6252846Z 2025-05-07T20:32:56.6253029Z x_sign = torch.sign(x) 2025-05-07T20:32:56.6253313Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.6253624Z x = x_sign * x_clamp 2025-05-07T20:32:56.6253857Z x0 = x[:, :D] 2025-05-07T20:32:56.6254069Z x1 = x[:, D:] 2025-05-07T20:32:56.6254275Z 2025-05-07T20:32:56.6254451Z if contiguous: 2025-05-07T20:32:56.6254676Z x0 = x0.contiguous() 2025-05-07T20:32:56.6254932Z x1 = x1.contiguous() 2025-05-07T20:32:56.6255161Z 2025-05-07T20:32:56.6255349Z if scale_ub is not None: 2025-05-07T20:32:56.6255629Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.6255956Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.6256387Z ) 2025-05-07T20:32:56.6256585Z else: 2025-05-07T20:32:56.6256789Z scale_ub_tensor = None 2025-05-07T20:32:56.6257038Z 2025-05-07T20:32:56.6257269Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.6257575Z op = silu_mul_quant 2025-05-07T20:32:56.6257828Z if compiled: 2025-05-07T20:32:56.6258083Z op = torch.compile(op) 2025-05-07T20:32:56.6258376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.6258634Z 2025-05-07T20:32:56.6258826Z y_fp8, y_scale = fn() 2025-05-07T20:32:56.6259104Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:56.6259378Z 2025-05-07T20:32:56.6259611Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.6259946Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:56.6260228Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:56.6260578Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:56.6260950Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:56.6261250Z 2025-05-07T20:32:56.6261449Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:56.6261647Z 2025-05-07T20:32:56.6261743Z moe/activation_test.py:126: 2025-05-07T20:32:56.6262038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.6262359Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:56.6262683Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:56.6263481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:56.6264228Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:56.6264869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.6265559Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.6266243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:56.6266945Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:56.6267735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:56.6268366Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:56.6268967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:56.6269469Z fn() 2025-05-07T20:32:56.6269993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:56.6270602Z self.fn.run( 2025-05-07T20:32:56.6271065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.6271584Z kernel = self.compile( 2025-05-07T20:32:56.6272141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.6272805Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.6273195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.6273425Z 2025-05-07T20:32:56.6273630Z self = 2025-05-07T20:32:56.6274742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.6276272Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788d477e0>} 2025-05-07T20:32:56.6277581Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.6278588Z context = 2025-05-07T20:32:56.6278870Z 2025-05-07T20:32:56.6279033Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.6279545Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.6280001Z module_map=module_map) 2025-05-07T20:32:56.6280361Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.6280713Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:56.6280974Z E ^ 2025-05-07T20:32:56.6281445Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.6281914Z 2025-05-07T20:32:56.6282337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.6478383Z W0507 20:32:56.646000 88852 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:56.6479682Z W0507 20:32:56.646000 88852 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:56.6481555Z W0507 20:32:56.646000 88852 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:56.6483822Z W0507 20:32:56.646000 88852 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:56.6486104Z W0507 20:32:56.646000 88852 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:57.0742213Z 2025-05-07T20:32:57.0742598Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.0743143Z self=, 2025-05-07T20:32:57.0743794Z T=1, 2025-05-07T20:32:57.0744059Z D=5120, 2025-05-07T20:32:57.0744343Z scale_ub=1200.0, 2025-05-07T20:32:57.0744643Z contiguous=True, 2025-05-07T20:32:57.0744889Z compiled=True, 2025-05-07T20:32:57.0745107Z ) 2025-05-07T20:32:57.0745428Z self = 2025-05-07T20:32:57.0745924Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:57.0746199Z 2025-05-07T20:32:57.0746300Z @given( 2025-05-07T20:32:57.0746533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.0746847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.0747153Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.0747548Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.0747865Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.0748149Z ) 2025-05-07T20:32:57.0754771Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.0755323Z def test_silu_mul_quant( 2025-05-07T20:32:57.0755587Z self, 2025-05-07T20:32:57.0755786Z T: int, 2025-05-07T20:32:57.0755990Z D: int, 2025-05-07T20:32:57.0756220Z scale_ub: Optional[float], 2025-05-07T20:32:57.0756509Z contiguous: bool, 2025-05-07T20:32:57.0756767Z compiled: bool, 2025-05-07T20:32:57.0757017Z ) -> None: 2025-05-07T20:32:57.0757238Z torch.manual_seed(2025) 2025-05-07T20:32:57.0757483Z 2025-05-07T20:32:57.0757911Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.0758264Z 2025-05-07T20:32:57.0758449Z x_sign = torch.sign(x) 2025-05-07T20:32:57.0758731Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.0759031Z x = x_sign * x_clamp 2025-05-07T20:32:57.0759258Z x0 = x[:, :D] 2025-05-07T20:32:57.0759471Z x1 = x[:, D:] 2025-05-07T20:32:57.0759678Z 2025-05-07T20:32:57.0759853Z if contiguous: 2025-05-07T20:32:57.0760077Z x0 = x0.contiguous() 2025-05-07T20:32:57.0760326Z x1 = x1.contiguous() 2025-05-07T20:32:57.0760557Z 2025-05-07T20:32:57.0760743Z if scale_ub is not None: 2025-05-07T20:32:57.0761012Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.0761337Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.0761647Z ) 2025-05-07T20:32:57.0761832Z else: 2025-05-07T20:32:57.0762031Z scale_ub_tensor = None 2025-05-07T20:32:57.0762284Z 2025-05-07T20:32:57.0762511Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.0762817Z op = silu_mul_quant 2025-05-07T20:32:57.0763054Z if compiled: 2025-05-07T20:32:57.0763294Z op = torch.compile(op) 2025-05-07T20:32:57.0763584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.0763851Z 2025-05-07T20:32:57.0764035Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.0764196Z 2025-05-07T20:32:57.0764296Z moe/activation_test.py:117: 2025-05-07T20:32:57.0764578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.0764897Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.0765169Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.0765733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.0766411Z return fn(*args, **kwargs) 2025-05-07T20:32:57.0767063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.0767733Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.0768275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.0768937Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.0769592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.0770107Z kernel = self.compile( 2025-05-07T20:32:57.0770661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.0771335Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.0771723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.0771946Z 2025-05-07T20:32:57.0772152Z self = 2025-05-07T20:32:57.0773255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.0774652Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788de1440>} 2025-05-07T20:32:57.0775993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.0777047Z context = 2025-05-07T20:32:57.0777330Z 2025-05-07T20:32:57.0777627Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.0778140Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.0778601Z module_map=module_map) 2025-05-07T20:32:57.0778956Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.0779300Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.0779550Z E ^ 2025-05-07T20:32:57.0780009Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.0780458Z 2025-05-07T20:32:57.0780861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.0781370Z 2025-05-07T20:32:57.0781468Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.0781873Z self=, 2025-05-07T20:32:57.0782261Z T=1, 2025-05-07T20:32:57.0782439Z D=5120, 2025-05-07T20:32:57.0782624Z scale_ub=None, 2025-05-07T20:32:57.0782836Z contiguous=False, 2025-05-07T20:32:57.0783052Z compiled=True, 2025-05-07T20:32:57.0783249Z ) 2025-05-07T20:32:57.0783562Z self = 2025-05-07T20:32:57.0784042Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:57.0784296Z 2025-05-07T20:32:57.0784373Z @given( 2025-05-07T20:32:57.0784592Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.0784888Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.0785184Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.0785504Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.0785820Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.0786174Z ) 2025-05-07T20:32:57.0786518Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.0786972Z def test_silu_mul_quant( 2025-05-07T20:32:57.0787203Z self, 2025-05-07T20:32:57.0787477Z T: int, 2025-05-07T20:32:57.0787670Z D: int, 2025-05-07T20:32:57.0787876Z scale_ub: Optional[float], 2025-05-07T20:32:57.0788141Z contiguous: bool, 2025-05-07T20:32:57.0788376Z compiled: bool, 2025-05-07T20:32:57.0788585Z ) -> None: 2025-05-07T20:32:57.0788795Z torch.manual_seed(2025) 2025-05-07T20:32:57.0789035Z 2025-05-07T20:32:57.0789293Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.0789627Z 2025-05-07T20:32:57.0789809Z x_sign = torch.sign(x) 2025-05-07T20:32:57.0790085Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.0790383Z x = x_sign * x_clamp 2025-05-07T20:32:57.0790629Z x0 = x[:, :D] 2025-05-07T20:32:57.0790839Z x1 = x[:, D:] 2025-05-07T20:32:57.0791038Z 2025-05-07T20:32:57.0791212Z if contiguous: 2025-05-07T20:32:57.0791441Z x0 = x0.contiguous() 2025-05-07T20:32:57.0791689Z x1 = x1.contiguous() 2025-05-07T20:32:57.0791920Z 2025-05-07T20:32:57.0792104Z if scale_ub is not None: 2025-05-07T20:32:57.0792367Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.0792690Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.0792996Z ) 2025-05-07T20:32:57.0793177Z else: 2025-05-07T20:32:57.0793380Z scale_ub_tensor = None 2025-05-07T20:32:57.0793622Z 2025-05-07T20:32:57.0793841Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.0794144Z op = silu_mul_quant 2025-05-07T20:32:57.0794392Z if compiled: 2025-05-07T20:32:57.0794625Z op = torch.compile(op) 2025-05-07T20:32:57.0794918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.0795183Z 2025-05-07T20:32:57.0795368Z y_fp8, y_scale = fn() 2025-05-07T20:32:57.0795723Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:57.0796005Z 2025-05-07T20:32:57.0796234Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.0796552Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:57.0796833Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:57.0797131Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:57.0797475Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.0797775Z 2025-05-07T20:32:57.0797968Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:57.0798154Z 2025-05-07T20:32:57.0798250Z moe/activation_test.py:126: 2025-05-07T20:32:57.0798537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.0798857Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:57.0799177Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.0799945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:57.0800725Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:57.0801279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.0801941Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.0802616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:57.0803328Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:57.0804043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:57.0804746Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:57.0805342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:57.0805847Z fn() 2025-05-07T20:32:57.0806347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:57.0806928Z self.fn.run( 2025-05-07T20:32:57.0807386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.0807898Z kernel = self.compile( 2025-05-07T20:32:57.0808431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.0809066Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.0809448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.0809675Z 2025-05-07T20:32:57.0809883Z self = 2025-05-07T20:32:57.0810939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.0812292Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3789f0e5c0>} 2025-05-07T20:32:57.0813603Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.0814597Z context = 2025-05-07T20:32:57.0814874Z 2025-05-07T20:32:57.0815038Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.0815544Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.0816087Z module_map=module_map) 2025-05-07T20:32:57.0816445Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.0816794Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:57.0817051Z E ^ 2025-05-07T20:32:57.0817500Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.0817948Z 2025-05-07T20:32:57.0818375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.2261990Z 2025-05-07T20:32:57.2262872Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.2264051Z self=, 2025-05-07T20:32:57.2265139Z T=1, 2025-05-07T20:32:57.2265700Z D=5120, 2025-05-07T20:32:57.2266213Z scale_ub=None, 2025-05-07T20:32:57.2266633Z contiguous=True, 2025-05-07T20:32:57.2267057Z compiled=False, 2025-05-07T20:32:57.2267572Z ) 2025-05-07T20:32:57.2274109Z self = 2025-05-07T20:32:57.2274597Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:57.2274864Z 2025-05-07T20:32:57.2274940Z @given( 2025-05-07T20:32:57.2275169Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.2275471Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.2275778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.2276105Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.2276423Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.2276708Z ) 2025-05-07T20:32:57.2277064Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.2277608Z def test_silu_mul_quant( 2025-05-07T20:32:57.2277840Z self, 2025-05-07T20:32:57.2278028Z T: int, 2025-05-07T20:32:57.2278215Z D: int, 2025-05-07T20:32:57.2278430Z scale_ub: Optional[float], 2025-05-07T20:32:57.2278717Z contiguous: bool, 2025-05-07T20:32:57.2278954Z compiled: bool, 2025-05-07T20:32:57.2279176Z ) -> None: 2025-05-07T20:32:57.2279393Z torch.manual_seed(2025) 2025-05-07T20:32:57.2279625Z 2025-05-07T20:32:57.2279900Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.2280243Z 2025-05-07T20:32:57.2280426Z x_sign = torch.sign(x) 2025-05-07T20:32:57.2280715Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.2281025Z x = x_sign * x_clamp 2025-05-07T20:32:57.2281264Z x0 = x[:, :D] 2025-05-07T20:32:57.2281472Z x1 = x[:, D:] 2025-05-07T20:32:57.2281676Z 2025-05-07T20:32:57.2281859Z if contiguous: 2025-05-07T20:32:57.2282090Z x0 = x0.contiguous() 2025-05-07T20:32:57.2282341Z x1 = x1.contiguous() 2025-05-07T20:32:57.2282582Z 2025-05-07T20:32:57.2282771Z if scale_ub is not None: 2025-05-07T20:32:57.2283052Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.2283385Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.2283688Z ) 2025-05-07T20:32:57.2283876Z else: 2025-05-07T20:32:57.2284085Z scale_ub_tensor = None 2025-05-07T20:32:57.2284323Z 2025-05-07T20:32:57.2284553Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.2284853Z op = silu_mul_quant 2025-05-07T20:32:57.2285091Z if compiled: 2025-05-07T20:32:57.2285331Z op = torch.compile(op) 2025-05-07T20:32:57.2285625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.2285890Z 2025-05-07T20:32:57.2286074Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.2286244Z 2025-05-07T20:32:57.2286338Z moe/activation_test.py:117: 2025-05-07T20:32:57.2286624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.2287066Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.2287344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.2288042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.2288714Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.2289241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.2289902Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.2290566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.2291118Z kernel = self.compile( 2025-05-07T20:32:57.2291656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.2292299Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.2292770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.2292990Z 2025-05-07T20:32:57.2293189Z self = 2025-05-07T20:32:57.2294245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.2295591Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae094fe0>} 2025-05-07T20:32:57.2296906Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.2297951Z context = 2025-05-07T20:32:57.2298245Z 2025-05-07T20:32:57.2298410Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.2298925Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.2299380Z module_map=module_map) 2025-05-07T20:32:57.2299734Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.2300078Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.2300329Z E ^ 2025-05-07T20:32:57.2300782Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.2301251Z 2025-05-07T20:32:57.2301657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.2302167Z 2025-05-07T20:32:57.2302266Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.2302672Z self=, 2025-05-07T20:32:57.2303060Z T=128, 2025-05-07T20:32:57.2303246Z D=5120, 2025-05-07T20:32:57.2303438Z scale_ub=None, 2025-05-07T20:32:57.2303642Z contiguous=False, 2025-05-07T20:32:57.2303863Z compiled=True, 2025-05-07T20:32:57.2304065Z ) 2025-05-07T20:32:57.2304375Z self = 2025-05-07T20:32:57.2304856Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:57.2305120Z 2025-05-07T20:32:57.2305195Z @given( 2025-05-07T20:32:57.2305417Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.2305714Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.2306006Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.2306325Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.2306715Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.2306993Z ) 2025-05-07T20:32:57.2307331Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.2307849Z def test_silu_mul_quant( 2025-05-07T20:32:57.2308086Z self, 2025-05-07T20:32:57.2308273Z T: int, 2025-05-07T20:32:57.2308461Z D: int, 2025-05-07T20:32:57.2308672Z scale_ub: Optional[float], 2025-05-07T20:32:57.2308939Z contiguous: bool, 2025-05-07T20:32:57.2309168Z compiled: bool, 2025-05-07T20:32:57.2309375Z ) -> None: 2025-05-07T20:32:57.2309580Z torch.manual_seed(2025) 2025-05-07T20:32:57.2309806Z 2025-05-07T20:32:57.2310057Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.2310383Z 2025-05-07T20:32:57.2310560Z x_sign = torch.sign(x) 2025-05-07T20:32:57.2310836Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.2311137Z x = x_sign * x_clamp 2025-05-07T20:32:57.2311375Z x0 = x[:, :D] 2025-05-07T20:32:57.2311645Z x1 = x[:, D:] 2025-05-07T20:32:57.2311843Z 2025-05-07T20:32:57.2312015Z if contiguous: 2025-05-07T20:32:57.2312242Z x0 = x0.contiguous() 2025-05-07T20:32:57.2312492Z x1 = x1.contiguous() 2025-05-07T20:32:57.2312723Z 2025-05-07T20:32:57.2312900Z if scale_ub is not None: 2025-05-07T20:32:57.2313172Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.2313495Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.2313795Z ) 2025-05-07T20:32:57.2313977Z else: 2025-05-07T20:32:57.2314181Z scale_ub_tensor = None 2025-05-07T20:32:57.2314423Z 2025-05-07T20:32:57.2314639Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.2314992Z op = silu_mul_quant 2025-05-07T20:32:57.2315235Z if compiled: 2025-05-07T20:32:57.2315474Z op = torch.compile(op) 2025-05-07T20:32:57.2315776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.2316040Z 2025-05-07T20:32:57.2316218Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.2316380Z 2025-05-07T20:32:57.2316476Z moe/activation_test.py:117: 2025-05-07T20:32:57.2316760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.2317071Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.2317344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.2317906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.2318447Z return fn(*args, **kwargs) 2025-05-07T20:32:57.2319087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.2319759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.2320295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.2321017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.2321659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.2322174Z kernel = self.compile( 2025-05-07T20:32:57.2322706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.2323363Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.2323746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.2323973Z 2025-05-07T20:32:57.2324170Z self = 2025-05-07T20:32:57.2325319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.2326660Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae1b0360>} 2025-05-07T20:32:57.2328013Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.2329071Z context = 2025-05-07T20:32:57.2329351Z 2025-05-07T20:32:57.2329516Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.2330030Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.2330492Z module_map=module_map) 2025-05-07T20:32:57.2330854Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.2331193Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.2331485Z E ^ 2025-05-07T20:32:57.2331944Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.2332390Z 2025-05-07T20:32:57.2332820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.2333320Z 2025-05-07T20:32:57.2333425Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.2333820Z self=, 2025-05-07T20:32:57.2334234Z T=128, 2025-05-07T20:32:57.2334424Z D=7168, 2025-05-07T20:32:57.2334605Z scale_ub=1200.0, 2025-05-07T20:32:57.2334822Z contiguous=False, 2025-05-07T20:32:57.2335089Z compiled=False, 2025-05-07T20:32:57.3890948Z ) 2025-05-07T20:32:57.3891334Z self = 2025-05-07T20:32:57.3892106Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:57.3892502Z 2025-05-07T20:32:57.3892614Z @given( 2025-05-07T20:32:57.3892938Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.3893376Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.3893815Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.3894284Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.3894606Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.3894910Z ) 2025-05-07T20:32:57.3895266Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.3895718Z def test_silu_mul_quant( 2025-05-07T20:32:57.3895970Z self, 2025-05-07T20:32:57.3896170Z T: int, 2025-05-07T20:32:57.3896380Z D: int, 2025-05-07T20:32:57.3896601Z scale_ub: Optional[float], 2025-05-07T20:32:57.3896875Z contiguous: bool, 2025-05-07T20:32:57.3897136Z compiled: bool, 2025-05-07T20:32:57.3897380Z ) -> None: 2025-05-07T20:32:57.3897609Z torch.manual_seed(2025) 2025-05-07T20:32:57.3897857Z 2025-05-07T20:32:57.3898126Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.3898479Z 2025-05-07T20:32:57.3898681Z x_sign = torch.sign(x) 2025-05-07T20:32:57.3898974Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.3899298Z x = x_sign * x_clamp 2025-05-07T20:32:57.3899553Z x0 = x[:, :D] 2025-05-07T20:32:57.3899769Z x1 = x[:, D:] 2025-05-07T20:32:57.3899982Z 2025-05-07T20:32:57.3900168Z if contiguous: 2025-05-07T20:32:57.3900400Z x0 = x0.contiguous() 2025-05-07T20:32:57.3900669Z x1 = x1.contiguous() 2025-05-07T20:32:57.3900914Z 2025-05-07T20:32:57.3901098Z if scale_ub is not None: 2025-05-07T20:32:57.3901377Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.3901901Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.3902226Z ) 2025-05-07T20:32:57.3902413Z else: 2025-05-07T20:32:57.3902622Z scale_ub_tensor = None 2025-05-07T20:32:57.3902874Z 2025-05-07T20:32:57.3903098Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.3903405Z op = silu_mul_quant 2025-05-07T20:32:57.3910231Z if compiled: 2025-05-07T20:32:57.3910485Z op = torch.compile(op) 2025-05-07T20:32:57.3910781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.3911059Z 2025-05-07T20:32:57.3911251Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.3911420Z 2025-05-07T20:32:57.3911519Z moe/activation_test.py:117: 2025-05-07T20:32:57.3911806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.3912138Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.3912442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.3913183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.3913978Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.3914526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.3915198Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.3915856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.3916378Z kernel = self.compile( 2025-05-07T20:32:57.3916919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.3917637Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.3918037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.3918259Z 2025-05-07T20:32:57.3918466Z self = 2025-05-07T20:32:57.3919572Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.3920921Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3789c0ec00>} 2025-05-07T20:32:57.3922279Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.3923345Z context = 2025-05-07T20:32:57.3923626Z 2025-05-07T20:32:57.3923794Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.3924313Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.3924777Z module_map=module_map) 2025-05-07T20:32:57.3925144Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.3925491Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.3925747Z E ^ 2025-05-07T20:32:57.3926203Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.3926647Z 2025-05-07T20:32:57.3927078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.3927585Z 2025-05-07T20:32:57.3927687Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.3928088Z self=, 2025-05-07T20:32:57.3928558Z T=128, 2025-05-07T20:32:57.3928741Z D=5120, 2025-05-07T20:32:57.3928934Z scale_ub=None, 2025-05-07T20:32:57.3929147Z contiguous=False, 2025-05-07T20:32:57.3929365Z compiled=False, 2025-05-07T20:32:57.3929565Z ) 2025-05-07T20:32:57.3929880Z self = 2025-05-07T20:32:57.3930372Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:57.3930644Z 2025-05-07T20:32:57.3930739Z @given( 2025-05-07T20:32:57.3930989Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.3931288Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.3931582Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.3931901Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.3932218Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.3932490Z ) 2025-05-07T20:32:57.3932834Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.3933321Z def test_silu_mul_quant( 2025-05-07T20:32:57.3933551Z self, 2025-05-07T20:32:57.3933744Z T: int, 2025-05-07T20:32:57.3933935Z D: int, 2025-05-07T20:32:57.3934143Z scale_ub: Optional[float], 2025-05-07T20:32:57.3934415Z contiguous: bool, 2025-05-07T20:32:57.3934678Z compiled: bool, 2025-05-07T20:32:57.3934900Z ) -> None: 2025-05-07T20:32:57.3935112Z torch.manual_seed(2025) 2025-05-07T20:32:57.3935343Z 2025-05-07T20:32:57.3935611Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.3935955Z 2025-05-07T20:32:57.3936141Z x_sign = torch.sign(x) 2025-05-07T20:32:57.3936427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.3936782Z x = x_sign * x_clamp 2025-05-07T20:32:57.3937016Z x0 = x[:, :D] 2025-05-07T20:32:57.3937228Z x1 = x[:, D:] 2025-05-07T20:32:57.3937429Z 2025-05-07T20:32:57.3937615Z if contiguous: 2025-05-07T20:32:57.3937840Z x0 = x0.contiguous() 2025-05-07T20:32:57.3938087Z x1 = x1.contiguous() 2025-05-07T20:32:57.3938315Z 2025-05-07T20:32:57.3938506Z if scale_ub is not None: 2025-05-07T20:32:57.3938771Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.3939098Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.3939403Z ) 2025-05-07T20:32:57.3939595Z else: 2025-05-07T20:32:57.3939797Z scale_ub_tensor = None 2025-05-07T20:32:57.3940036Z 2025-05-07T20:32:57.3940442Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.3940748Z op = silu_mul_quant 2025-05-07T20:32:57.3940989Z if compiled: 2025-05-07T20:32:57.3941228Z op = torch.compile(op) 2025-05-07T20:32:57.3941518Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.3941780Z 2025-05-07T20:32:57.3941969Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.3942127Z 2025-05-07T20:32:57.3942229Z moe/activation_test.py:117: 2025-05-07T20:32:57.3942511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.3942831Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.3943105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.3943778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.3944444Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.3944974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.3945647Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.3946301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.3946820Z kernel = self.compile( 2025-05-07T20:32:57.3947553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.3948221Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.3948614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.3948838Z 2025-05-07T20:32:57.3949036Z self = 2025-05-07T20:32:57.3950091Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.3951447Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788e25e40>} 2025-05-07T20:32:57.3952778Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.3953888Z context = 2025-05-07T20:32:57.3954172Z 2025-05-07T20:32:57.3954335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.3954848Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.3955308Z module_map=module_map) 2025-05-07T20:32:57.3955661Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.3956013Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.3956266Z E ^ 2025-05-07T20:32:57.3956779Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.3957238Z 2025-05-07T20:32:57.3957672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.3958179Z 2025-05-07T20:32:57.3958280Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.3958681Z self=, 2025-05-07T20:32:57.3959067Z T=128, 2025-05-07T20:32:57.3959249Z D=5120, 2025-05-07T20:32:57.3959434Z scale_ub=1200.0, 2025-05-07T20:32:57.3959640Z contiguous=True, 2025-05-07T20:32:57.3959860Z compiled=False, 2025-05-07T20:32:57.3960061Z ) 2025-05-07T20:32:57.3960372Z self = 2025-05-07T20:32:57.3960905Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:57.3961165Z 2025-05-07T20:32:57.3961246Z @given( 2025-05-07T20:32:57.3961465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.3961773Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.3962080Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.3962403Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.3962716Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.3962996Z ) 2025-05-07T20:32:57.3963330Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.3963771Z def test_silu_mul_quant( 2025-05-07T20:32:57.3964008Z self, 2025-05-07T20:32:57.3964193Z T: int, 2025-05-07T20:32:57.3964377Z D: int, 2025-05-07T20:32:57.3964586Z scale_ub: Optional[float], 2025-05-07T20:32:57.3964847Z contiguous: bool, 2025-05-07T20:32:57.3965078Z compiled: bool, 2025-05-07T20:32:57.3965291Z ) -> None: 2025-05-07T20:32:57.3965499Z torch.manual_seed(2025) 2025-05-07T20:32:57.3965732Z 2025-05-07T20:32:57.3965988Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.3966315Z 2025-05-07T20:32:57.3966590Z x_sign = torch.sign(x) 2025-05-07T20:32:57.3966875Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.3967180Z x = x_sign * x_clamp 2025-05-07T20:32:57.3967410Z x0 = x[:, :D] 2025-05-07T20:32:57.3967614Z x1 = x[:, D:] 2025-05-07T20:32:57.3967820Z 2025-05-07T20:32:57.3968009Z if contiguous: 2025-05-07T20:32:57.3968239Z x0 = x0.contiguous() 2025-05-07T20:32:57.3968496Z x1 = x1.contiguous() 2025-05-07T20:32:57.3968726Z 2025-05-07T20:32:57.3968906Z if scale_ub is not None: 2025-05-07T20:32:57.3969171Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.3969493Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.3969791Z ) 2025-05-07T20:32:57.3969972Z else: 2025-05-07T20:32:57.3970173Z scale_ub_tensor = None 2025-05-07T20:32:57.3970407Z 2025-05-07T20:32:57.3970631Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.3970931Z op = silu_mul_quant 2025-05-07T20:32:57.3971219Z if compiled: 2025-05-07T20:32:57.3971457Z op = torch.compile(op) 2025-05-07T20:32:57.3971741Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.3972003Z 2025-05-07T20:32:57.3972182Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.3972343Z 2025-05-07T20:32:57.3972438Z moe/activation_test.py:117: 2025-05-07T20:32:57.3972728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.3973047Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.3973320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.3974020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.3974735Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.3975279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.3975945Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.3976598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.3977110Z kernel = self.compile( 2025-05-07T20:32:57.3977660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.3978293Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.3978670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.3978888Z 2025-05-07T20:32:57.3979086Z self = 2025-05-07T20:32:57.3980190Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.3981533Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae84a0c0>} 2025-05-07T20:32:57.3982841Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.3983883Z context = 2025-05-07T20:32:57.3984163Z 2025-05-07T20:32:57.3984321Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.3984830Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.3985302Z module_map=module_map) 2025-05-07T20:32:57.3985754Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.3986103Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.3986362Z E ^ 2025-05-07T20:32:57.3986817Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.3987278Z 2025-05-07T20:32:57.3987755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.5541031Z 2025-05-07T20:32:57.5541230Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.5541780Z self=, 2025-05-07T20:32:57.5542346Z T=1, 2025-05-07T20:32:57.5542658Z D=7168, 2025-05-07T20:32:57.5542942Z scale_ub=1200.0, 2025-05-07T20:32:57.5543270Z contiguous=True, 2025-05-07T20:32:57.5543556Z compiled=True, 2025-05-07T20:32:57.5543771Z ) 2025-05-07T20:32:57.5544099Z self = 2025-05-07T20:32:57.5544598Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:57.5544999Z 2025-05-07T20:32:57.5545078Z @given( 2025-05-07T20:32:57.5545312Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.5545627Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.5545947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.5546277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.5546622Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.5546903Z ) 2025-05-07T20:32:57.5547258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.5547787Z def test_silu_mul_quant( 2025-05-07T20:32:57.5548023Z self, 2025-05-07T20:32:57.5548295Z T: int, 2025-05-07T20:32:57.5548490Z D: int, 2025-05-07T20:32:57.5548701Z scale_ub: Optional[float], 2025-05-07T20:32:57.5548966Z contiguous: bool, 2025-05-07T20:32:57.5549211Z compiled: bool, 2025-05-07T20:32:57.5549432Z ) -> None: 2025-05-07T20:32:57.5549644Z torch.manual_seed(2025) 2025-05-07T20:32:57.5549878Z 2025-05-07T20:32:57.5550139Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.5550476Z 2025-05-07T20:32:57.5550675Z x_sign = torch.sign(x) 2025-05-07T20:32:57.5551008Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.5551322Z x = x_sign * x_clamp 2025-05-07T20:32:57.5551562Z x0 = x[:, :D] 2025-05-07T20:32:57.5551776Z x1 = x[:, D:] 2025-05-07T20:32:57.5551979Z 2025-05-07T20:32:57.5552165Z if contiguous: 2025-05-07T20:32:57.5552405Z x0 = x0.contiguous() 2025-05-07T20:32:57.5552655Z x1 = x1.contiguous() 2025-05-07T20:32:57.5552890Z 2025-05-07T20:32:57.5553080Z if scale_ub is not None: 2025-05-07T20:32:57.5553346Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.5553683Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.5553990Z ) 2025-05-07T20:32:57.5554178Z else: 2025-05-07T20:32:57.5554383Z scale_ub_tensor = None 2025-05-07T20:32:57.5554629Z 2025-05-07T20:32:57.5554850Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.5555156Z op = silu_mul_quant 2025-05-07T20:32:57.5555404Z if compiled: 2025-05-07T20:32:57.5555643Z op = torch.compile(op) 2025-05-07T20:32:57.5555937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.5556209Z 2025-05-07T20:32:57.5556398Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.5556559Z 2025-05-07T20:32:57.5556658Z moe/activation_test.py:117: 2025-05-07T20:32:57.5556956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.5557292Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.5557686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.5558254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.5558811Z return fn(*args, **kwargs) 2025-05-07T20:32:57.5559467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.5560137Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.5560667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.5561330Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.5561970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.5562493Z kernel = self.compile( 2025-05-07T20:32:57.5563051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.5563696Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.5564136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.5564363Z 2025-05-07T20:32:57.5564565Z self = 2025-05-07T20:32:57.5565658Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.5567007Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae35f1a0>} 2025-05-07T20:32:57.5568387Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.5569442Z context = 2025-05-07T20:32:57.5569726Z 2025-05-07T20:32:57.5569887Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.5570402Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.5570917Z module_map=module_map) 2025-05-07T20:32:57.5571282Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.5571637Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.5571888Z E ^ 2025-05-07T20:32:57.5572353Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.5572818Z 2025-05-07T20:32:57.5573246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.5573752Z 2025-05-07T20:32:57.5573860Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.5574264Z self=, 2025-05-07T20:32:57.5574658Z T=1, 2025-05-07T20:32:57.5574841Z D=7168, 2025-05-07T20:32:57.5575024Z scale_ub=1200.0, 2025-05-07T20:32:57.5575243Z contiguous=False, 2025-05-07T20:32:57.5575465Z compiled=True, 2025-05-07T20:32:57.5575675Z ) 2025-05-07T20:32:57.5575991Z self = 2025-05-07T20:32:57.5576468Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:57.5576724Z 2025-05-07T20:32:57.5576808Z @given( 2025-05-07T20:32:57.5577030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.5577340Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.5577647Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.5578047Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.5578376Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.5578664Z ) 2025-05-07T20:32:57.5579004Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.5579443Z def test_silu_mul_quant( 2025-05-07T20:32:57.5579686Z self, 2025-05-07T20:32:57.5579879Z T: int, 2025-05-07T20:32:57.5580070Z D: int, 2025-05-07T20:32:57.5580284Z scale_ub: Optional[float], 2025-05-07T20:32:57.5580547Z contiguous: bool, 2025-05-07T20:32:57.5580776Z compiled: bool, 2025-05-07T20:32:57.5580990Z ) -> None: 2025-05-07T20:32:57.5581201Z torch.manual_seed(2025) 2025-05-07T20:32:57.5581435Z 2025-05-07T20:32:57.5581703Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.5582048Z 2025-05-07T20:32:57.5582229Z x_sign = torch.sign(x) 2025-05-07T20:32:57.5582516Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.5582827Z x = x_sign * x_clamp 2025-05-07T20:32:57.5583112Z x0 = x[:, :D] 2025-05-07T20:32:57.5583316Z x1 = x[:, D:] 2025-05-07T20:32:57.5583514Z 2025-05-07T20:32:57.5583689Z if contiguous: 2025-05-07T20:32:57.5583911Z x0 = x0.contiguous() 2025-05-07T20:32:57.5584161Z x1 = x1.contiguous() 2025-05-07T20:32:57.5584393Z 2025-05-07T20:32:57.5584573Z if scale_ub is not None: 2025-05-07T20:32:57.5584848Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.5585176Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.5585482Z ) 2025-05-07T20:32:57.5585674Z else: 2025-05-07T20:32:57.5585885Z scale_ub_tensor = None 2025-05-07T20:32:57.5586120Z 2025-05-07T20:32:57.5586394Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.5586697Z op = silu_mul_quant 2025-05-07T20:32:57.5586938Z if compiled: 2025-05-07T20:32:57.5587192Z op = torch.compile(op) 2025-05-07T20:32:57.5587538Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.5587802Z 2025-05-07T20:32:57.5587989Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.5588147Z 2025-05-07T20:32:57.5588252Z moe/activation_test.py:117: 2025-05-07T20:32:57.5588541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.5588858Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.5589137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.5589709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.5590251Z return fn(*args, **kwargs) 2025-05-07T20:32:57.5590900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.5591580Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.5592128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.5592792Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.5593443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.5593960Z kernel = self.compile( 2025-05-07T20:32:57.5594489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.5595158Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.5595544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.5595765Z 2025-05-07T20:32:57.5595972Z self = 2025-05-07T20:32:57.5597107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.5598458Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3789f0eb60>} 2025-05-07T20:32:57.5599810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.5600899Z context = 2025-05-07T20:32:57.5601187Z 2025-05-07T20:32:57.5601355Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.5601862Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.5602324Z module_map=module_map) 2025-05-07T20:32:57.5602693Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.5603084Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.5603336Z E ^ 2025-05-07T20:32:57.5603788Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.5604233Z 2025-05-07T20:32:57.5604653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.7740956Z 2025-05-07T20:32:57.7741212Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.7742341Z self=, 2025-05-07T20:32:57.7743335Z T=1, 2025-05-07T20:32:57.7743791Z D=7168, 2025-05-07T20:32:57.7744261Z scale_ub=None, 2025-05-07T20:32:57.7744999Z contiguous=False, 2025-05-07T20:32:57.7745407Z compiled=True, 2025-05-07T20:32:57.7753820Z ) 2025-05-07T20:32:57.7754163Z self = 2025-05-07T20:32:57.7754669Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:57.7754931Z 2025-05-07T20:32:57.7755004Z @given( 2025-05-07T20:32:57.7755223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.7755522Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.7755814Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.7756134Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.7756453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.7756724Z ) 2025-05-07T20:32:57.7757061Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.7757505Z def test_silu_mul_quant( 2025-05-07T20:32:57.7757740Z self, 2025-05-07T20:32:57.7757918Z T: int, 2025-05-07T20:32:57.7758111Z D: int, 2025-05-07T20:32:57.7758320Z scale_ub: Optional[float], 2025-05-07T20:32:57.7758586Z contiguous: bool, 2025-05-07T20:32:57.7758831Z compiled: bool, 2025-05-07T20:32:57.7759054Z ) -> None: 2025-05-07T20:32:57.7759257Z torch.manual_seed(2025) 2025-05-07T20:32:57.7759487Z 2025-05-07T20:32:57.7759752Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.7760076Z 2025-05-07T20:32:57.7760269Z x_sign = torch.sign(x) 2025-05-07T20:32:57.7760552Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.7760844Z x = x_sign * x_clamp 2025-05-07T20:32:57.7761072Z x0 = x[:, :D] 2025-05-07T20:32:57.7761280Z x1 = x[:, D:] 2025-05-07T20:32:57.7761479Z 2025-05-07T20:32:57.7761648Z if contiguous: 2025-05-07T20:32:57.7761867Z x0 = x0.contiguous() 2025-05-07T20:32:57.7762122Z x1 = x1.contiguous() 2025-05-07T20:32:57.7762348Z 2025-05-07T20:32:57.7762527Z if scale_ub is not None: 2025-05-07T20:32:57.7762953Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.7763277Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.7763578Z ) 2025-05-07T20:32:57.7763764Z else: 2025-05-07T20:32:57.7763959Z scale_ub_tensor = None 2025-05-07T20:32:57.7764197Z 2025-05-07T20:32:57.7764418Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.7764716Z op = silu_mul_quant 2025-05-07T20:32:57.7764950Z if compiled: 2025-05-07T20:32:57.7765190Z op = torch.compile(op) 2025-05-07T20:32:57.7765472Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.7765734Z 2025-05-07T20:32:57.7765916Z y_fp8, y_scale = fn() 2025-05-07T20:32:57.7766192Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:57.7766463Z 2025-05-07T20:32:57.7766685Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.7767010Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:57.7767285Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:57.7767656Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:57.7767999Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.7768293Z 2025-05-07T20:32:57.7768480Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:57.7768666Z 2025-05-07T20:32:57.7768765Z moe/activation_test.py:126: 2025-05-07T20:32:57.7769046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.7769366Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:57.7769681Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.7770449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:57.7771223Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:57.7771766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.7772447Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.7773118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:57.7773819Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:57.7774548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:57.7775172Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:57.7775775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:57.7776283Z fn() 2025-05-07T20:32:57.7776791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:57.7777378Z self.fn.run( 2025-05-07T20:32:57.7777833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.7778365Z kernel = self.compile( 2025-05-07T20:32:57.7778912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.7779562Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.7779941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.7780162Z 2025-05-07T20:32:57.7780364Z self = 2025-05-07T20:32:57.7781470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.7782994Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788dd3920>} 2025-05-07T20:32:57.7784325Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.7785373Z context = 2025-05-07T20:32:57.7785658Z 2025-05-07T20:32:57.7785818Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.7786324Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.7786778Z module_map=module_map) 2025-05-07T20:32:57.7787138Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.7787542Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:57.7787800Z E ^ 2025-05-07T20:32:57.7788257Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.7788753Z 2025-05-07T20:32:57.7789170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.7789667Z 2025-05-07T20:32:57.7789766Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.7790160Z self=, 2025-05-07T20:32:57.7790561Z T=1, 2025-05-07T20:32:57.7790735Z D=5120, 2025-05-07T20:32:57.7790915Z scale_ub=1200.0, 2025-05-07T20:32:57.7791130Z contiguous=False, 2025-05-07T20:32:57.7791344Z compiled=True, 2025-05-07T20:32:57.7791530Z ) 2025-05-07T20:32:57.7791955Z self = 2025-05-07T20:32:57.7792428Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:57.7792686Z 2025-05-07T20:32:57.7792767Z @given( 2025-05-07T20:32:57.7792982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.7793280Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.7793574Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.7793884Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.7794195Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.7794471Z ) 2025-05-07T20:32:57.7794803Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.7795238Z def test_silu_mul_quant( 2025-05-07T20:32:57.7795471Z self, 2025-05-07T20:32:57.7795652Z T: int, 2025-05-07T20:32:57.7795837Z D: int, 2025-05-07T20:32:57.7796046Z scale_ub: Optional[float], 2025-05-07T20:32:57.7796303Z contiguous: bool, 2025-05-07T20:32:57.7796524Z compiled: bool, 2025-05-07T20:32:57.7796737Z ) -> None: 2025-05-07T20:32:57.7796943Z torch.manual_seed(2025) 2025-05-07T20:32:57.7797173Z 2025-05-07T20:32:57.7797430Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.7797755Z 2025-05-07T20:32:57.7797931Z x_sign = torch.sign(x) 2025-05-07T20:32:57.7798206Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.7798502Z x = x_sign * x_clamp 2025-05-07T20:32:57.7798725Z x0 = x[:, :D] 2025-05-07T20:32:57.7798928Z x1 = x[:, D:] 2025-05-07T20:32:57.7799129Z 2025-05-07T20:32:57.7799299Z if contiguous: 2025-05-07T20:32:57.7799517Z x0 = x0.contiguous() 2025-05-07T20:32:57.7799760Z x1 = x1.contiguous() 2025-05-07T20:32:57.7799982Z 2025-05-07T20:32:57.7800160Z if scale_ub is not None: 2025-05-07T20:32:57.7800425Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.7800741Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.7801111Z ) 2025-05-07T20:32:57.7801294Z else: 2025-05-07T20:32:57.7801493Z scale_ub_tensor = None 2025-05-07T20:32:57.7801726Z 2025-05-07T20:32:57.7801940Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.7802234Z op = silu_mul_quant 2025-05-07T20:32:57.7802471Z if compiled: 2025-05-07T20:32:57.7802705Z op = torch.compile(op) 2025-05-07T20:32:57.7802993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.7803247Z 2025-05-07T20:32:57.7803431Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.7803585Z 2025-05-07T20:32:57.7803679Z moe/activation_test.py:117: 2025-05-07T20:32:57.7803953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.7804266Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.7804536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.7805094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.7805700Z return fn(*args, **kwargs) 2025-05-07T20:32:57.7806345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.7807006Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.7807525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.7808183Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.7808827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.7809338Z kernel = self.compile( 2025-05-07T20:32:57.7809864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.7810571Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.7811001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.7811222Z 2025-05-07T20:32:57.7811426Z self = 2025-05-07T20:32:57.7812470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.7813806Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788dd2520>} 2025-05-07T20:32:57.7815156Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.7816209Z context = 2025-05-07T20:32:57.7816487Z 2025-05-07T20:32:57.7816644Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.7817149Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.7817602Z module_map=module_map) 2025-05-07T20:32:57.7817956Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.7818289Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.7818532Z E ^ 2025-05-07T20:32:57.7818983Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.7819429Z 2025-05-07T20:32:57.7819839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.9224114Z 2025-05-07T20:32:57.9224549Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.9225341Z self=, 2025-05-07T20:32:57.9226003Z T=1, 2025-05-07T20:32:57.9226246Z D=5120, 2025-05-07T20:32:57.9226503Z scale_ub=1200.0, 2025-05-07T20:32:57.9226793Z contiguous=False, 2025-05-07T20:32:57.9227016Z compiled=False, 2025-05-07T20:32:57.9227229Z ) 2025-05-07T20:32:57.9227605Z self = 2025-05-07T20:32:57.9228162Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:57.9228466Z 2025-05-07T20:32:57.9228550Z @given( 2025-05-07T20:32:57.9228784Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.9229082Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.9229376Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.9229703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.9230031Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.9230321Z ) 2025-05-07T20:32:57.9230661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.9231215Z def test_silu_mul_quant( 2025-05-07T20:32:57.9231460Z self, 2025-05-07T20:32:57.9231649Z T: int, 2025-05-07T20:32:57.9231830Z D: int, 2025-05-07T20:32:57.9232036Z scale_ub: Optional[float], 2025-05-07T20:32:57.9232316Z contiguous: bool, 2025-05-07T20:32:57.9232556Z compiled: bool, 2025-05-07T20:32:57.9232787Z ) -> None: 2025-05-07T20:32:57.9233010Z torch.manual_seed(2025) 2025-05-07T20:32:57.9233251Z 2025-05-07T20:32:57.9233541Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.9233887Z 2025-05-07T20:32:57.9234085Z x_sign = torch.sign(x) 2025-05-07T20:32:57.9234445Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.9234761Z x = x_sign * x_clamp 2025-05-07T20:32:57.9235014Z x0 = x[:, :D] 2025-05-07T20:32:57.9235234Z x1 = x[:, D:] 2025-05-07T20:32:57.9235452Z 2025-05-07T20:32:57.9235640Z if contiguous: 2025-05-07T20:32:57.9235864Z x0 = x0.contiguous() 2025-05-07T20:32:57.9236127Z x1 = x1.contiguous() 2025-05-07T20:32:57.9236371Z 2025-05-07T20:32:57.9236563Z if scale_ub is not None: 2025-05-07T20:32:57.9236836Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.9237175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.9237487Z ) 2025-05-07T20:32:57.9237685Z else: 2025-05-07T20:32:57.9237889Z scale_ub_tensor = None 2025-05-07T20:32:57.9238138Z 2025-05-07T20:32:57.9238365Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.9238681Z op = silu_mul_quant 2025-05-07T20:32:57.9238944Z if compiled: 2025-05-07T20:32:57.9239190Z op = torch.compile(op) 2025-05-07T20:32:57.9239511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.9239805Z 2025-05-07T20:32:57.9239990Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.9240331Z 2025-05-07T20:32:57.9240432Z moe/activation_test.py:117: 2025-05-07T20:32:57.9240754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.9241105Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.9241388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.9242084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.9242768Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.9243316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.9243987Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.9244774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.9245306Z kernel = self.compile( 2025-05-07T20:32:57.9245853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.9246502Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.9246898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.9247120Z 2025-05-07T20:32:57.9247325Z self = 2025-05-07T20:32:57.9248395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.9249761Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37890ca660>} 2025-05-07T20:32:57.9251181Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.9252247Z context = 2025-05-07T20:32:57.9252531Z 2025-05-07T20:32:57.9252698Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.9253212Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.9253678Z module_map=module_map) 2025-05-07T20:32:57.9254030Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.9254442Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.9254698Z E ^ 2025-05-07T20:32:57.9255155Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.9255609Z 2025-05-07T20:32:57.9256025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.9256534Z 2025-05-07T20:32:57.9256633Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.9257036Z self=, 2025-05-07T20:32:57.9257422Z T=16384, 2025-05-07T20:32:57.9257615Z D=5120, 2025-05-07T20:32:57.9257795Z scale_ub=1200.0, 2025-05-07T20:32:57.9258005Z contiguous=False, 2025-05-07T20:32:57.9258230Z compiled=True, 2025-05-07T20:32:57.9258426Z ) 2025-05-07T20:32:57.9258733Z self = 2025-05-07T20:32:57.9259221Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:57.9259507Z 2025-05-07T20:32:57.9259586Z @given( 2025-05-07T20:32:57.9259827Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.9260132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.9260434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.9260761Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.9261073Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.9261344Z ) 2025-05-07T20:32:57.9261677Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.9262113Z def test_silu_mul_quant( 2025-05-07T20:32:57.9262341Z self, 2025-05-07T20:32:57.9262523Z T: int, 2025-05-07T20:32:57.9262706Z D: int, 2025-05-07T20:32:57.9262908Z scale_ub: Optional[float], 2025-05-07T20:32:57.9263170Z contiguous: bool, 2025-05-07T20:32:57.9263406Z compiled: bool, 2025-05-07T20:32:57.9263615Z ) -> None: 2025-05-07T20:32:57.9263819Z torch.manual_seed(2025) 2025-05-07T20:32:57.9264050Z 2025-05-07T20:32:57.9264391Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.9264739Z 2025-05-07T20:32:57.9264935Z x_sign = torch.sign(x) 2025-05-07T20:32:57.9265215Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.9265527Z x = x_sign * x_clamp 2025-05-07T20:32:57.9265757Z x0 = x[:, :D] 2025-05-07T20:32:57.9265965Z x1 = x[:, D:] 2025-05-07T20:32:57.9266171Z 2025-05-07T20:32:57.9266352Z if contiguous: 2025-05-07T20:32:57.9266569Z x0 = x0.contiguous() 2025-05-07T20:32:57.9266816Z x1 = x1.contiguous() 2025-05-07T20:32:57.9267046Z 2025-05-07T20:32:57.9267232Z if scale_ub is not None: 2025-05-07T20:32:57.9267563Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.9267934Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.9268277Z ) 2025-05-07T20:32:57.9268469Z else: 2025-05-07T20:32:57.9268684Z scale_ub_tensor = None 2025-05-07T20:32:57.9268960Z 2025-05-07T20:32:57.9269243Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.9269547Z op = silu_mul_quant 2025-05-07T20:32:57.9269786Z if compiled: 2025-05-07T20:32:57.9270024Z op = torch.compile(op) 2025-05-07T20:32:57.9270309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.9270573Z 2025-05-07T20:32:57.9270757Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.9270921Z 2025-05-07T20:32:57.9271017Z moe/activation_test.py:117: 2025-05-07T20:32:57.9271305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.9271623Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.9271891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.9272456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.9273046Z return fn(*args, **kwargs) 2025-05-07T20:32:57.9273695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.9274368Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.9274908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.9275568Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.9276211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.9276728Z kernel = self.compile( 2025-05-07T20:32:57.9277278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.9277937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.9278320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.9278547Z 2025-05-07T20:32:57.9278746Z self = 2025-05-07T20:32:57.9279803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.9281142Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37890c8720>} 2025-05-07T20:32:57.9282508Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.9283504Z context = 2025-05-07T20:32:57.9283786Z 2025-05-07T20:32:57.9284027Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.9284545Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.9285003Z module_map=module_map) 2025-05-07T20:32:57.9285368Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.9285711Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.9285959Z E ^ 2025-05-07T20:32:57.9286408Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.9286850Z 2025-05-07T20:32:57.9287262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.9287764Z 2025-05-07T20:32:57.9287873Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.9288279Z self=, 2025-05-07T20:32:57.9288669Z T=2048, 2025-05-07T20:32:57.9288858Z D=7168, 2025-05-07T20:32:57.9289082Z scale_ub=1200.0, 2025-05-07T20:32:57.9289301Z contiguous=False, 2025-05-07T20:32:57.9289518Z compiled=True, 2025-05-07T20:32:58.1160895Z ) 2025-05-07T20:32:58.1161454Z self = 2025-05-07T20:32:58.1162031Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.1162422Z 2025-05-07T20:32:58.1162533Z @given( 2025-05-07T20:32:58.1163033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.1163390Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.1163706Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.1164027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.1164488Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.1164766Z ) 2025-05-07T20:32:58.1165115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.1165565Z def test_silu_mul_quant( 2025-05-07T20:32:58.1165817Z self, 2025-05-07T20:32:58.1166002Z T: int, 2025-05-07T20:32:58.1166268Z D: int, 2025-05-07T20:32:58.1166567Z scale_ub: Optional[float], 2025-05-07T20:32:58.1166834Z contiguous: bool, 2025-05-07T20:32:58.1167067Z compiled: bool, 2025-05-07T20:32:58.1167304Z ) -> None: 2025-05-07T20:32:58.1167519Z torch.manual_seed(2025) 2025-05-07T20:32:58.1167759Z 2025-05-07T20:32:58.1173985Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.1174334Z 2025-05-07T20:32:58.1174515Z x_sign = torch.sign(x) 2025-05-07T20:32:58.1174798Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.1175099Z x = x_sign * x_clamp 2025-05-07T20:32:58.1175333Z x0 = x[:, :D] 2025-05-07T20:32:58.1175534Z x1 = x[:, D:] 2025-05-07T20:32:58.1175724Z 2025-05-07T20:32:58.1175898Z if contiguous: 2025-05-07T20:32:58.1176123Z x0 = x0.contiguous() 2025-05-07T20:32:58.1176374Z x1 = x1.contiguous() 2025-05-07T20:32:58.1176601Z 2025-05-07T20:32:58.1176783Z if scale_ub is not None: 2025-05-07T20:32:58.1177044Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.1177364Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.1177667Z ) 2025-05-07T20:32:58.1177845Z else: 2025-05-07T20:32:58.1178041Z scale_ub_tensor = None 2025-05-07T20:32:58.1178286Z 2025-05-07T20:32:58.1178505Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.1178797Z op = silu_mul_quant 2025-05-07T20:32:58.1179071Z if compiled: 2025-05-07T20:32:58.1179309Z op = torch.compile(op) 2025-05-07T20:32:58.1179604Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.1179871Z 2025-05-07T20:32:58.1180052Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.1180389Z 2025-05-07T20:32:58.1180535Z moe/activation_test.py:117: 2025-05-07T20:32:58.1180837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.1181164Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.1181432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.1182001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.1182572Z return fn(*args, **kwargs) 2025-05-07T20:32:58.1183213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.1183881Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.1184413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.1185088Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.1185738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.1186319Z kernel = self.compile( 2025-05-07T20:32:58.1186918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.1187681Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.1188062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.1188281Z 2025-05-07T20:32:58.1188487Z self = 2025-05-07T20:32:58.1189537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.1191074Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788de1b20>} 2025-05-07T20:32:58.1192390Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.1193424Z context = 2025-05-07T20:32:58.1193699Z 2025-05-07T20:32:58.1193856Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.1194361Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.1194810Z module_map=module_map) 2025-05-07T20:32:58.1195163Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.1195504Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.1195753Z E ^ 2025-05-07T20:32:58.1196218Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.1196676Z 2025-05-07T20:32:58.1197108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.1197614Z 2025-05-07T20:32:58.1197712Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.1198113Z self=, 2025-05-07T20:32:58.1198545Z T=1, 2025-05-07T20:32:58.1198715Z D=5120, 2025-05-07T20:32:58.1198902Z scale_ub=None, 2025-05-07T20:32:58.1199107Z contiguous=False, 2025-05-07T20:32:58.1199319Z compiled=False, 2025-05-07T20:32:58.1199513Z ) 2025-05-07T20:32:58.1199819Z self = 2025-05-07T20:32:58.1200291Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.1200546Z 2025-05-07T20:32:58.1200704Z @given( 2025-05-07T20:32:58.1200925Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.1201227Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.1201511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.1201820Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.1202129Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.1202395Z ) 2025-05-07T20:32:58.1202732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.1203152Z def test_silu_mul_quant( 2025-05-07T20:32:58.1203417Z self, 2025-05-07T20:32:58.1203601Z T: int, 2025-05-07T20:32:58.1203783Z D: int, 2025-05-07T20:32:58.1203993Z scale_ub: Optional[float], 2025-05-07T20:32:58.1204271Z contiguous: bool, 2025-05-07T20:32:58.1204523Z compiled: bool, 2025-05-07T20:32:58.1204770Z ) -> None: 2025-05-07T20:32:58.1204964Z torch.manual_seed(2025) 2025-05-07T20:32:58.1205198Z 2025-05-07T20:32:58.1205498Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.1205822Z 2025-05-07T20:32:58.1205990Z x_sign = torch.sign(x) 2025-05-07T20:32:58.1206262Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.1206559Z x = x_sign * x_clamp 2025-05-07T20:32:58.1206821Z x0 = x[:, :D] 2025-05-07T20:32:58.1207019Z x1 = x[:, D:] 2025-05-07T20:32:58.1207213Z 2025-05-07T20:32:58.1207378Z if contiguous: 2025-05-07T20:32:58.1207598Z x0 = x0.contiguous() 2025-05-07T20:32:58.1207843Z x1 = x1.contiguous() 2025-05-07T20:32:58.1208064Z 2025-05-07T20:32:58.1208282Z if scale_ub is not None: 2025-05-07T20:32:58.1208540Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.1208906Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.1209199Z ) 2025-05-07T20:32:58.1209385Z else: 2025-05-07T20:32:58.1209579Z scale_ub_tensor = None 2025-05-07T20:32:58.1209819Z 2025-05-07T20:32:58.1210037Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.1210338Z op = silu_mul_quant 2025-05-07T20:32:58.1210575Z if compiled: 2025-05-07T20:32:58.1210812Z op = torch.compile(op) 2025-05-07T20:32:58.1211104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.1211356Z 2025-05-07T20:32:58.1211535Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.1211691Z 2025-05-07T20:32:58.1211790Z moe/activation_test.py:117: 2025-05-07T20:32:58.1212065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.1212383Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.1212645Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.1213323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.1214065Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.1214622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.1215324Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.1215966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.1216478Z kernel = self.compile( 2025-05-07T20:32:58.1217029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.1217670Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.1218084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.1218311Z 2025-05-07T20:32:58.1218509Z self = 2025-05-07T20:32:58.1219650Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.1221056Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788de0ae0>} 2025-05-07T20:32:58.1222366Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.1223360Z context = 2025-05-07T20:32:58.1223647Z 2025-05-07T20:32:58.1223805Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.1224314Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.1224860Z module_map=module_map) 2025-05-07T20:32:58.1225209Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.1225548Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.1225789Z E ^ 2025-05-07T20:32:58.1226239Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.1226702Z 2025-05-07T20:32:58.1227111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.1227660Z 2025-05-07T20:32:58.1227763Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.1228153Z self=, 2025-05-07T20:32:58.1228589Z T=4096, 2025-05-07T20:32:58.1228771Z D=7168, 2025-05-07T20:32:58.1228993Z scale_ub=1200.0, 2025-05-07T20:32:58.1229205Z contiguous=False, 2025-05-07T20:32:58.1229426Z compiled=False, 2025-05-07T20:32:58.1229627Z ) 2025-05-07T20:32:58.1229930Z self = 2025-05-07T20:32:58.1230412Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.1230684Z 2025-05-07T20:32:58.1230776Z @given( 2025-05-07T20:32:58.1231025Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.1231325Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.1231619Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.1231927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.1232242Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.1232511Z ) 2025-05-07T20:32:58.1232846Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.1233268Z def test_silu_mul_quant( 2025-05-07T20:32:58.1233496Z self, 2025-05-07T20:32:58.1233682Z T: int, 2025-05-07T20:32:58.1233861Z D: int, 2025-05-07T20:32:58.1234069Z scale_ub: Optional[float], 2025-05-07T20:32:58.1234326Z contiguous: bool, 2025-05-07T20:32:58.1234547Z compiled: bool, 2025-05-07T20:32:58.1234758Z ) -> None: 2025-05-07T20:32:58.1234961Z torch.manual_seed(2025) 2025-05-07T20:32:58.1235188Z 2025-05-07T20:32:58.1235449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.1235776Z 2025-05-07T20:32:58.1235958Z x_sign = torch.sign(x) 2025-05-07T20:32:58.1236232Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.1236553Z x = x_sign * x_clamp 2025-05-07T20:32:58.1236868Z x0 = x[:, :D] 2025-05-07T20:32:58.1237242Z x1 = x[:, D:] 2025-05-07T20:32:58.1237449Z 2025-05-07T20:32:58.1237619Z if contiguous: 2025-05-07T20:32:58.1237831Z x0 = x0.contiguous() 2025-05-07T20:32:58.1238173Z x1 = x1.contiguous() 2025-05-07T20:32:58.1238400Z 2025-05-07T20:32:58.1238577Z if scale_ub is not None: 2025-05-07T20:32:58.1238834Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.1239154Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.1239450Z ) 2025-05-07T20:32:58.1239631Z else: 2025-05-07T20:32:58.1239829Z scale_ub_tensor = None 2025-05-07T20:32:58.1240333Z 2025-05-07T20:32:58.1240557Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.1240853Z op = silu_mul_quant 2025-05-07T20:32:58.1241083Z if compiled: 2025-05-07T20:32:58.1241324Z op = torch.compile(op) 2025-05-07T20:32:58.1241611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.1241861Z 2025-05-07T20:32:58.1242047Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.1242210Z 2025-05-07T20:32:58.1242303Z moe/activation_test.py:117: 2025-05-07T20:32:58.1242590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.1243019Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.1243292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.1243971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.1244643Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.1245175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.1245900Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.1246554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.1247145Z kernel = self.compile( 2025-05-07T20:32:58.1247683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.1248325Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.1248709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.1248985Z 2025-05-07T20:32:58.1249183Z self = 2025-05-07T20:32:58.1250242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.1251660Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788f9b880>} 2025-05-07T20:32:58.1252980Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.1253978Z context = 2025-05-07T20:32:58.1254262Z 2025-05-07T20:32:58.1254424Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.1254934Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.1255395Z module_map=module_map) 2025-05-07T20:32:58.1255744Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.1256096Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.1256346Z E ^ 2025-05-07T20:32:58.1256793Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.1257264Z 2025-05-07T20:32:58.1257677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.2847436Z 2025-05-07T20:32:58.2848148Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.2848778Z self=, 2025-05-07T20:32:58.2849339Z T=16384, 2025-05-07T20:32:58.2849609Z D=7168, 2025-05-07T20:32:58.2849863Z scale_ub=None, 2025-05-07T20:32:58.2850133Z contiguous=True, 2025-05-07T20:32:58.2850416Z compiled=True, 2025-05-07T20:32:58.2850678Z ) 2025-05-07T20:32:58.2851119Z self = 2025-05-07T20:32:58.2851615Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.2851888Z 2025-05-07T20:32:58.2851969Z @given( 2025-05-07T20:32:58.2852192Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.2852508Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.2852823Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.2853147Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.2853480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.2853877Z ) 2025-05-07T20:32:58.2854257Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.2854718Z def test_silu_mul_quant( 2025-05-07T20:32:58.2854953Z self, 2025-05-07T20:32:58.2855147Z T: int, 2025-05-07T20:32:58.2855350Z D: int, 2025-05-07T20:32:58.2855557Z scale_ub: Optional[float], 2025-05-07T20:32:58.2855824Z contiguous: bool, 2025-05-07T20:32:58.2856063Z compiled: bool, 2025-05-07T20:32:58.2856282Z ) -> None: 2025-05-07T20:32:58.2856488Z torch.manual_seed(2025) 2025-05-07T20:32:58.2856722Z 2025-05-07T20:32:58.2856993Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.2857423Z 2025-05-07T20:32:58.2857614Z x_sign = torch.sign(x) 2025-05-07T20:32:58.2857903Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.2858205Z x = x_sign * x_clamp 2025-05-07T20:32:58.2858452Z x0 = x[:, :D] 2025-05-07T20:32:58.2858666Z x1 = x[:, D:] 2025-05-07T20:32:58.2858859Z 2025-05-07T20:32:58.2859045Z if contiguous: 2025-05-07T20:32:58.2859279Z x0 = x0.contiguous() 2025-05-07T20:32:58.2859526Z x1 = x1.contiguous() 2025-05-07T20:32:58.2859761Z 2025-05-07T20:32:58.2859947Z if scale_ub is not None: 2025-05-07T20:32:58.2860210Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.2860543Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.2860857Z ) 2025-05-07T20:32:58.2861052Z else: 2025-05-07T20:32:58.2861251Z scale_ub_tensor = None 2025-05-07T20:32:58.2861503Z 2025-05-07T20:32:58.2861732Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.2862038Z op = silu_mul_quant 2025-05-07T20:32:58.2862284Z if compiled: 2025-05-07T20:32:58.2862532Z op = torch.compile(op) 2025-05-07T20:32:58.2862824Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.2863094Z 2025-05-07T20:32:58.2863282Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.2863443Z 2025-05-07T20:32:58.2863540Z moe/activation_test.py:117: 2025-05-07T20:32:58.2863829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.2864156Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.2864425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.2864975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.2865528Z return fn(*args, **kwargs) 2025-05-07T20:32:58.2866182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.2866859Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.2867594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.2868278Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.2868958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.2869480Z kernel = self.compile( 2025-05-07T20:32:58.2870033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.2870680Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.2871064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.2871292Z 2025-05-07T20:32:58.2871494Z self = 2025-05-07T20:32:58.2872570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.2873974Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788f9a980>} 2025-05-07T20:32:58.2875293Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.2876294Z context = 2025-05-07T20:32:58.2876582Z 2025-05-07T20:32:58.2876746Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.2877268Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.2877776Z module_map=module_map) 2025-05-07T20:32:58.2878141Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.2878494Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.2878750Z E ^ 2025-05-07T20:32:58.2879205Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.2879660Z 2025-05-07T20:32:58.2880074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.2880581Z 2025-05-07T20:32:58.2880681Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.2881083Z self=, 2025-05-07T20:32:58.2881465Z T=4096, 2025-05-07T20:32:58.2881647Z D=5120, 2025-05-07T20:32:58.2881839Z scale_ub=None, 2025-05-07T20:32:58.2882050Z contiguous=False, 2025-05-07T20:32:58.2882276Z compiled=True, 2025-05-07T20:32:58.2882474Z ) 2025-05-07T20:32:58.2882787Z self = 2025-05-07T20:32:58.2883274Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.2883547Z 2025-05-07T20:32:58.2883623Z @given( 2025-05-07T20:32:58.2883849Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.2884152Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.2884462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.2884797Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.2885122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.2885413Z ) 2025-05-07T20:32:58.2885762Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.2886196Z def test_silu_mul_quant( 2025-05-07T20:32:58.2886435Z self, 2025-05-07T20:32:58.2886631Z T: int, 2025-05-07T20:32:58.2886833Z D: int, 2025-05-07T20:32:58.2887042Z scale_ub: Optional[float], 2025-05-07T20:32:58.2887395Z contiguous: bool, 2025-05-07T20:32:58.2887646Z compiled: bool, 2025-05-07T20:32:58.2887864Z ) -> None: 2025-05-07T20:32:58.2888084Z torch.manual_seed(2025) 2025-05-07T20:32:58.2888333Z 2025-05-07T20:32:58.2888596Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.2888933Z 2025-05-07T20:32:58.2889125Z x_sign = torch.sign(x) 2025-05-07T20:32:58.2889405Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.2889725Z x = x_sign * x_clamp 2025-05-07T20:32:58.2889975Z x0 = x[:, :D] 2025-05-07T20:32:58.2890183Z x1 = x[:, D:] 2025-05-07T20:32:58.2890389Z 2025-05-07T20:32:58.2890575Z if contiguous: 2025-05-07T20:32:58.2890793Z x0 = x0.contiguous() 2025-05-07T20:32:58.2891056Z x1 = x1.contiguous() 2025-05-07T20:32:58.2891298Z 2025-05-07T20:32:58.2891478Z if scale_ub is not None: 2025-05-07T20:32:58.2891760Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.2892140Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.2892444Z ) 2025-05-07T20:32:58.2892633Z else: 2025-05-07T20:32:58.2892841Z scale_ub_tensor = None 2025-05-07T20:32:58.2893096Z 2025-05-07T20:32:58.2893318Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.2893625Z op = silu_mul_quant 2025-05-07T20:32:58.2893866Z if compiled: 2025-05-07T20:32:58.2904212Z op = torch.compile(op) 2025-05-07T20:32:58.2904530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.2904809Z 2025-05-07T20:32:58.2905005Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.2905177Z 2025-05-07T20:32:58.2905280Z moe/activation_test.py:117: 2025-05-07T20:32:58.2905665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.2906007Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.2906298Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.2906876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.2907493Z return fn(*args, **kwargs) 2025-05-07T20:32:58.2908161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.2908838Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.2909380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.2910059Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.2910740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.2911309Z kernel = self.compile( 2025-05-07T20:32:58.2911858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.2912509Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.2912917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.2913139Z 2025-05-07T20:32:58.2913342Z self = 2025-05-07T20:32:58.2914410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.2915765Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37885a3920>} 2025-05-07T20:32:58.2917175Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.2918192Z context = 2025-05-07T20:32:58.2918482Z 2025-05-07T20:32:58.2918645Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.2919173Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.2919643Z module_map=module_map) 2025-05-07T20:32:58.2919998Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.2920353Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.2920617Z E ^ 2025-05-07T20:32:58.2921072Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.2921531Z 2025-05-07T20:32:58.2921946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.4316680Z 2025-05-07T20:32:58.4317011Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.4317493Z self=, 2025-05-07T20:32:58.4318086Z T=4096, 2025-05-07T20:32:58.4318278Z D=5120, 2025-05-07T20:32:58.4318481Z scale_ub=1200.0, 2025-05-07T20:32:58.4318721Z contiguous=False, 2025-05-07T20:32:58.4318945Z compiled=False, 2025-05-07T20:32:58.4319160Z ) 2025-05-07T20:32:58.4319479Z self = 2025-05-07T20:32:58.4319972Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.4320246Z 2025-05-07T20:32:58.4320326Z @given( 2025-05-07T20:32:58.4320555Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.4321135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.4321444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.4321792Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.4322133Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.4322410Z ) 2025-05-07T20:32:58.4322769Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.4323226Z def test_silu_mul_quant( 2025-05-07T20:32:58.4323473Z self, 2025-05-07T20:32:58.4323660Z T: int, 2025-05-07T20:32:58.4323867Z D: int, 2025-05-07T20:32:58.4324086Z scale_ub: Optional[float], 2025-05-07T20:32:58.4324356Z contiguous: bool, 2025-05-07T20:32:58.4324597Z compiled: bool, 2025-05-07T20:32:58.4324823Z ) -> None: 2025-05-07T20:32:58.4325031Z torch.manual_seed(2025) 2025-05-07T20:32:58.4325272Z 2025-05-07T20:32:58.4325539Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.4325877Z 2025-05-07T20:32:58.4326067Z x_sign = torch.sign(x) 2025-05-07T20:32:58.4326365Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.4326664Z x = x_sign * x_clamp 2025-05-07T20:32:58.4326906Z x0 = x[:, :D] 2025-05-07T20:32:58.4327115Z x1 = x[:, D:] 2025-05-07T20:32:58.4327313Z 2025-05-07T20:32:58.4327490Z if contiguous: 2025-05-07T20:32:58.4327716Z x0 = x0.contiguous() 2025-05-07T20:32:58.4327969Z x1 = x1.contiguous() 2025-05-07T20:32:58.4328204Z 2025-05-07T20:32:58.4328389Z if scale_ub is not None: 2025-05-07T20:32:58.4328660Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.4328983Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.4329290Z ) 2025-05-07T20:32:58.4329477Z else: 2025-05-07T20:32:58.4329679Z scale_ub_tensor = None 2025-05-07T20:32:58.4329935Z 2025-05-07T20:32:58.4330159Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.4330462Z op = silu_mul_quant 2025-05-07T20:32:58.4330864Z if compiled: 2025-05-07T20:32:58.4331116Z op = torch.compile(op) 2025-05-07T20:32:58.4331401Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.4331676Z 2025-05-07T20:32:58.4331866Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.4332030Z 2025-05-07T20:32:58.4332128Z moe/activation_test.py:117: 2025-05-07T20:32:58.4332421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.4332744Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.4333022Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.4333723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.4334425Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.4334982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.4335676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.4336432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.4336967Z kernel = self.compile( 2025-05-07T20:32:58.4337517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.4338166Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.4338568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.4338790Z 2025-05-07T20:32:58.4339008Z self = 2025-05-07T20:32:58.4340470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.4342321Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37885b42c0>} 2025-05-07T20:32:58.4343659Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.4344666Z context = 2025-05-07T20:32:58.4344949Z 2025-05-07T20:32:58.4345121Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.4345632Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.4346106Z module_map=module_map) 2025-05-07T20:32:58.4346471Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.4346830Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.4347081Z E ^ 2025-05-07T20:32:58.4347604Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.4348050Z 2025-05-07T20:32:58.4348487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.4348988Z 2025-05-07T20:32:58.4349097Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.4349502Z self=, 2025-05-07T20:32:58.4349906Z T=4096, 2025-05-07T20:32:58.4350096Z D=5120, 2025-05-07T20:32:58.4350278Z scale_ub=1200.0, 2025-05-07T20:32:58.4350500Z contiguous=False, 2025-05-07T20:32:58.4350721Z compiled=True, 2025-05-07T20:32:58.4350919Z ) 2025-05-07T20:32:58.4351234Z self = 2025-05-07T20:32:58.4351843Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.4352110Z 2025-05-07T20:32:58.4352188Z @given( 2025-05-07T20:32:58.4352420Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.4352737Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.4353040Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.4353365Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.4353691Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.4353977Z ) 2025-05-07T20:32:58.4354315Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.4354752Z def test_silu_mul_quant( 2025-05-07T20:32:58.4354994Z self, 2025-05-07T20:32:58.4355180Z T: int, 2025-05-07T20:32:58.4355381Z D: int, 2025-05-07T20:32:58.4355607Z scale_ub: Optional[float], 2025-05-07T20:32:58.4355866Z contiguous: bool, 2025-05-07T20:32:58.4356113Z compiled: bool, 2025-05-07T20:32:58.4356346Z ) -> None: 2025-05-07T20:32:58.4356555Z torch.manual_seed(2025) 2025-05-07T20:32:58.4356864Z 2025-05-07T20:32:58.4357140Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.4357487Z 2025-05-07T20:32:58.4357677Z x_sign = torch.sign(x) 2025-05-07T20:32:58.4357970Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.4358283Z x = x_sign * x_clamp 2025-05-07T20:32:58.4358518Z x0 = x[:, :D] 2025-05-07T20:32:58.4358738Z x1 = x[:, D:] 2025-05-07T20:32:58.4358943Z 2025-05-07T20:32:58.4359120Z if contiguous: 2025-05-07T20:32:58.4359347Z x0 = x0.contiguous() 2025-05-07T20:32:58.4359600Z x1 = x1.contiguous() 2025-05-07T20:32:58.4359826Z 2025-05-07T20:32:58.4360062Z if scale_ub is not None: 2025-05-07T20:32:58.4360327Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.4360659Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.4360964Z ) 2025-05-07T20:32:58.4361156Z else: 2025-05-07T20:32:58.4361355Z scale_ub_tensor = None 2025-05-07T20:32:58.4361601Z 2025-05-07T20:32:58.4361829Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.4362140Z op = silu_mul_quant 2025-05-07T20:32:58.4362380Z if compiled: 2025-05-07T20:32:58.4362621Z op = torch.compile(op) 2025-05-07T20:32:58.4362912Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.4363170Z 2025-05-07T20:32:58.4363362Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.4363524Z 2025-05-07T20:32:58.4363625Z moe/activation_test.py:117: 2025-05-07T20:32:58.4363910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.4364237Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.4364512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.4365065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.4365622Z return fn(*args, **kwargs) 2025-05-07T20:32:58.4366269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.4366947Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.4367482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.4368158Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.4368809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.4369327Z kernel = self.compile( 2025-05-07T20:32:58.4369877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.4370607Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.4371010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.4371232Z 2025-05-07T20:32:58.4371433Z self = 2025-05-07T20:32:58.4372502Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.4373853Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37885b5b20>} 2025-05-07T20:32:58.4375170Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.4376202Z context = 2025-05-07T20:32:58.4376530Z 2025-05-07T20:32:58.4376694Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.4377210Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.4377676Z module_map=module_map) 2025-05-07T20:32:58.4378041Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.4378385Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.4378643Z E ^ 2025-05-07T20:32:58.4379109Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.4379549Z 2025-05-07T20:32:58.4379972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.4380525Z 2025-05-07T20:32:58.4380626Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.4381040Z self=, 2025-05-07T20:32:58.4381453Z T=2048, 2025-05-07T20:32:58.4381631Z D=7168, 2025-05-07T20:32:58.4381823Z scale_ub=1200.0, 2025-05-07T20:32:58.4382055Z contiguous=False, 2025-05-07T20:32:58.4382277Z compiled=False, 2025-05-07T20:32:58.6387105Z ) 2025-05-07T20:32:58.6387609Z self = 2025-05-07T20:32:58.6388335Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.6388735Z 2025-05-07T20:32:58.6388814Z @given( 2025-05-07T20:32:58.6389056Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.6389361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.6389684Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.6390011Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.6390338Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.6390620Z ) 2025-05-07T20:32:58.6390974Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.6391443Z def test_silu_mul_quant( 2025-05-07T20:32:58.6391680Z self, 2025-05-07T20:32:58.6391874Z T: int, 2025-05-07T20:32:58.6392060Z D: int, 2025-05-07T20:32:58.6392284Z scale_ub: Optional[float], 2025-05-07T20:32:58.6392551Z contiguous: bool, 2025-05-07T20:32:58.6392781Z compiled: bool, 2025-05-07T20:32:58.6393015Z ) -> None: 2025-05-07T20:32:58.6393238Z torch.manual_seed(2025) 2025-05-07T20:32:58.6393477Z 2025-05-07T20:32:58.6393744Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.6394101Z 2025-05-07T20:32:58.6394301Z x_sign = torch.sign(x) 2025-05-07T20:32:58.6394592Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.6394902Z x = x_sign * x_clamp 2025-05-07T20:32:58.6395509Z x0 = x[:, :D] 2025-05-07T20:32:58.6395731Z x1 = x[:, D:] 2025-05-07T20:32:58.6395942Z 2025-05-07T20:32:58.6396124Z if contiguous: 2025-05-07T20:32:58.6396349Z x0 = x0.contiguous() 2025-05-07T20:32:58.6396622Z x1 = x1.contiguous() 2025-05-07T20:32:58.6396874Z 2025-05-07T20:32:58.6397057Z if scale_ub is not None: 2025-05-07T20:32:58.6397332Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.6397669Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.6397990Z ) 2025-05-07T20:32:58.6398195Z else: 2025-05-07T20:32:58.6398409Z scale_ub_tensor = None 2025-05-07T20:32:58.6398663Z 2025-05-07T20:32:58.6398889Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.6399207Z op = silu_mul_quant 2025-05-07T20:32:58.6399455Z if compiled: 2025-05-07T20:32:58.6399698Z op = torch.compile(op) 2025-05-07T20:32:58.6400004Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.6400354Z 2025-05-07T20:32:58.6400538Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.6400707Z 2025-05-07T20:32:58.6400806Z moe/activation_test.py:117: 2025-05-07T20:32:58.6401105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.6401423Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.6401700Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.6402392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.6403070Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.6403597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.6404355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.6405024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.6405545Z kernel = self.compile( 2025-05-07T20:32:58.6406101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.6406745Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.6407138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.6407360Z 2025-05-07T20:32:58.6407565Z self = 2025-05-07T20:32:58.6408660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.6410035Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37885b6700>} 2025-05-07T20:32:58.6411396Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.6412396Z context = 2025-05-07T20:32:58.6412675Z 2025-05-07T20:32:58.6412840Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.6413354Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.6413816Z module_map=module_map) 2025-05-07T20:32:58.6414172Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.6414536Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.6414795Z E ^ 2025-05-07T20:32:58.6415977Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.6416426Z 2025-05-07T20:32:58.6416854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.6417362Z 2025-05-07T20:32:58.6417463Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.6417868Z self=, 2025-05-07T20:32:58.6418272Z T=1, 2025-05-07T20:32:58.6418451Z D=7168, 2025-05-07T20:32:58.6418645Z scale_ub=None, 2025-05-07T20:32:58.6418862Z contiguous=True, 2025-05-07T20:32:58.6419080Z compiled=False, 2025-05-07T20:32:58.6419290Z ) 2025-05-07T20:32:58.6419612Z self = 2025-05-07T20:32:58.6420129Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.6420382Z 2025-05-07T20:32:58.6420466Z @given( 2025-05-07T20:32:58.6420692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.6421063Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.6421368Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.6421686Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.6422010Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.6422293Z ) 2025-05-07T20:32:58.6422633Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.6423082Z def test_silu_mul_quant( 2025-05-07T20:32:58.6423325Z self, 2025-05-07T20:32:58.6423512Z T: int, 2025-05-07T20:32:58.6423711Z D: int, 2025-05-07T20:32:58.6423930Z scale_ub: Optional[float], 2025-05-07T20:32:58.6424202Z contiguous: bool, 2025-05-07T20:32:58.6424483Z compiled: bool, 2025-05-07T20:32:58.6424709Z ) -> None: 2025-05-07T20:32:58.6424929Z torch.manual_seed(2025) 2025-05-07T20:32:58.6425160Z 2025-05-07T20:32:58.6425432Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.6425782Z 2025-05-07T20:32:58.6425967Z x_sign = torch.sign(x) 2025-05-07T20:32:58.6426251Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.6426558Z x = x_sign * x_clamp 2025-05-07T20:32:58.6426792Z x0 = x[:, :D] 2025-05-07T20:32:58.6427006Z x1 = x[:, D:] 2025-05-07T20:32:58.6427216Z 2025-05-07T20:32:58.6427510Z if contiguous: 2025-05-07T20:32:58.6427741Z x0 = x0.contiguous() 2025-05-07T20:32:58.6427999Z x1 = x1.contiguous() 2025-05-07T20:32:58.6428228Z 2025-05-07T20:32:58.6428425Z if scale_ub is not None: 2025-05-07T20:32:58.6428695Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.6429024Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.6429324Z ) 2025-05-07T20:32:58.6429519Z else: 2025-05-07T20:32:58.6429732Z scale_ub_tensor = None 2025-05-07T20:32:58.6429978Z 2025-05-07T20:32:58.6430213Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.6430526Z op = silu_mul_quant 2025-05-07T20:32:58.6430771Z if compiled: 2025-05-07T20:32:58.6431019Z op = torch.compile(op) 2025-05-07T20:32:58.6431313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.6431577Z 2025-05-07T20:32:58.6431776Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.6431937Z 2025-05-07T20:32:58.6432040Z moe/activation_test.py:117: 2025-05-07T20:32:58.6432327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.6432658Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.6432935Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.6433639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.6434403Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.6434947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.6435624Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.6436276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.6436799Z kernel = self.compile( 2025-05-07T20:32:58.6437338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.6437982Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.6438365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.6438595Z 2025-05-07T20:32:58.6438797Z self = 2025-05-07T20:32:58.6439865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.6441636Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37885b7a60>} 2025-05-07T20:32:58.6442950Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.6443960Z context = 2025-05-07T20:32:58.6444252Z 2025-05-07T20:32:58.6444417Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.6445039Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.6445499Z module_map=module_map) 2025-05-07T20:32:58.6445868Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.6446221Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.6446479Z E ^ 2025-05-07T20:32:58.6446931Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.6447380Z 2025-05-07T20:32:58.6447796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.6448297Z 2025-05-07T20:32:58.6448408Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.6448813Z self=, 2025-05-07T20:32:58.6449218Z T=16384, 2025-05-07T20:32:58.6449419Z D=7168, 2025-05-07T20:32:58.6449613Z scale_ub=1200.0, 2025-05-07T20:32:58.6449826Z contiguous=False, 2025-05-07T20:32:58.6460639Z compiled=True, 2025-05-07T20:32:58.6460898Z ) 2025-05-07T20:32:58.6461285Z self = 2025-05-07T20:32:58.6461787Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.6462064Z 2025-05-07T20:32:58.6462146Z @given( 2025-05-07T20:32:58.6462375Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.6462694Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.6463001Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.6463322Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.6463653Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.6463942Z ) 2025-05-07T20:32:58.6464288Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.6464752Z def test_silu_mul_quant( 2025-05-07T20:32:58.6464992Z self, 2025-05-07T20:32:58.6465183Z T: int, 2025-05-07T20:32:58.6465568Z D: int, 2025-05-07T20:32:58.6465782Z scale_ub: Optional[float], 2025-05-07T20:32:58.6466055Z contiguous: bool, 2025-05-07T20:32:58.6466293Z compiled: bool, 2025-05-07T20:32:58.6466516Z ) -> None: 2025-05-07T20:32:58.6466730Z torch.manual_seed(2025) 2025-05-07T20:32:58.6466979Z 2025-05-07T20:32:58.6467243Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.6467650Z 2025-05-07T20:32:58.6467842Z x_sign = torch.sign(x) 2025-05-07T20:32:58.6468132Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.6468443Z x = x_sign * x_clamp 2025-05-07T20:32:58.6468684Z x0 = x[:, :D] 2025-05-07T20:32:58.6468893Z x1 = x[:, D:] 2025-05-07T20:32:58.6469088Z 2025-05-07T20:32:58.6469275Z if contiguous: 2025-05-07T20:32:58.6469505Z x0 = x0.contiguous() 2025-05-07T20:32:58.6469752Z x1 = x1.contiguous() 2025-05-07T20:32:58.6469987Z 2025-05-07T20:32:58.6470183Z if scale_ub is not None: 2025-05-07T20:32:58.6470549Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.6470880Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.6471185Z ) 2025-05-07T20:32:58.6471373Z else: 2025-05-07T20:32:58.6471581Z scale_ub_tensor = None 2025-05-07T20:32:58.6471828Z 2025-05-07T20:32:58.6472050Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.6472358Z op = silu_mul_quant 2025-05-07T20:32:58.6472605Z if compiled: 2025-05-07T20:32:58.6472845Z op = torch.compile(op) 2025-05-07T20:32:58.6473135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.6473406Z 2025-05-07T20:32:58.6473603Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.6473817Z 2025-05-07T20:32:58.6473915Z moe/activation_test.py:117: 2025-05-07T20:32:58.6474217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.6474550Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.6474825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.6475389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.6475940Z return fn(*args, **kwargs) 2025-05-07T20:32:58.6476596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.6477267Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.6477806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.6478484Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.6479149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.6479677Z kernel = self.compile( 2025-05-07T20:32:58.6480221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.6480876Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.6481271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.6481501Z 2025-05-07T20:32:58.6481706Z self = 2025-05-07T20:32:58.6482774Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.6484184Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7950d60>} 2025-05-07T20:32:58.6485591Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.6486665Z context = 2025-05-07T20:32:58.6486958Z 2025-05-07T20:32:58.6487123Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.6487654Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.6488108Z module_map=module_map) 2025-05-07T20:32:58.6488475Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.6488832Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.6489096Z E ^ 2025-05-07T20:32:58.6489550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.6490009Z 2025-05-07T20:32:58.6490433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.7801334Z 2025-05-07T20:32:58.7802276Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.7802951Z self=, 2025-05-07T20:32:58.7803471Z T=1, 2025-05-07T20:32:58.7803672Z D=7168, 2025-05-07T20:32:58.7803872Z scale_ub=None, 2025-05-07T20:32:58.7804086Z contiguous=False, 2025-05-07T20:32:58.7804326Z compiled=False, 2025-05-07T20:32:58.7804542Z ) 2025-05-07T20:32:58.7804868Z self = 2025-05-07T20:32:58.7805361Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.7805641Z 2025-05-07T20:32:58.7806037Z @given( 2025-05-07T20:32:58.7806268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.7806577Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.7806894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.7807238Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.7807562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.7807852Z ) 2025-05-07T20:32:58.7808211Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.7808647Z def test_silu_mul_quant( 2025-05-07T20:32:58.7808893Z self, 2025-05-07T20:32:58.7809092Z T: int, 2025-05-07T20:32:58.7809286Z D: int, 2025-05-07T20:32:58.7809509Z scale_ub: Optional[float], 2025-05-07T20:32:58.7809788Z contiguous: bool, 2025-05-07T20:32:58.7810048Z compiled: bool, 2025-05-07T20:32:58.7810275Z ) -> None: 2025-05-07T20:32:58.7810493Z torch.manual_seed(2025) 2025-05-07T20:32:58.7810745Z 2025-05-07T20:32:58.7811010Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.7811350Z 2025-05-07T20:32:58.7811552Z x_sign = torch.sign(x) 2025-05-07T20:32:58.7811839Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.7812151Z x = x_sign * x_clamp 2025-05-07T20:32:58.7812400Z x0 = x[:, :D] 2025-05-07T20:32:58.7812611Z x1 = x[:, D:] 2025-05-07T20:32:58.7812827Z 2025-05-07T20:32:58.7813023Z if contiguous: 2025-05-07T20:32:58.7813255Z x0 = x0.contiguous() 2025-05-07T20:32:58.7813530Z x1 = x1.contiguous() 2025-05-07T20:32:58.7813776Z 2025-05-07T20:32:58.7813974Z if scale_ub is not None: 2025-05-07T20:32:58.7814249Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.7814586Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.7814921Z ) 2025-05-07T20:32:58.7815122Z else: 2025-05-07T20:32:58.7815328Z scale_ub_tensor = None 2025-05-07T20:32:58.7815581Z 2025-05-07T20:32:58.7815982Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.7816287Z op = silu_mul_quant 2025-05-07T20:32:58.7816541Z if compiled: 2025-05-07T20:32:58.7816792Z op = torch.compile(op) 2025-05-07T20:32:58.7817090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.7817356Z 2025-05-07T20:32:58.7817551Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.7817717Z 2025-05-07T20:32:58.7817827Z moe/activation_test.py:117: 2025-05-07T20:32:58.7818117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.7818451Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.7818736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.7819426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.7820113Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.7820658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.7821432Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.7822103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.7822646Z kernel = self.compile( 2025-05-07T20:32:58.7823210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.7823856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.7824243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.7824472Z 2025-05-07T20:32:58.7824676Z self = 2025-05-07T20:32:58.7825805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.7827179Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7951760>} 2025-05-07T20:32:58.7828610Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.7829623Z context = 2025-05-07T20:32:58.7829912Z 2025-05-07T20:32:58.7830077Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.7830602Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.7831065Z module_map=module_map) 2025-05-07T20:32:58.7831437Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.7831794Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.7832054Z E ^ 2025-05-07T20:32:58.7832519Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.7832974Z 2025-05-07T20:32:58.7833406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.7833916Z 2025-05-07T20:32:58.7834026Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.7834427Z self=, 2025-05-07T20:32:58.7834834Z T=2048, 2025-05-07T20:32:58.7835029Z D=7168, 2025-05-07T20:32:58.7835217Z scale_ub=None, 2025-05-07T20:32:58.7835436Z contiguous=False, 2025-05-07T20:32:58.7835671Z compiled=True, 2025-05-07T20:32:58.7835868Z ) 2025-05-07T20:32:58.7836189Z self = 2025-05-07T20:32:58.7836758Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.7837037Z 2025-05-07T20:32:58.7837125Z @given( 2025-05-07T20:32:58.7837348Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.7837657Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.7837960Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.7838278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.7838603Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.7838883Z ) 2025-05-07T20:32:58.7839222Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.7839667Z def test_silu_mul_quant( 2025-05-07T20:32:58.7839907Z self, 2025-05-07T20:32:58.7840287Z T: int, 2025-05-07T20:32:58.7840479Z D: int, 2025-05-07T20:32:58.7840697Z scale_ub: Optional[float], 2025-05-07T20:32:58.7840976Z contiguous: bool, 2025-05-07T20:32:58.7841229Z compiled: bool, 2025-05-07T20:32:58.7841527Z ) -> None: 2025-05-07T20:32:58.7841739Z torch.manual_seed(2025) 2025-05-07T20:32:58.7841984Z 2025-05-07T20:32:58.7842262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.7842613Z 2025-05-07T20:32:58.7842800Z x_sign = torch.sign(x) 2025-05-07T20:32:58.7843093Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.7843403Z x = x_sign * x_clamp 2025-05-07T20:32:58.7843642Z x0 = x[:, :D] 2025-05-07T20:32:58.7843872Z x1 = x[:, D:] 2025-05-07T20:32:58.7844094Z 2025-05-07T20:32:58.7844279Z if contiguous: 2025-05-07T20:32:58.7844518Z x0 = x0.contiguous() 2025-05-07T20:32:58.7844784Z x1 = x1.contiguous() 2025-05-07T20:32:58.7845099Z 2025-05-07T20:32:58.7845304Z if scale_ub is not None: 2025-05-07T20:32:58.7845581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.7845923Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.7846237Z ) 2025-05-07T20:32:58.7846431Z else: 2025-05-07T20:32:58.7846634Z scale_ub_tensor = None 2025-05-07T20:32:58.7846881Z 2025-05-07T20:32:58.7847108Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.7847420Z op = silu_mul_quant 2025-05-07T20:32:58.7847662Z if compiled: 2025-05-07T20:32:58.7847916Z op = torch.compile(op) 2025-05-07T20:32:58.7848227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.7848502Z 2025-05-07T20:32:58.7848708Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.7848869Z 2025-05-07T20:32:58.7848971Z moe/activation_test.py:117: 2025-05-07T20:32:58.7849265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.7849602Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.7849893Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.7850442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.7851007Z return fn(*args, **kwargs) 2025-05-07T20:32:58.7851671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.7852353Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.7852885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.7853566Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.7854227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.7854758Z kernel = self.compile( 2025-05-07T20:32:58.7855428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.7856096Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.7856498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.7856722Z 2025-05-07T20:32:58.7856925Z self = 2025-05-07T20:32:58.7857994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.7859356Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7952f20>} 2025-05-07T20:32:58.7860686Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.7861799Z context = 2025-05-07T20:32:58.7862090Z 2025-05-07T20:32:58.7862256Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.7862783Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.7863254Z module_map=module_map) 2025-05-07T20:32:58.7863627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.7863984Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.7864261Z E ^ 2025-05-07T20:32:58.7864737Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.7865292Z 2025-05-07T20:32:58.7865732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.7866257Z 2025-05-07T20:32:58.7866365Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.7866778Z self=, 2025-05-07T20:32:58.7867179Z T=4096, 2025-05-07T20:32:58.7867362Z D=7168, 2025-05-07T20:32:58.7867628Z scale_ub=None, 2025-05-07T20:32:58.7867863Z contiguous=False, 2025-05-07T20:32:58.7868086Z compiled=True, 2025-05-07T20:32:59.0133434Z ) 2025-05-07T20:32:59.0134389Z self = 2025-05-07T20:32:59.0135837Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:59.0136601Z 2025-05-07T20:32:59.0136782Z @given( 2025-05-07T20:32:59.0137231Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.0137848Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.0138452Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.0139112Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.0139752Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.0140612Z ) 2025-05-07T20:32:59.0141248Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.0141698Z def test_silu_mul_quant( 2025-05-07T20:32:59.0141935Z self, 2025-05-07T20:32:59.0142132Z T: int, 2025-05-07T20:32:59.0142328Z D: int, 2025-05-07T20:32:59.0142538Z scale_ub: Optional[float], 2025-05-07T20:32:59.0142808Z contiguous: bool, 2025-05-07T20:32:59.0143047Z compiled: bool, 2025-05-07T20:32:59.0143266Z ) -> None: 2025-05-07T20:32:59.0143486Z torch.manual_seed(2025) 2025-05-07T20:32:59.0143729Z 2025-05-07T20:32:59.0143991Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.0144329Z 2025-05-07T20:32:59.0144527Z x_sign = torch.sign(x) 2025-05-07T20:32:59.0144811Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.0145317Z x = x_sign * x_clamp 2025-05-07T20:32:59.0145569Z x0 = x[:, :D] 2025-05-07T20:32:59.0145790Z x1 = x[:, D:] 2025-05-07T20:32:59.0145994Z 2025-05-07T20:32:59.0146186Z if contiguous: 2025-05-07T20:32:59.0146424Z x0 = x0.contiguous() 2025-05-07T20:32:59.0146677Z x1 = x1.contiguous() 2025-05-07T20:32:59.0146914Z 2025-05-07T20:32:59.0147103Z if scale_ub is not None: 2025-05-07T20:32:59.0147368Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.0147796Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.0148102Z ) 2025-05-07T20:32:59.0148289Z else: 2025-05-07T20:32:59.0148500Z scale_ub_tensor = None 2025-05-07T20:32:59.0148745Z 2025-05-07T20:32:59.0148970Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.0149278Z op = silu_mul_quant 2025-05-07T20:32:59.0149524Z if compiled: 2025-05-07T20:32:59.0149770Z op = torch.compile(op) 2025-05-07T20:32:59.0150171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.0150445Z 2025-05-07T20:32:59.0150635Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.0150797Z 2025-05-07T20:32:59.0150893Z moe/activation_test.py:117: 2025-05-07T20:32:59.0151183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.0151510Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.0151784Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.0152340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.0152900Z return fn(*args, **kwargs) 2025-05-07T20:32:59.0153547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.0154293Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.0154832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.0155504Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.0156155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.0156677Z kernel = self.compile( 2025-05-07T20:32:59.0157219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.0157864Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.0158254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.0158486Z 2025-05-07T20:32:59.0158689Z self = 2025-05-07T20:32:59.0159766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.0161124Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37888280e0>} 2025-05-07T20:32:59.0162437Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.0163440Z context = 2025-05-07T20:32:59.0163734Z 2025-05-07T20:32:59.0163898Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.0164422Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.0164880Z module_map=module_map) 2025-05-07T20:32:59.0165325Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.0165676Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.0165928Z E ^ 2025-05-07T20:32:59.0166392Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.0166846Z 2025-05-07T20:32:59.0167272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.0167774Z 2025-05-07T20:32:59.0167883Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.0168284Z self=, 2025-05-07T20:32:59.0168682Z T=16384, 2025-05-07T20:32:59.0168877Z D=5120, 2025-05-07T20:32:59.0169071Z scale_ub=1200.0, 2025-05-07T20:32:59.0169294Z contiguous=False, 2025-05-07T20:32:59.0169516Z compiled=False, 2025-05-07T20:32:59.0169723Z ) 2025-05-07T20:32:59.0170034Z self = 2025-05-07T20:32:59.0170571Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:59.0170844Z 2025-05-07T20:32:59.0170926Z @given( 2025-05-07T20:32:59.0171169Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.0171507Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.0171811Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.0172129Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.0172452Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.0172735Z ) 2025-05-07T20:32:59.0173088Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.0173513Z def test_silu_mul_quant( 2025-05-07T20:32:59.0173803Z self, 2025-05-07T20:32:59.0173995Z T: int, 2025-05-07T20:32:59.0174184Z D: int, 2025-05-07T20:32:59.0174412Z scale_ub: Optional[float], 2025-05-07T20:32:59.0174679Z contiguous: bool, 2025-05-07T20:32:59.0174911Z compiled: bool, 2025-05-07T20:32:59.0175130Z ) -> None: 2025-05-07T20:32:59.0175344Z torch.manual_seed(2025) 2025-05-07T20:32:59.0175576Z 2025-05-07T20:32:59.0175845Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.0176183Z 2025-05-07T20:32:59.0176365Z x_sign = torch.sign(x) 2025-05-07T20:32:59.0176653Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.0176957Z x = x_sign * x_clamp 2025-05-07T20:32:59.0177188Z x0 = x[:, :D] 2025-05-07T20:32:59.0177416Z x1 = x[:, D:] 2025-05-07T20:32:59.0177618Z 2025-05-07T20:32:59.0177800Z if contiguous: 2025-05-07T20:32:59.0178028Z x0 = x0.contiguous() 2025-05-07T20:32:59.0178279Z x1 = x1.contiguous() 2025-05-07T20:32:59.0178517Z 2025-05-07T20:32:59.0178708Z if scale_ub is not None: 2025-05-07T20:32:59.0178980Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.0187346Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.0187760Z ) 2025-05-07T20:32:59.0187949Z else: 2025-05-07T20:32:59.0188154Z scale_ub_tensor = None 2025-05-07T20:32:59.0188396Z 2025-05-07T20:32:59.0188620Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.0188928Z op = silu_mul_quant 2025-05-07T20:32:59.0189171Z if compiled: 2025-05-07T20:32:59.0189415Z op = torch.compile(op) 2025-05-07T20:32:59.0189716Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.0189989Z 2025-05-07T20:32:59.0190175Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.0190351Z 2025-05-07T20:32:59.0190456Z moe/activation_test.py:117: 2025-05-07T20:32:59.0190753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.0191197Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.0191487Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.0192184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.0192868Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.0193395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.0194105Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.0194758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.0195272Z kernel = self.compile( 2025-05-07T20:32:59.0195827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.0196473Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.0196874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.0197141Z 2025-05-07T20:32:59.0197346Z self = 2025-05-07T20:32:59.0198410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.0199755Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788828b80>} 2025-05-07T20:32:59.0201109Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.0202168Z context = 2025-05-07T20:32:59.0202449Z 2025-05-07T20:32:59.0202618Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.0203146Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.0203600Z module_map=module_map) 2025-05-07T20:32:59.0203957Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.0204310Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.0204567Z E ^ 2025-05-07T20:32:59.0205022Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.0205474Z 2025-05-07T20:32:59.0205884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.0206416Z 2025-05-07T20:32:59.0206516Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.0206930Z self=, 2025-05-07T20:32:59.0207329Z T=16384, 2025-05-07T20:32:59.0207526Z D=5120, 2025-05-07T20:32:59.0207715Z scale_ub=1200.0, 2025-05-07T20:32:59.0207926Z contiguous=True, 2025-05-07T20:32:59.0208143Z compiled=True, 2025-05-07T20:32:59.0208343Z ) 2025-05-07T20:32:59.0208655Z self = 2025-05-07T20:32:59.0209136Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:59.0209411Z 2025-05-07T20:32:59.0209488Z @given( 2025-05-07T20:32:59.0209714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.0210028Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.0210339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.0210669Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.0210987Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.0211274Z ) 2025-05-07T20:32:59.0211702Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.0212151Z def test_silu_mul_quant( 2025-05-07T20:32:59.0212387Z self, 2025-05-07T20:32:59.0212583Z T: int, 2025-05-07T20:32:59.0212788Z D: int, 2025-05-07T20:32:59.0213000Z scale_ub: Optional[float], 2025-05-07T20:32:59.0213268Z contiguous: bool, 2025-05-07T20:32:59.0213509Z compiled: bool, 2025-05-07T20:32:59.0213729Z ) -> None: 2025-05-07T20:32:59.0213943Z torch.manual_seed(2025) 2025-05-07T20:32:59.0214182Z 2025-05-07T20:32:59.0214447Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.0214785Z 2025-05-07T20:32:59.0214976Z x_sign = torch.sign(x) 2025-05-07T20:32:59.0215260Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.0215563Z x = x_sign * x_clamp 2025-05-07T20:32:59.0215799Z x0 = x[:, :D] 2025-05-07T20:32:59.0216013Z x1 = x[:, D:] 2025-05-07T20:32:59.0216264Z 2025-05-07T20:32:59.0216445Z if contiguous: 2025-05-07T20:32:59.0216662Z x0 = x0.contiguous() 2025-05-07T20:32:59.0216910Z x1 = x1.contiguous() 2025-05-07T20:32:59.0217150Z 2025-05-07T20:32:59.0217335Z if scale_ub is not None: 2025-05-07T20:32:59.0217595Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.0217925Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.0218224Z ) 2025-05-07T20:32:59.0218406Z else: 2025-05-07T20:32:59.0218614Z scale_ub_tensor = None 2025-05-07T20:32:59.0218856Z 2025-05-07T20:32:59.0219077Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.0219385Z op = silu_mul_quant 2025-05-07T20:32:59.0219678Z if compiled: 2025-05-07T20:32:59.0219914Z op = torch.compile(op) 2025-05-07T20:32:59.0220212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.0220477Z 2025-05-07T20:32:59.0220665Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.0220836Z 2025-05-07T20:32:59.0220929Z moe/activation_test.py:117: 2025-05-07T20:32:59.0221228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.0221605Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.0221872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.0222425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.0222979Z return fn(*args, **kwargs) 2025-05-07T20:32:59.0223654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.0224339Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.0224884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.0225571Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.0226260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.0226787Z kernel = self.compile( 2025-05-07T20:32:59.0227328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.0228019Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.0228413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.0228646Z 2025-05-07T20:32:59.0228849Z self = 2025-05-07T20:32:59.0230025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.0231383Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f378882a2a0>} 2025-05-07T20:32:59.0232697Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.0233706Z context = 2025-05-07T20:32:59.0234002Z 2025-05-07T20:32:59.0234165Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.0234683Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.0235144Z module_map=module_map) 2025-05-07T20:32:59.0235504Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.0235860Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.0236154Z E ^ 2025-05-07T20:32:59.0236614Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.0237070Z 2025-05-07T20:32:59.0237477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.1790547Z 2025-05-07T20:32:59.1791081Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.1791678Z self=, 2025-05-07T20:32:59.1792238Z T=16384, 2025-05-07T20:32:59.1792541Z D=5120, 2025-05-07T20:32:59.1792799Z scale_ub=None, 2025-05-07T20:32:59.1793056Z contiguous=False, 2025-05-07T20:32:59.1793283Z compiled=True, 2025-05-07T20:32:59.1793626Z ) 2025-05-07T20:32:59.1793946Z self = 2025-05-07T20:32:59.1794449Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:59.1794740Z 2025-05-07T20:32:59.1794816Z @given( 2025-05-07T20:32:59.1795038Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.1795358Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.1795657Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.1795978Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.1796303Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.1796580Z ) 2025-05-07T20:32:59.1796927Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.1797382Z def test_silu_mul_quant( 2025-05-07T20:32:59.1797620Z self, 2025-05-07T20:32:59.1797812Z T: int, 2025-05-07T20:32:59.1798016Z D: int, 2025-05-07T20:32:59.1798229Z scale_ub: Optional[float], 2025-05-07T20:32:59.1798501Z contiguous: bool, 2025-05-07T20:32:59.1798740Z compiled: bool, 2025-05-07T20:32:59.1798977Z ) -> None: 2025-05-07T20:32:59.1799184Z torch.manual_seed(2025) 2025-05-07T20:32:59.1799428Z 2025-05-07T20:32:59.1799703Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.1800047Z 2025-05-07T20:32:59.1800248Z x_sign = torch.sign(x) 2025-05-07T20:32:59.1800541Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.1800840Z x = x_sign * x_clamp 2025-05-07T20:32:59.1801112Z x0 = x[:, :D] 2025-05-07T20:32:59.1801357Z x1 = x[:, D:] 2025-05-07T20:32:59.1801561Z 2025-05-07T20:32:59.1801749Z if contiguous: 2025-05-07T20:32:59.1801980Z x0 = x0.contiguous() 2025-05-07T20:32:59.1802227Z x1 = x1.contiguous() 2025-05-07T20:32:59.1802466Z 2025-05-07T20:32:59.1802670Z if scale_ub is not None: 2025-05-07T20:32:59.1802936Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.1803395Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.1803712Z ) 2025-05-07T20:32:59.1803903Z else: 2025-05-07T20:32:59.1804103Z scale_ub_tensor = None 2025-05-07T20:32:59.1804349Z 2025-05-07T20:32:59.1804576Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.1804876Z op = silu_mul_quant 2025-05-07T20:32:59.1805122Z if compiled: 2025-05-07T20:32:59.1805368Z op = torch.compile(op) 2025-05-07T20:32:59.1805653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.1805923Z 2025-05-07T20:32:59.1806114Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.1806275Z 2025-05-07T20:32:59.1806371Z moe/activation_test.py:117: 2025-05-07T20:32:59.1806661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.1806986Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.1807256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.1807828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.1808461Z return fn(*args, **kwargs) 2025-05-07T20:32:59.1809122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.1809790Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.1810337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.1811015Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.1811691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.1812206Z kernel = self.compile( 2025-05-07T20:32:59.1812800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.1813466Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.1813859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.1814090Z 2025-05-07T20:32:59.1814294Z self = 2025-05-07T20:32:59.1815357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.1816703Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f378882b060>} 2025-05-07T20:32:59.1818048Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.1819056Z context = 2025-05-07T20:32:59.1819347Z 2025-05-07T20:32:59.1819514Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.1820042Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.1820507Z module_map=module_map) 2025-05-07T20:32:59.1820862Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.1821217Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.1821480Z E ^ 2025-05-07T20:32:59.1821930Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.1822378Z 2025-05-07T20:32:59.1822805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.1823312Z 2025-05-07T20:32:59.1823499Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.1823910Z self=, 2025-05-07T20:32:59.1824303Z T=2048, 2025-05-07T20:32:59.1824494Z D=5120, 2025-05-07T20:32:59.1824683Z scale_ub=None, 2025-05-07T20:32:59.1824892Z contiguous=False, 2025-05-07T20:32:59.1825115Z compiled=True, 2025-05-07T20:32:59.1825319Z ) 2025-05-07T20:32:59.1825623Z self = 2025-05-07T20:32:59.1826107Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:59.1826381Z 2025-05-07T20:32:59.1826462Z @given( 2025-05-07T20:32:59.1826686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.1826985Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.1827291Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.1827692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.1828018Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.1828342Z ) 2025-05-07T20:32:59.1828688Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.1829121Z def test_silu_mul_quant( 2025-05-07T20:32:59.1829358Z self, 2025-05-07T20:32:59.1829550Z T: int, 2025-05-07T20:32:59.1829740Z D: int, 2025-05-07T20:32:59.1829950Z scale_ub: Optional[float], 2025-05-07T20:32:59.1830223Z contiguous: bool, 2025-05-07T20:32:59.1830451Z compiled: bool, 2025-05-07T20:32:59.1830668Z ) -> None: 2025-05-07T20:32:59.1830879Z torch.manual_seed(2025) 2025-05-07T20:32:59.1831124Z 2025-05-07T20:32:59.1831424Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.1831764Z 2025-05-07T20:32:59.1831998Z x_sign = torch.sign(x) 2025-05-07T20:32:59.1832279Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.1832601Z x = x_sign * x_clamp 2025-05-07T20:32:59.1832841Z x0 = x[:, :D] 2025-05-07T20:32:59.1833045Z x1 = x[:, D:] 2025-05-07T20:32:59.1833252Z 2025-05-07T20:32:59.1833442Z if contiguous: 2025-05-07T20:32:59.1833669Z x0 = x0.contiguous() 2025-05-07T20:32:59.1833935Z x1 = x1.contiguous() 2025-05-07T20:32:59.1834177Z 2025-05-07T20:32:59.1834362Z if scale_ub is not None: 2025-05-07T20:32:59.1834645Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.1834982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.1835280Z ) 2025-05-07T20:32:59.1835476Z else: 2025-05-07T20:32:59.1835686Z scale_ub_tensor = None 2025-05-07T20:32:59.1835933Z 2025-05-07T20:32:59.1836163Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.1836476Z op = silu_mul_quant 2025-05-07T20:32:59.1836724Z if compiled: 2025-05-07T20:32:59.1836962Z op = torch.compile(op) 2025-05-07T20:32:59.1837256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.1837525Z 2025-05-07T20:32:59.1837706Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.1837869Z 2025-05-07T20:32:59.1837963Z moe/activation_test.py:117: 2025-05-07T20:32:59.1838245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.1838559Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.1838832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.1839383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.1839932Z return fn(*args, **kwargs) 2025-05-07T20:32:59.1840832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.1841519Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.1842484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.1843171Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.1843850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.1844380Z kernel = self.compile( 2025-05-07T20:32:59.1844927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.1845565Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.1845947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.1846167Z 2025-05-07T20:32:59.1846369Z self = 2025-05-07T20:32:59.1847426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.1848822Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d78f07c0>} 2025-05-07T20:32:59.1850185Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.1851179Z context = 2025-05-07T20:32:59.1851459Z 2025-05-07T20:32:59.1851626Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.1852136Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.1852654Z module_map=module_map) 2025-05-07T20:32:59.1853021Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.1853367Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.1853616Z E ^ 2025-05-07T20:32:59.1854076Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.1854534Z 2025-05-07T20:32:59.1854968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.5497423Z 2025-05-07T20:32:59.5498186Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.5498810Z self=, 2025-05-07T20:32:59.5499357Z T=2048, 2025-05-07T20:32:59.5499613Z D=5120, 2025-05-07T20:32:59.5499874Z scale_ub=1200.0, 2025-05-07T20:32:59.5500181Z contiguous=False, 2025-05-07T20:32:59.5500525Z compiled=True, 2025-05-07T20:32:59.5500781Z ) 2025-05-07T20:32:59.5501120Z self = 2025-05-07T20:32:59.5501618Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:59.5501908Z 2025-05-07T20:32:59.5501988Z @given( 2025-05-07T20:32:59.5502220Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.5502531Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.5502836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.5503170Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.5503501Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.5503784Z ) 2025-05-07T20:32:59.5504126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.5504583Z def test_silu_mul_quant( 2025-05-07T20:32:59.5504827Z self, 2025-05-07T20:32:59.5505022Z T: int, 2025-05-07T20:32:59.5505220Z D: int, 2025-05-07T20:32:59.5505441Z scale_ub: Optional[float], 2025-05-07T20:32:59.5506110Z contiguous: bool, 2025-05-07T20:32:59.5506356Z compiled: bool, 2025-05-07T20:32:59.5506588Z ) -> None: 2025-05-07T20:32:59.5506798Z torch.manual_seed(2025) 2025-05-07T20:32:59.5507042Z 2025-05-07T20:32:59.5507314Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.5507730Z 2025-05-07T20:32:59.5507924Z x_sign = torch.sign(x) 2025-05-07T20:32:59.5508217Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.5508518Z x = x_sign * x_clamp 2025-05-07T20:32:59.5508759Z x0 = x[:, :D] 2025-05-07T20:32:59.5508977Z x1 = x[:, D:] 2025-05-07T20:32:59.5509175Z 2025-05-07T20:32:59.5509357Z if contiguous: 2025-05-07T20:32:59.5509587Z x0 = x0.contiguous() 2025-05-07T20:32:59.5509848Z x1 = x1.contiguous() 2025-05-07T20:32:59.5510081Z 2025-05-07T20:32:59.5510272Z if scale_ub is not None: 2025-05-07T20:32:59.5510560Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.5510889Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.5511283Z ) 2025-05-07T20:32:59.5511477Z else: 2025-05-07T20:32:59.5511680Z scale_ub_tensor = None 2025-05-07T20:32:59.5511934Z 2025-05-07T20:32:59.5512166Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.5512468Z op = silu_mul_quant 2025-05-07T20:32:59.5512722Z if compiled: 2025-05-07T20:32:59.5512972Z op = torch.compile(op) 2025-05-07T20:32:59.5513258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.5513530Z 2025-05-07T20:32:59.5513722Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.5513881Z 2025-05-07T20:32:59.5513986Z moe/activation_test.py:117: 2025-05-07T20:32:59.5514363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.5514691Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.5514980Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.5515533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.5516093Z return fn(*args, **kwargs) 2025-05-07T20:32:59.5516760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.5517461Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.5517996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.5518669Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.5519324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.5519841Z kernel = self.compile( 2025-05-07T20:32:59.5520392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.5521040Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.5521430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.5521654Z 2025-05-07T20:32:59.5521856Z self = 2025-05-07T20:32:59.5522916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.5524300Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d78f1580>} 2025-05-07T20:32:59.5525753Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.5526757Z context = 2025-05-07T20:32:59.5527047Z 2025-05-07T20:32:59.5527211Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.5527728Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.5528197Z module_map=module_map) 2025-05-07T20:32:59.5528560Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.5528913Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.5537308Z E ^ 2025-05-07T20:32:59.5537825Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.5538280Z 2025-05-07T20:32:59.5538713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.5539225Z 2025-05-07T20:32:59.5539333Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.5539832Z self=, 2025-05-07T20:32:59.5540534Z T=4096, 2025-05-07T20:32:59.5540726Z D=5120, 2025-05-07T20:32:59.5540929Z scale_ub=1200.0, 2025-05-07T20:32:59.5541163Z contiguous=True, 2025-05-07T20:32:59.5541382Z compiled=True, 2025-05-07T20:32:59.5541597Z ) 2025-05-07T20:32:59.5541921Z self = 2025-05-07T20:32:59.5542415Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:59.5542680Z 2025-05-07T20:32:59.5542761Z @given( 2025-05-07T20:32:59.5542998Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.5543407Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.5543710Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.5544051Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.5544387Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.5544667Z ) 2025-05-07T20:32:59.5545015Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.5545466Z def test_silu_mul_quant( 2025-05-07T20:32:59.5545715Z self, 2025-05-07T20:32:59.5545907Z T: int, 2025-05-07T20:32:59.5546109Z D: int, 2025-05-07T20:32:59.5546332Z scale_ub: Optional[float], 2025-05-07T20:32:59.5546602Z contiguous: bool, 2025-05-07T20:32:59.5546845Z compiled: bool, 2025-05-07T20:32:59.5547073Z ) -> None: 2025-05-07T20:32:59.5547285Z torch.manual_seed(2025) 2025-05-07T20:32:59.5547620Z 2025-05-07T20:32:59.5547899Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.5548247Z 2025-05-07T20:32:59.5548445Z x_sign = torch.sign(x) 2025-05-07T20:32:59.5548746Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.5549049Z x = x_sign * x_clamp 2025-05-07T20:32:59.5549299Z x0 = x[:, :D] 2025-05-07T20:32:59.5549520Z x1 = x[:, D:] 2025-05-07T20:32:59.5549729Z 2025-05-07T20:32:59.5549921Z if contiguous: 2025-05-07T20:32:59.5550158Z x0 = x0.contiguous() 2025-05-07T20:32:59.5550410Z x1 = x1.contiguous() 2025-05-07T20:32:59.5550658Z 2025-05-07T20:32:59.5550853Z if scale_ub is not None: 2025-05-07T20:32:59.5551122Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.5551465Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.5551778Z ) 2025-05-07T20:32:59.5551974Z else: 2025-05-07T20:32:59.5552180Z scale_ub_tensor = None 2025-05-07T20:32:59.5552433Z 2025-05-07T20:32:59.5552679Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.5552987Z op = silu_mul_quant 2025-05-07T20:32:59.5553242Z if compiled: 2025-05-07T20:32:59.5553628Z op = torch.compile(op) 2025-05-07T20:32:59.5553927Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.5554206Z 2025-05-07T20:32:59.5554404Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.5554569Z 2025-05-07T20:32:59.5554668Z moe/activation_test.py:117: 2025-05-07T20:32:59.5554968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.5555304Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.5555590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.5556153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.5556714Z return fn(*args, **kwargs) 2025-05-07T20:32:59.5557394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.5558070Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.5558625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.5559401Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.5560064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.5560584Z kernel = self.compile( 2025-05-07T20:32:59.5561132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.5561783Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.5562172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.5562404Z 2025-05-07T20:32:59.5562656Z self = 2025-05-07T20:32:59.5563734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.5565099Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d78f2840>} 2025-05-07T20:32:59.5566428Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.5567429Z context = 2025-05-07T20:32:59.5567718Z 2025-05-07T20:32:59.5567884Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.5568408Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.5568876Z module_map=module_map) 2025-05-07T20:32:59.5569233Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.5569590Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.5569859Z E ^ 2025-05-07T20:32:59.5570311Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.5570763Z 2025-05-07T20:32:59.5571180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.7278459Z 2025-05-07T20:32:59.7278779Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.7279379Z self=, 2025-05-07T20:32:59.7279858Z T=128, 2025-05-07T20:32:59.7280057Z D=5120, 2025-05-07T20:32:59.7280272Z scale_ub=1200.0, 2025-05-07T20:32:59.7280506Z contiguous=False, 2025-05-07T20:32:59.7280739Z compiled=True, 2025-05-07T20:32:59.7280940Z ) 2025-05-07T20:32:59.7281603Z self = 2025-05-07T20:32:59.7282125Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:59.7282397Z 2025-05-07T20:32:59.7282489Z @given( 2025-05-07T20:32:59.7282718Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.7283042Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.7283360Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.7283690Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.7284022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.7284317Z ) 2025-05-07T20:32:59.7284665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.7285124Z def test_silu_mul_quant( 2025-05-07T20:32:59.7285372Z self, 2025-05-07T20:32:59.7285568Z T: int, 2025-05-07T20:32:59.7285774Z D: int, 2025-05-07T20:32:59.7286010Z scale_ub: Optional[float], 2025-05-07T20:32:59.7286374Z contiguous: bool, 2025-05-07T20:32:59.7286624Z compiled: bool, 2025-05-07T20:32:59.7286856Z ) -> None: 2025-05-07T20:32:59.7287074Z torch.manual_seed(2025) 2025-05-07T20:32:59.7287309Z 2025-05-07T20:32:59.7287585Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.7287930Z 2025-05-07T20:32:59.7288120Z x_sign = torch.sign(x) 2025-05-07T20:32:59.7288410Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.7288722Z x = x_sign * x_clamp 2025-05-07T20:32:59.7288952Z x0 = x[:, :D] 2025-05-07T20:32:59.7289171Z x1 = x[:, D:] 2025-05-07T20:32:59.7289382Z 2025-05-07T20:32:59.7289564Z if contiguous: 2025-05-07T20:32:59.7289882Z x0 = x0.contiguous() 2025-05-07T20:32:59.7290139Z x1 = x1.contiguous() 2025-05-07T20:32:59.7290378Z 2025-05-07T20:32:59.7290570Z if scale_ub is not None: 2025-05-07T20:32:59.7290852Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.7291184Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.7291499Z ) 2025-05-07T20:32:59.7291697Z else: 2025-05-07T20:32:59.7291904Z scale_ub_tensor = None 2025-05-07T20:32:59.7292155Z 2025-05-07T20:32:59.7292389Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.7292699Z op = silu_mul_quant 2025-05-07T20:32:59.7292942Z if compiled: 2025-05-07T20:32:59.7293188Z op = torch.compile(op) 2025-05-07T20:32:59.7293483Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.7293750Z 2025-05-07T20:32:59.7293940Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.7294102Z 2025-05-07T20:32:59.7294211Z moe/activation_test.py:117: 2025-05-07T20:32:59.7294496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.7294826Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.7295107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.7295668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.7296212Z return fn(*args, **kwargs) 2025-05-07T20:32:59.7296868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.7297548Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.7298075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.7298752Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.7299423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.7299947Z kernel = self.compile( 2025-05-07T20:32:59.7300571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.7301222Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.7301617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.7301839Z 2025-05-07T20:32:59.7302040Z self = 2025-05-07T20:32:59.7303104Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.7304532Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d78f34c0>} 2025-05-07T20:32:59.7305855Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.7306909Z context = 2025-05-07T20:32:59.7307192Z 2025-05-07T20:32:59.7307355Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.7307984Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.7308447Z module_map=module_map) 2025-05-07T20:32:59.7308805Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.7309145Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.7309400Z E ^ 2025-05-07T20:32:59.7309853Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.7310349Z 2025-05-07T20:32:59.7310778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.7311289Z 2025-05-07T20:32:59.7311391Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.7311797Z self=, 2025-05-07T20:32:59.7312190Z T=16384, 2025-05-07T20:32:59.7312379Z D=7168, 2025-05-07T20:32:59.7312571Z scale_ub=1200.0, 2025-05-07T20:32:59.7312793Z contiguous=True, 2025-05-07T20:32:59.7313005Z compiled=True, 2025-05-07T20:32:59.7313215Z ) 2025-05-07T20:32:59.7313542Z self = 2025-05-07T20:32:59.7314020Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:59.7314306Z 2025-05-07T20:32:59.7314383Z @given( 2025-05-07T20:32:59.7314613Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.7314927Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.7315234Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.7315562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.7315904Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.7316189Z ) 2025-05-07T20:32:59.7316540Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.7316979Z def test_silu_mul_quant( 2025-05-07T20:32:59.7317216Z self, 2025-05-07T20:32:59.7317431Z T: int, 2025-05-07T20:32:59.7317638Z D: int, 2025-05-07T20:32:59.7317854Z scale_ub: Optional[float], 2025-05-07T20:32:59.7318132Z contiguous: bool, 2025-05-07T20:32:59.7318379Z compiled: bool, 2025-05-07T20:32:59.7318604Z ) -> None: 2025-05-07T20:32:59.7318822Z torch.manual_seed(2025) 2025-05-07T20:32:59.7319073Z 2025-05-07T20:32:59.7319346Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.7319692Z 2025-05-07T20:32:59.7319893Z x_sign = torch.sign(x) 2025-05-07T20:32:59.7320311Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.7320619Z x = x_sign * x_clamp 2025-05-07T20:32:59.7320866Z x0 = x[:, :D] 2025-05-07T20:32:59.7321089Z x1 = x[:, D:] 2025-05-07T20:32:59.7321297Z 2025-05-07T20:32:59.7321492Z if contiguous: 2025-05-07T20:32:59.7321725Z x0 = x0.contiguous() 2025-05-07T20:32:59.7321979Z x1 = x1.contiguous() 2025-05-07T20:32:59.7322220Z 2025-05-07T20:32:59.7322417Z if scale_ub is not None: 2025-05-07T20:32:59.7322683Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.7323022Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.7323331Z ) 2025-05-07T20:32:59.7323515Z else: 2025-05-07T20:32:59.7323733Z scale_ub_tensor = None 2025-05-07T20:32:59.7323987Z 2025-05-07T20:32:59.7324214Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.7324538Z op = silu_mul_quant 2025-05-07T20:32:59.7324795Z if compiled: 2025-05-07T20:32:59.7325101Z op = torch.compile(op) 2025-05-07T20:32:59.7325392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.7325672Z 2025-05-07T20:32:59.7325869Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.7326033Z 2025-05-07T20:32:59.7326135Z moe/activation_test.py:117: 2025-05-07T20:32:59.7326437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.7326771Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.7327050Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.7327616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.7328187Z return fn(*args, **kwargs) 2025-05-07T20:32:59.7328898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.7329604Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.7330150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.7330834Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.7331489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.7332022Z kernel = self.compile( 2025-05-07T20:32:59.7332572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.7333229Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.7333629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.7333868Z 2025-05-07T20:32:59.7334078Z self = 2025-05-07T20:32:59.7335153Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.7336566Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d76fcc20>} 2025-05-07T20:32:59.7337880Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.7338898Z context = 2025-05-07T20:32:59.7339193Z 2025-05-07T20:32:59.7339365Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.7339973Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.7340692Z module_map=module_map) 2025-05-07T20:32:59.7341065Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.7341427Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.7341697Z E ^ 2025-05-07T20:32:59.7342158Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.7342614Z 2025-05-07T20:32:59.7343053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.8511897Z 2025-05-07T20:32:59.8512123Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.8512540Z self=, 2025-05-07T20:32:59.8513013Z T=16384, 2025-05-07T20:32:59.8513288Z D=5120, 2025-05-07T20:32:59.8513557Z scale_ub=1200.0, 2025-05-07T20:32:59.8513859Z contiguous=True, 2025-05-07T20:32:59.8514170Z compiled=False, 2025-05-07T20:32:59.8514454Z ) 2025-05-07T20:32:59.8514992Z self = 2025-05-07T20:32:59.8515496Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:59.8515769Z 2025-05-07T20:32:59.8515859Z @given( 2025-05-07T20:32:59.8516086Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.8516408Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.8516721Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.8517056Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.8517379Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.8517669Z ) 2025-05-07T20:32:59.8518021Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.8518544Z def test_silu_mul_quant( 2025-05-07T20:32:59.8518796Z self, 2025-05-07T20:32:59.8518998Z T: int, 2025-05-07T20:32:59.8519200Z D: int, 2025-05-07T20:32:59.8519428Z scale_ub: Optional[float], 2025-05-07T20:32:59.8519704Z contiguous: bool, 2025-05-07T20:32:59.8519944Z compiled: bool, 2025-05-07T20:32:59.8520184Z ) -> None: 2025-05-07T20:32:59.8520409Z torch.manual_seed(2025) 2025-05-07T20:32:59.8520650Z 2025-05-07T20:32:59.8520928Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.8521271Z 2025-05-07T20:32:59.8521464Z x_sign = torch.sign(x) 2025-05-07T20:32:59.8521764Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.8522077Z x = x_sign * x_clamp 2025-05-07T20:32:59.8522326Z x0 = x[:, :D] 2025-05-07T20:32:59.8522542Z x1 = x[:, D:] 2025-05-07T20:32:59.8522763Z 2025-05-07T20:32:59.8522959Z if contiguous: 2025-05-07T20:32:59.8523192Z x0 = x0.contiguous() 2025-05-07T20:32:59.8523459Z x1 = x1.contiguous() 2025-05-07T20:32:59.8523710Z 2025-05-07T20:32:59.8523901Z if scale_ub is not None: 2025-05-07T20:32:59.8524184Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.8524527Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.8524829Z ) 2025-05-07T20:32:59.8525032Z else: 2025-05-07T20:32:59.8525251Z scale_ub_tensor = None 2025-05-07T20:32:59.8525504Z 2025-05-07T20:32:59.8525744Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.8526062Z op = silu_mul_quant 2025-05-07T20:32:59.8526310Z if compiled: 2025-05-07T20:32:59.8526565Z op = torch.compile(op) 2025-05-07T20:32:59.8526864Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.8527142Z 2025-05-07T20:32:59.8527338Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.8527513Z 2025-05-07T20:32:59.8527616Z moe/activation_test.py:117: 2025-05-07T20:32:59.8528077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.8528419Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.8528717Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.8529412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.8530094Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.8530642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.8531331Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.8531994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.8532518Z kernel = self.compile( 2025-05-07T20:32:59.8533082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.8533747Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.8534191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.8534416Z 2025-05-07T20:32:59.8534625Z self = 2025-05-07T20:32:59.8535713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.8537078Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d76fd580>} 2025-05-07T20:32:59.8538413Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.8539461Z context = 2025-05-07T20:32:59.8539758Z 2025-05-07T20:32:59.8539928Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.8540715Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.8541189Z module_map=module_map) 2025-05-07T20:32:59.8541550Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.8541907Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.8542179Z E ^ 2025-05-07T20:32:59.8542676Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.8543120Z 2025-05-07T20:32:59.8543546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.8544058Z 2025-05-07T20:32:59.8544170Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.8544592Z self=, 2025-05-07T20:32:59.8545004Z T=1, 2025-05-07T20:32:59.8545191Z D=7168, 2025-05-07T20:32:59.8545394Z scale_ub=1200.0, 2025-05-07T20:32:59.8545628Z contiguous=False, 2025-05-07T20:32:59.8545853Z compiled=False, 2025-05-07T20:32:59.8546067Z ) 2025-05-07T20:32:59.8546388Z self = 2025-05-07T20:32:59.8546870Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:59.8547151Z 2025-05-07T20:32:59.8547233Z @given( 2025-05-07T20:32:59.8547579Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.8547903Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.8548212Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.8548547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.8549010Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.8549291Z ) 2025-05-07T20:32:59.8549645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.8550090Z def test_silu_mul_quant( 2025-05-07T20:32:59.8550330Z self, 2025-05-07T20:32:59.8550529Z T: int, 2025-05-07T20:32:59.8550733Z D: int, 2025-05-07T20:32:59.8550949Z scale_ub: Optional[float], 2025-05-07T20:32:59.8551229Z contiguous: bool, 2025-05-07T20:32:59.8551472Z compiled: bool, 2025-05-07T20:32:59.8551692Z ) -> None: 2025-05-07T20:32:59.8551917Z torch.manual_seed(2025) 2025-05-07T20:32:59.8552163Z 2025-05-07T20:32:59.8552441Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.8552786Z 2025-05-07T20:32:59.8552988Z x_sign = torch.sign(x) 2025-05-07T20:32:59.8553284Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.8553597Z x = x_sign * x_clamp 2025-05-07T20:32:59.8553838Z x0 = x[:, :D] 2025-05-07T20:32:59.8554132Z x1 = x[:, D:] 2025-05-07T20:32:59.8554338Z 2025-05-07T20:32:59.8554532Z if contiguous: 2025-05-07T20:32:59.8554769Z x0 = x0.contiguous() 2025-05-07T20:32:59.8555021Z x1 = x1.contiguous() 2025-05-07T20:32:59.8555266Z 2025-05-07T20:32:59.8555467Z if scale_ub is not None: 2025-05-07T20:32:59.8555742Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.8556086Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.8564345Z ) 2025-05-07T20:32:59.8564565Z else: 2025-05-07T20:32:59.8564781Z scale_ub_tensor = None 2025-05-07T20:32:59.8565034Z 2025-05-07T20:32:59.8565263Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.8565700Z op = silu_mul_quant 2025-05-07T20:32:59.8565949Z if compiled: 2025-05-07T20:32:59.8566202Z op = torch.compile(op) 2025-05-07T20:32:59.8566494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.8566777Z 2025-05-07T20:32:59.8566974Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.8567139Z 2025-05-07T20:32:59.8567238Z moe/activation_test.py:117: 2025-05-07T20:32:59.8567531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.8567860Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.8568132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.8568822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.8569500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.8570033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.8570723Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.8571384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.8571910Z kernel = self.compile( 2025-05-07T20:32:59.8572456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.8573095Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.8573492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.8573718Z 2025-05-07T20:32:59.8573934Z self = 2025-05-07T20:32:59.8574990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.8576435Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d76fe8e0>} 2025-05-07T20:32:59.8577759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.8578772Z context = 2025-05-07T20:32:59.8579051Z 2025-05-07T20:32:59.8579221Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.8579731Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.8580217Z module_map=module_map) 2025-05-07T20:32:59.8580575Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.8580919Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.8581181Z E ^ 2025-05-07T20:32:59.8581644Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.8582141Z 2025-05-07T20:32:59.8582557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.8583056Z 2025-05-07T20:32:59.8583157Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.8583562Z self=, 2025-05-07T20:32:59.8583958Z T=4096, 2025-05-07T20:32:59.8584147Z D=7168, 2025-05-07T20:32:59.8584334Z scale_ub=1200.0, 2025-05-07T20:32:59.8584561Z contiguous=False, 2025-05-07T20:32:59.8584793Z compiled=True, 2025-05-07T20:33:00.0203768Z ) 2025-05-07T20:33:00.0204710Z self = 2025-05-07T20:33:00.0206063Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:00.0206617Z 2025-05-07T20:33:00.0206790Z @given( 2025-05-07T20:33:00.0207240Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.0207854Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.0208454Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.0209091Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.0209729Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.0210288Z ) 2025-05-07T20:33:00.0210963Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.0211509Z def test_silu_mul_quant( 2025-05-07T20:33:00.0211753Z self, 2025-05-07T20:33:00.0211941Z T: int, 2025-05-07T20:33:00.0212139Z D: int, 2025-05-07T20:33:00.0212362Z scale_ub: Optional[float], 2025-05-07T20:33:00.0212627Z contiguous: bool, 2025-05-07T20:33:00.0212871Z compiled: bool, 2025-05-07T20:33:00.0213104Z ) -> None: 2025-05-07T20:33:00.0213314Z torch.manual_seed(2025) 2025-05-07T20:33:00.0213560Z 2025-05-07T20:33:00.0213832Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.0214173Z 2025-05-07T20:33:00.0214364Z x_sign = torch.sign(x) 2025-05-07T20:33:00.0214651Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.0214962Z x = x_sign * x_clamp 2025-05-07T20:33:00.0215199Z x0 = x[:, :D] 2025-05-07T20:33:00.0215423Z x1 = x[:, D:] 2025-05-07T20:33:00.0215640Z 2025-05-07T20:33:00.0215820Z if contiguous: 2025-05-07T20:33:00.0216049Z x0 = x0.contiguous() 2025-05-07T20:33:00.0216304Z x1 = x1.contiguous() 2025-05-07T20:33:00.0216542Z 2025-05-07T20:33:00.0216732Z if scale_ub is not None: 2025-05-07T20:33:00.0217009Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.0217339Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.0217647Z ) 2025-05-07T20:33:00.0217842Z else: 2025-05-07T20:33:00.0218223Z scale_ub_tensor = None 2025-05-07T20:33:00.0218482Z 2025-05-07T20:33:00.0218714Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.0219019Z op = silu_mul_quant 2025-05-07T20:33:00.0219273Z if compiled: 2025-05-07T20:33:00.0219524Z op = torch.compile(op) 2025-05-07T20:33:00.0219820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.0220089Z 2025-05-07T20:33:00.0220283Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.0220444Z 2025-05-07T20:33:00.0220550Z moe/activation_test.py:117: 2025-05-07T20:33:00.0220838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.0221169Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.0221454Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.0222008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.0222570Z return fn(*args, **kwargs) 2025-05-07T20:33:00.0223304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.0223982Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.0224510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.0225191Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.0225848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.0226368Z kernel = self.compile( 2025-05-07T20:33:00.0226924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.0227732Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.0228127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.0228352Z 2025-05-07T20:33:00.0228555Z self = 2025-05-07T20:33:00.0229620Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.0230989Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d76ffa60>} 2025-05-07T20:33:00.0232365Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.0233373Z context = 2025-05-07T20:33:00.0233658Z 2025-05-07T20:33:00.0233822Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.0234352Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.0234815Z module_map=module_map) 2025-05-07T20:33:00.0235175Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.0235526Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.0235786Z E ^ 2025-05-07T20:33:00.0236247Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.0236697Z 2025-05-07T20:33:00.0237120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.0237633Z 2025-05-07T20:33:00.0237736Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.0238223Z self=, 2025-05-07T20:33:00.0238629Z T=128, 2025-05-07T20:33:00.0238816Z D=7168, 2025-05-07T20:33:00.0239037Z scale_ub=1200.0, 2025-05-07T20:33:00.0239269Z contiguous=False, 2025-05-07T20:33:00.0239502Z compiled=True, 2025-05-07T20:33:00.0239706Z ) 2025-05-07T20:33:00.0240022Z self = 2025-05-07T20:33:00.0240787Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:00.0241055Z 2025-05-07T20:33:00.0241135Z @given( 2025-05-07T20:33:00.0241371Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.0241684Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.0241984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.0242311Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.0242643Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.0242926Z ) 2025-05-07T20:33:00.0243277Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.0243789Z def test_silu_mul_quant( 2025-05-07T20:33:00.0244029Z self, 2025-05-07T20:33:00.0244223Z T: int, 2025-05-07T20:33:00.0244423Z D: int, 2025-05-07T20:33:00.0244647Z scale_ub: Optional[float], 2025-05-07T20:33:00.0244913Z contiguous: bool, 2025-05-07T20:33:00.0245153Z compiled: bool, 2025-05-07T20:33:00.0245377Z ) -> None: 2025-05-07T20:33:00.0245588Z torch.manual_seed(2025) 2025-05-07T20:33:00.0245832Z 2025-05-07T20:33:00.0246101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.0246429Z 2025-05-07T20:33:00.0246618Z x_sign = torch.sign(x) 2025-05-07T20:33:00.0246908Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.0247284Z x = x_sign * x_clamp 2025-05-07T20:33:00.0247527Z x0 = x[:, :D] 2025-05-07T20:33:00.0247749Z x1 = x[:, D:] 2025-05-07T20:33:00.0247959Z 2025-05-07T20:33:00.0248152Z if contiguous: 2025-05-07T20:33:00.0248388Z x0 = x0.contiguous() 2025-05-07T20:33:00.0248649Z x1 = x1.contiguous() 2025-05-07T20:33:00.0248884Z 2025-05-07T20:33:00.0249074Z if scale_ub is not None: 2025-05-07T20:33:00.0249343Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.0249674Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.0249986Z ) 2025-05-07T20:33:00.0250180Z else: 2025-05-07T20:33:00.0250387Z scale_ub_tensor = None 2025-05-07T20:33:00.0250639Z 2025-05-07T20:33:00.0250874Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.0251185Z op = silu_mul_quant 2025-05-07T20:33:00.0251450Z if compiled: 2025-05-07T20:33:00.0251709Z op = torch.compile(op) 2025-05-07T20:33:00.0252000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.0252280Z 2025-05-07T20:33:00.0252481Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.0252668Z 2025-05-07T20:33:00.0252767Z moe/activation_test.py:117: 2025-05-07T20:33:00.0253065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.0253388Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.0253673Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.0254236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.0254792Z return fn(*args, **kwargs) 2025-05-07T20:33:00.0255439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.0256124Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.0256671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.0257510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.0258185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.0258759Z kernel = self.compile( 2025-05-07T20:33:00.0259319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.0259974Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.0260384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.0260611Z 2025-05-07T20:33:00.0260837Z self = 2025-05-07T20:33:00.0261907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.0263261Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d74d4ea0>} 2025-05-07T20:33:00.0264677Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.0265689Z context = 2025-05-07T20:33:00.0265970Z 2025-05-07T20:33:00.0266148Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.0266662Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.0267131Z module_map=module_map) 2025-05-07T20:33:00.0267634Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.0267990Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.0268252Z E ^ 2025-05-07T20:33:00.0268723Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.0269174Z 2025-05-07T20:33:00.0269595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.0270109Z 2025-05-07T20:33:00.0270219Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.0270621Z self=, 2025-05-07T20:33:00.0271021Z T=2048, 2025-05-07T20:33:00.0271212Z D=7168, 2025-05-07T20:33:00.0271404Z scale_ub=None, 2025-05-07T20:33:00.0271627Z contiguous=True, 2025-05-07T20:33:00.0271851Z compiled=True, 2025-05-07T20:33:00.1498503Z ) 2025-05-07T20:33:00.1499010Z self = 2025-05-07T20:33:00.1499683Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.1500067Z 2025-05-07T20:33:00.1500170Z @given( 2025-05-07T20:33:00.1500400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.1500712Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.1501025Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.1501348Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.1501674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.1501967Z ) 2025-05-07T20:33:00.1502313Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.1502757Z def test_silu_mul_quant( 2025-05-07T20:33:00.1503011Z self, 2025-05-07T20:33:00.1503209Z T: int, 2025-05-07T20:33:00.1503405Z D: int, 2025-05-07T20:33:00.1503629Z scale_ub: Optional[float], 2025-05-07T20:33:00.1503905Z contiguous: bool, 2025-05-07T20:33:00.1504147Z compiled: bool, 2025-05-07T20:33:00.1504382Z ) -> None: 2025-05-07T20:33:00.1504918Z torch.manual_seed(2025) 2025-05-07T20:33:00.1505164Z 2025-05-07T20:33:00.1505437Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.1505781Z 2025-05-07T20:33:00.1505970Z x_sign = torch.sign(x) 2025-05-07T20:33:00.1506264Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.1506578Z x = x_sign * x_clamp 2025-05-07T20:33:00.1506815Z x0 = x[:, :D] 2025-05-07T20:33:00.1507041Z x1 = x[:, D:] 2025-05-07T20:33:00.1507255Z 2025-05-07T20:33:00.1507568Z if contiguous: 2025-05-07T20:33:00.1507808Z x0 = x0.contiguous() 2025-05-07T20:33:00.1508057Z x1 = x1.contiguous() 2025-05-07T20:33:00.1508298Z 2025-05-07T20:33:00.1508492Z if scale_ub is not None: 2025-05-07T20:33:00.1508763Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.1509099Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.1509414Z ) 2025-05-07T20:33:00.1509603Z else: 2025-05-07T20:33:00.1509896Z scale_ub_tensor = None 2025-05-07T20:33:00.1510150Z 2025-05-07T20:33:00.1510373Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.1510685Z op = silu_mul_quant 2025-05-07T20:33:00.1510938Z if compiled: 2025-05-07T20:33:00.1511186Z op = torch.compile(op) 2025-05-07T20:33:00.1511480Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.1511751Z 2025-05-07T20:33:00.1511946Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.1512108Z 2025-05-07T20:33:00.1512211Z moe/activation_test.py:117: 2025-05-07T20:33:00.1512503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.1512833Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.1513194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.1513759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.1514321Z return fn(*args, **kwargs) 2025-05-07T20:33:00.1515021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.1515695Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.1516232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.1516904Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.1517558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.1518091Z kernel = self.compile( 2025-05-07T20:33:00.1518648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.1519319Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.1519722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.1519959Z 2025-05-07T20:33:00.1520166Z self = 2025-05-07T20:33:00.1521264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.1522635Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d74d5c60>} 2025-05-07T20:33:00.1523951Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.1525052Z context = 2025-05-07T20:33:00.1525342Z 2025-05-07T20:33:00.1525505Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.1526029Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.1526488Z module_map=module_map) 2025-05-07T20:33:00.1526862Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.1527216Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.1527483Z E ^ 2025-05-07T20:33:00.1527937Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.1528390Z 2025-05-07T20:33:00.1528811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.1529315Z 2025-05-07T20:33:00.1529422Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.1529833Z self=, 2025-05-07T20:33:00.1530282Z T=16384, 2025-05-07T20:33:00.1530475Z D=5120, 2025-05-07T20:33:00.1530670Z scale_ub=None, 2025-05-07T20:33:00.1530878Z contiguous=False, 2025-05-07T20:33:00.1531100Z compiled=False, 2025-05-07T20:33:00.1531307Z ) 2025-05-07T20:33:00.1531617Z self = 2025-05-07T20:33:00.1532108Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.1532386Z 2025-05-07T20:33:00.1532472Z @given( 2025-05-07T20:33:00.1532691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.1533003Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.1533306Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.1533676Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.1533993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.1534282Z ) 2025-05-07T20:33:00.1534629Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.1535066Z def test_silu_mul_quant( 2025-05-07T20:33:00.1535306Z self, 2025-05-07T20:33:00.1535496Z T: int, 2025-05-07T20:33:00.1535682Z D: int, 2025-05-07T20:33:00.1535903Z scale_ub: Optional[float], 2025-05-07T20:33:00.1536176Z contiguous: bool, 2025-05-07T20:33:00.1536406Z compiled: bool, 2025-05-07T20:33:00.1536627Z ) -> None: 2025-05-07T20:33:00.1536847Z torch.manual_seed(2025) 2025-05-07T20:33:00.1537078Z 2025-05-07T20:33:00.1537344Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.1537688Z 2025-05-07T20:33:00.1537877Z x_sign = torch.sign(x) 2025-05-07T20:33:00.1538172Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.1540600Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.1542464Z 2025-05-07T20:33:00.1542583Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:00.1542798Z 2025-05-07T20:33:00.1542900Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.1543317Z self=, 2025-05-07T20:33:00.1543710Z T=4096, 2025-05-07T20:33:00.1543898Z D=7168, 2025-05-07T20:33:00.1544093Z scale_ub=1200.0, 2025-05-07T20:33:00.1544316Z contiguous=True, 2025-05-07T20:33:00.1544679Z compiled=True, 2025-05-07T20:33:00.1544887Z ) 2025-05-07T20:33:00.1545200Z self = 2025-05-07T20:33:00.1545687Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.1545967Z 2025-05-07T20:33:00.1546047Z @given( 2025-05-07T20:33:00.1546271Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.1546576Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.1546877Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.1547200Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.1547569Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.1547846Z ) 2025-05-07T20:33:00.1548188Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.1548618Z def test_silu_mul_quant( 2025-05-07T20:33:00.1548857Z self, 2025-05-07T20:33:00.1549046Z T: int, 2025-05-07T20:33:00.1549239Z D: int, 2025-05-07T20:33:00.1549521Z scale_ub: Optional[float], 2025-05-07T20:33:00.1549790Z contiguous: bool, 2025-05-07T20:33:00.1550025Z compiled: bool, 2025-05-07T20:33:00.1550237Z ) -> None: 2025-05-07T20:33:00.1550448Z torch.manual_seed(2025) 2025-05-07T20:33:00.1550691Z 2025-05-07T20:33:00.1550951Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.1551299Z 2025-05-07T20:33:00.1551487Z x_sign = torch.sign(x) 2025-05-07T20:33:00.1551769Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.1553746Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.1555703Z 2025-05-07T20:33:00.1555819Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:00.1556032Z 2025-05-07T20:33:00.1556136Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.1556546Z self=, 2025-05-07T20:33:00.1556955Z T=16384, 2025-05-07T20:33:00.1557143Z D=7168, 2025-05-07T20:33:00.1557328Z scale_ub=None, 2025-05-07T20:33:00.1557537Z contiguous=False, 2025-05-07T20:33:00.1557759Z compiled=False, 2025-05-07T20:33:00.1557958Z ) 2025-05-07T20:33:00.1558262Z self = 2025-05-07T20:33:00.1559100Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.1559409Z 2025-05-07T20:33:00.1559522Z @given( 2025-05-07T20:33:00.1568159Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.1568516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.1568826Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.1569154Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.1569471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.1569754Z ) 2025-05-07T20:33:00.1570111Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.1570555Z def test_silu_mul_quant( 2025-05-07T20:33:00.1570799Z self, 2025-05-07T20:33:00.1570993Z T: int, 2025-05-07T20:33:00.1571184Z D: int, 2025-05-07T20:33:00.1571401Z scale_ub: Optional[float], 2025-05-07T20:33:00.1571678Z contiguous: bool, 2025-05-07T20:33:00.1571919Z compiled: bool, 2025-05-07T20:33:00.1572141Z ) -> None: 2025-05-07T20:33:00.1572483Z torch.manual_seed(2025) 2025-05-07T20:33:00.1572732Z 2025-05-07T20:33:00.1573005Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.1575087Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.1576947Z 2025-05-07T20:33:00.1577069Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.2811463Z 2025-05-07T20:33:00.2811804Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.2812472Z self=, 2025-05-07T20:33:00.2813387Z T=2048, 2025-05-07T20:33:00.2813630Z D=7168, 2025-05-07T20:33:00.2813820Z scale_ub=1200.0, 2025-05-07T20:33:00.2814044Z contiguous=True, 2025-05-07T20:33:00.2814270Z compiled=True, 2025-05-07T20:33:00.2814475Z ) 2025-05-07T20:33:00.2814794Z self = 2025-05-07T20:33:00.2815288Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.2815556Z 2025-05-07T20:33:00.2815638Z @given( 2025-05-07T20:33:00.2815868Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.2816183Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.2816487Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.2816920Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.2817250Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.2817539Z ) 2025-05-07T20:33:00.2817906Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.2818370Z def test_silu_mul_quant( 2025-05-07T20:33:00.2818620Z self, 2025-05-07T20:33:00.2818808Z T: int, 2025-05-07T20:33:00.2819014Z D: int, 2025-05-07T20:33:00.2819247Z scale_ub: Optional[float], 2025-05-07T20:33:00.2819520Z contiguous: bool, 2025-05-07T20:33:00.2819763Z compiled: bool, 2025-05-07T20:33:00.2819993Z ) -> None: 2025-05-07T20:33:00.2820205Z torch.manual_seed(2025) 2025-05-07T20:33:00.2820451Z 2025-05-07T20:33:00.2820724Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.2821063Z 2025-05-07T20:33:00.2821257Z x_sign = torch.sign(x) 2025-05-07T20:33:00.2821555Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.2823569Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.2825507Z 2025-05-07T20:33:00.2825630Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:00.2825843Z 2025-05-07T20:33:00.2825945Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.2826364Z self=, 2025-05-07T20:33:00.2826780Z T=2048, 2025-05-07T20:33:00.2826975Z D=7168, 2025-05-07T20:33:00.2827160Z scale_ub=None, 2025-05-07T20:33:00.2827373Z contiguous=True, 2025-05-07T20:33:00.2827695Z compiled=False, 2025-05-07T20:33:00.2828052Z ) 2025-05-07T20:33:00.2828373Z self = 2025-05-07T20:33:00.2828862Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.2829125Z 2025-05-07T20:33:00.2829204Z @given( 2025-05-07T20:33:00.2829431Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.2829746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.2830045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.2830375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.2830698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.2830981Z ) 2025-05-07T20:33:00.2831325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.2831770Z def test_silu_mul_quant( 2025-05-07T20:33:00.2832020Z self, 2025-05-07T20:33:00.2832210Z T: int, 2025-05-07T20:33:00.2832406Z D: int, 2025-05-07T20:33:00.2832634Z scale_ub: Optional[float], 2025-05-07T20:33:00.2832949Z contiguous: bool, 2025-05-07T20:33:00.2833196Z compiled: bool, 2025-05-07T20:33:00.2833427Z ) -> None: 2025-05-07T20:33:00.2833634Z torch.manual_seed(2025) 2025-05-07T20:33:00.2833882Z 2025-05-07T20:33:00.2834150Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.2834493Z 2025-05-07T20:33:00.2834689Z > x_sign = torch.sign(x) 2025-05-07T20:33:00.2836603Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.2838571Z 2025-05-07T20:33:00.2838692Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:00.2838901Z 2025-05-07T20:33:00.2839011Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.2839419Z self=, 2025-05-07T20:33:00.2839824Z T=1, 2025-05-07T20:33:00.2840011Z D=7168, 2025-05-07T20:33:00.2840458Z scale_ub=1200.0, 2025-05-07T20:33:00.2840682Z contiguous=True, 2025-05-07T20:33:00.2840907Z compiled=False, 2025-05-07T20:33:00.2841107Z ) 2025-05-07T20:33:00.2841427Z self = 2025-05-07T20:33:00.2841909Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.2842181Z 2025-05-07T20:33:00.2842264Z @given( 2025-05-07T20:33:00.2842485Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.2842811Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.2843121Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.2843441Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.2843771Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.2844056Z ) 2025-05-07T20:33:00.2844404Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.2844842Z def test_silu_mul_quant( 2025-05-07T20:33:00.2845089Z self, 2025-05-07T20:33:00.2845280Z T: int, 2025-05-07T20:33:00.2845487Z D: int, 2025-05-07T20:33:00.2845713Z scale_ub: Optional[float], 2025-05-07T20:33:00.2845983Z contiguous: bool, 2025-05-07T20:33:00.2846223Z compiled: bool, 2025-05-07T20:33:00.2846455Z ) -> None: 2025-05-07T20:33:00.2846668Z torch.manual_seed(2025) 2025-05-07T20:33:00.2846910Z 2025-05-07T20:33:00.2847305Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.2847655Z 2025-05-07T20:33:00.2847847Z x_sign = torch.sign(x) 2025-05-07T20:33:00.2848140Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.2848453Z x = x_sign * x_clamp 2025-05-07T20:33:00.2848691Z x0 = x[:, :D] 2025-05-07T20:33:00.2848910Z x1 = x[:, D:] 2025-05-07T20:33:00.2849124Z 2025-05-07T20:33:00.2849304Z if contiguous: 2025-05-07T20:33:00.2849547Z x0 = x0.contiguous() 2025-05-07T20:33:00.2849813Z x1 = x1.contiguous() 2025-05-07T20:33:00.2850054Z 2025-05-07T20:33:00.2850248Z if scale_ub is not None: 2025-05-07T20:33:00.2850530Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.2850863Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.2851193Z ) 2025-05-07T20:33:00.2851393Z else: 2025-05-07T20:33:00.2851614Z scale_ub_tensor = None 2025-05-07T20:33:00.2851863Z 2025-05-07T20:33:00.2852112Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.2852496Z op = silu_mul_quant 2025-05-07T20:33:00.2852755Z if compiled: 2025-05-07T20:33:00.2853005Z op = torch.compile(op) 2025-05-07T20:33:00.2853312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.2853592Z 2025-05-07T20:33:00.2853791Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.2853955Z 2025-05-07T20:33:00.2854064Z moe/activation_test.py:117: 2025-05-07T20:33:00.2854356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.2854695Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.2854976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.2855662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.2856416Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.2856967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.2857652Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.2858305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.2858837Z kernel = self.compile( 2025-05-07T20:33:00.2859389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.2860045Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.2860442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.2860697Z 2025-05-07T20:33:00.2860905Z self = 2025-05-07T20:33:00.2861981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.2863341Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7504b80>} 2025-05-07T20:33:00.2864666Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.2865677Z context = 2025-05-07T20:33:00.2865957Z 2025-05-07T20:33:00.2866123Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.2866644Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.2867193Z module_map=module_map) 2025-05-07T20:33:00.2867626Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.2867982Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.2868240Z E ^ 2025-05-07T20:33:00.2868703Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.2869159Z 2025-05-07T20:33:00.2869581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.2870095Z 2025-05-07T20:33:00.2870197Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.2870611Z self=, 2025-05-07T20:33:00.2871001Z T=128, 2025-05-07T20:33:00.2871187Z D=5120, 2025-05-07T20:33:00.2871381Z scale_ub=None, 2025-05-07T20:33:00.2871590Z contiguous=True, 2025-05-07T20:33:00.2871810Z compiled=False, 2025-05-07T20:33:00.2872013Z ) 2025-05-07T20:33:00.2872336Z self = 2025-05-07T20:33:00.2872942Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.2873208Z 2025-05-07T20:33:00.2873288Z @given( 2025-05-07T20:33:00.2873516Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.2873822Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.2874128Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.2874474Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.2874800Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.2875077Z ) 2025-05-07T20:33:00.2875422Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.2875859Z def test_silu_mul_quant( 2025-05-07T20:33:00.2876150Z self, 2025-05-07T20:33:00.2876350Z T: int, 2025-05-07T20:33:00.2876550Z D: int, 2025-05-07T20:33:00.2876766Z scale_ub: Optional[float], 2025-05-07T20:33:00.2877039Z contiguous: bool, 2025-05-07T20:33:00.2877283Z compiled: bool, 2025-05-07T20:33:00.2877503Z ) -> None: 2025-05-07T20:33:00.2877717Z torch.manual_seed(2025) 2025-05-07T20:33:00.2877958Z 2025-05-07T20:33:00.2878231Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.2878583Z 2025-05-07T20:33:00.2878787Z x_sign = torch.sign(x) 2025-05-07T20:33:00.2879084Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.2879390Z x = x_sign * x_clamp 2025-05-07T20:33:00.2879631Z x0 = x[:, :D] 2025-05-07T20:33:00.2879846Z x1 = x[:, D:] 2025-05-07T20:33:00.2880045Z 2025-05-07T20:33:00.2880227Z if contiguous: 2025-05-07T20:33:00.2880463Z x0 = x0.contiguous() 2025-05-07T20:33:00.2880713Z x1 = x1.contiguous() 2025-05-07T20:33:00.2880949Z 2025-05-07T20:33:00.2881138Z if scale_ub is not None: 2025-05-07T20:33:00.2881408Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.2881741Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.2882045Z ) 2025-05-07T20:33:00.2882228Z else: 2025-05-07T20:33:00.2882438Z scale_ub_tensor = None 2025-05-07T20:33:00.2882689Z 2025-05-07T20:33:00.2882912Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.2883222Z op = silu_mul_quant 2025-05-07T20:33:00.2883471Z if compiled: 2025-05-07T20:33:00.2883718Z op = torch.compile(op) 2025-05-07T20:33:00.2884010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.2884284Z 2025-05-07T20:33:00.2884478Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.2884641Z 2025-05-07T20:33:00.2884740Z moe/activation_test.py:117: 2025-05-07T20:33:00.2885027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.2885441Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.2885717Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.2886423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.2887103Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.2887639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.2888314Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.2888989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.2889532Z kernel = self.compile( 2025-05-07T20:33:00.2890085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.2890735Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.2891131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.2891408Z 2025-05-07T20:33:00.2891616Z self = 2025-05-07T20:33:00.2892674Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.2894026Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7505a80>} 2025-05-07T20:33:00.2895346Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.2896404Z context = 2025-05-07T20:33:00.2896692Z 2025-05-07T20:33:00.2896870Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.2897389Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.2897873Z module_map=module_map) 2025-05-07T20:33:00.2898237Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.2898581Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.2898841Z E ^ 2025-05-07T20:33:00.2899307Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.2899753Z 2025-05-07T20:33:00.2900184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.4064340Z 2025-05-07T20:33:00.4064978Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.4066170Z self=, 2025-05-07T20:33:00.4067301Z T=128, 2025-05-07T20:33:00.4067979Z D=7168, 2025-05-07T20:33:00.4068472Z scale_ub=None, 2025-05-07T20:33:00.4069050Z contiguous=True, 2025-05-07T20:33:00.4069645Z compiled=False, 2025-05-07T20:33:00.4070126Z ) 2025-05-07T20:33:00.4070768Z self = 2025-05-07T20:33:00.4071649Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.4071923Z 2025-05-07T20:33:00.4072001Z @given( 2025-05-07T20:33:00.4072234Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.4072548Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.4072850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.4073190Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.4073513Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.4073796Z ) 2025-05-07T20:33:00.4074466Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.4074902Z def test_silu_mul_quant( 2025-05-07T20:33:00.4075145Z self, 2025-05-07T20:33:00.4075342Z T: int, 2025-05-07T20:33:00.4075541Z D: int, 2025-05-07T20:33:00.4075754Z scale_ub: Optional[float], 2025-05-07T20:33:00.4076023Z contiguous: bool, 2025-05-07T20:33:00.4076263Z compiled: bool, 2025-05-07T20:33:00.4076484Z ) -> None: 2025-05-07T20:33:00.4076704Z torch.manual_seed(2025) 2025-05-07T20:33:00.4076946Z 2025-05-07T20:33:00.4077213Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.4077560Z 2025-05-07T20:33:00.4077754Z x_sign = torch.sign(x) 2025-05-07T20:33:00.4078047Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.4078355Z x = x_sign * x_clamp 2025-05-07T20:33:00.4078596Z x0 = x[:, :D] 2025-05-07T20:33:00.4078811Z x1 = x[:, D:] 2025-05-07T20:33:00.4079109Z 2025-05-07T20:33:00.4079294Z if contiguous: 2025-05-07T20:33:00.4079519Z x0 = x0.contiguous() 2025-05-07T20:33:00.4079775Z x1 = x1.contiguous() 2025-05-07T20:33:00.4080018Z 2025-05-07T20:33:00.4080202Z if scale_ub is not None: 2025-05-07T20:33:00.4080473Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.4080804Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.4081112Z ) 2025-05-07T20:33:00.4081303Z else: 2025-05-07T20:33:00.4081516Z scale_ub_tensor = None 2025-05-07T20:33:00.4081763Z 2025-05-07T20:33:00.4081989Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.4082297Z op = silu_mul_quant 2025-05-07T20:33:00.4082631Z if compiled: 2025-05-07T20:33:00.4082870Z op = torch.compile(op) 2025-05-07T20:33:00.4083166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.4083444Z 2025-05-07T20:33:00.4083630Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.4083800Z 2025-05-07T20:33:00.4083897Z moe/activation_test.py:117: 2025-05-07T20:33:00.4084189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.4084526Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.4084803Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.4085491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.4086173Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.4086707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.4087397Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.4088079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.4088605Z kernel = self.compile( 2025-05-07T20:33:00.4089137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.4089787Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.4090178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.4090401Z 2025-05-07T20:33:00.4090606Z self = 2025-05-07T20:33:00.4091679Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.4093130Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7506980>} 2025-05-07T20:33:00.4094451Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.4095462Z context = 2025-05-07T20:33:00.4095742Z 2025-05-07T20:33:00.4095905Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.4096426Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.4096905Z module_map=module_map) 2025-05-07T20:33:00.4097278Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.4097621Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.4097886Z E ^ 2025-05-07T20:33:00.4098351Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.4098797Z 2025-05-07T20:33:00.4099273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.4099782Z 2025-05-07T20:33:00.4099886Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.4100298Z self=, 2025-05-07T20:33:00.4100704Z T=2048, 2025-05-07T20:33:00.4100889Z D=7168, 2025-05-07T20:33:00.4101086Z scale_ub=1200.0, 2025-05-07T20:33:00.4101315Z contiguous=True, 2025-05-07T20:33:00.4101553Z compiled=False, 2025-05-07T20:33:00.4101791Z ) 2025-05-07T20:33:00.4102118Z self = 2025-05-07T20:33:00.4102600Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.4102920Z 2025-05-07T20:33:00.4103000Z @given( 2025-05-07T20:33:00.4103229Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.4103537Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.4103843Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.4104171Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.4104500Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.4104776Z ) 2025-05-07T20:33:00.4105130Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.4105565Z def test_silu_mul_quant( 2025-05-07T20:33:00.4105800Z self, 2025-05-07T20:33:00.4105997Z T: int, 2025-05-07T20:33:00.4106197Z D: int, 2025-05-07T20:33:00.4106409Z scale_ub: Optional[float], 2025-05-07T20:33:00.4106678Z contiguous: bool, 2025-05-07T20:33:00.4106925Z compiled: bool, 2025-05-07T20:33:00.4107150Z ) -> None: 2025-05-07T20:33:00.4107365Z torch.manual_seed(2025) 2025-05-07T20:33:00.4107677Z 2025-05-07T20:33:00.4107947Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.4109974Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.4111941Z 2025-05-07T20:33:00.4112061Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.4112274Z 2025-05-07T20:33:00.4112375Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.4112785Z self=, 2025-05-07T20:33:00.4113176Z T=1, 2025-05-07T20:33:00.4113479Z D=5120, 2025-05-07T20:33:00.4113672Z scale_ub=1200.0, 2025-05-07T20:33:00.4113892Z contiguous=True, 2025-05-07T20:33:00.4114115Z compiled=False, 2025-05-07T20:33:00.4114323Z ) 2025-05-07T20:33:00.4114635Z self = 2025-05-07T20:33:00.4115137Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.4115403Z 2025-05-07T20:33:00.4115483Z @given( 2025-05-07T20:33:00.4115765Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.4116286Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.4116689Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.4117098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.4126166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.4126479Z ) 2025-05-07T20:33:00.4126836Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.4127299Z def test_silu_mul_quant( 2025-05-07T20:33:00.4127628Z self, 2025-05-07T20:33:00.4127836Z T: int, 2025-05-07T20:33:00.4128035Z D: int, 2025-05-07T20:33:00.4128265Z scale_ub: Optional[float], 2025-05-07T20:33:00.4128545Z contiguous: bool, 2025-05-07T20:33:00.4128788Z compiled: bool, 2025-05-07T20:33:00.4129022Z ) -> None: 2025-05-07T20:33:00.4129246Z torch.manual_seed(2025) 2025-05-07T20:33:00.4129489Z 2025-05-07T20:33:00.4129766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.4130112Z 2025-05-07T20:33:00.4130308Z x_sign = torch.sign(x) 2025-05-07T20:33:00.4130594Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.4130902Z x = x_sign * x_clamp 2025-05-07T20:33:00.4131191Z x0 = x[:, :D] 2025-05-07T20:33:00.4131434Z x1 = x[:, D:] 2025-05-07T20:33:00.4131675Z 2025-05-07T20:33:00.4131863Z if contiguous: 2025-05-07T20:33:00.4132101Z x0 = x0.contiguous() 2025-05-07T20:33:00.4132366Z x1 = x1.contiguous() 2025-05-07T20:33:00.4132613Z 2025-05-07T20:33:00.4132803Z if scale_ub is not None: 2025-05-07T20:33:00.4133081Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.4133419Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.4133729Z ) 2025-05-07T20:33:00.4133930Z else: 2025-05-07T20:33:00.4134145Z scale_ub_tensor = None 2025-05-07T20:33:00.4134401Z 2025-05-07T20:33:00.4134635Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.4134951Z op = silu_mul_quant 2025-05-07T20:33:00.4135205Z if compiled: 2025-05-07T20:33:00.4135452Z op = torch.compile(op) 2025-05-07T20:33:00.4135761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.4136043Z 2025-05-07T20:33:00.4136238Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.4136409Z 2025-05-07T20:33:00.4136516Z moe/activation_test.py:117: 2025-05-07T20:33:00.4136817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.4137145Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.4137431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.4138120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.4138805Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.4139341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.4140023Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.4140944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.4141473Z kernel = self.compile( 2025-05-07T20:33:00.4142169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.4142834Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.4143250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.4143485Z 2025-05-07T20:33:00.4143690Z self = 2025-05-07T20:33:00.4144763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.4146122Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7507e20>} 2025-05-07T20:33:00.4147511Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.4148595Z context = 2025-05-07T20:33:00.4148883Z 2025-05-07T20:33:00.4149050Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.4149573Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.4150049Z module_map=module_map) 2025-05-07T20:33:00.4150412Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.4150770Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.4151030Z E ^ 2025-05-07T20:33:00.4151490Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.4152023Z 2025-05-07T20:33:00.4152459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.4960613Z 2025-05-07T20:33:00.4961060Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.4961668Z self=, 2025-05-07T20:33:00.4962203Z T=2048, 2025-05-07T20:33:00.4962442Z D=5120, 2025-05-07T20:33:00.4962681Z scale_ub=None, 2025-05-07T20:33:00.4962936Z contiguous=True, 2025-05-07T20:33:00.4963216Z compiled=False, 2025-05-07T20:33:00.4963473Z ) 2025-05-07T20:33:00.4963808Z self = 2025-05-07T20:33:00.4964296Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.4964576Z 2025-05-07T20:33:00.4964654Z @given( 2025-05-07T20:33:00.4964895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.4965197Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.4965505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.4965828Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.4966143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.4966420Z ) 2025-05-07T20:33:00.4966763Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.4967188Z def test_silu_mul_quant( 2025-05-07T20:33:00.4967422Z self, 2025-05-07T20:33:00.4967609Z T: int, 2025-05-07T20:33:00.4967795Z D: int, 2025-05-07T20:33:00.4968002Z scale_ub: Optional[float], 2025-05-07T20:33:00.4968264Z contiguous: bool, 2025-05-07T20:33:00.4968496Z compiled: bool, 2025-05-07T20:33:00.4968707Z ) -> None: 2025-05-07T20:33:00.4968912Z torch.manual_seed(2025) 2025-05-07T20:33:00.4969151Z 2025-05-07T20:33:00.4969413Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.4969749Z 2025-05-07T20:33:00.4969934Z > x_sign = torch.sign(x) 2025-05-07T20:33:00.4972147Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.4974020Z 2025-05-07T20:33:00.4974141Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:00.4974346Z 2025-05-07T20:33:00.4974442Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.4974850Z self=, 2025-05-07T20:33:00.4975254Z T=16384, 2025-05-07T20:33:00.4975439Z D=5120, 2025-05-07T20:33:00.4975622Z scale_ub=None, 2025-05-07T20:33:00.4975830Z contiguous=True, 2025-05-07T20:33:00.4976106Z compiled=False, 2025-05-07T20:33:00.4976301Z ) 2025-05-07T20:33:00.4976614Z self = 2025-05-07T20:33:00.4977101Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.4977370Z 2025-05-07T20:33:00.4977447Z @given( 2025-05-07T20:33:00.4977677Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.4977996Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.4978294Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.4978646Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.4978966Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.4979312Z ) 2025-05-07T20:33:00.4979646Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.4980097Z def test_silu_mul_quant( 2025-05-07T20:33:00.4980335Z self, 2025-05-07T20:33:00.4980528Z T: int, 2025-05-07T20:33:00.4980713Z D: int, 2025-05-07T20:33:00.4980932Z scale_ub: Optional[float], 2025-05-07T20:33:00.4981210Z contiguous: bool, 2025-05-07T20:33:00.4981436Z compiled: bool, 2025-05-07T20:33:00.4981691Z ) -> None: 2025-05-07T20:33:00.4981921Z torch.manual_seed(2025) 2025-05-07T20:33:00.4982157Z 2025-05-07T20:33:00.4982417Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.4984437Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.4986397Z 2025-05-07T20:33:00.4986524Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.4986730Z 2025-05-07T20:33:00.4986834Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.4987237Z self=, 2025-05-07T20:33:00.4987695Z T=4096, 2025-05-07T20:33:00.4987877Z D=5120, 2025-05-07T20:33:00.4988049Z scale_ub=None, 2025-05-07T20:33:00.4988255Z contiguous=True, 2025-05-07T20:33:00.4988470Z compiled=False, 2025-05-07T20:33:00.4988658Z ) 2025-05-07T20:33:00.4988970Z self = 2025-05-07T20:33:00.4989453Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.4989713Z 2025-05-07T20:33:00.4989790Z @given( 2025-05-07T20:33:00.4990093Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.4990406Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.4990699Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.4991011Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.4991330Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.4991646Z ) 2025-05-07T20:33:00.4991989Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.4992418Z def test_silu_mul_quant( 2025-05-07T20:33:00.4992651Z self, 2025-05-07T20:33:00.4992832Z T: int, 2025-05-07T20:33:00.4993022Z D: int, 2025-05-07T20:33:00.4993233Z scale_ub: Optional[float], 2025-05-07T20:33:00.4993498Z contiguous: bool, 2025-05-07T20:33:00.4993742Z compiled: bool, 2025-05-07T20:33:00.4993958Z ) -> None: 2025-05-07T20:33:00.4994162Z torch.manual_seed(2025) 2025-05-07T20:33:00.4994404Z 2025-05-07T20:33:00.4994678Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.4996756Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.4998658Z 2025-05-07T20:33:00.4998782Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.4998985Z 2025-05-07T20:33:00.4999127Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.4999544Z self=, 2025-05-07T20:33:00.4999957Z T=2048, 2025-05-07T20:33:00.5000139Z D=5120, 2025-05-07T20:33:00.5000324Z scale_ub=None, 2025-05-07T20:33:00.5000542Z contiguous=False, 2025-05-07T20:33:00.5000771Z compiled=False, 2025-05-07T20:33:00.5000970Z ) 2025-05-07T20:33:00.5001279Z self = 2025-05-07T20:33:00.5001770Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.5002044Z 2025-05-07T20:33:00.5002121Z @given( 2025-05-07T20:33:00.5002347Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.5002653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.5002954Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.5003272Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.5003594Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.5003872Z ) 2025-05-07T20:33:00.5004215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.5004644Z def test_silu_mul_quant( 2025-05-07T20:33:00.5004889Z self, 2025-05-07T20:33:00.5005070Z T: int, 2025-05-07T20:33:00.5005267Z D: int, 2025-05-07T20:33:00.5005476Z scale_ub: Optional[float], 2025-05-07T20:33:00.5005728Z contiguous: bool, 2025-05-07T20:33:00.5005968Z compiled: bool, 2025-05-07T20:33:00.5006182Z ) -> None: 2025-05-07T20:33:00.5006382Z torch.manual_seed(2025) 2025-05-07T20:33:00.5006615Z 2025-05-07T20:33:00.5006876Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.5008962Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.5010774Z 2025-05-07T20:33:00.5010893Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.5011097Z 2025-05-07T20:33:00.5011192Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.5011600Z self=, 2025-05-07T20:33:00.5012047Z T=4096, 2025-05-07T20:33:00.5012225Z D=7168, 2025-05-07T20:33:00.5012405Z scale_ub=None, 2025-05-07T20:33:00.5012612Z contiguous=True, 2025-05-07T20:33:00.5012820Z compiled=True, 2025-05-07T20:33:00.5013013Z ) 2025-05-07T20:33:00.5013320Z self = 2025-05-07T20:33:00.5013807Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.5014064Z 2025-05-07T20:33:00.5014143Z @given( 2025-05-07T20:33:00.5014409Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.5014709Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.5014994Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.5015310Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.5015632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.5015902Z ) 2025-05-07T20:33:00.5016238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.5016670Z def test_silu_mul_quant( 2025-05-07T20:33:00.5016907Z self, 2025-05-07T20:33:00.5017089Z T: int, 2025-05-07T20:33:00.5017290Z D: int, 2025-05-07T20:33:00.5017504Z scale_ub: Optional[float], 2025-05-07T20:33:00.5017811Z contiguous: bool, 2025-05-07T20:33:00.5018052Z compiled: bool, 2025-05-07T20:33:00.5018273Z ) -> None: 2025-05-07T20:33:00.5018482Z torch.manual_seed(2025) 2025-05-07T20:33:00.5018716Z 2025-05-07T20:33:00.5018985Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.5021009Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.5022931Z 2025-05-07T20:33:00.5023043Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.5023256Z 2025-05-07T20:33:00.5023354Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.5023767Z self=, 2025-05-07T20:33:00.5024166Z T=2048, 2025-05-07T20:33:00.5024343Z D=5120, 2025-05-07T20:33:00.5024532Z scale_ub=1200.0, 2025-05-07T20:33:00.5024754Z contiguous=False, 2025-05-07T20:33:00.5024969Z compiled=False, 2025-05-07T20:33:00.5576229Z ) 2025-05-07T20:33:00.5577105Z self = 2025-05-07T20:33:00.5578158Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.5578691Z 2025-05-07T20:33:00.5578848Z @given( 2025-05-07T20:33:00.5579277Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.5579886Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.5580482Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.5581117Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.5581564Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.5581841Z ) 2025-05-07T20:33:00.5582348Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.5582782Z def test_silu_mul_quant( 2025-05-07T20:33:00.5583025Z self, 2025-05-07T20:33:00.5583206Z T: int, 2025-05-07T20:33:00.5583395Z D: int, 2025-05-07T20:33:00.5583611Z scale_ub: Optional[float], 2025-05-07T20:33:00.5583876Z contiguous: bool, 2025-05-07T20:33:00.5584112Z compiled: bool, 2025-05-07T20:33:00.5584335Z ) -> None: 2025-05-07T20:33:00.5584541Z torch.manual_seed(2025) 2025-05-07T20:33:00.5584774Z 2025-05-07T20:33:00.5585036Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.5587074Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.5589076Z 2025-05-07T20:33:00.5589197Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.5589400Z 2025-05-07T20:33:00.5589498Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.5589905Z self=, 2025-05-07T20:33:00.5590308Z T=4096, 2025-05-07T20:33:00.5590494Z D=7168, 2025-05-07T20:33:00.5590671Z scale_ub=1200.0, 2025-05-07T20:33:00.5590887Z contiguous=True, 2025-05-07T20:33:00.5591103Z compiled=False, 2025-05-07T20:33:00.5591370Z ) 2025-05-07T20:33:00.5591693Z self = 2025-05-07T20:33:00.5592181Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.5592452Z 2025-05-07T20:33:00.5592527Z @given( 2025-05-07T20:33:00.5592753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.5593059Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.5593352Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.5593676Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.5594004Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.5594295Z ) 2025-05-07T20:33:00.5594639Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.5595084Z def test_silu_mul_quant( 2025-05-07T20:33:00.5595324Z self, 2025-05-07T20:33:00.5595508Z T: int, 2025-05-07T20:33:00.5595706Z D: int, 2025-05-07T20:33:00.5595922Z scale_ub: Optional[float], 2025-05-07T20:33:00.5596184Z contiguous: bool, 2025-05-07T20:33:00.5596421Z compiled: bool, 2025-05-07T20:33:00.5596645Z ) -> None: 2025-05-07T20:33:00.5596855Z torch.manual_seed(2025) 2025-05-07T20:33:00.5597098Z 2025-05-07T20:33:00.5597364Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.5599395Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.5601259Z 2025-05-07T20:33:00.5601379Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.5601586Z 2025-05-07T20:33:00.5601765Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.5602175Z self=, 2025-05-07T20:33:00.5602577Z T=16384, 2025-05-07T20:33:00.5602758Z D=7168, 2025-05-07T20:33:00.5602942Z scale_ub=None, 2025-05-07T20:33:00.5603153Z contiguous=False, 2025-05-07T20:33:00.5603369Z compiled=True, 2025-05-07T20:33:00.5603572Z ) 2025-05-07T20:33:00.5603887Z self = 2025-05-07T20:33:00.5604370Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.5604649Z 2025-05-07T20:33:00.5604721Z @given( 2025-05-07T20:33:00.5604939Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.5605252Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.5605542Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.5605862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.5606185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.5606505Z ) 2025-05-07T20:33:00.5606862Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.5607301Z def test_silu_mul_quant( 2025-05-07T20:33:00.5607538Z self, 2025-05-07T20:33:00.5607733Z T: int, 2025-05-07T20:33:00.5607956Z D: int, 2025-05-07T20:33:00.5608173Z scale_ub: Optional[float], 2025-05-07T20:33:00.5608438Z contiguous: bool, 2025-05-07T20:33:00.5608668Z compiled: bool, 2025-05-07T20:33:00.5608881Z ) -> None: 2025-05-07T20:33:00.5609089Z torch.manual_seed(2025) 2025-05-07T20:33:00.5609330Z 2025-05-07T20:33:00.5609597Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.5611619Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.5613565Z 2025-05-07T20:33:00.5613691Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.5613894Z 2025-05-07T20:33:00.5613994Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.5614404Z self=, 2025-05-07T20:33:00.5614803Z T=4096, 2025-05-07T20:33:00.5614983Z D=7168, 2025-05-07T20:33:00.5615171Z scale_ub=None, 2025-05-07T20:33:00.5615386Z contiguous=True, 2025-05-07T20:33:00.5615601Z compiled=False, 2025-05-07T20:33:00.5615804Z ) 2025-05-07T20:33:00.5616124Z self = 2025-05-07T20:33:00.5616611Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.5616874Z 2025-05-07T20:33:00.5616948Z @given( 2025-05-07T20:33:00.5617173Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.5617472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.5617763Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.5618085Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.5618406Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.5618681Z ) 2025-05-07T20:33:00.5619018Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.5619445Z def test_silu_mul_quant( 2025-05-07T20:33:00.5619684Z self, 2025-05-07T20:33:00.5619875Z T: int, 2025-05-07T20:33:00.5620069Z D: int, 2025-05-07T20:33:00.5620281Z scale_ub: Optional[float], 2025-05-07T20:33:00.5620621Z contiguous: bool, 2025-05-07T20:33:00.5620860Z compiled: bool, 2025-05-07T20:33:00.5621079Z ) -> None: 2025-05-07T20:33:00.5621281Z torch.manual_seed(2025) 2025-05-07T20:33:00.5621522Z 2025-05-07T20:33:00.5621808Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.5623838Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.5625686Z 2025-05-07T20:33:00.5625806Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.5626016Z 2025-05-07T20:33:00.5626156Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.5626556Z self=, 2025-05-07T20:33:00.5626955Z T=16384, 2025-05-07T20:33:00.5627132Z D=7168, 2025-05-07T20:33:00.5627312Z scale_ub=None, 2025-05-07T20:33:00.5627564Z contiguous=True, 2025-05-07T20:33:00.5627780Z compiled=False, 2025-05-07T20:33:00.5627982Z ) 2025-05-07T20:33:00.5628289Z self = 2025-05-07T20:33:00.5628766Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.5629046Z 2025-05-07T20:33:00.5629122Z @given( 2025-05-07T20:33:00.5629339Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.5629694Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.5630281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.5630702Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.5631061Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.5638734Z ) 2025-05-07T20:33:00.5639107Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.5639565Z def test_silu_mul_quant( 2025-05-07T20:33:00.5639803Z self, 2025-05-07T20:33:00.5639996Z T: int, 2025-05-07T20:33:00.5640375Z D: int, 2025-05-07T20:33:00.5640581Z scale_ub: Optional[float], 2025-05-07T20:33:00.5640847Z contiguous: bool, 2025-05-07T20:33:00.5641089Z compiled: bool, 2025-05-07T20:33:00.5641302Z ) -> None: 2025-05-07T20:33:00.5641508Z torch.manual_seed(2025) 2025-05-07T20:33:00.5641741Z 2025-05-07T20:33:00.5642009Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.5644049Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.5645917Z 2025-05-07T20:33:00.5646032Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.5646249Z 2025-05-07T20:33:00.5646348Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.5646761Z self=, 2025-05-07T20:33:00.5647156Z T=16384, 2025-05-07T20:33:00.5647345Z D=7168, 2025-05-07T20:33:00.5647537Z scale_ub=1200.0, 2025-05-07T20:33:00.5647752Z contiguous=True, 2025-05-07T20:33:00.5647968Z compiled=False, 2025-05-07T20:33:00.5648164Z ) 2025-05-07T20:33:00.5648624Z self = 2025-05-07T20:33:00.5649130Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.5649407Z 2025-05-07T20:33:00.5649491Z @given( 2025-05-07T20:33:00.5649722Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.5650027Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.5650332Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.5650658Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.5650972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.5651252Z ) 2025-05-07T20:33:00.5651593Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.5652027Z def test_silu_mul_quant( 2025-05-07T20:33:00.5652258Z self, 2025-05-07T20:33:00.5652442Z T: int, 2025-05-07T20:33:00.5652626Z D: int, 2025-05-07T20:33:00.5652856Z scale_ub: Optional[float], 2025-05-07T20:33:00.5653181Z contiguous: bool, 2025-05-07T20:33:00.5653403Z compiled: bool, 2025-05-07T20:33:00.5653614Z ) -> None: 2025-05-07T20:33:00.5653818Z torch.manual_seed(2025) 2025-05-07T20:33:00.5654059Z 2025-05-07T20:33:00.5654322Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.5656369Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.5658298Z 2025-05-07T20:33:00.5658414Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7467265Z 2025-05-07T20:33:00.7467697Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7468343Z self=, 2025-05-07T20:33:00.7468897Z T=128, 2025-05-07T20:33:00.7469161Z D=5120, 2025-05-07T20:33:00.7469426Z scale_ub=1200.0, 2025-05-07T20:33:00.7469727Z contiguous=False, 2025-05-07T20:33:00.7469998Z compiled=False, 2025-05-07T20:33:00.7470218Z ) 2025-05-07T20:33:00.7470533Z self = 2025-05-07T20:33:00.7471048Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.7471335Z 2025-05-07T20:33:00.7471416Z @given( 2025-05-07T20:33:00.7471717Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7472037Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7472360Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7472702Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7473029Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7473324Z ) 2025-05-07T20:33:00.7473686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7474139Z def test_silu_mul_quant( 2025-05-07T20:33:00.7474401Z self, 2025-05-07T20:33:00.7474607Z T: int, 2025-05-07T20:33:00.7474845Z D: int, 2025-05-07T20:33:00.7475073Z scale_ub: Optional[float], 2025-05-07T20:33:00.7475340Z contiguous: bool, 2025-05-07T20:33:00.7475585Z compiled: bool, 2025-05-07T20:33:00.7475820Z ) -> None: 2025-05-07T20:33:00.7476028Z torch.manual_seed(2025) 2025-05-07T20:33:00.7476280Z 2025-05-07T20:33:00.7476559Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7476900Z 2025-05-07T20:33:00.7477480Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7477785Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7478107Z x = x_sign * x_clamp 2025-05-07T20:33:00.7478343Z x0 = x[:, :D] 2025-05-07T20:33:00.7478565Z x1 = x[:, D:] 2025-05-07T20:33:00.7478777Z 2025-05-07T20:33:00.7478958Z if contiguous: 2025-05-07T20:33:00.7479196Z x0 = x0.contiguous() 2025-05-07T20:33:00.7479460Z x1 = x1.contiguous() 2025-05-07T20:33:00.7479695Z 2025-05-07T20:33:00.7479888Z if scale_ub is not None: 2025-05-07T20:33:00.7480164Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.7480497Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.7480804Z ) 2025-05-07T20:33:00.7481003Z else: 2025-05-07T20:33:00.7481211Z scale_ub_tensor = None 2025-05-07T20:33:00.7481472Z 2025-05-07T20:33:00.7481709Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.7482021Z op = silu_mul_quant 2025-05-07T20:33:00.7482359Z if compiled: 2025-05-07T20:33:00.7482608Z op = torch.compile(op) 2025-05-07T20:33:00.7482904Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7483168Z 2025-05-07T20:33:00.7483360Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.7483522Z 2025-05-07T20:33:00.7483632Z moe/activation_test.py:117: 2025-05-07T20:33:00.7483923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7484253Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.7484539Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7485216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.7485993Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.7486542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.7487236Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.7487924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.7488454Z kernel = self.compile( 2025-05-07T20:33:00.7489022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.7489665Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.7490063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7490298Z 2025-05-07T20:33:00.7490505Z self = 2025-05-07T20:33:00.7491583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.7493008Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d72fcae0>} 2025-05-07T20:33:00.7494321Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.7495333Z context = 2025-05-07T20:33:00.7495624Z 2025-05-07T20:33:00.7495790Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.7496312Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.7496779Z module_map=module_map) 2025-05-07T20:33:00.7497153Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.7497591Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.7497851Z E ^ 2025-05-07T20:33:00.7498315Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.7498785Z 2025-05-07T20:33:00.7499215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.7499721Z 2025-05-07T20:33:00.7499833Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7500236Z self=, 2025-05-07T20:33:00.7500644Z T=2048, 2025-05-07T20:33:00.7500836Z D=7168, 2025-05-07T20:33:00.7501021Z scale_ub=None, 2025-05-07T20:33:00.7501240Z contiguous=False, 2025-05-07T20:33:00.7501469Z compiled=False, 2025-05-07T20:33:00.7501670Z ) 2025-05-07T20:33:00.7501992Z self = 2025-05-07T20:33:00.7502487Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.7502802Z 2025-05-07T20:33:00.7502890Z @given( 2025-05-07T20:33:00.7503114Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7503427Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7503736Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7504059Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7504390Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7504676Z ) 2025-05-07T20:33:00.7505018Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7505467Z def test_silu_mul_quant( 2025-05-07T20:33:00.7505709Z self, 2025-05-07T20:33:00.7505909Z T: int, 2025-05-07T20:33:00.7506146Z D: int, 2025-05-07T20:33:00.7506367Z scale_ub: Optional[float], 2025-05-07T20:33:00.7506636Z contiguous: bool, 2025-05-07T20:33:00.7506876Z compiled: bool, 2025-05-07T20:33:00.7507103Z ) -> None: 2025-05-07T20:33:00.7507319Z torch.manual_seed(2025) 2025-05-07T20:33:00.7507635Z 2025-05-07T20:33:00.7507913Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7510061Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7511963Z 2025-05-07T20:33:00.7512090Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7512299Z 2025-05-07T20:33:00.7512415Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7512826Z self=, 2025-05-07T20:33:00.7513251Z T=128, 2025-05-07T20:33:00.7513444Z D=7168, 2025-05-07T20:33:00.7513634Z scale_ub=1200.0, 2025-05-07T20:33:00.7513864Z contiguous=True, 2025-05-07T20:33:00.7514090Z compiled=True, 2025-05-07T20:33:00.7514288Z ) 2025-05-07T20:33:00.7514610Z self = 2025-05-07T20:33:00.7515095Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.7515357Z 2025-05-07T20:33:00.7515435Z @given( 2025-05-07T20:33:00.7515664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7515979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7516292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7516614Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7517021Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7517311Z ) 2025-05-07T20:33:00.7517653Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7518099Z def test_silu_mul_quant( 2025-05-07T20:33:00.7518345Z self, 2025-05-07T20:33:00.7518554Z T: int, 2025-05-07T20:33:00.7518758Z D: int, 2025-05-07T20:33:00.7518970Z scale_ub: Optional[float], 2025-05-07T20:33:00.7519244Z contiguous: bool, 2025-05-07T20:33:00.7519493Z compiled: bool, 2025-05-07T20:33:00.7519709Z ) -> None: 2025-05-07T20:33:00.7519928Z torch.manual_seed(2025) 2025-05-07T20:33:00.7520172Z 2025-05-07T20:33:00.7520442Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7520785Z 2025-05-07T20:33:00.7520977Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7521272Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7521584Z x = x_sign * x_clamp 2025-05-07T20:33:00.7521883Z x0 = x[:, :D] 2025-05-07T20:33:00.7522102Z x1 = x[:, D:] 2025-05-07T20:33:00.7522302Z 2025-05-07T20:33:00.7522488Z if contiguous: 2025-05-07T20:33:00.7522722Z x0 = x0.contiguous() 2025-05-07T20:33:00.7522981Z x1 = x1.contiguous() 2025-05-07T20:33:00.7523219Z 2025-05-07T20:33:00.7523412Z if scale_ub is not None: 2025-05-07T20:33:00.7523678Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.7524013Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.7524329Z ) 2025-05-07T20:33:00.7524515Z else: 2025-05-07T20:33:00.7524728Z scale_ub_tensor = None 2025-05-07T20:33:00.7524979Z 2025-05-07T20:33:00.7525209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.7525562Z op = silu_mul_quant 2025-05-07T20:33:00.7525820Z if compiled: 2025-05-07T20:33:00.7526079Z op = torch.compile(op) 2025-05-07T20:33:00.7526373Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7526648Z 2025-05-07T20:33:00.7526845Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.7527007Z 2025-05-07T20:33:00.7527103Z moe/activation_test.py:117: 2025-05-07T20:33:00.7527398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7527731Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.7528003Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7528568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.7529138Z return fn(*args, **kwargs) 2025-05-07T20:33:00.7529801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.7530477Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.7531026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.7531731Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.7532395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.7532919Z kernel = self.compile( 2025-05-07T20:33:00.7533469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.7534146Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.7534539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7534773Z 2025-05-07T20:33:00.7534981Z self = 2025-05-07T20:33:00.7536141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.7537501Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7180040>} 2025-05-07T20:33:00.7538829Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.7539835Z context = 2025-05-07T20:33:00.7540395Z 2025-05-07T20:33:00.7540565Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.7541086Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.7541565Z module_map=module_map) 2025-05-07T20:33:00.7541936Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.7542424Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.7542685Z E ^ 2025-05-07T20:33:00.7543147Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.7543604Z 2025-05-07T20:33:00.7544022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.0865526Z 2025-05-07T20:33:01.0865859Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.0866326Z self=, 2025-05-07T20:33:01.0866911Z T=128, 2025-05-07T20:33:01.0867199Z D=7168, 2025-05-07T20:33:01.0867543Z scale_ub=1200.0, 2025-05-07T20:33:01.0868128Z contiguous=True, 2025-05-07T20:33:01.0868358Z compiled=False, 2025-05-07T20:33:01.0868567Z ) 2025-05-07T20:33:01.0868906Z self = 2025-05-07T20:33:01.0869410Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.0869694Z 2025-05-07T20:33:01.0869785Z @given( 2025-05-07T20:33:01.0870014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.0870332Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.0870650Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.0870984Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.0871321Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.0871614Z ) 2025-05-07T20:33:01.0871963Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.0872431Z def test_silu_mul_quant( 2025-05-07T20:33:01.0872686Z self, 2025-05-07T20:33:01.0872890Z T: int, 2025-05-07T20:33:01.0873091Z D: int, 2025-05-07T20:33:01.0873318Z scale_ub: Optional[float], 2025-05-07T20:33:01.0873604Z contiguous: bool, 2025-05-07T20:33:01.0873847Z compiled: bool, 2025-05-07T20:33:01.0874088Z ) -> None: 2025-05-07T20:33:01.0874313Z torch.manual_seed(2025) 2025-05-07T20:33:01.0874555Z 2025-05-07T20:33:01.0874836Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.0875185Z 2025-05-07T20:33:01.0875376Z x_sign = torch.sign(x) 2025-05-07T20:33:01.0875680Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.0877939Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.0879806Z 2025-05-07T20:33:01.0879926Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:01.0880138Z 2025-05-07T20:33:01.0880250Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.0880663Z self=, 2025-05-07T20:33:01.0881077Z T=128, 2025-05-07T20:33:01.0881271Z D=5120, 2025-05-07T20:33:01.0881461Z scale_ub=1200.0, 2025-05-07T20:33:01.0881691Z contiguous=True, 2025-05-07T20:33:01.0881918Z compiled=True, 2025-05-07T20:33:01.0882121Z ) 2025-05-07T20:33:01.0882445Z self = 2025-05-07T20:33:01.0882935Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.0883215Z 2025-05-07T20:33:01.0883301Z @given( 2025-05-07T20:33:01.0883535Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.0883853Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.0884241Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.0884567Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.0884896Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.0885185Z ) 2025-05-07T20:33:01.0885530Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.0885978Z def test_silu_mul_quant( 2025-05-07T20:33:01.0886230Z self, 2025-05-07T20:33:01.0886424Z T: int, 2025-05-07T20:33:01.0886627Z D: int, 2025-05-07T20:33:01.0886850Z scale_ub: Optional[float], 2025-05-07T20:33:01.0887114Z contiguous: bool, 2025-05-07T20:33:01.0887363Z compiled: bool, 2025-05-07T20:33:01.0887641Z ) -> None: 2025-05-07T20:33:01.0887861Z torch.manual_seed(2025) 2025-05-07T20:33:01.0888100Z 2025-05-07T20:33:01.0888381Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.0888738Z 2025-05-07T20:33:01.0888927Z x_sign = torch.sign(x) 2025-05-07T20:33:01.0889225Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.0891220Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.0893168Z 2025-05-07T20:33:01.0893295Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:01.0893503Z 2025-05-07T20:33:01.0893608Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.0894027Z self=, 2025-05-07T20:33:01.0894432Z T=128, 2025-05-07T20:33:01.0894625Z D=7168, 2025-05-07T20:33:01.0894812Z scale_ub=None, 2025-05-07T20:33:01.0895028Z contiguous=True, 2025-05-07T20:33:01.0895256Z compiled=True, 2025-05-07T20:33:01.0895454Z ) 2025-05-07T20:33:01.0895777Z self = 2025-05-07T20:33:01.0896262Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.0896529Z 2025-05-07T20:33:01.0896608Z @given( 2025-05-07T20:33:01.0896838Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.0897155Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.0897453Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.0897792Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.0898204Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.0898495Z ) 2025-05-07T20:33:01.0898844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.0899293Z def test_silu_mul_quant( 2025-05-07T20:33:01.0899539Z self, 2025-05-07T20:33:01.0899728Z T: int, 2025-05-07T20:33:01.0899928Z D: int, 2025-05-07T20:33:01.0900150Z scale_ub: Optional[float], 2025-05-07T20:33:01.0900414Z contiguous: bool, 2025-05-07T20:33:01.0900654Z compiled: bool, 2025-05-07T20:33:01.0900878Z ) -> None: 2025-05-07T20:33:01.0901087Z torch.manual_seed(2025) 2025-05-07T20:33:01.0901330Z 2025-05-07T20:33:01.0901602Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.0903615Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.0905604Z 2025-05-07T20:33:01.0905728Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.0905937Z 2025-05-07T20:33:01.0909060Z FAILED 2025-05-07T20:33:01.0909192Z 2025-05-07T20:33:01.0909336Z =================================== FAILURES =================================== 2025-05-07T20:33:01.0909945Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:01.0910561Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:01.0911473Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:33:01.0912232Z | yield 2025-05-07T20:33:01.0912816Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:33:01.0913525Z | self._callTestMethod(testMethod) 2025-05-07T20:33:01.0914074Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:33:01.0914819Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:33:01.0915561Z | if method() is not None: 2025-05-07T20:33:01.0915902Z | ~~~~~~^^ 2025-05-07T20:33:01.0929656Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:01.0930787Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.0931232Z | ^^^^^^^ 2025-05-07T20:33:01.0932115Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:01.0933036Z | raise the_error_hypothesis_found 2025-05-07T20:33:01.0933643Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:01.0934240Z +-+---------------- 1 ---------------- 2025-05-07T20:33:01.0934662Z | Traceback (most recent call last): 2025-05-07T20:33:01.0935682Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:01.0936813Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.0939989Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.0943194Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:01.0943814Z | self=, 2025-05-07T20:33:01.0944394Z | T=2048, 2025-05-07T20:33:01.0944714Z | D=5120, # or any other generated value 2025-05-07T20:33:01.0945194Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:01.0945708Z | contiguous=True, # or any other generated value 2025-05-07T20:33:01.0946222Z | compiled=False, # or any other generated value 2025-05-07T20:33:01.0946655Z | ) 2025-05-07T20:33:01.0946908Z | 2025-05-07T20:33:01.0947772Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:01.0948678Z +---------------- 2 ---------------- 2025-05-07T20:33:01.0949199Z | Traceback (most recent call last): 2025-05-07T20:33:01.0950223Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:01.0951357Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.0954354Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.0957100Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:01.0957741Z | self=, 2025-05-07T20:33:01.0958328Z | T=128, 2025-05-07T20:33:01.0958608Z | D=7168, 2025-05-07T20:33:01.0958898Z | scale_ub=None, 2025-05-07T20:33:01.0959232Z | contiguous=True, 2025-05-07T20:33:01.0959558Z | compiled=True, 2025-05-07T20:33:01.0959859Z | ) 2025-05-07T20:33:01.0960105Z | 2025-05-07T20:33:01.0960768Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:01.0961365Z +---------------- 3 ---------------- 2025-05-07T20:33:01.0961652Z | Traceback (most recent call last): 2025-05-07T20:33:01.0962359Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:01.0963121Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.0965125Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.0967145Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:01.0967583Z | self=, 2025-05-07T20:33:01.0967983Z | T=128, 2025-05-07T20:33:01.0968177Z | D=5120, 2025-05-07T20:33:01.0968546Z | scale_ub=1200.0, 2025-05-07T20:33:01.0968791Z | contiguous=True, 2025-05-07T20:33:01.0969028Z | compiled=True, 2025-05-07T20:33:01.0969249Z | ) 2025-05-07T20:33:01.0969418Z | 2025-05-07T20:33:01.0969927Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:01.0970517Z +---------------- 4 ---------------- 2025-05-07T20:33:01.0970799Z | Traceback (most recent call last): 2025-05-07T20:33:01.0971501Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:01.0972275Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.0972558Z | ~~~~~~^^ 2025-05-07T20:33:01.0973338Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:01.0974334Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.0975594Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:01.0976740Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.0977129Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:33:01.0977503Z | a, 2025-05-07T20:33:01.0977789Z | ^^ 2025-05-07T20:33:01.0978074Z | ...<23 lines>... 2025-05-07T20:33:01.0978419Z | USE_INT64=use_int64, 2025-05-07T20:33:01.0978705Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:01.0979018Z | ) 2025-05-07T20:33:01.0979266Z | ^ 2025-05-07T20:33:01.0980004Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:01.0981155Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.0981819Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:01.1004141Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:01.1004958Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.1005422Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:01.1006070Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:01.1006781Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.1007172Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:01.1007806Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:01.1008394Z | fn() 2025-05-07T20:33:01.1008589Z | ~~^^ 2025-05-07T20:33:01.1009169Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:01.1009809Z | self.fn.run( 2025-05-07T20:33:01.1010027Z | ~~~~~~~~~~~^ 2025-05-07T20:33:01.1010235Z | *args, 2025-05-07T20:33:01.1010432Z | ^^^^^^ 2025-05-07T20:33:01.1010638Z | **current, 2025-05-07T20:33:01.1010856Z | ^^^^^^^^^^ 2025-05-07T20:33:01.1011065Z | ) 2025-05-07T20:33:01.1011246Z | ^ 2025-05-07T20:33:01.1011739Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:01.1012309Z | kernel = self.compile( 2025-05-07T20:33:01.1012551Z | src, 2025-05-07T20:33:01.1012892Z | target=target, 2025-05-07T20:33:01.1013148Z | options=options.__dict__, 2025-05-07T20:33:01.1013412Z | ) 2025-05-07T20:33:01.1013966Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:01.1014669Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1015360Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:01.1016147Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1016617Z | module_map=module_map) 2025-05-07T20:33:01.1016972Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1017312Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.1017570Z | ^ 2025-05-07T20:33:01.1018027Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1018673Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:01.1019063Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:01.1019576Z | self=, 2025-05-07T20:33:01.1020000Z | T=1, # or any other generated value 2025-05-07T20:33:01.1020298Z | D=5120, # or any other generated value 2025-05-07T20:33:01.1020623Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:01.1020972Z | contiguous=True, # or any other generated value 2025-05-07T20:33:01.1021319Z | compiled=True, # or any other generated value 2025-05-07T20:33:01.1021615Z | ) 2025-05-07T20:33:01.1021880Z | 2025-05-07T20:33:01.1022395Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:01.1022988Z +------------------------------------ 2025-05-07T20:33:01.1023343Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:01.1023720Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1024120Z self=, 2025-05-07T20:33:01.1024511Z T=1, 2025-05-07T20:33:01.1024691Z D=5120, 2025-05-07T20:33:01.1024871Z scale_ub=None, 2025-05-07T20:33:01.1025078Z contiguous=True, 2025-05-07T20:33:01.1025292Z compiled=True, 2025-05-07T20:33:01.1025485Z ) 2025-05-07T20:33:01.1025797Z self = 2025-05-07T20:33:01.1026268Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.1026531Z 2025-05-07T20:33:01.1026612Z @given( 2025-05-07T20:33:01.1026826Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1027133Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1027528Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1027854Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1028173Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1028449Z ) 2025-05-07T20:33:01.1028788Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1029233Z def test_silu_mul_quant( 2025-05-07T20:33:01.1029470Z self, 2025-05-07T20:33:01.1029658Z T: int, 2025-05-07T20:33:01.1029843Z D: int, 2025-05-07T20:33:01.1030055Z scale_ub: Optional[float], 2025-05-07T20:33:01.1030322Z contiguous: bool, 2025-05-07T20:33:01.1030552Z compiled: bool, 2025-05-07T20:33:01.1030769Z ) -> None: 2025-05-07T20:33:01.1030980Z torch.manual_seed(2025) 2025-05-07T20:33:01.1031211Z 2025-05-07T20:33:01.1031478Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1031909Z 2025-05-07T20:33:01.1032094Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1032381Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1032679Z x = x_sign * x_clamp 2025-05-07T20:33:01.1032903Z x0 = x[:, :D] 2025-05-07T20:33:01.1033109Z x1 = x[:, D:] 2025-05-07T20:33:01.1033305Z 2025-05-07T20:33:01.1033481Z if contiguous: 2025-05-07T20:33:01.1033699Z x0 = x0.contiguous() 2025-05-07T20:33:01.1033947Z x1 = x1.contiguous() 2025-05-07T20:33:01.1034177Z 2025-05-07T20:33:01.1034353Z if scale_ub is not None: 2025-05-07T20:33:01.1034620Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1034948Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1035241Z ) 2025-05-07T20:33:01.1035427Z else: 2025-05-07T20:33:01.1035629Z scale_ub_tensor = None 2025-05-07T20:33:01.1035863Z 2025-05-07T20:33:01.1036095Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1036445Z op = silu_mul_quant 2025-05-07T20:33:01.1036681Z if compiled: 2025-05-07T20:33:01.1036919Z op = torch.compile(op) 2025-05-07T20:33:01.1037205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1037462Z 2025-05-07T20:33:01.1037646Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.1037921Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.1038201Z 2025-05-07T20:33:01.1038421Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1038942Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.1039228Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.1039524Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.1039927Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1040525Z 2025-05-07T20:33:01.1040798Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.1041067Z 2025-05-07T20:33:01.1041197Z moe/activation_test.py:126: 2025-05-07T20:33:01.1041575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1042001Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.1042364Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1043138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.1043877Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.1044411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1045085Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1045791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.1046509Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.1047241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.1047871Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.1048473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.1048981Z fn() 2025-05-07T20:33:01.1049495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.1050090Z self.fn.run( 2025-05-07T20:33:01.1050555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1051070Z kernel = self.compile( 2025-05-07T20:33:01.1051826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1052471Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1052858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1053080Z 2025-05-07T20:33:01.1053281Z self = 2025-05-07T20:33:01.1054340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1055713Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37b445b6a0>} 2025-05-07T20:33:01.1057037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1058140Z context = 2025-05-07T20:33:01.1058433Z 2025-05-07T20:33:01.1058601Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1059122Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1059580Z module_map=module_map) 2025-05-07T20:33:01.1059931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1060281Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.1060538Z E ^ 2025-05-07T20:33:01.1060985Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1061510Z 2025-05-07T20:33:01.1061933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1062439Z 2025-05-07T20:33:01.1062539Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1062938Z self=, 2025-05-07T20:33:01.1063320Z T=2048, 2025-05-07T20:33:01.1063509Z D=5120, 2025-05-07T20:33:01.1063694Z scale_ub=1200.0, 2025-05-07T20:33:01.1063905Z contiguous=True, 2025-05-07T20:33:01.1064119Z compiled=False, 2025-05-07T20:33:01.1064318Z ) 2025-05-07T20:33:01.1064622Z self = 2025-05-07T20:33:01.1065102Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.1065371Z 2025-05-07T20:33:01.1065444Z @given( 2025-05-07T20:33:01.1065664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1065967Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1066265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1066591Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1066908Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1067187Z ) 2025-05-07T20:33:01.1067655Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1068077Z def test_silu_mul_quant( 2025-05-07T20:33:01.1068314Z self, 2025-05-07T20:33:01.1068503Z T: int, 2025-05-07T20:33:01.1068688Z D: int, 2025-05-07T20:33:01.1068902Z scale_ub: Optional[float], 2025-05-07T20:33:01.1069167Z contiguous: bool, 2025-05-07T20:33:01.1069404Z compiled: bool, 2025-05-07T20:33:01.1069613Z ) -> None: 2025-05-07T20:33:01.1069820Z torch.manual_seed(2025) 2025-05-07T20:33:01.1070058Z 2025-05-07T20:33:01.1070321Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1070658Z 2025-05-07T20:33:01.1070844Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1071210Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1071520Z x = x_sign * x_clamp 2025-05-07T20:33:01.1071752Z x0 = x[:, :D] 2025-05-07T20:33:01.1071954Z x1 = x[:, D:] 2025-05-07T20:33:01.1072154Z 2025-05-07T20:33:01.1072329Z if contiguous: 2025-05-07T20:33:01.1072551Z x0 = x0.contiguous() 2025-05-07T20:33:01.1072798Z x1 = x1.contiguous() 2025-05-07T20:33:01.1073027Z 2025-05-07T20:33:01.1073206Z if scale_ub is not None: 2025-05-07T20:33:01.1073468Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1073793Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1074091Z ) 2025-05-07T20:33:01.1074268Z else: 2025-05-07T20:33:01.1074471Z scale_ub_tensor = None 2025-05-07T20:33:01.1074718Z 2025-05-07T20:33:01.1074933Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1075234Z op = silu_mul_quant 2025-05-07T20:33:01.1075481Z if compiled: 2025-05-07T20:33:01.1075713Z op = torch.compile(op) 2025-05-07T20:33:01.1076047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1076311Z 2025-05-07T20:33:01.1076495Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1076660Z 2025-05-07T20:33:01.1076754Z moe/activation_test.py:117: 2025-05-07T20:33:01.1077038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1077361Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1077632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1078332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1079032Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1079599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1080272Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1080944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1081478Z kernel = self.compile( 2025-05-07T20:33:01.1082048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1082687Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1083074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1083296Z 2025-05-07T20:33:01.1083497Z self = 2025-05-07T20:33:01.1084556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1085910Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37b40c1f80>} 2025-05-07T20:33:01.1087233Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1088236Z context = 2025-05-07T20:33:01.1088516Z 2025-05-07T20:33:01.1088676Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1089196Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1089662Z module_map=module_map) 2025-05-07T20:33:01.1090023Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1090368Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1090738Z E ^ 2025-05-07T20:33:01.1091200Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1091650Z 2025-05-07T20:33:01.1092076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1092582Z 2025-05-07T20:33:01.1092679Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1093081Z self=, 2025-05-07T20:33:01.1093480Z T=2048, 2025-05-07T20:33:01.1093657Z D=5120, 2025-05-07T20:33:01.1093840Z scale_ub=1200.0, 2025-05-07T20:33:01.1094051Z contiguous=True, 2025-05-07T20:33:01.1094257Z compiled=True, 2025-05-07T20:33:01.1094450Z ) 2025-05-07T20:33:01.1094763Z self = 2025-05-07T20:33:01.1095242Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.1095517Z 2025-05-07T20:33:01.1095639Z @given( 2025-05-07T20:33:01.1095862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1096161Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1096459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1096781Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1097095Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1097367Z ) 2025-05-07T20:33:01.1097707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1098153Z def test_silu_mul_quant( 2025-05-07T20:33:01.1098386Z self, 2025-05-07T20:33:01.1098574Z T: int, 2025-05-07T20:33:01.1098762Z D: int, 2025-05-07T20:33:01.1099016Z scale_ub: Optional[float], 2025-05-07T20:33:01.1099281Z contiguous: bool, 2025-05-07T20:33:01.1099514Z compiled: bool, 2025-05-07T20:33:01.1099725Z ) -> None: 2025-05-07T20:33:01.1099947Z torch.manual_seed(2025) 2025-05-07T20:33:01.1100188Z 2025-05-07T20:33:01.1100446Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1100780Z 2025-05-07T20:33:01.1100966Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1101255Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1101554Z x = x_sign * x_clamp 2025-05-07T20:33:01.1101790Z x0 = x[:, :D] 2025-05-07T20:33:01.1102005Z x1 = x[:, D:] 2025-05-07T20:33:01.1102302Z 2025-05-07T20:33:01.1102566Z if contiguous: 2025-05-07T20:33:01.1102868Z x0 = x0.contiguous() 2025-05-07T20:33:01.1103421Z x1 = x1.contiguous() 2025-05-07T20:33:01.1103738Z 2025-05-07T20:33:01.1103981Z if scale_ub is not None: 2025-05-07T20:33:01.1120006Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1120474Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1120899Z ) 2025-05-07T20:33:01.1121098Z else: 2025-05-07T20:33:01.1121309Z scale_ub_tensor = None 2025-05-07T20:33:01.1121558Z 2025-05-07T20:33:01.1121782Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1122093Z op = silu_mul_quant 2025-05-07T20:33:01.1122338Z if compiled: 2025-05-07T20:33:01.1122577Z op = torch.compile(op) 2025-05-07T20:33:01.1122866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1123133Z 2025-05-07T20:33:01.1123313Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.1123589Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.1123866Z 2025-05-07T20:33:01.1124088Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1124416Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.1124701Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.1125196Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.1125547Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1125848Z 2025-05-07T20:33:01.1126040Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.1126229Z 2025-05-07T20:33:01.1126325Z moe/activation_test.py:126: 2025-05-07T20:33:01.1126617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1126950Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.1127263Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1128038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.1128770Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.1129304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1129988Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1130740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.1131454Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.1132178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.1132799Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.1133402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.1133904Z fn() 2025-05-07T20:33:01.1134409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.1135052Z self.fn.run( 2025-05-07T20:33:01.1135515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1136034Z kernel = self.compile( 2025-05-07T20:33:01.1136582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1137350Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1137740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1137962Z 2025-05-07T20:33:01.1138165Z self = 2025-05-07T20:33:01.1139236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1141131Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37b41191c0>} 2025-05-07T20:33:01.1142462Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1143477Z context = 2025-05-07T20:33:01.1143762Z 2025-05-07T20:33:01.1143924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1144437Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1144900Z module_map=module_map) 2025-05-07T20:33:01.1145260Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1145608Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.1145870Z E ^ 2025-05-07T20:33:01.1146570Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1147036Z 2025-05-07T20:33:01.1147577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1148285Z 2025-05-07T20:33:01.1148420Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1148960Z self=, 2025-05-07T20:33:01.1149478Z T=16384, 2025-05-07T20:33:01.1149657Z D=7168, 2025-05-07T20:33:01.1149843Z scale_ub=1200.0, 2025-05-07T20:33:01.1150059Z contiguous=False, 2025-05-07T20:33:01.1150272Z compiled=False, 2025-05-07T20:33:01.1150473Z ) 2025-05-07T20:33:01.1150784Z self = 2025-05-07T20:33:01.1151264Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:01.1151555Z 2025-05-07T20:33:01.1151629Z @given( 2025-05-07T20:33:01.1151856Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1152153Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1152571Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1152892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1153208Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1153476Z ) 2025-05-07T20:33:01.1153812Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1154238Z def test_silu_mul_quant( 2025-05-07T20:33:01.1154464Z self, 2025-05-07T20:33:01.1154653Z T: int, 2025-05-07T20:33:01.1154838Z D: int, 2025-05-07T20:33:01.1155042Z scale_ub: Optional[float], 2025-05-07T20:33:01.1155303Z contiguous: bool, 2025-05-07T20:33:01.1155534Z compiled: bool, 2025-05-07T20:33:01.1155820Z ) -> None: 2025-05-07T20:33:01.1156029Z torch.manual_seed(2025) 2025-05-07T20:33:01.1156261Z 2025-05-07T20:33:01.1156525Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1156858Z 2025-05-07T20:33:01.1157045Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1157320Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1157621Z x = x_sign * x_clamp 2025-05-07T20:33:01.1157852Z x0 = x[:, :D] 2025-05-07T20:33:01.1158056Z x1 = x[:, D:] 2025-05-07T20:33:01.1158249Z 2025-05-07T20:33:01.1158422Z if contiguous: 2025-05-07T20:33:01.1158641Z x0 = x0.contiguous() 2025-05-07T20:33:01.1158882Z x1 = x1.contiguous() 2025-05-07T20:33:01.1159114Z 2025-05-07T20:33:01.1159299Z if scale_ub is not None: 2025-05-07T20:33:01.1159560Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1159889Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1160191Z ) 2025-05-07T20:33:01.1160372Z else: 2025-05-07T20:33:01.1160575Z scale_ub_tensor = None 2025-05-07T20:33:01.1160825Z 2025-05-07T20:33:01.1161041Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1161346Z op = silu_mul_quant 2025-05-07T20:33:01.1161590Z if compiled: 2025-05-07T20:33:01.1161822Z op = torch.compile(op) 2025-05-07T20:33:01.1162110Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1162374Z 2025-05-07T20:33:01.1162555Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1162712Z 2025-05-07T20:33:01.1162803Z moe/activation_test.py:117: 2025-05-07T20:33:01.1163090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1163409Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1163672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1164370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1165051Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1165660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1166347Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1166999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1167519Z kernel = self.compile( 2025-05-07T20:33:01.1168053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1168717Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1169105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1169325Z 2025-05-07T20:33:01.1169536Z self = 2025-05-07T20:33:01.1170600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1172033Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37b42aa980>} 2025-05-07T20:33:01.1173389Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1174391Z context = 2025-05-07T20:33:01.1174670Z 2025-05-07T20:33:01.1174835Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1175382Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1175847Z module_map=module_map) 2025-05-07T20:33:01.1176207Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1176543Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1176791Z E ^ 2025-05-07T20:33:01.1177248Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1177691Z 2025-05-07T20:33:01.1178113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1178613Z 2025-05-07T20:33:01.1178709Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1179113Z self=, 2025-05-07T20:33:01.1179508Z T=1, 2025-05-07T20:33:01.1179679Z D=7168, 2025-05-07T20:33:01.1179872Z scale_ub=None, 2025-05-07T20:33:01.1180074Z contiguous=True, 2025-05-07T20:33:01.1180282Z compiled=True, 2025-05-07T20:33:01.1180477Z ) 2025-05-07T20:33:01.1180794Z self = 2025-05-07T20:33:01.1181267Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.1181518Z 2025-05-07T20:33:01.1181591Z @given( 2025-05-07T20:33:01.1181810Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1182112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1182402Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1182723Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1183041Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1183311Z ) 2025-05-07T20:33:01.1183649Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1184088Z def test_silu_mul_quant( 2025-05-07T20:33:01.1184317Z self, 2025-05-07T20:33:01.1184499Z T: int, 2025-05-07T20:33:01.1184690Z D: int, 2025-05-07T20:33:01.1184993Z scale_ub: Optional[float], 2025-05-07T20:33:01.1185255Z contiguous: bool, 2025-05-07T20:33:01.1185487Z compiled: bool, 2025-05-07T20:33:01.1185699Z ) -> None: 2025-05-07T20:33:01.1185897Z torch.manual_seed(2025) 2025-05-07T20:33:01.1186132Z 2025-05-07T20:33:01.1186397Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1186719Z 2025-05-07T20:33:01.1186902Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1187181Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1187573Z x = x_sign * x_clamp 2025-05-07T20:33:01.1187807Z x0 = x[:, :D] 2025-05-07T20:33:01.1188022Z x1 = x[:, D:] 2025-05-07T20:33:01.1188217Z 2025-05-07T20:33:01.1188393Z if contiguous: 2025-05-07T20:33:01.1188620Z x0 = x0.contiguous() 2025-05-07T20:33:01.1188865Z x1 = x1.contiguous() 2025-05-07T20:33:01.1189098Z 2025-05-07T20:33:01.1189288Z if scale_ub is not None: 2025-05-07T20:33:01.1189546Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1189931Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1190225Z ) 2025-05-07T20:33:01.1190409Z else: 2025-05-07T20:33:01.1190605Z scale_ub_tensor = None 2025-05-07T20:33:01.1190852Z 2025-05-07T20:33:01.1191076Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1191376Z op = silu_mul_quant 2025-05-07T20:33:01.1191623Z if compiled: 2025-05-07T20:33:01.1191862Z op = torch.compile(op) 2025-05-07T20:33:01.1192144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1192409Z 2025-05-07T20:33:01.1192604Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.1192872Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.1193204Z 2025-05-07T20:33:01.1193438Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1193768Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.1194055Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.1194358Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.1194708Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1195002Z 2025-05-07T20:33:01.1195195Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.1195383Z 2025-05-07T20:33:01.1195483Z moe/activation_test.py:126: 2025-05-07T20:33:01.1195768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1196093Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.1196414Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1197180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.1197923Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.1198482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1199155Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1199833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.1200542Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.1201259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.1201886Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.1202474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.1202978Z fn() 2025-05-07T20:33:01.1203582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.1204168Z self.fn.run( 2025-05-07T20:33:01.1204628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1205150Z kernel = self.compile( 2025-05-07T20:33:01.1205686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1206316Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1206701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1206921Z 2025-05-07T20:33:01.1207128Z self = 2025-05-07T20:33:01.1208194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1209536Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37aec76520>} 2025-05-07T20:33:01.1210899Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1211903Z context = 2025-05-07T20:33:01.1212184Z 2025-05-07T20:33:01.1212351Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1212852Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1213351Z module_map=module_map) 2025-05-07T20:33:01.1213705Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1214059Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.1214313Z E ^ 2025-05-07T20:33:01.1214774Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1215217Z 2025-05-07T20:33:01.1215656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1216154Z 2025-05-07T20:33:01.1216257Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1216650Z self=, 2025-05-07T20:33:01.1217040Z T=4096, 2025-05-07T20:33:01.1217227Z D=5120, 2025-05-07T20:33:01.1217407Z scale_ub=None, 2025-05-07T20:33:01.1217624Z contiguous=False, 2025-05-07T20:33:01.1217846Z compiled=False, 2025-05-07T20:33:01.1218044Z ) 2025-05-07T20:33:01.1218356Z self = 2025-05-07T20:33:01.1218845Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:01.1219116Z 2025-05-07T20:33:01.1219189Z @given( 2025-05-07T20:33:01.1219414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1219719Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1220023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1220337Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1220657Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1220933Z ) 2025-05-07T20:33:01.1221265Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1221702Z def test_silu_mul_quant( 2025-05-07T20:33:01.1221943Z self, 2025-05-07T20:33:01.1222129Z T: int, 2025-05-07T20:33:01.1222320Z D: int, 2025-05-07T20:33:01.1222533Z scale_ub: Optional[float], 2025-05-07T20:33:01.1222795Z contiguous: bool, 2025-05-07T20:33:01.1223028Z compiled: bool, 2025-05-07T20:33:01.1223324Z ) -> None: 2025-05-07T20:33:01.1223529Z torch.manual_seed(2025) 2025-05-07T20:33:01.1223762Z 2025-05-07T20:33:01.1224027Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1224363Z 2025-05-07T20:33:01.1224542Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1224826Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1225129Z x = x_sign * x_clamp 2025-05-07T20:33:01.1225354Z x0 = x[:, :D] 2025-05-07T20:33:01.1225560Z x1 = x[:, D:] 2025-05-07T20:33:01.1225762Z 2025-05-07T20:33:01.1225932Z if contiguous: 2025-05-07T20:33:01.1226152Z x0 = x0.contiguous() 2025-05-07T20:33:01.1226394Z x1 = x1.contiguous() 2025-05-07T20:33:01.1226617Z 2025-05-07T20:33:01.1226800Z if scale_ub is not None: 2025-05-07T20:33:01.1227062Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1227384Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1227761Z ) 2025-05-07T20:33:01.1227993Z else: 2025-05-07T20:33:01.1228188Z scale_ub_tensor = None 2025-05-07T20:33:01.1228431Z 2025-05-07T20:33:01.1228650Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1228952Z op = silu_mul_quant 2025-05-07T20:33:01.1229194Z if compiled: 2025-05-07T20:33:01.1229437Z op = torch.compile(op) 2025-05-07T20:33:01.1229722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1229984Z 2025-05-07T20:33:01.1230170Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1230327Z 2025-05-07T20:33:01.1230427Z moe/activation_test.py:117: 2025-05-07T20:33:01.1230704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1230857Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1230950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1231468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1231571Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1231935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1232161Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1232496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1232586Z kernel = self.compile( 2025-05-07T20:33:01.1232990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1233162Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1233289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1233300Z 2025-05-07T20:33:01.1233504Z self = 2025-05-07T20:33:01.1234272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1234807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37aec77f60>} 2025-05-07T20:33:01.1235540Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1235735Z context = 2025-05-07T20:33:01.1235739Z 2025-05-07T20:33:01.1235998Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1236262Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1236374Z module_map=module_map) 2025-05-07T20:33:01.1236530Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1236631Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1236704Z E ^ 2025-05-07T20:33:01.1237052Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1237057Z 2025-05-07T20:33:01.1237487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1237492Z 2025-05-07T20:33:01.1237590Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1237819Z self=, 2025-05-07T20:33:01.1237889Z T=4096, 2025-05-07T20:33:01.1237963Z D=7168, 2025-05-07T20:33:01.1238052Z scale_ub=None, 2025-05-07T20:33:01.1238203Z contiguous=False, 2025-05-07T20:33:01.1238278Z compiled=False, 2025-05-07T20:33:01.1238356Z ) 2025-05-07T20:33:01.1238576Z self = 2025-05-07T20:33:01.1238752Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:01.1238756Z 2025-05-07T20:33:01.1238836Z @given( 2025-05-07T20:33:01.1238948Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1239043Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1239163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1239275Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1239395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1239511Z ) 2025-05-07T20:33:01.1239791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1239932Z def test_silu_mul_quant( 2025-05-07T20:33:01.1240038Z self, 2025-05-07T20:33:01.1240358Z T: int, 2025-05-07T20:33:01.1240478Z D: int, 2025-05-07T20:33:01.1240629Z scale_ub: Optional[float], 2025-05-07T20:33:01.1240755Z contiguous: bool, 2025-05-07T20:33:01.1240899Z compiled: bool, 2025-05-07T20:33:01.1241010Z ) -> None: 2025-05-07T20:33:01.1241135Z torch.manual_seed(2025) 2025-05-07T20:33:01.1241232Z 2025-05-07T20:33:01.1241401Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1241482Z 2025-05-07T20:33:01.1241570Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1241690Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1241779Z x = x_sign * x_clamp 2025-05-07T20:33:01.1241858Z x0 = x[:, :D] 2025-05-07T20:33:01.1241935Z x1 = x[:, D:] 2025-05-07T20:33:01.1242009Z 2025-05-07T20:33:01.1242087Z if contiguous: 2025-05-07T20:33:01.1242178Z x0 = x0.contiguous() 2025-05-07T20:33:01.1242272Z x1 = x1.contiguous() 2025-05-07T20:33:01.1242340Z 2025-05-07T20:33:01.1242424Z if scale_ub is not None: 2025-05-07T20:33:01.1242530Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1242662Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1242739Z ) 2025-05-07T20:33:01.1242813Z else: 2025-05-07T20:33:01.1242902Z scale_ub_tensor = None 2025-05-07T20:33:01.1242974Z 2025-05-07T20:33:01.1243101Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1243191Z op = silu_mul_quant 2025-05-07T20:33:01.1243278Z if compiled: 2025-05-07T20:33:01.1243372Z op = torch.compile(op) 2025-05-07T20:33:01.1243476Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1243550Z 2025-05-07T20:33:01.1243635Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1243640Z 2025-05-07T20:33:01.1243921Z moe/activation_test.py:117: 2025-05-07T20:33:01.1244058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1244155Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1244256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1244761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1244853Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1245230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1245453Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1245801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1245893Z kernel = self.compile( 2025-05-07T20:33:01.1246294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1246538Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1246662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1246666Z 2025-05-07T20:33:01.1246866Z self = 2025-05-07T20:33:01.1247639Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1248139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37aec76ca0>} 2025-05-07T20:33:01.1248949Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1249138Z context = 2025-05-07T20:33:01.1249143Z 2025-05-07T20:33:01.1249309Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1249573Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1249677Z module_map=module_map) 2025-05-07T20:33:01.1249841Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1249936Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1250008Z E ^ 2025-05-07T20:33:01.1250372Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1250380Z 2025-05-07T20:33:01.1250814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1250825Z 2025-05-07T20:33:01.1250968Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1251267Z self=, 2025-05-07T20:33:01.1251368Z T=128, 2025-05-07T20:33:01.1251472Z D=7168, 2025-05-07T20:33:01.1251575Z scale_ub=None, 2025-05-07T20:33:01.1251682Z contiguous=False, 2025-05-07T20:33:01.1251796Z compiled=True, 2025-05-07T20:33:01.1251889Z ) 2025-05-07T20:33:01.1252968Z self = 2025-05-07T20:33:01.1253138Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.1253142Z 2025-05-07T20:33:01.1253218Z @given( 2025-05-07T20:33:01.1253338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1253440Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1253654Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1253775Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1253888Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1253967Z ) 2025-05-07T20:33:01.1254215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1254304Z def test_silu_mul_quant( 2025-05-07T20:33:01.1254383Z self, 2025-05-07T20:33:01.1254458Z T: int, 2025-05-07T20:33:01.1254530Z D: int, 2025-05-07T20:33:01.1254631Z scale_ub: Optional[float], 2025-05-07T20:33:01.1254716Z contiguous: bool, 2025-05-07T20:33:01.1254798Z compiled: bool, 2025-05-07T20:33:01.1254880Z ) -> None: 2025-05-07T20:33:01.1254969Z torch.manual_seed(2025) 2025-05-07T20:33:01.1255040Z 2025-05-07T20:33:01.1255218Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1255288Z 2025-05-07T20:33:01.1255378Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1255510Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1255643Z x = x_sign * x_clamp 2025-05-07T20:33:01.1255725Z x0 = x[:, :D] 2025-05-07T20:33:01.1255800Z x1 = x[:, D:] 2025-05-07T20:33:01.1255867Z 2025-05-07T20:33:01.1255951Z if contiguous: 2025-05-07T20:33:01.1256038Z x0 = x0.contiguous() 2025-05-07T20:33:01.1256122Z x1 = x1.contiguous() 2025-05-07T20:33:01.1256197Z 2025-05-07T20:33:01.1256282Z if scale_ub is not None: 2025-05-07T20:33:01.1256382Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1256520Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1256595Z ) 2025-05-07T20:33:01.1256665Z else: 2025-05-07T20:33:01.1256761Z scale_ub_tensor = None 2025-05-07T20:33:01.1256876Z 2025-05-07T20:33:01.1257007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1257099Z op = silu_mul_quant 2025-05-07T20:33:01.1257182Z if compiled: 2025-05-07T20:33:01.1257286Z op = torch.compile(op) 2025-05-07T20:33:01.1257389Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1257457Z 2025-05-07T20:33:01.1257549Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.1257666Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.1257737Z 2025-05-07T20:33:01.1257874Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1257970Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.1258065Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.1264953Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.1265121Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1265208Z 2025-05-07T20:33:01.1265308Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.1265314Z 2025-05-07T20:33:01.1265414Z moe/activation_test.py:126: 2025-05-07T20:33:01.1265555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1265660Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.1265792Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1266372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.1266471Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.1266843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1267060Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1267544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.1267920Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.1268315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.1268480Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.1268824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.1268898Z fn() 2025-05-07T20:33:01.1269320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.1269400Z self.fn.run( 2025-05-07T20:33:01.1269733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1269830Z kernel = self.compile( 2025-05-07T20:33:01.1270212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1270394Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1270562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1270567Z 2025-05-07T20:33:01.1270767Z self = 2025-05-07T20:33:01.1271573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1272091Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae368180>} 2025-05-07T20:33:01.1272824Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1273057Z context = 2025-05-07T20:33:01.1273064Z 2025-05-07T20:33:01.1273225Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1273487Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1273594Z module_map=module_map) 2025-05-07T20:33:01.1273757Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1273854Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.1273928Z E ^ 2025-05-07T20:33:01.1274292Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1274297Z 2025-05-07T20:33:01.1274710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1274718Z 2025-05-07T20:33:01.1274821Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1275043Z self=, 2025-05-07T20:33:01.1275119Z T=128, 2025-05-07T20:33:01.1275194Z D=7168, 2025-05-07T20:33:01.1275270Z scale_ub=None, 2025-05-07T20:33:01.1275357Z contiguous=False, 2025-05-07T20:33:01.1275443Z compiled=False, 2025-05-07T20:33:01.1275511Z ) 2025-05-07T20:33:01.1275725Z self = 2025-05-07T20:33:01.1275897Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:01.1275901Z 2025-05-07T20:33:01.1275976Z @given( 2025-05-07T20:33:01.1276089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1276193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1276303Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1276422Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1276632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1276706Z ) 2025-05-07T20:33:01.1276957Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1277046Z def test_silu_mul_quant( 2025-05-07T20:33:01.1277120Z self, 2025-05-07T20:33:01.1277199Z T: int, 2025-05-07T20:33:01.1277270Z D: int, 2025-05-07T20:33:01.1277361Z scale_ub: Optional[float], 2025-05-07T20:33:01.1277451Z contiguous: bool, 2025-05-07T20:33:01.1277532Z compiled: bool, 2025-05-07T20:33:01.1277612Z ) -> None: 2025-05-07T20:33:01.1277701Z torch.manual_seed(2025) 2025-05-07T20:33:01.1277769Z 2025-05-07T20:33:01.1277938Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1278007Z 2025-05-07T20:33:01.1278093Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1278220Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1278303Z x = x_sign * x_clamp 2025-05-07T20:33:01.1278385Z x0 = x[:, :D] 2025-05-07T20:33:01.1278507Z x1 = x[:, D:] 2025-05-07T20:33:01.1278577Z 2025-05-07T20:33:01.1278656Z if contiguous: 2025-05-07T20:33:01.1278750Z x0 = x0.contiguous() 2025-05-07T20:33:01.1278832Z x1 = x1.contiguous() 2025-05-07T20:33:01.1278899Z 2025-05-07T20:33:01.1278992Z if scale_ub is not None: 2025-05-07T20:33:01.1279095Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1279232Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1279306Z ) 2025-05-07T20:33:01.1279383Z else: 2025-05-07T20:33:01.1279479Z scale_ub_tensor = None 2025-05-07T20:33:01.1279549Z 2025-05-07T20:33:01.1279675Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1279806Z op = silu_mul_quant 2025-05-07T20:33:01.1279888Z if compiled: 2025-05-07T20:33:01.1279985Z op = torch.compile(op) 2025-05-07T20:33:01.1280100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1280174Z 2025-05-07T20:33:01.1280261Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1280274Z 2025-05-07T20:33:01.1280368Z moe/activation_test.py:117: 2025-05-07T20:33:01.1280493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1280597Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1280692Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1281232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1281369Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1281851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1282199Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1282702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1282799Z kernel = self.compile( 2025-05-07T20:33:01.1283205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1283374Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1283498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1283503Z 2025-05-07T20:33:01.1283712Z self = 2025-05-07T20:33:01.1284476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1285105Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae36b100>} 2025-05-07T20:33:01.1285839Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1286028Z context = 2025-05-07T20:33:01.1286033Z 2025-05-07T20:33:01.1286191Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1286447Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1286558Z module_map=module_map) 2025-05-07T20:33:01.1286715Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1286811Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1286886Z E ^ 2025-05-07T20:33:01.1287238Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1287283Z 2025-05-07T20:33:01.1287704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1287709Z 2025-05-07T20:33:01.1287805Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1288024Z self=, 2025-05-07T20:33:01.1288105Z T=4096, 2025-05-07T20:33:01.1288181Z D=5120, 2025-05-07T20:33:01.1288262Z scale_ub=1200.0, 2025-05-07T20:33:01.1288346Z contiguous=True, 2025-05-07T20:33:01.1288426Z compiled=False, 2025-05-07T20:33:01.1288502Z ) 2025-05-07T20:33:01.1288720Z self = 2025-05-07T20:33:01.1288934Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.1288939Z 2025-05-07T20:33:01.1289017Z @given( 2025-05-07T20:33:01.1289136Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1289232Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1289349Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1289460Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1289567Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1289643Z ) 2025-05-07T20:33:01.1289883Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1289976Z def test_silu_mul_quant( 2025-05-07T20:33:01.1290047Z self, 2025-05-07T20:33:01.1290119Z T: int, 2025-05-07T20:33:01.1290195Z D: int, 2025-05-07T20:33:01.1290288Z scale_ub: Optional[float], 2025-05-07T20:33:01.1290374Z contiguous: bool, 2025-05-07T20:33:01.1290467Z compiled: bool, 2025-05-07T20:33:01.1290544Z ) -> None: 2025-05-07T20:33:01.1290630Z torch.manual_seed(2025) 2025-05-07T20:33:01.1290705Z 2025-05-07T20:33:01.1290874Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1290944Z 2025-05-07T20:33:01.1291039Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1291158Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1291248Z x = x_sign * x_clamp 2025-05-07T20:33:01.1291322Z x0 = x[:, :D] 2025-05-07T20:33:01.1291396Z x1 = x[:, D:] 2025-05-07T20:33:01.1291466Z 2025-05-07T20:33:01.1291544Z if contiguous: 2025-05-07T20:33:01.1291628Z x0 = x0.contiguous() 2025-05-07T20:33:01.1291716Z x1 = x1.contiguous() 2025-05-07T20:33:01.1291784Z 2025-05-07T20:33:01.1291866Z if scale_ub is not None: 2025-05-07T20:33:01.1291973Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1292105Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1292179Z ) 2025-05-07T20:33:01.1292254Z else: 2025-05-07T20:33:01.1292420Z scale_ub_tensor = None 2025-05-07T20:33:01.1292494Z 2025-05-07T20:33:01.1292622Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1292706Z op = silu_mul_quant 2025-05-07T20:33:01.1292794Z if compiled: 2025-05-07T20:33:01.1292889Z op = torch.compile(op) 2025-05-07T20:33:01.1292991Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1293070Z 2025-05-07T20:33:01.1293155Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1293159Z 2025-05-07T20:33:01.1293249Z moe/activation_test.py:117: 2025-05-07T20:33:01.1293379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1293476Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1293573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1294085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1294182Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1294601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1294824Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1295168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1295260Z kernel = self.compile( 2025-05-07T20:33:01.1295655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1295832Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1295955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1296003Z 2025-05-07T20:33:01.1296201Z self = 2025-05-07T20:33:01.1296976Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1297488Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae1b1f80>} 2025-05-07T20:33:01.1298225Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1298410Z context = 2025-05-07T20:33:01.1298415Z 2025-05-07T20:33:01.1298575Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1298846Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1298954Z module_map=module_map) 2025-05-07T20:33:01.1299122Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1299215Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1299288Z E ^ 2025-05-07T20:33:01.1299658Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1299663Z 2025-05-07T20:33:01.1300088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1300093Z 2025-05-07T20:33:01.1300198Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1300416Z self=, 2025-05-07T20:33:01.1300489Z T=1, 2025-05-07T20:33:01.1300573Z D=5120, 2025-05-07T20:33:01.1300654Z scale_ub=None, 2025-05-07T20:33:01.1300736Z contiguous=True, 2025-05-07T20:33:01.1300822Z compiled=True, 2025-05-07T20:33:01.1300969Z ) 2025-05-07T20:33:01.1301189Z self = 2025-05-07T20:33:01.1301351Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.1301356Z 2025-05-07T20:33:01.1301429Z @given( 2025-05-07T20:33:01.1301550Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1301643Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1301755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1301873Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1301981Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1302048Z ) 2025-05-07T20:33:01.1302301Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1302397Z def test_silu_mul_quant( 2025-05-07T20:33:01.1302482Z self, 2025-05-07T20:33:01.1302568Z T: int, 2025-05-07T20:33:01.1302661Z D: int, 2025-05-07T20:33:01.1302771Z scale_ub: Optional[float], 2025-05-07T20:33:01.1303492Z contiguous: bool, 2025-05-07T20:33:01.1303574Z compiled: bool, 2025-05-07T20:33:01.1303655Z ) -> None: 2025-05-07T20:33:01.1303744Z torch.manual_seed(2025) 2025-05-07T20:33:01.1303812Z 2025-05-07T20:33:01.1303979Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1304048Z 2025-05-07T20:33:01.1304134Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1304261Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1304343Z x = x_sign * x_clamp 2025-05-07T20:33:01.1304416Z x0 = x[:, :D] 2025-05-07T20:33:01.1304499Z x1 = x[:, D:] 2025-05-07T20:33:01.1304565Z 2025-05-07T20:33:01.1304641Z if contiguous: 2025-05-07T20:33:01.1304801Z x0 = x0.contiguous() 2025-05-07T20:33:01.1304883Z x1 = x1.contiguous() 2025-05-07T20:33:01.1304956Z 2025-05-07T20:33:01.1305046Z if scale_ub is not None: 2025-05-07T20:33:01.1305149Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1305289Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1305363Z ) 2025-05-07T20:33:01.1305433Z else: 2025-05-07T20:33:01.1305527Z scale_ub_tensor = None 2025-05-07T20:33:01.1305596Z 2025-05-07T20:33:01.1305723Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1305811Z op = silu_mul_quant 2025-05-07T20:33:01.1305891Z if compiled: 2025-05-07T20:33:01.1305984Z op = torch.compile(op) 2025-05-07T20:33:01.1306090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1306157Z 2025-05-07T20:33:01.1306250Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.1306369Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.1306438Z 2025-05-07T20:33:01.1306575Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1306672Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.1306768Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.1306893Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.1307027Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1307093Z 2025-05-07T20:33:01.1307192Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.1307197Z 2025-05-07T20:33:01.1307289Z moe/activation_test.py:126: 2025-05-07T20:33:01.1307494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1307594Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.1307721Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1308290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.1308471Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.1308828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1309052Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1309419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.1309676Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.1310052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.1310216Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.1310558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.1310640Z fn() 2025-05-07T20:33:01.1311068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.1311186Z self.fn.run( 2025-05-07T20:33:01.1311535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1311630Z kernel = self.compile( 2025-05-07T20:33:01.1312006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1312176Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1312309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1312314Z 2025-05-07T20:33:01.1312513Z self = 2025-05-07T20:33:01.1313285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1313851Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae35e520>} 2025-05-07T20:33:01.1314589Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1314774Z context = 2025-05-07T20:33:01.1314778Z 2025-05-07T20:33:01.1314935Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1315200Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1315304Z module_map=module_map) 2025-05-07T20:33:01.1315462Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1315570Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.1315644Z E ^ 2025-05-07T20:33:01.1315999Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1316004Z 2025-05-07T20:33:01.1316428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1316433Z 2025-05-07T20:33:01.1316530Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1316761Z self=, 2025-05-07T20:33:01.1316834Z T=2048, 2025-05-07T20:33:01.1316913Z D=5120, 2025-05-07T20:33:01.1316988Z scale_ub=None, 2025-05-07T20:33:01.1317069Z contiguous=True, 2025-05-07T20:33:01.1317152Z compiled=True, 2025-05-07T20:33:01.1317217Z ) 2025-05-07T20:33:01.1317433Z self = 2025-05-07T20:33:01.1317681Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.1317688Z 2025-05-07T20:33:01.1317759Z @given( 2025-05-07T20:33:01.1317872Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1317971Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1318081Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1318191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1318305Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1318373Z ) 2025-05-07T20:33:01.1318625Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1318711Z def test_silu_mul_quant( 2025-05-07T20:33:01.1318780Z self, 2025-05-07T20:33:01.1318856Z T: int, 2025-05-07T20:33:01.1318931Z D: int, 2025-05-07T20:33:01.1319024Z scale_ub: Optional[float], 2025-05-07T20:33:01.1319117Z contiguous: bool, 2025-05-07T20:33:01.1319201Z compiled: bool, 2025-05-07T20:33:01.1319272Z ) -> None: 2025-05-07T20:33:01.1319407Z torch.manual_seed(2025) 2025-05-07T20:33:01.1319474Z 2025-05-07T20:33:01.1319635Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1319710Z 2025-05-07T20:33:01.1319794Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1319920Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1320000Z x = x_sign * x_clamp 2025-05-07T20:33:01.1320075Z x0 = x[:, :D] 2025-05-07T20:33:01.1320157Z x1 = x[:, D:] 2025-05-07T20:33:01.1320226Z 2025-05-07T20:33:01.1320304Z if contiguous: 2025-05-07T20:33:01.1320394Z x0 = x0.contiguous() 2025-05-07T20:33:01.1320481Z x1 = x1.contiguous() 2025-05-07T20:33:01.1320589Z 2025-05-07T20:33:01.1320679Z if scale_ub is not None: 2025-05-07T20:33:01.1320778Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1320913Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1320998Z ) 2025-05-07T20:33:01.1321066Z else: 2025-05-07T20:33:01.1321161Z scale_ub_tensor = None 2025-05-07T20:33:01.1321231Z 2025-05-07T20:33:01.1321354Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1321443Z op = silu_mul_quant 2025-05-07T20:33:01.1321523Z if compiled: 2025-05-07T20:33:01.1321616Z op = torch.compile(op) 2025-05-07T20:33:01.1321722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1321791Z 2025-05-07T20:33:01.1321877Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.1321998Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.1322067Z 2025-05-07T20:33:01.1322196Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1322301Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.1322395Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.1322521Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.1322658Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1322725Z 2025-05-07T20:33:01.1322824Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.1322829Z 2025-05-07T20:33:01.1322919Z moe/activation_test.py:126: 2025-05-07T20:33:01.1323041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1323147Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.1323274Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1323839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.1323939Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.1324298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1324607Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1324987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.1325242Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.1325618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.1325779Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.1326133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.1326206Z fn() 2025-05-07T20:33:01.1326617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.1326705Z self.fn.run( 2025-05-07T20:33:01.1327042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1327178Z kernel = self.compile( 2025-05-07T20:33:01.1327572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1327742Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1327873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1327877Z 2025-05-07T20:33:01.1328077Z self = 2025-05-07T20:33:01.1328838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1329394Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae10a840>} 2025-05-07T20:33:01.1330129Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1330324Z context = 2025-05-07T20:33:01.1330329Z 2025-05-07T20:33:01.1330486Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1330758Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1330861Z module_map=module_map) 2025-05-07T20:33:01.1331018Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1331124Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.1331193Z E ^ 2025-05-07T20:33:01.1331546Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1331559Z 2025-05-07T20:33:01.1331982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1331987Z 2025-05-07T20:33:01.1332085Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1332307Z self=, 2025-05-07T20:33:01.1332381Z T=128, 2025-05-07T20:33:01.1332452Z D=5120, 2025-05-07T20:33:01.1332537Z scale_ub=None, 2025-05-07T20:33:01.1332616Z contiguous=True, 2025-05-07T20:33:01.1332690Z compiled=True, 2025-05-07T20:33:01.1332767Z ) 2025-05-07T20:33:01.1332989Z self = 2025-05-07T20:33:01.1333158Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.1333163Z 2025-05-07T20:33:01.1333234Z @given( 2025-05-07T20:33:01.1333426Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1333529Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1333638Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1333750Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1333864Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1333933Z ) 2025-05-07T20:33:01.1334176Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1334269Z def test_silu_mul_quant( 2025-05-07T20:33:01.1334340Z self, 2025-05-07T20:33:01.1334417Z T: int, 2025-05-07T20:33:01.1334488Z D: int, 2025-05-07T20:33:01.1334580Z scale_ub: Optional[float], 2025-05-07T20:33:01.1334672Z contiguous: bool, 2025-05-07T20:33:01.1334755Z compiled: bool, 2025-05-07T20:33:01.1334827Z ) -> None: 2025-05-07T20:33:01.1334922Z torch.manual_seed(2025) 2025-05-07T20:33:01.1334990Z 2025-05-07T20:33:01.1335159Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1335302Z 2025-05-07T20:33:01.1335391Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1335511Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1335598Z x = x_sign * x_clamp 2025-05-07T20:33:01.1335674Z x0 = x[:, :D] 2025-05-07T20:33:01.1335746Z x1 = x[:, D:] 2025-05-07T20:33:01.1335819Z 2025-05-07T20:33:01.1335896Z if contiguous: 2025-05-07T20:33:01.1335987Z x0 = x0.contiguous() 2025-05-07T20:33:01.1336070Z x1 = x1.contiguous() 2025-05-07T20:33:01.1336135Z 2025-05-07T20:33:01.1336222Z if scale_ub is not None: 2025-05-07T20:33:01.1336323Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1336495Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1336568Z ) 2025-05-07T20:33:01.1336637Z else: 2025-05-07T20:33:01.1336732Z scale_ub_tensor = None 2025-05-07T20:33:01.1336807Z 2025-05-07T20:33:01.1336930Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1337015Z op = silu_mul_quant 2025-05-07T20:33:01.1337100Z if compiled: 2025-05-07T20:33:01.1337193Z op = torch.compile(op) 2025-05-07T20:33:01.1337304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1337372Z 2025-05-07T20:33:01.1337455Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.1337577Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.1337642Z 2025-05-07T20:33:01.1337773Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1337874Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.1337967Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.1338085Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.1338229Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1338295Z 2025-05-07T20:33:01.1338399Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.1338403Z 2025-05-07T20:33:01.1338496Z moe/activation_test.py:126: 2025-05-07T20:33:01.1338619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1338722Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.1338851Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1339422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.1339522Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.1339891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1340387Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1341063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.1341327Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.1341713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.1341877Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.1342222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.1342298Z fn() 2025-05-07T20:33:01.1342716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.1342802Z self.fn.run( 2025-05-07T20:33:01.1343138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1343225Z kernel = self.compile( 2025-05-07T20:33:01.1343630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1343862Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1343991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1343996Z 2025-05-07T20:33:01.1344197Z self = 2025-05-07T20:33:01.1344958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1345473Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37b43aa160>} 2025-05-07T20:33:01.1346272Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1346467Z context = 2025-05-07T20:33:01.1346472Z 2025-05-07T20:33:01.1346630Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1346894Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1347004Z module_map=module_map) 2025-05-07T20:33:01.1347161Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1347267Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.1347339Z E ^ 2025-05-07T20:33:01.1347776Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1347785Z 2025-05-07T20:33:01.1348222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1348230Z 2025-05-07T20:33:01.1348328Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1348556Z self=, 2025-05-07T20:33:01.1348631Z T=4096, 2025-05-07T20:33:01.1348704Z D=5120, 2025-05-07T20:33:01.1348787Z scale_ub=None, 2025-05-07T20:33:01.1348869Z contiguous=True, 2025-05-07T20:33:01.1348946Z compiled=True, 2025-05-07T20:33:01.1349028Z ) 2025-05-07T20:33:01.1349244Z self = 2025-05-07T20:33:01.1349408Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.1349413Z 2025-05-07T20:33:01.1349495Z @given( 2025-05-07T20:33:01.1349609Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1349703Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1349900Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1350016Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1350129Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1350203Z ) 2025-05-07T20:33:01.1350443Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1350539Z def test_silu_mul_quant( 2025-05-07T20:33:01.1350612Z self, 2025-05-07T20:33:01.1350688Z T: int, 2025-05-07T20:33:01.1350765Z D: int, 2025-05-07T20:33:01.1350857Z scale_ub: Optional[float], 2025-05-07T20:33:01.1350940Z contiguous: bool, 2025-05-07T20:33:01.1351027Z compiled: bool, 2025-05-07T20:33:01.1351103Z ) -> None: 2025-05-07T20:33:01.1351191Z torch.manual_seed(2025) 2025-05-07T20:33:01.1351269Z 2025-05-07T20:33:01.1351433Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1351506Z 2025-05-07T20:33:01.1351598Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1351759Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1351848Z x = x_sign * x_clamp 2025-05-07T20:33:01.1351926Z x0 = x[:, :D] 2025-05-07T20:33:01.1352000Z x1 = x[:, D:] 2025-05-07T20:33:01.1352077Z 2025-05-07T20:33:01.1352156Z if contiguous: 2025-05-07T20:33:01.1352242Z x0 = x0.contiguous() 2025-05-07T20:33:01.1352330Z x1 = x1.contiguous() 2025-05-07T20:33:01.1352400Z 2025-05-07T20:33:01.1352482Z if scale_ub is not None: 2025-05-07T20:33:01.1352585Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1352715Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1352792Z ) 2025-05-07T20:33:01.1352865Z else: 2025-05-07T20:33:01.1352996Z scale_ub_tensor = None 2025-05-07T20:33:01.1353066Z 2025-05-07T20:33:01.1353189Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1353280Z op = silu_mul_quant 2025-05-07T20:33:01.1353369Z if compiled: 2025-05-07T20:33:01.1353462Z op = torch.compile(op) 2025-05-07T20:33:01.1353562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1353636Z 2025-05-07T20:33:01.1353722Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.1353880Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.1353986Z 2025-05-07T20:33:01.1354170Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1354316Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.1354450Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.1354612Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.1354813Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1354913Z 2025-05-07T20:33:01.1355050Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.1355058Z 2025-05-07T20:33:01.1355207Z moe/activation_test.py:126: 2025-05-07T20:33:01.1355344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1355445Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.1355583Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1356140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.1356244Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.1356603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1356827Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1357203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.1357546Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.1358005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.1358188Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.1358594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.1358678Z fn() 2025-05-07T20:33:01.1359166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.1359252Z self.fn.run( 2025-05-07T20:33:01.1359661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1359759Z kernel = self.compile( 2025-05-07T20:33:01.1360228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1360426Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1360609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1360614Z 2025-05-07T20:33:01.1360852Z self = 2025-05-07T20:33:01.1361873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1362499Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788e26de0>} 2025-05-07T20:33:01.1363424Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1363678Z context = 2025-05-07T20:33:01.1363692Z 2025-05-07T20:33:01.1363872Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1364180Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1364298Z module_map=module_map) 2025-05-07T20:33:01.1364472Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1364581Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.1364664Z E ^ 2025-05-07T20:33:01.1365086Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1365091Z 2025-05-07T20:33:01.1365597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1365602Z 2025-05-07T20:33:01.1365715Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1365972Z self=, 2025-05-07T20:33:01.1366058Z T=16384, 2025-05-07T20:33:01.1366135Z D=5120, 2025-05-07T20:33:01.1366221Z scale_ub=None, 2025-05-07T20:33:01.1366313Z contiguous=True, 2025-05-07T20:33:01.1366398Z compiled=True, 2025-05-07T20:33:01.1366471Z ) 2025-05-07T20:33:01.1366724Z self = 2025-05-07T20:33:01.1366916Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.1366920Z 2025-05-07T20:33:01.1367003Z @given( 2025-05-07T20:33:01.1367128Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1367229Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1367359Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1367484Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1367707Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1367787Z ) 2025-05-07T20:33:01.1368029Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1368122Z def test_silu_mul_quant( 2025-05-07T20:33:01.1368195Z self, 2025-05-07T20:33:01.1368272Z T: int, 2025-05-07T20:33:01.1368351Z D: int, 2025-05-07T20:33:01.1368442Z scale_ub: Optional[float], 2025-05-07T20:33:01.1368527Z contiguous: bool, 2025-05-07T20:33:01.1368616Z compiled: bool, 2025-05-07T20:33:01.1368687Z ) -> None: 2025-05-07T20:33:01.1368777Z torch.manual_seed(2025) 2025-05-07T20:33:01.1368852Z 2025-05-07T20:33:01.1369015Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1369087Z 2025-05-07T20:33:01.1369178Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1369298Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1369387Z x = x_sign * x_clamp 2025-05-07T20:33:01.1369509Z x0 = x[:, :D] 2025-05-07T20:33:01.1369585Z x1 = x[:, D:] 2025-05-07T20:33:01.1369660Z 2025-05-07T20:33:01.1369740Z if contiguous: 2025-05-07T20:33:01.1369825Z x0 = x0.contiguous() 2025-05-07T20:33:01.1369912Z x1 = x1.contiguous() 2025-05-07T20:33:01.1369980Z 2025-05-07T20:33:01.1370064Z if scale_ub is not None: 2025-05-07T20:33:01.1370168Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1370298Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1370370Z ) 2025-05-07T20:33:01.1370449Z else: 2025-05-07T20:33:01.1370538Z scale_ub_tensor = None 2025-05-07T20:33:01.1370604Z 2025-05-07T20:33:01.1370736Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1370865Z op = silu_mul_quant 2025-05-07T20:33:01.1370951Z if compiled: 2025-05-07T20:33:01.1371050Z op = torch.compile(op) 2025-05-07T20:33:01.1371154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1371228Z 2025-05-07T20:33:01.1371315Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.1371433Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.1371505Z 2025-05-07T20:33:01.1371638Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1371736Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.1371841Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.1371957Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.1372090Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1372166Z 2025-05-07T20:33:01.1372262Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.1372269Z 2025-05-07T20:33:01.1372368Z moe/activation_test.py:126: 2025-05-07T20:33:01.1372498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1372598Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.1372740Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1373287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.1373384Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.1373757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1373978Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1374350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.1374602Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.1375062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.1375235Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.1375576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.1375661Z fn() 2025-05-07T20:33:01.1376071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.1376152Z self.fn.run( 2025-05-07T20:33:01.1376491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1376579Z kernel = self.compile( 2025-05-07T20:33:01.1376972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1377153Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1377279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1377344Z 2025-05-07T20:33:01.1377552Z self = 2025-05-07T20:33:01.1378318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1378828Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788d477e0>} 2025-05-07T20:33:01.1379566Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1379792Z context = 2025-05-07T20:33:01.1379797Z 2025-05-07T20:33:01.1379966Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1380233Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1380342Z module_map=module_map) 2025-05-07T20:33:01.1380498Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1380594Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.1380673Z E ^ 2025-05-07T20:33:01.1381020Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1381024Z 2025-05-07T20:33:01.1381445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1381453Z 2025-05-07T20:33:01.1381554Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1381775Z self=, 2025-05-07T20:33:01.1381860Z T=1, 2025-05-07T20:33:01.1381936Z D=5120, 2025-05-07T20:33:01.1382013Z scale_ub=1200.0, 2025-05-07T20:33:01.1382098Z contiguous=True, 2025-05-07T20:33:01.1382181Z compiled=True, 2025-05-07T20:33:01.1382251Z ) 2025-05-07T20:33:01.1382475Z self = 2025-05-07T20:33:01.1382633Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.1382638Z 2025-05-07T20:33:01.1382711Z @given( 2025-05-07T20:33:01.1382831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1382925Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1383041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1383151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1383262Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1383343Z ) 2025-05-07T20:33:01.1383660Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1383751Z def test_silu_mul_quant( 2025-05-07T20:33:01.1383834Z self, 2025-05-07T20:33:01.1383905Z T: int, 2025-05-07T20:33:01.1383976Z D: int, 2025-05-07T20:33:01.1384074Z scale_ub: Optional[float], 2025-05-07T20:33:01.1384157Z contiguous: bool, 2025-05-07T20:33:01.1384237Z compiled: bool, 2025-05-07T20:33:01.1384317Z ) -> None: 2025-05-07T20:33:01.1384405Z torch.manual_seed(2025) 2025-05-07T20:33:01.1384479Z 2025-05-07T20:33:01.1384642Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1384711Z 2025-05-07T20:33:01.1384802Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1384924Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1385009Z x = x_sign * x_clamp 2025-05-07T20:33:01.1385087Z x0 = x[:, :D] 2025-05-07T20:33:01.1385163Z x1 = x[:, D:] 2025-05-07T20:33:01.1385232Z 2025-05-07T20:33:01.1385318Z if contiguous: 2025-05-07T20:33:01.1385447Z x0 = x0.contiguous() 2025-05-07T20:33:01.1385529Z x1 = x1.contiguous() 2025-05-07T20:33:01.1385605Z 2025-05-07T20:33:01.1385690Z if scale_ub is not None: 2025-05-07T20:33:01.1385794Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1385924Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1385996Z ) 2025-05-07T20:33:01.1386075Z else: 2025-05-07T20:33:01.1386169Z scale_ub_tensor = None 2025-05-07T20:33:01.1386238Z 2025-05-07T20:33:01.1386366Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1386451Z op = silu_mul_quant 2025-05-07T20:33:01.1386532Z if compiled: 2025-05-07T20:33:01.1386679Z op = torch.compile(op) 2025-05-07T20:33:01.1386780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1386848Z 2025-05-07T20:33:01.1386946Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1386953Z 2025-05-07T20:33:01.1387045Z moe/activation_test.py:117: 2025-05-07T20:33:01.1387181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1387276Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1387369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1387835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1387925Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1388433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1388535Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1400998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1401270Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1401620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1401712Z kernel = self.compile( 2025-05-07T20:33:01.1402103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1402273Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1402398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1402409Z 2025-05-07T20:33:01.1402606Z self = 2025-05-07T20:33:01.1403370Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1403994Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788de1440>} 2025-05-07T20:33:01.1404728Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1404912Z context = 2025-05-07T20:33:01.1404916Z 2025-05-07T20:33:01.1405074Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1405328Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1405432Z module_map=module_map) 2025-05-07T20:33:01.1405594Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1405688Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1405774Z E ^ 2025-05-07T20:33:01.1406123Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1406171Z 2025-05-07T20:33:01.1406591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1406596Z 2025-05-07T20:33:01.1406693Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1406908Z self=, 2025-05-07T20:33:01.1406987Z T=1, 2025-05-07T20:33:01.1407068Z D=5120, 2025-05-07T20:33:01.1407151Z scale_ub=None, 2025-05-07T20:33:01.1407233Z contiguous=False, 2025-05-07T20:33:01.1407311Z compiled=True, 2025-05-07T20:33:01.1407388Z ) 2025-05-07T20:33:01.1407599Z self = 2025-05-07T20:33:01.1407804Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.1407814Z 2025-05-07T20:33:01.1407893Z @given( 2025-05-07T20:33:01.1408011Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1408106Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1408225Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1408338Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1408457Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1408529Z ) 2025-05-07T20:33:01.1408766Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1408861Z def test_silu_mul_quant( 2025-05-07T20:33:01.1408934Z self, 2025-05-07T20:33:01.1409005Z T: int, 2025-05-07T20:33:01.1409083Z D: int, 2025-05-07T20:33:01.1409176Z scale_ub: Optional[float], 2025-05-07T20:33:01.1409267Z contiguous: bool, 2025-05-07T20:33:01.1409352Z compiled: bool, 2025-05-07T20:33:01.1409428Z ) -> None: 2025-05-07T20:33:01.1409522Z torch.manual_seed(2025) 2025-05-07T20:33:01.1409601Z 2025-05-07T20:33:01.1409764Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1409841Z 2025-05-07T20:33:01.1409929Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1410048Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1410140Z x = x_sign * x_clamp 2025-05-07T20:33:01.1410215Z x0 = x[:, :D] 2025-05-07T20:33:01.1410289Z x1 = x[:, D:] 2025-05-07T20:33:01.1410365Z 2025-05-07T20:33:01.1410443Z if contiguous: 2025-05-07T20:33:01.1410530Z x0 = x0.contiguous() 2025-05-07T20:33:01.1410619Z x1 = x1.contiguous() 2025-05-07T20:33:01.1410685Z 2025-05-07T20:33:01.1410769Z if scale_ub is not None: 2025-05-07T20:33:01.1410879Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1411008Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1411081Z ) 2025-05-07T20:33:01.1411273Z else: 2025-05-07T20:33:01.1411367Z scale_ub_tensor = None 2025-05-07T20:33:01.1411441Z 2025-05-07T20:33:01.1411565Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1411650Z op = silu_mul_quant 2025-05-07T20:33:01.1411736Z if compiled: 2025-05-07T20:33:01.1411830Z op = torch.compile(op) 2025-05-07T20:33:01.1411931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1412008Z 2025-05-07T20:33:01.1412095Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.1412210Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.1412286Z 2025-05-07T20:33:01.1412415Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1412511Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.1412613Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.1412731Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.1412877Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1412988Z 2025-05-07T20:33:01.1413081Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.1413086Z 2025-05-07T20:33:01.1413184Z moe/activation_test.py:126: 2025-05-07T20:33:01.1413308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1413409Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.1413543Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1414094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.1414193Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.1414548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1414815Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1415189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.1415441Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.1415817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.1415978Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.1416315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.1416395Z fn() 2025-05-07T20:33:01.1416791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.1416871Z self.fn.run( 2025-05-07T20:33:01.1417211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1417305Z kernel = self.compile( 2025-05-07T20:33:01.1417688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1417856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1417979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1417984Z 2025-05-07T20:33:01.1418189Z self = 2025-05-07T20:33:01.1418951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1419454Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3789f0e5c0>} 2025-05-07T20:33:01.1420267Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1420457Z context = 2025-05-07T20:33:01.1420461Z 2025-05-07T20:33:01.1420629Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1420886Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1420996Z module_map=module_map) 2025-05-07T20:33:01.1421154Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1421250Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.1421334Z E ^ 2025-05-07T20:33:01.1421682Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1421691Z 2025-05-07T20:33:01.1422132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1422176Z 2025-05-07T20:33:01.1422277Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1422491Z self=, 2025-05-07T20:33:01.1422576Z T=1, 2025-05-07T20:33:01.1422650Z D=5120, 2025-05-07T20:33:01.1422726Z scale_ub=None, 2025-05-07T20:33:01.1422814Z contiguous=True, 2025-05-07T20:33:01.1422894Z compiled=False, 2025-05-07T20:33:01.1422963Z ) 2025-05-07T20:33:01.1423180Z self = 2025-05-07T20:33:01.1423340Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:01.1423385Z 2025-05-07T20:33:01.1423468Z @given( 2025-05-07T20:33:01.1423583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1423682Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1423801Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1423912Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1424021Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1424098Z ) 2025-05-07T20:33:01.1424334Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1424422Z def test_silu_mul_quant( 2025-05-07T20:33:01.1424498Z self, 2025-05-07T20:33:01.1424568Z T: int, 2025-05-07T20:33:01.1424644Z D: int, 2025-05-07T20:33:01.1424736Z scale_ub: Optional[float], 2025-05-07T20:33:01.1424818Z contiguous: bool, 2025-05-07T20:33:01.1424904Z compiled: bool, 2025-05-07T20:33:01.1424978Z ) -> None: 2025-05-07T20:33:01.1425070Z torch.manual_seed(2025) 2025-05-07T20:33:01.1425145Z 2025-05-07T20:33:01.1425312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1425379Z 2025-05-07T20:33:01.1425472Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1425590Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1425676Z x = x_sign * x_clamp 2025-05-07T20:33:01.1425759Z x0 = x[:, :D] 2025-05-07T20:33:01.1425834Z x1 = x[:, D:] 2025-05-07T20:33:01.1425903Z 2025-05-07T20:33:01.1425988Z if contiguous: 2025-05-07T20:33:01.1426072Z x0 = x0.contiguous() 2025-05-07T20:33:01.1426161Z x1 = x1.contiguous() 2025-05-07T20:33:01.1426230Z 2025-05-07T20:33:01.1426313Z if scale_ub is not None: 2025-05-07T20:33:01.1426423Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1426552Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1426628Z ) 2025-05-07T20:33:01.1426708Z else: 2025-05-07T20:33:01.1426797Z scale_ub_tensor = None 2025-05-07T20:33:01.1426865Z 2025-05-07T20:33:01.1427145Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1427234Z op = silu_mul_quant 2025-05-07T20:33:01.1427316Z if compiled: 2025-05-07T20:33:01.1427522Z op = torch.compile(op) 2025-05-07T20:33:01.1427624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1427698Z 2025-05-07T20:33:01.1427781Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1427786Z 2025-05-07T20:33:01.1427875Z moe/activation_test.py:117: 2025-05-07T20:33:01.1428000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1428096Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1428187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1428681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1428775Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1429137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1429400Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1429727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1429821Z kernel = self.compile( 2025-05-07T20:33:01.1430216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1430385Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1430511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1430516Z 2025-05-07T20:33:01.1430710Z self = 2025-05-07T20:33:01.1431521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1432017Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae094fe0>} 2025-05-07T20:33:01.1432752Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1432937Z context = 2025-05-07T20:33:01.1432941Z 2025-05-07T20:33:01.1433099Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1433362Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1433470Z module_map=module_map) 2025-05-07T20:33:01.1433638Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1433734Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1433809Z E ^ 2025-05-07T20:33:01.1434161Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1434165Z 2025-05-07T20:33:01.1434567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1434572Z 2025-05-07T20:33:01.1434672Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1434893Z self=, 2025-05-07T20:33:01.1434967Z T=128, 2025-05-07T20:33:01.1435046Z D=5120, 2025-05-07T20:33:01.1435123Z scale_ub=None, 2025-05-07T20:33:01.1435207Z contiguous=False, 2025-05-07T20:33:01.1435291Z compiled=True, 2025-05-07T20:33:01.1435357Z ) 2025-05-07T20:33:01.1435645Z self = 2025-05-07T20:33:01.1435825Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.1435829Z 2025-05-07T20:33:01.1435901Z @given( 2025-05-07T20:33:01.1436017Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1436119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1436228Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1436344Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1436452Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1436524Z ) 2025-05-07T20:33:01.1436769Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1436859Z def test_silu_mul_quant( 2025-05-07T20:33:01.1436936Z self, 2025-05-07T20:33:01.1437015Z T: int, 2025-05-07T20:33:01.1437090Z D: int, 2025-05-07T20:33:01.1437184Z scale_ub: Optional[float], 2025-05-07T20:33:01.1437279Z contiguous: bool, 2025-05-07T20:33:01.1437403Z compiled: bool, 2025-05-07T20:33:01.1437477Z ) -> None: 2025-05-07T20:33:01.1437570Z torch.manual_seed(2025) 2025-05-07T20:33:01.1437639Z 2025-05-07T20:33:01.1437805Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1437875Z 2025-05-07T20:33:01.1437961Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1438084Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1438167Z x = x_sign * x_clamp 2025-05-07T20:33:01.1438241Z x0 = x[:, :D] 2025-05-07T20:33:01.1438324Z x1 = x[:, D:] 2025-05-07T20:33:01.1438394Z 2025-05-07T20:33:01.1438470Z if contiguous: 2025-05-07T20:33:01.1438560Z x0 = x0.contiguous() 2025-05-07T20:33:01.1438719Z x1 = x1.contiguous() 2025-05-07T20:33:01.1438788Z 2025-05-07T20:33:01.1438876Z if scale_ub is not None: 2025-05-07T20:33:01.1438981Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1439121Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1439190Z ) 2025-05-07T20:33:01.1439265Z else: 2025-05-07T20:33:01.1439359Z scale_ub_tensor = None 2025-05-07T20:33:01.1439427Z 2025-05-07T20:33:01.1439551Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1439643Z op = silu_mul_quant 2025-05-07T20:33:01.1439722Z if compiled: 2025-05-07T20:33:01.1439816Z op = torch.compile(op) 2025-05-07T20:33:01.1439923Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1439990Z 2025-05-07T20:33:01.1440343Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1440349Z 2025-05-07T20:33:01.1440500Z moe/activation_test.py:117: 2025-05-07T20:33:01.1440690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1440847Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1440997Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1441471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1441571Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1442060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1442152Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1442512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1442729Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1443069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1443162Z kernel = self.compile( 2025-05-07T20:33:01.1443787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1443970Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1444091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1444096Z 2025-05-07T20:33:01.1444300Z self = 2025-05-07T20:33:01.1445068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1445563Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae1b0360>} 2025-05-07T20:33:01.1446632Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1446951Z context = 2025-05-07T20:33:01.1446956Z 2025-05-07T20:33:01.1447124Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1447379Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1447483Z module_map=module_map) 2025-05-07T20:33:01.1447648Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1447743Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1447821Z E ^ 2025-05-07T20:33:01.1448169Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1448245Z 2025-05-07T20:33:01.1448662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1448669Z 2025-05-07T20:33:01.1448774Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1448997Z self=, 2025-05-07T20:33:01.1449075Z T=128, 2025-05-07T20:33:01.1449157Z D=7168, 2025-05-07T20:33:01.1449235Z scale_ub=1200.0, 2025-05-07T20:33:01.1449329Z contiguous=False, 2025-05-07T20:33:01.1449410Z compiled=False, 2025-05-07T20:33:01.1449480Z ) 2025-05-07T20:33:01.1449702Z self = 2025-05-07T20:33:01.1449873Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:01.1449877Z 2025-05-07T20:33:01.1449951Z @given( 2025-05-07T20:33:01.1450073Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1450171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1450280Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1450405Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1450517Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1450596Z ) 2025-05-07T20:33:01.1450835Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1450926Z def test_silu_mul_quant( 2025-05-07T20:33:01.1451003Z self, 2025-05-07T20:33:01.1451085Z T: int, 2025-05-07T20:33:01.1451160Z D: int, 2025-05-07T20:33:01.1451263Z scale_ub: Optional[float], 2025-05-07T20:33:01.1451348Z contiguous: bool, 2025-05-07T20:33:01.1451431Z compiled: bool, 2025-05-07T20:33:01.1451516Z ) -> None: 2025-05-07T20:33:01.1451608Z torch.manual_seed(2025) 2025-05-07T20:33:01.1451675Z 2025-05-07T20:33:01.1451845Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1451913Z 2025-05-07T20:33:01.1452005Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1452208Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1452297Z x = x_sign * x_clamp 2025-05-07T20:33:01.1452376Z x0 = x[:, :D] 2025-05-07T20:33:01.1452454Z x1 = x[:, D:] 2025-05-07T20:33:01.1452518Z 2025-05-07T20:33:01.1452600Z if contiguous: 2025-05-07T20:33:01.1452686Z x0 = x0.contiguous() 2025-05-07T20:33:01.1452770Z x1 = x1.contiguous() 2025-05-07T20:33:01.1452847Z 2025-05-07T20:33:01.1452930Z if scale_ub is not None: 2025-05-07T20:33:01.1453029Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1453164Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1453231Z ) 2025-05-07T20:33:01.1453307Z else: 2025-05-07T20:33:01.1453395Z scale_ub_tensor = None 2025-05-07T20:33:01.1453462Z 2025-05-07T20:33:01.1453593Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1453678Z op = silu_mul_quant 2025-05-07T20:33:01.1453763Z if compiled: 2025-05-07T20:33:01.1453906Z op = torch.compile(op) 2025-05-07T20:33:01.1454006Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1454072Z 2025-05-07T20:33:01.1454163Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1454168Z 2025-05-07T20:33:01.1454258Z moe/activation_test.py:117: 2025-05-07T20:33:01.1454381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1454491Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1454585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1455086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1455180Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1455576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1455806Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1456144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1456238Z kernel = self.compile( 2025-05-07T20:33:01.1456688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1456919Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1457088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1457098Z 2025-05-07T20:33:01.1457353Z self = 2025-05-07T20:33:01.1458280Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1458786Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3789c0ec00>} 2025-05-07T20:33:01.1459516Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1459705Z context = 2025-05-07T20:33:01.1459709Z 2025-05-07T20:33:01.1459866Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1460124Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1460228Z module_map=module_map) 2025-05-07T20:33:01.1460384Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1460587Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1460661Z E ^ 2025-05-07T20:33:01.1461011Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1461016Z 2025-05-07T20:33:01.1461430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1461435Z 2025-05-07T20:33:01.1461532Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1461753Z self=, 2025-05-07T20:33:01.1461825Z T=128, 2025-05-07T20:33:01.1461900Z D=5120, 2025-05-07T20:33:01.1461981Z scale_ub=None, 2025-05-07T20:33:01.1462062Z contiguous=False, 2025-05-07T20:33:01.1462141Z compiled=False, 2025-05-07T20:33:01.1462219Z ) 2025-05-07T20:33:01.1462431Z self = 2025-05-07T20:33:01.1462607Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:01.1462652Z 2025-05-07T20:33:01.1462724Z @given( 2025-05-07T20:33:01.1462841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1462941Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1463058Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1463170Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1463283Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1463349Z ) 2025-05-07T20:33:01.1463584Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1463675Z def test_silu_mul_quant( 2025-05-07T20:33:01.1463748Z self, 2025-05-07T20:33:01.1463824Z T: int, 2025-05-07T20:33:01.1463897Z D: int, 2025-05-07T20:33:01.1464052Z scale_ub: Optional[float], 2025-05-07T20:33:01.1464145Z contiguous: bool, 2025-05-07T20:33:01.1464226Z compiled: bool, 2025-05-07T20:33:01.1464304Z ) -> None: 2025-05-07T20:33:01.1464406Z torch.manual_seed(2025) 2025-05-07T20:33:01.1464477Z 2025-05-07T20:33:01.1464639Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1464713Z 2025-05-07T20:33:01.1464802Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1464921Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1465012Z x = x_sign * x_clamp 2025-05-07T20:33:01.1465090Z x0 = x[:, :D] 2025-05-07T20:33:01.1465165Z x1 = x[:, D:] 2025-05-07T20:33:01.1465242Z 2025-05-07T20:33:01.1465322Z if contiguous: 2025-05-07T20:33:01.1465415Z x0 = x0.contiguous() 2025-05-07T20:33:01.1465500Z x1 = x1.contiguous() 2025-05-07T20:33:01.1465569Z 2025-05-07T20:33:01.1465661Z if scale_ub is not None: 2025-05-07T20:33:01.1465765Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1465901Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1465982Z ) 2025-05-07T20:33:01.1466062Z else: 2025-05-07T20:33:01.1466156Z scale_ub_tensor = None 2025-05-07T20:33:01.1466232Z 2025-05-07T20:33:01.1466357Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1466442Z op = silu_mul_quant 2025-05-07T20:33:01.1466529Z if compiled: 2025-05-07T20:33:01.1466624Z op = torch.compile(op) 2025-05-07T20:33:01.1466732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1466802Z 2025-05-07T20:33:01.1466886Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1466891Z 2025-05-07T20:33:01.1466989Z moe/activation_test.py:117: 2025-05-07T20:33:01.1467112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1467210Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1467313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1467965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1468067Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1468418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1468632Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1468975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1469062Z kernel = self.compile( 2025-05-07T20:33:01.1469457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1469631Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1469755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1469760Z 2025-05-07T20:33:01.1469968Z self = 2025-05-07T20:33:01.1470797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1471287Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788e25e40>} 2025-05-07T20:33:01.1472024Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1472208Z context = 2025-05-07T20:33:01.1472255Z 2025-05-07T20:33:01.1472422Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1472680Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1472786Z module_map=module_map) 2025-05-07T20:33:01.1472950Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1473042Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1473120Z E ^ 2025-05-07T20:33:01.1473466Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1473471Z 2025-05-07T20:33:01.1473873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1473878Z 2025-05-07T20:33:01.1473983Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1474199Z self=, 2025-05-07T20:33:01.1474282Z T=128, 2025-05-07T20:33:01.1474355Z D=5120, 2025-05-07T20:33:01.1474438Z scale_ub=1200.0, 2025-05-07T20:33:01.1474527Z contiguous=True, 2025-05-07T20:33:01.1474605Z compiled=False, 2025-05-07T20:33:01.1474674Z ) 2025-05-07T20:33:01.1474896Z self = 2025-05-07T20:33:01.1475061Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.1475065Z 2025-05-07T20:33:01.1475139Z @given( 2025-05-07T20:33:01.1475262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1475358Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1475473Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1475585Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1475693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1475772Z ) 2025-05-07T20:33:01.1476010Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1476178Z def test_silu_mul_quant( 2025-05-07T20:33:01.1476257Z self, 2025-05-07T20:33:01.1476336Z T: int, 2025-05-07T20:33:01.1476410Z D: int, 2025-05-07T20:33:01.1476509Z scale_ub: Optional[float], 2025-05-07T20:33:01.1476596Z contiguous: bool, 2025-05-07T20:33:01.1476675Z compiled: bool, 2025-05-07T20:33:01.1476755Z ) -> None: 2025-05-07T20:33:01.1476847Z torch.manual_seed(2025) 2025-05-07T20:33:01.1476921Z 2025-05-07T20:33:01.1477085Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1477155Z 2025-05-07T20:33:01.1477250Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1477369Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1477454Z x = x_sign * x_clamp 2025-05-07T20:33:01.1477537Z x0 = x[:, :D] 2025-05-07T20:33:01.1477617Z x1 = x[:, D:] 2025-05-07T20:33:01.1477687Z 2025-05-07T20:33:01.1477774Z if contiguous: 2025-05-07T20:33:01.1477867Z x0 = x0.contiguous() 2025-05-07T20:33:01.1477996Z x1 = x1.contiguous() 2025-05-07T20:33:01.1478070Z 2025-05-07T20:33:01.1478156Z if scale_ub is not None: 2025-05-07T20:33:01.1478257Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1478394Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1478466Z ) 2025-05-07T20:33:01.1478544Z else: 2025-05-07T20:33:01.1478633Z scale_ub_tensor = None 2025-05-07T20:33:01.1478704Z 2025-05-07T20:33:01.1478835Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1478921Z op = silu_mul_quant 2025-05-07T20:33:01.1479001Z if compiled: 2025-05-07T20:33:01.1479103Z op = torch.compile(op) 2025-05-07T20:33:01.1479207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1479320Z 2025-05-07T20:33:01.1479414Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1479418Z 2025-05-07T20:33:01.1479521Z moe/activation_test.py:117: 2025-05-07T20:33:01.1479656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1479750Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1479847Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1480344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1480437Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1480793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1481019Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1481354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1481455Z kernel = self.compile( 2025-05-07T20:33:01.1481855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1482028Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1482156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1482160Z 2025-05-07T20:33:01.1482357Z self = 2025-05-07T20:33:01.1483121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1483613Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae84a0c0>} 2025-05-07T20:33:01.1484426Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1484622Z context = 2025-05-07T20:33:01.1484627Z 2025-05-07T20:33:01.1484785Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1485044Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1485148Z module_map=module_map) 2025-05-07T20:33:01.1485308Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1485406Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1485481Z E ^ 2025-05-07T20:33:01.1485825Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1485839Z 2025-05-07T20:33:01.1486247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1486296Z 2025-05-07T20:33:01.1486395Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1486617Z self=, 2025-05-07T20:33:01.1486690Z T=1, 2025-05-07T20:33:01.1486764Z D=7168, 2025-05-07T20:33:01.1486850Z scale_ub=1200.0, 2025-05-07T20:33:01.1486929Z contiguous=True, 2025-05-07T20:33:01.1487008Z compiled=True, 2025-05-07T20:33:01.1487084Z ) 2025-05-07T20:33:01.1487296Z self = 2025-05-07T20:33:01.1487463Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.1487467Z 2025-05-07T20:33:01.1487541Z @given( 2025-05-07T20:33:01.1487654Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1487799Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1487910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1488027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1488143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1488215Z ) 2025-05-07T20:33:01.1488462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1488550Z def test_silu_mul_quant( 2025-05-07T20:33:01.1488623Z self, 2025-05-07T20:33:01.1488702Z T: int, 2025-05-07T20:33:01.1488775Z D: int, 2025-05-07T20:33:01.1488869Z scale_ub: Optional[float], 2025-05-07T20:33:01.1488958Z contiguous: bool, 2025-05-07T20:33:01.1489039Z compiled: bool, 2025-05-07T20:33:01.1489114Z ) -> None: 2025-05-07T20:33:01.1489212Z torch.manual_seed(2025) 2025-05-07T20:33:01.1489279Z 2025-05-07T20:33:01.1489445Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1489519Z 2025-05-07T20:33:01.1489607Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1489732Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1489825Z x = x_sign * x_clamp 2025-05-07T20:33:01.1489900Z x0 = x[:, :D] 2025-05-07T20:33:01.1489984Z x1 = x[:, D:] 2025-05-07T20:33:01.1490055Z 2025-05-07T20:33:01.1490134Z if contiguous: 2025-05-07T20:33:01.1490226Z x0 = x0.contiguous() 2025-05-07T20:33:01.1490310Z x1 = x1.contiguous() 2025-05-07T20:33:01.1490379Z 2025-05-07T20:33:01.1490470Z if scale_ub is not None: 2025-05-07T20:33:01.1490572Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1490703Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1490785Z ) 2025-05-07T20:33:01.1490860Z else: 2025-05-07T20:33:01.1490955Z scale_ub_tensor = None 2025-05-07T20:33:01.1491033Z 2025-05-07T20:33:01.1491159Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1491253Z op = silu_mul_quant 2025-05-07T20:33:01.1491418Z if compiled: 2025-05-07T20:33:01.1491517Z op = torch.compile(op) 2025-05-07T20:33:01.1491627Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1491697Z 2025-05-07T20:33:01.1491783Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1491787Z 2025-05-07T20:33:01.1491886Z moe/activation_test.py:117: 2025-05-07T20:33:01.1492010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1492105Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1492210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1492576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1492675Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1493168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1493265Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1493665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1493882Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1494219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1494321Z kernel = self.compile( 2025-05-07T20:33:01.1494702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1494877Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1495001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1495045Z 2025-05-07T20:33:01.1495242Z self = 2025-05-07T20:33:01.1496015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1496509Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37ae35f1a0>} 2025-05-07T20:33:01.1497248Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1497432Z context = 2025-05-07T20:33:01.1497437Z 2025-05-07T20:33:01.1497604Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1497862Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1497972Z module_map=module_map) 2025-05-07T20:33:01.1498139Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1498235Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1498305Z E ^ 2025-05-07T20:33:01.1498657Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1498663Z 2025-05-07T20:33:01.1499073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1499077Z 2025-05-07T20:33:01.1499185Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1499405Z self=, 2025-05-07T20:33:01.1499478Z T=1, 2025-05-07T20:33:01.1499562Z D=7168, 2025-05-07T20:33:01.1499639Z scale_ub=1200.0, 2025-05-07T20:33:01.1499723Z contiguous=False, 2025-05-07T20:33:01.1499808Z compiled=True, 2025-05-07T20:33:01.1499978Z ) 2025-05-07T20:33:01.1500195Z self = 2025-05-07T20:33:01.1500364Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:01.1500368Z 2025-05-07T20:33:01.1500441Z @given( 2025-05-07T20:33:01.1500563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1500659Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1500768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1500884Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1500993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1501064Z ) 2025-05-07T20:33:01.1501309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1501400Z def test_silu_mul_quant( 2025-05-07T20:33:01.1501483Z self, 2025-05-07T20:33:01.1501558Z T: int, 2025-05-07T20:33:01.1501639Z D: int, 2025-05-07T20:33:01.1501739Z scale_ub: Optional[float], 2025-05-07T20:33:01.1501872Z contiguous: bool, 2025-05-07T20:33:01.1501952Z compiled: bool, 2025-05-07T20:33:01.1502037Z ) -> None: 2025-05-07T20:33:01.1502126Z torch.manual_seed(2025) 2025-05-07T20:33:01.1502196Z 2025-05-07T20:33:01.1502365Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1502439Z 2025-05-07T20:33:01.1502526Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1502652Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1502738Z x = x_sign * x_clamp 2025-05-07T20:33:01.1502813Z x0 = x[:, :D] 2025-05-07T20:33:01.1502893Z x1 = x[:, D:] 2025-05-07T20:33:01.1502961Z 2025-05-07T20:33:01.1503049Z if contiguous: 2025-05-07T20:33:01.1503179Z x0 = x0.contiguous() 2025-05-07T20:33:01.1503263Z x1 = x1.contiguous() 2025-05-07T20:33:01.1503338Z 2025-05-07T20:33:01.1503429Z if scale_ub is not None: 2025-05-07T20:33:01.1503533Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1503669Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1503743Z ) 2025-05-07T20:33:01.1503820Z else: 2025-05-07T20:33:01.1503917Z scale_ub_tensor = None 2025-05-07T20:33:01.1503986Z 2025-05-07T20:33:01.1504109Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1504199Z op = silu_mul_quant 2025-05-07T20:33:01.1504281Z if compiled: 2025-05-07T20:33:01.1504382Z op = torch.compile(op) 2025-05-07T20:33:01.1504483Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1504553Z 2025-05-07T20:33:01.1504646Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1504654Z 2025-05-07T20:33:01.1504750Z moe/activation_test.py:117: 2025-05-07T20:33:01.1504874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1504981Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1505078Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1505443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1505541Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1506027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1506127Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1506481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1506697Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1507042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1507133Z kernel = self.compile( 2025-05-07T20:33:01.1507689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1507864Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1507987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1507992Z 2025-05-07T20:33:01.1508194Z self = 2025-05-07T20:33:01.1508956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1509455Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3789f0eb60>} 2025-05-07T20:33:01.1510194Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1510422Z context = 2025-05-07T20:33:01.1510427Z 2025-05-07T20:33:01.1510593Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1510847Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1510957Z module_map=module_map) 2025-05-07T20:33:01.1511114Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1511206Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1511289Z E ^ 2025-05-07T20:33:01.1511636Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1511683Z 2025-05-07T20:33:01.1512099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1512111Z 2025-05-07T20:33:01.1512211Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1512428Z self=, 2025-05-07T20:33:01.1512508Z T=1, 2025-05-07T20:33:01.1512584Z D=7168, 2025-05-07T20:33:01.1512665Z scale_ub=None, 2025-05-07T20:33:01.1512751Z contiguous=False, 2025-05-07T20:33:01.1512830Z compiled=True, 2025-05-07T20:33:01.1512902Z ) 2025-05-07T20:33:01.1513119Z self = 2025-05-07T20:33:01.1513278Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.1513282Z 2025-05-07T20:33:01.1513361Z @given( 2025-05-07T20:33:01.1513479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1513575Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1513696Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1513812Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1513919Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1513996Z ) 2025-05-07T20:33:01.1514234Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1514323Z def test_silu_mul_quant( 2025-05-07T20:33:01.1514401Z self, 2025-05-07T20:33:01.1514475Z T: int, 2025-05-07T20:33:01.1514551Z D: int, 2025-05-07T20:33:01.1514650Z scale_ub: Optional[float], 2025-05-07T20:33:01.1514734Z contiguous: bool, 2025-05-07T20:33:01.1514821Z compiled: bool, 2025-05-07T20:33:01.1514895Z ) -> None: 2025-05-07T20:33:01.1514984Z torch.manual_seed(2025) 2025-05-07T20:33:01.1515066Z 2025-05-07T20:33:01.1515232Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1515299Z 2025-05-07T20:33:01.1515470Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1515593Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1515676Z x = x_sign * x_clamp 2025-05-07T20:33:01.1515758Z x0 = x[:, :D] 2025-05-07T20:33:01.1515834Z x1 = x[:, D:] 2025-05-07T20:33:01.1515903Z 2025-05-07T20:33:01.1515987Z if contiguous: 2025-05-07T20:33:01.1516072Z x0 = x0.contiguous() 2025-05-07T20:33:01.1516160Z x1 = x1.contiguous() 2025-05-07T20:33:01.1516230Z 2025-05-07T20:33:01.1516315Z if scale_ub is not None: 2025-05-07T20:33:01.1516420Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1516549Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1516620Z ) 2025-05-07T20:33:01.1516700Z else: 2025-05-07T20:33:01.1516795Z scale_ub_tensor = None 2025-05-07T20:33:01.1516864Z 2025-05-07T20:33:01.1516998Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1517092Z op = silu_mul_quant 2025-05-07T20:33:01.1517267Z if compiled: 2025-05-07T20:33:01.1517370Z op = torch.compile(op) 2025-05-07T20:33:01.1517469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1517548Z 2025-05-07T20:33:01.1517633Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.1517749Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.1517826Z 2025-05-07T20:33:01.1517960Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1518058Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.1518158Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.1518276Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.1518413Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1518537Z 2025-05-07T20:33:01.1523336Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.1523347Z 2025-05-07T20:33:01.1523474Z moe/activation_test.py:126: 2025-05-07T20:33:01.1523607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1523712Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.1523852Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.1524406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.1524503Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.1524864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1525080Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1525470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.1525726Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.1526100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.1526268Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.1526609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.1526685Z fn() 2025-05-07T20:33:01.1527089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.1527171Z self.fn.run( 2025-05-07T20:33:01.1527511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1527603Z kernel = self.compile( 2025-05-07T20:33:01.1527984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1528264Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1528394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1528399Z 2025-05-07T20:33:01.1528601Z self = 2025-05-07T20:33:01.1529368Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1529862Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788dd3920>} 2025-05-07T20:33:01.1530604Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1530792Z context = 2025-05-07T20:33:01.1530836Z 2025-05-07T20:33:01.1530998Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1531255Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1531359Z module_map=module_map) 2025-05-07T20:33:01.1531522Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1531621Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.1531706Z E ^ 2025-05-07T20:33:01.1532057Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1532062Z 2025-05-07T20:33:01.1532516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1532521Z 2025-05-07T20:33:01.1532633Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1532851Z self=, 2025-05-07T20:33:01.1532927Z T=1, 2025-05-07T20:33:01.1533007Z D=5120, 2025-05-07T20:33:01.1533087Z scale_ub=1200.0, 2025-05-07T20:33:01.1533178Z contiguous=False, 2025-05-07T20:33:01.1533258Z compiled=True, 2025-05-07T20:33:01.1533329Z ) 2025-05-07T20:33:01.1533546Z self = 2025-05-07T20:33:01.1533709Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:01.1533714Z 2025-05-07T20:33:01.1533790Z @given( 2025-05-07T20:33:01.1533911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1534008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1534121Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1534242Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1534357Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1534435Z ) 2025-05-07T20:33:01.1534674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1534763Z def test_silu_mul_quant( 2025-05-07T20:33:01.1534844Z self, 2025-05-07T20:33:01.1534924Z T: int, 2025-05-07T20:33:01.1534994Z D: int, 2025-05-07T20:33:01.1535093Z scale_ub: Optional[float], 2025-05-07T20:33:01.1535180Z contiguous: bool, 2025-05-07T20:33:01.1535259Z compiled: bool, 2025-05-07T20:33:01.1535340Z ) -> None: 2025-05-07T20:33:01.1535434Z torch.manual_seed(2025) 2025-05-07T20:33:01.1535505Z 2025-05-07T20:33:01.1535679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1535753Z 2025-05-07T20:33:01.1535849Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1535972Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1536169Z x = x_sign * x_clamp 2025-05-07T20:33:01.1536254Z x0 = x[:, :D] 2025-05-07T20:33:01.1536337Z x1 = x[:, D:] 2025-05-07T20:33:01.1536406Z 2025-05-07T20:33:01.1536493Z if contiguous: 2025-05-07T20:33:01.1536581Z x0 = x0.contiguous() 2025-05-07T20:33:01.1536666Z x1 = x1.contiguous() 2025-05-07T20:33:01.1536737Z 2025-05-07T20:33:01.1536824Z if scale_ub is not None: 2025-05-07T20:33:01.1536926Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1537067Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1537143Z ) 2025-05-07T20:33:01.1537218Z else: 2025-05-07T20:33:01.1537315Z scale_ub_tensor = None 2025-05-07T20:33:01.1537388Z 2025-05-07T20:33:01.1537522Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1537612Z op = silu_mul_quant 2025-05-07T20:33:01.1537692Z if compiled: 2025-05-07T20:33:01.1537799Z op = torch.compile(op) 2025-05-07T20:33:01.1537900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1538017Z 2025-05-07T20:33:01.1538112Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1538117Z 2025-05-07T20:33:01.1538210Z moe/activation_test.py:117: 2025-05-07T20:33:01.1538337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1538440Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1538535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1538910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1539001Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1539489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1539633Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1539991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1540575Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1540975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1541069Z kernel = self.compile( 2025-05-07T20:33:01.1541457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1541627Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1541751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1541756Z 2025-05-07T20:33:01.1541964Z self = 2025-05-07T20:33:01.1542740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1543244Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788dd2520>} 2025-05-07T20:33:01.1543979Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1544170Z context = 2025-05-07T20:33:01.1544175Z 2025-05-07T20:33:01.1544333Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1544588Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1544703Z module_map=module_map) 2025-05-07T20:33:01.1545064Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1545170Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1545250Z E ^ 2025-05-07T20:33:01.1545600Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1545605Z 2025-05-07T20:33:01.1546049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1546054Z 2025-05-07T20:33:01.1546153Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1546371Z self=, 2025-05-07T20:33:01.1546451Z T=1, 2025-05-07T20:33:01.1546527Z D=5120, 2025-05-07T20:33:01.1546610Z scale_ub=1200.0, 2025-05-07T20:33:01.1546705Z contiguous=False, 2025-05-07T20:33:01.1546783Z compiled=False, 2025-05-07T20:33:01.1546860Z ) 2025-05-07T20:33:01.1547081Z self = 2025-05-07T20:33:01.1547308Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:01.1547313Z 2025-05-07T20:33:01.1547478Z @given( 2025-05-07T20:33:01.1547595Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1547691Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1547808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1547921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1548028Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1548106Z ) 2025-05-07T20:33:01.1548343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1548437Z def test_silu_mul_quant( 2025-05-07T20:33:01.1548584Z self, 2025-05-07T20:33:01.1548659Z T: int, 2025-05-07T20:33:01.1548740Z D: int, 2025-05-07T20:33:01.1548834Z scale_ub: Optional[float], 2025-05-07T20:33:01.1548925Z contiguous: bool, 2025-05-07T20:33:01.1549013Z compiled: bool, 2025-05-07T20:33:01.1549090Z ) -> None: 2025-05-07T20:33:01.1549182Z torch.manual_seed(2025) 2025-05-07T20:33:01.1549254Z 2025-05-07T20:33:01.1549418Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1549489Z 2025-05-07T20:33:01.1549582Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1549702Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1549792Z x = x_sign * x_clamp 2025-05-07T20:33:01.1549866Z x0 = x[:, :D] 2025-05-07T20:33:01.1549943Z x1 = x[:, D:] 2025-05-07T20:33:01.1550017Z 2025-05-07T20:33:01.1550096Z if contiguous: 2025-05-07T20:33:01.1550188Z x0 = x0.contiguous() 2025-05-07T20:33:01.1550280Z x1 = x1.contiguous() 2025-05-07T20:33:01.1550351Z 2025-05-07T20:33:01.1550436Z if scale_ub is not None: 2025-05-07T20:33:01.1550546Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1550677Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1550753Z ) 2025-05-07T20:33:01.1550834Z else: 2025-05-07T20:33:01.1550922Z scale_ub_tensor = None 2025-05-07T20:33:01.1550993Z 2025-05-07T20:33:01.1551122Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1551208Z op = silu_mul_quant 2025-05-07T20:33:01.1551295Z if compiled: 2025-05-07T20:33:01.1551392Z op = torch.compile(op) 2025-05-07T20:33:01.1551493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1551568Z 2025-05-07T20:33:01.1551661Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1551666Z 2025-05-07T20:33:01.1551759Z moe/activation_test.py:117: 2025-05-07T20:33:01.1551907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1552014Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1552213Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1552710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1552806Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1553164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1553381Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1553719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1553817Z kernel = self.compile( 2025-05-07T20:33:01.1554214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1554392Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1554519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1554565Z 2025-05-07T20:33:01.1554765Z self = 2025-05-07T20:33:01.1555544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1556044Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37890ca660>} 2025-05-07T20:33:01.1556781Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1557006Z context = 2025-05-07T20:33:01.1557014Z 2025-05-07T20:33:01.1557175Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1557443Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1557546Z module_map=module_map) 2025-05-07T20:33:01.1557707Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1557803Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1557881Z E ^ 2025-05-07T20:33:01.1558236Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1558241Z 2025-05-07T20:33:01.1558673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1558680Z 2025-05-07T20:33:01.1558789Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1559009Z self=, 2025-05-07T20:33:01.1559087Z T=16384, 2025-05-07T20:33:01.1559173Z D=5120, 2025-05-07T20:33:01.1559257Z scale_ub=1200.0, 2025-05-07T20:33:01.1559361Z contiguous=False, 2025-05-07T20:33:01.1559481Z compiled=True, 2025-05-07T20:33:01.1559583Z ) 2025-05-07T20:33:01.1559827Z self = 2025-05-07T20:33:01.1560011Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:01.1560016Z 2025-05-07T20:33:01.1560090Z @given( 2025-05-07T20:33:01.1560216Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1560311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1560422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1560542Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1560653Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1560724Z ) 2025-05-07T20:33:01.1561133Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1561227Z def test_silu_mul_quant( 2025-05-07T20:33:01.1561301Z self, 2025-05-07T20:33:01.1561379Z T: int, 2025-05-07T20:33:01.1561452Z D: int, 2025-05-07T20:33:01.1561542Z scale_ub: Optional[float], 2025-05-07T20:33:01.1561634Z contiguous: bool, 2025-05-07T20:33:01.1561716Z compiled: bool, 2025-05-07T20:33:01.1561796Z ) -> None: 2025-05-07T20:33:01.1561886Z torch.manual_seed(2025) 2025-05-07T20:33:01.1561956Z 2025-05-07T20:33:01.1562123Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1562192Z 2025-05-07T20:33:01.1562278Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1562405Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1562496Z x = x_sign * x_clamp 2025-05-07T20:33:01.1562574Z x0 = x[:, :D] 2025-05-07T20:33:01.1562654Z x1 = x[:, D:] 2025-05-07T20:33:01.1562729Z 2025-05-07T20:33:01.1562809Z if contiguous: 2025-05-07T20:33:01.1562943Z x0 = x0.contiguous() 2025-05-07T20:33:01.1563027Z x1 = x1.contiguous() 2025-05-07T20:33:01.1563102Z 2025-05-07T20:33:01.1563188Z if scale_ub is not None: 2025-05-07T20:33:01.1563289Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1563423Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1563500Z ) 2025-05-07T20:33:01.1563575Z else: 2025-05-07T20:33:01.1563669Z scale_ub_tensor = None 2025-05-07T20:33:01.1563738Z 2025-05-07T20:33:01.1563865Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1563952Z op = silu_mul_quant 2025-05-07T20:33:01.1564033Z if compiled: 2025-05-07T20:33:01.1564198Z op = torch.compile(op) 2025-05-07T20:33:01.1564304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1564375Z 2025-05-07T20:33:01.1564471Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1564478Z 2025-05-07T20:33:01.1564567Z moe/activation_test.py:117: 2025-05-07T20:33:01.1564693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1564794Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1564888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1565252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1565347Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1565831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1565929Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1566284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1566504Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1566848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1566940Z kernel = self.compile( 2025-05-07T20:33:01.1567319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1567493Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1567615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1567620Z 2025-05-07T20:33:01.1567820Z self = 2025-05-07T20:33:01.1568581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1569154Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37890c8720>} 2025-05-07T20:33:01.1569894Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1570077Z context = 2025-05-07T20:33:01.1570082Z 2025-05-07T20:33:01.1570244Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1570498Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1570609Z module_map=module_map) 2025-05-07T20:33:01.1570767Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1570860Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1570945Z E ^ 2025-05-07T20:33:01.1571338Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1571342Z 2025-05-07T20:33:01.1571754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1571759Z 2025-05-07T20:33:01.1571862Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1572078Z self=, 2025-05-07T20:33:01.1572158Z T=2048, 2025-05-07T20:33:01.1572235Z D=7168, 2025-05-07T20:33:01.1572312Z scale_ub=1200.0, 2025-05-07T20:33:01.1572398Z contiguous=False, 2025-05-07T20:33:01.1572478Z compiled=True, 2025-05-07T20:33:01.1572546Z ) 2025-05-07T20:33:01.1572812Z self = 2025-05-07T20:33:01.1572986Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:01.1572991Z 2025-05-07T20:33:01.1573064Z @given( 2025-05-07T20:33:01.1573184Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1573279Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1573393Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1573507Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1573618Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1573693Z ) 2025-05-07T20:33:01.1573931Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1574020Z def test_silu_mul_quant( 2025-05-07T20:33:01.1574100Z self, 2025-05-07T20:33:01.1574175Z T: int, 2025-05-07T20:33:01.1574249Z D: int, 2025-05-07T20:33:01.1574349Z scale_ub: Optional[float], 2025-05-07T20:33:01.1574435Z contiguous: bool, 2025-05-07T20:33:01.1574516Z compiled: bool, 2025-05-07T20:33:01.1574600Z ) -> None: 2025-05-07T20:33:01.1574686Z torch.manual_seed(2025) 2025-05-07T20:33:01.1574759Z 2025-05-07T20:33:01.1574921Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1574992Z 2025-05-07T20:33:01.1575083Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1575205Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1575293Z x = x_sign * x_clamp 2025-05-07T20:33:01.1575372Z x0 = x[:, :D] 2025-05-07T20:33:01.1575447Z x1 = x[:, D:] 2025-05-07T20:33:01.1575515Z 2025-05-07T20:33:01.1575599Z if contiguous: 2025-05-07T20:33:01.1575684Z x0 = x0.contiguous() 2025-05-07T20:33:01.1575770Z x1 = x1.contiguous() 2025-05-07T20:33:01.1575843Z 2025-05-07T20:33:01.1575927Z if scale_ub is not None: 2025-05-07T20:33:01.1576037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1576166Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1576324Z ) 2025-05-07T20:33:01.1576409Z else: 2025-05-07T20:33:01.1576499Z scale_ub_tensor = None 2025-05-07T20:33:01.1576568Z 2025-05-07T20:33:01.1576698Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1576783Z op = silu_mul_quant 2025-05-07T20:33:01.1576864Z if compiled: 2025-05-07T20:33:01.1576968Z op = torch.compile(op) 2025-05-07T20:33:01.1577068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1577135Z 2025-05-07T20:33:01.1577226Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1577230Z 2025-05-07T20:33:01.1577323Z moe/activation_test.py:117: 2025-05-07T20:33:01.1577452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1577550Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1577643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1578016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1578145Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1578627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1578723Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1579074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1579297Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1579632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1579720Z kernel = self.compile( 2025-05-07T20:33:01.1580107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1580321Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1580445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1580459Z 2025-05-07T20:33:01.1580656Z self = 2025-05-07T20:33:01.1581416Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1581952Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788de1b20>} 2025-05-07T20:33:01.1582685Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1582877Z context = 2025-05-07T20:33:01.1582883Z 2025-05-07T20:33:01.1583040Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1583294Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1583400Z module_map=module_map) 2025-05-07T20:33:01.1583559Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1583653Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1583727Z E ^ 2025-05-07T20:33:01.1584074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1584078Z 2025-05-07T20:33:01.1584492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1584500Z 2025-05-07T20:33:01.1584599Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1584891Z self=, 2025-05-07T20:33:01.1584971Z T=1, 2025-05-07T20:33:01.1585042Z D=5120, 2025-05-07T20:33:01.1585127Z scale_ub=None, 2025-05-07T20:33:01.1585208Z contiguous=False, 2025-05-07T20:33:01.1585285Z compiled=False, 2025-05-07T20:33:01.1585356Z ) 2025-05-07T20:33:01.1585569Z self = 2025-05-07T20:33:01.1585731Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:01.1585735Z 2025-05-07T20:33:01.1585815Z @given( 2025-05-07T20:33:01.1585938Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1586032Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1586150Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1586266Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1586379Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1586457Z ) 2025-05-07T20:33:01.1586694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1586835Z def test_silu_mul_quant( 2025-05-07T20:33:01.1586911Z self, 2025-05-07T20:33:01.1586983Z T: int, 2025-05-07T20:33:01.1587059Z D: int, 2025-05-07T20:33:01.1587154Z scale_ub: Optional[float], 2025-05-07T20:33:01.1587237Z contiguous: bool, 2025-05-07T20:33:01.1587323Z compiled: bool, 2025-05-07T20:33:01.1587552Z ) -> None: 2025-05-07T20:33:01.1587643Z torch.manual_seed(2025) 2025-05-07T20:33:01.1587719Z 2025-05-07T20:33:01.1587884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1587958Z 2025-05-07T20:33:01.1588047Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1588221Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1588310Z x = x_sign * x_clamp 2025-05-07T20:33:01.1588392Z x0 = x[:, :D] 2025-05-07T20:33:01.1588476Z x1 = x[:, D:] 2025-05-07T20:33:01.1588555Z 2025-05-07T20:33:01.1588635Z if contiguous: 2025-05-07T20:33:01.1588722Z x0 = x0.contiguous() 2025-05-07T20:33:01.1588813Z x1 = x1.contiguous() 2025-05-07T20:33:01.1588884Z 2025-05-07T20:33:01.1588968Z if scale_ub is not None: 2025-05-07T20:33:01.1589078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1589214Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1589300Z ) 2025-05-07T20:33:01.1589374Z else: 2025-05-07T20:33:01.1589464Z scale_ub_tensor = None 2025-05-07T20:33:01.1589540Z 2025-05-07T20:33:01.1589668Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1589754Z op = silu_mul_quant 2025-05-07T20:33:01.1589850Z if compiled: 2025-05-07T20:33:01.1589948Z op = torch.compile(op) 2025-05-07T20:33:01.1590055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1590162Z 2025-05-07T20:33:01.1590286Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1590292Z 2025-05-07T20:33:01.1590423Z moe/activation_test.py:117: 2025-05-07T20:33:01.1590567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1590666Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1590767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1591261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1591353Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1591716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1591939Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1592386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1592480Z kernel = self.compile( 2025-05-07T20:33:01.1592861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1593039Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1593161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1593165Z 2025-05-07T20:33:01.1593365Z self = 2025-05-07T20:33:01.1594146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1594651Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788de0ae0>} 2025-05-07T20:33:01.1595458Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1595646Z context = 2025-05-07T20:33:01.1595651Z 2025-05-07T20:33:01.1595816Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1596073Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1596178Z module_map=module_map) 2025-05-07T20:33:01.1596342Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1596479Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1596556Z E ^ 2025-05-07T20:33:01.1596923Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1596930Z 2025-05-07T20:33:01.1597368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1597373Z 2025-05-07T20:33:01.1597477Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1597695Z self=, 2025-05-07T20:33:01.1597770Z T=4096, 2025-05-07T20:33:01.1597851Z D=7168, 2025-05-07T20:33:01.1597931Z scale_ub=1200.0, 2025-05-07T20:33:01.1598014Z contiguous=False, 2025-05-07T20:33:01.1598101Z compiled=False, 2025-05-07T20:33:01.1598170Z ) 2025-05-07T20:33:01.1598392Z self = 2025-05-07T20:33:01.1598566Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:01.1598574Z 2025-05-07T20:33:01.1598649Z @given( 2025-05-07T20:33:01.1598778Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1598881Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1598992Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1599111Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1599219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1599292Z ) 2025-05-07T20:33:01.1599538Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1599630Z def test_silu_mul_quant( 2025-05-07T20:33:01.1599712Z self, 2025-05-07T20:33:01.1599787Z T: int, 2025-05-07T20:33:01.1599864Z D: int, 2025-05-07T20:33:01.1599963Z scale_ub: Optional[float], 2025-05-07T20:33:01.1600048Z contiguous: bool, 2025-05-07T20:33:01.1600132Z compiled: bool, 2025-05-07T20:33:01.1600221Z ) -> None: 2025-05-07T20:33:01.1600315Z torch.manual_seed(2025) 2025-05-07T20:33:01.1600388Z 2025-05-07T20:33:01.1600639Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1600720Z 2025-05-07T20:33:01.1600808Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1600980Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1601101Z x = x_sign * x_clamp 2025-05-07T20:33:01.1601216Z x0 = x[:, :D] 2025-05-07T20:33:01.1601327Z x1 = x[:, D:] 2025-05-07T20:33:01.1601413Z 2025-05-07T20:33:01.1601500Z if contiguous: 2025-05-07T20:33:01.1601586Z x0 = x0.contiguous() 2025-05-07T20:33:01.1601669Z x1 = x1.contiguous() 2025-05-07T20:33:01.1601749Z 2025-05-07T20:33:01.1601834Z if scale_ub is not None: 2025-05-07T20:33:01.1601935Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1602081Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1602160Z ) 2025-05-07T20:33:01.1602236Z else: 2025-05-07T20:33:01.1602360Z scale_ub_tensor = None 2025-05-07T20:33:01.1602466Z 2025-05-07T20:33:01.1602701Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1602824Z op = silu_mul_quant 2025-05-07T20:33:01.1602934Z if compiled: 2025-05-07T20:33:01.1603051Z op = torch.compile(op) 2025-05-07T20:33:01.1603157Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1603223Z 2025-05-07T20:33:01.1603317Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1603322Z 2025-05-07T20:33:01.1603415Z moe/activation_test.py:117: 2025-05-07T20:33:01.1603539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1603640Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1603735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1604279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1604380Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1604734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1604962Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1605300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1605389Z kernel = self.compile( 2025-05-07T20:33:01.1605790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1605963Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1606090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1606098Z 2025-05-07T20:33:01.1606294Z self = 2025-05-07T20:33:01.1607064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1607563Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788f9b880>} 2025-05-07T20:33:01.1608298Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1608485Z context = 2025-05-07T20:33:01.1608489Z 2025-05-07T20:33:01.1608647Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1608904Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1609093Z module_map=module_map) 2025-05-07T20:33:01.1609255Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1609355Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1609431Z E ^ 2025-05-07T20:33:01.1609778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1609782Z 2025-05-07T20:33:01.1610217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1610222Z 2025-05-07T20:33:01.1610321Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1610541Z self=, 2025-05-07T20:33:01.1610616Z T=16384, 2025-05-07T20:33:01.1610692Z D=7168, 2025-05-07T20:33:01.1610777Z scale_ub=None, 2025-05-07T20:33:01.1610860Z contiguous=True, 2025-05-07T20:33:01.1610940Z compiled=True, 2025-05-07T20:33:01.1611017Z ) 2025-05-07T20:33:01.1611229Z self = 2025-05-07T20:33:01.1611443Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.1611447Z 2025-05-07T20:33:01.1611528Z @given( 2025-05-07T20:33:01.1611643Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1611743Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1611852Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1611964Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1612076Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1612143Z ) 2025-05-07T20:33:01.1612380Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1612519Z def test_silu_mul_quant( 2025-05-07T20:33:01.1612592Z self, 2025-05-07T20:33:01.1612667Z T: int, 2025-05-07T20:33:01.1612746Z D: int, 2025-05-07T20:33:01.1612843Z scale_ub: Optional[float], 2025-05-07T20:33:01.1612929Z contiguous: bool, 2025-05-07T20:33:01.1613015Z compiled: bool, 2025-05-07T20:33:01.1613088Z ) -> None: 2025-05-07T20:33:01.1613184Z torch.manual_seed(2025) 2025-05-07T20:33:01.1613256Z 2025-05-07T20:33:01.1613417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1613491Z 2025-05-07T20:33:01.1613580Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1613699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1613788Z x = x_sign * x_clamp 2025-05-07T20:33:01.1613862Z x0 = x[:, :D] 2025-05-07T20:33:01.1613937Z x1 = x[:, D:] 2025-05-07T20:33:01.1614012Z 2025-05-07T20:33:01.1614091Z if contiguous: 2025-05-07T20:33:01.1614181Z x0 = x0.contiguous() 2025-05-07T20:33:01.1614274Z x1 = x1.contiguous() 2025-05-07T20:33:01.1614341Z 2025-05-07T20:33:01.1614438Z if scale_ub is not None: 2025-05-07T20:33:01.1614546Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1614676Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1614757Z ) 2025-05-07T20:33:01.1614831Z else: 2025-05-07T20:33:01.1614921Z scale_ub_tensor = None 2025-05-07T20:33:01.1614999Z 2025-05-07T20:33:01.1615124Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1615210Z op = silu_mul_quant 2025-05-07T20:33:01.1615298Z if compiled: 2025-05-07T20:33:01.1615394Z op = torch.compile(op) 2025-05-07T20:33:01.1615495Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1615567Z 2025-05-07T20:33:01.1615651Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1615661Z 2025-05-07T20:33:01.1615760Z moe/activation_test.py:117: 2025-05-07T20:33:01.1615884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1616064Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1616167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1616531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1616620Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1617111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1617204Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1617562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1617783Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1618122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1618218Z kernel = self.compile( 2025-05-07T20:33:01.1618620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1618830Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1618959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1618963Z 2025-05-07T20:33:01.1619158Z self = 2025-05-07T20:33:01.1619925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1620415Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788f9a980>} 2025-05-07T20:33:01.1621200Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1621386Z context = 2025-05-07T20:33:01.1621390Z 2025-05-07T20:33:01.1621547Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1621808Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1621912Z module_map=module_map) 2025-05-07T20:33:01.1622072Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1622167Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1622243Z E ^ 2025-05-07T20:33:01.1622597Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1622604Z 2025-05-07T20:33:01.1623023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1623030Z 2025-05-07T20:33:01.1623127Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1623347Z self=, 2025-05-07T20:33:01.1623422Z T=4096, 2025-05-07T20:33:01.1623499Z D=5120, 2025-05-07T20:33:01.1623580Z scale_ub=None, 2025-05-07T20:33:01.1623662Z contiguous=False, 2025-05-07T20:33:01.1623745Z compiled=True, 2025-05-07T20:33:01.1623816Z ) 2025-05-07T20:33:01.1624025Z self = 2025-05-07T20:33:01.1624199Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.1624204Z 2025-05-07T20:33:01.1624276Z @given( 2025-05-07T20:33:01.1624398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1624499Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1624728Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1624850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1624958Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1625027Z ) 2025-05-07T20:33:01.1625266Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1625355Z def test_silu_mul_quant( 2025-05-07T20:33:01.1625427Z self, 2025-05-07T20:33:01.1625506Z T: int, 2025-05-07T20:33:01.1625580Z D: int, 2025-05-07T20:33:01.1625672Z scale_ub: Optional[float], 2025-05-07T20:33:01.1625763Z contiguous: bool, 2025-05-07T20:33:01.1625844Z compiled: bool, 2025-05-07T20:33:01.1625918Z ) -> None: 2025-05-07T20:33:01.1626013Z torch.manual_seed(2025) 2025-05-07T20:33:01.1626083Z 2025-05-07T20:33:01.1626252Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1626322Z 2025-05-07T20:33:01.1626418Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1626590Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1626676Z x = x_sign * x_clamp 2025-05-07T20:33:01.1626752Z x0 = x[:, :D] 2025-05-07T20:33:01.1626833Z x1 = x[:, D:] 2025-05-07T20:33:01.1626907Z 2025-05-07T20:33:01.1626987Z if contiguous: 2025-05-07T20:33:01.1627079Z x0 = x0.contiguous() 2025-05-07T20:33:01.1627165Z x1 = x1.contiguous() 2025-05-07T20:33:01.1627236Z 2025-05-07T20:33:01.1627327Z if scale_ub is not None: 2025-05-07T20:33:01.1627556Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1627708Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1627782Z ) 2025-05-07T20:33:01.1627856Z else: 2025-05-07T20:33:01.1628011Z scale_ub_tensor = None 2025-05-07T20:33:01.1628082Z 2025-05-07T20:33:01.1628211Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1628313Z op = silu_mul_quant 2025-05-07T20:33:01.1628400Z if compiled: 2025-05-07T20:33:01.1628493Z op = torch.compile(op) 2025-05-07T20:33:01.1628602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1628670Z 2025-05-07T20:33:01.1628759Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1628763Z 2025-05-07T20:33:01.1628861Z moe/activation_test.py:117: 2025-05-07T20:33:01.1628985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1629094Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1629188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1629550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1629650Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1630141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1630235Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1630598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1630816Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1631159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1631248Z kernel = self.compile( 2025-05-07T20:33:01.1631648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1631823Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1631950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1631957Z 2025-05-07T20:33:01.1632161Z self = 2025-05-07T20:33:01.1633018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1633516Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37885a3920>} 2025-05-07T20:33:01.1634257Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1634444Z context = 2025-05-07T20:33:01.1634451Z 2025-05-07T20:33:01.1634617Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1634879Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1635021Z module_map=module_map) 2025-05-07T20:33:01.1635185Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1635279Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1635361Z E ^ 2025-05-07T20:33:01.1635710Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1635715Z 2025-05-07T20:33:01.1636119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1636123Z 2025-05-07T20:33:01.1636231Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1636448Z self=, 2025-05-07T20:33:01.1636564Z T=4096, 2025-05-07T20:33:01.1636647Z D=5120, 2025-05-07T20:33:01.1636727Z scale_ub=1200.0, 2025-05-07T20:33:01.1636820Z contiguous=False, 2025-05-07T20:33:01.1636910Z compiled=False, 2025-05-07T20:33:01.1636984Z ) 2025-05-07T20:33:01.1637204Z self = 2025-05-07T20:33:01.1637376Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:01.1637381Z 2025-05-07T20:33:01.1637455Z @given( 2025-05-07T20:33:01.1637574Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1637674Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1637785Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1637904Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1638013Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1638093Z ) 2025-05-07T20:33:01.1638332Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1638425Z def test_silu_mul_quant( 2025-05-07T20:33:01.1638507Z self, 2025-05-07T20:33:01.1638589Z T: int, 2025-05-07T20:33:01.1638666Z D: int, 2025-05-07T20:33:01.1638767Z scale_ub: Optional[float], 2025-05-07T20:33:01.1638852Z contiguous: bool, 2025-05-07T20:33:01.1638935Z compiled: bool, 2025-05-07T20:33:01.1639017Z ) -> None: 2025-05-07T20:33:01.1639110Z torch.manual_seed(2025) 2025-05-07T20:33:01.1639180Z 2025-05-07T20:33:01.1639349Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1639422Z 2025-05-07T20:33:01.1639515Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1639634Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1639718Z x = x_sign * x_clamp 2025-05-07T20:33:01.1639803Z x0 = x[:, :D] 2025-05-07T20:33:01.1639880Z x1 = x[:, D:] 2025-05-07T20:33:01.1639953Z 2025-05-07T20:33:01.1640046Z if contiguous: 2025-05-07T20:33:01.1640543Z x0 = x0.contiguous() 2025-05-07T20:33:01.1640886Z x1 = x1.contiguous() 2025-05-07T20:33:01.1640962Z 2025-05-07T20:33:01.1641048Z if scale_ub is not None: 2025-05-07T20:33:01.1641150Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1641291Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1641365Z ) 2025-05-07T20:33:01.1641447Z else: 2025-05-07T20:33:01.1641536Z scale_ub_tensor = None 2025-05-07T20:33:01.1641605Z 2025-05-07T20:33:01.1641737Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1641822Z op = silu_mul_quant 2025-05-07T20:33:01.1641901Z if compiled: 2025-05-07T20:33:01.1642000Z op = torch.compile(op) 2025-05-07T20:33:01.1642100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1642169Z 2025-05-07T20:33:01.1642263Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1642267Z 2025-05-07T20:33:01.1642360Z moe/activation_test.py:117: 2025-05-07T20:33:01.1642489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1642656Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1642751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1643262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1643355Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1648324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1648579Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1648923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1649126Z kernel = self.compile( 2025-05-07T20:33:01.1649523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1649708Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1649844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1649850Z 2025-05-07T20:33:01.1650049Z self = 2025-05-07T20:33:01.1650819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1651321Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37885b42c0>} 2025-05-07T20:33:01.1652177Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1652410Z context = 2025-05-07T20:33:01.1652416Z 2025-05-07T20:33:01.1652579Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1652837Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1652943Z module_map=module_map) 2025-05-07T20:33:01.1653101Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1653204Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1653280Z E ^ 2025-05-07T20:33:01.1653632Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1653639Z 2025-05-07T20:33:01.1654078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1654083Z 2025-05-07T20:33:01.1654277Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1654512Z self=, 2025-05-07T20:33:01.1654592Z T=4096, 2025-05-07T20:33:01.1654668Z D=5120, 2025-05-07T20:33:01.1654757Z scale_ub=1200.0, 2025-05-07T20:33:01.1654843Z contiguous=False, 2025-05-07T20:33:01.1654925Z compiled=True, 2025-05-07T20:33:01.1655004Z ) 2025-05-07T20:33:01.1655218Z self = 2025-05-07T20:33:01.1655391Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:01.1655402Z 2025-05-07T20:33:01.1655478Z @given( 2025-05-07T20:33:01.1655594Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1655700Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1655814Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1655927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1656054Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1656176Z ) 2025-05-07T20:33:01.1656460Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1656562Z def test_silu_mul_quant( 2025-05-07T20:33:01.1656639Z self, 2025-05-07T20:33:01.1656723Z T: int, 2025-05-07T20:33:01.1656801Z D: int, 2025-05-07T20:33:01.1656902Z scale_ub: Optional[float], 2025-05-07T20:33:01.1656999Z contiguous: bool, 2025-05-07T20:33:01.1657088Z compiled: bool, 2025-05-07T20:33:01.1657173Z ) -> None: 2025-05-07T20:33:01.1657268Z torch.manual_seed(2025) 2025-05-07T20:33:01.1657338Z 2025-05-07T20:33:01.1657501Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1657623Z 2025-05-07T20:33:01.1657715Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1657833Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1657929Z x = x_sign * x_clamp 2025-05-07T20:33:01.1658009Z x0 = x[:, :D] 2025-05-07T20:33:01.1658086Z x1 = x[:, D:] 2025-05-07T20:33:01.1658159Z 2025-05-07T20:33:01.1658239Z if contiguous: 2025-05-07T20:33:01.1658331Z x0 = x0.contiguous() 2025-05-07T20:33:01.1658414Z x1 = x1.contiguous() 2025-05-07T20:33:01.1658483Z 2025-05-07T20:33:01.1658575Z if scale_ub is not None: 2025-05-07T20:33:01.1658676Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1658808Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1658885Z ) 2025-05-07T20:33:01.1658964Z else: 2025-05-07T20:33:01.1659058Z scale_ub_tensor = None 2025-05-07T20:33:01.1659134Z 2025-05-07T20:33:01.1659260Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1659348Z op = silu_mul_quant 2025-05-07T20:33:01.1659436Z if compiled: 2025-05-07T20:33:01.1659540Z op = torch.compile(op) 2025-05-07T20:33:01.1659656Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1659729Z 2025-05-07T20:33:01.1659814Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1659818Z 2025-05-07T20:33:01.1659918Z moe/activation_test.py:117: 2025-05-07T20:33:01.1660043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1660138Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1660238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1660604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1660693Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1661186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1661283Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1661770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1661992Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1662329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1662426Z kernel = self.compile( 2025-05-07T20:33:01.1662872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1663108Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1663272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1663278Z 2025-05-07T20:33:01.1663532Z self = 2025-05-07T20:33:01.1664508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1665086Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37885b5b20>} 2025-05-07T20:33:01.1665831Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1666019Z context = 2025-05-07T20:33:01.1666024Z 2025-05-07T20:33:01.1666183Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1666492Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1666596Z module_map=module_map) 2025-05-07T20:33:01.1666761Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1666857Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1666929Z E ^ 2025-05-07T20:33:01.1667287Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1667292Z 2025-05-07T20:33:01.1667816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1667821Z 2025-05-07T20:33:01.1667927Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1668141Z self=, 2025-05-07T20:33:01.1668218Z T=2048, 2025-05-07T20:33:01.1668301Z D=7168, 2025-05-07T20:33:01.1668380Z scale_ub=1200.0, 2025-05-07T20:33:01.1668469Z contiguous=False, 2025-05-07T20:33:01.1668558Z compiled=False, 2025-05-07T20:33:01.1668630Z ) 2025-05-07T20:33:01.1668846Z self = 2025-05-07T20:33:01.1669026Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:01.1669030Z 2025-05-07T20:33:01.1669108Z @given( 2025-05-07T20:33:01.1669229Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1669326Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1669437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1669558Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1669667Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1669741Z ) 2025-05-07T20:33:01.1669993Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1670083Z def test_silu_mul_quant( 2025-05-07T20:33:01.1670159Z self, 2025-05-07T20:33:01.1670242Z T: int, 2025-05-07T20:33:01.1670316Z D: int, 2025-05-07T20:33:01.1670497Z scale_ub: Optional[float], 2025-05-07T20:33:01.1670592Z contiguous: bool, 2025-05-07T20:33:01.1670677Z compiled: bool, 2025-05-07T20:33:01.1670761Z ) -> None: 2025-05-07T20:33:01.1670853Z torch.manual_seed(2025) 2025-05-07T20:33:01.1670925Z 2025-05-07T20:33:01.1671094Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1671166Z 2025-05-07T20:33:01.1671253Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1671380Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1671466Z x = x_sign * x_clamp 2025-05-07T20:33:01.1671542Z x0 = x[:, :D] 2025-05-07T20:33:01.1671624Z x1 = x[:, D:] 2025-05-07T20:33:01.1671693Z 2025-05-07T20:33:01.1671771Z if contiguous: 2025-05-07T20:33:01.1671864Z x0 = x0.contiguous() 2025-05-07T20:33:01.1671950Z x1 = x1.contiguous() 2025-05-07T20:33:01.1672024Z 2025-05-07T20:33:01.1672109Z if scale_ub is not None: 2025-05-07T20:33:01.1672214Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1672391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1672464Z ) 2025-05-07T20:33:01.1672540Z else: 2025-05-07T20:33:01.1672635Z scale_ub_tensor = None 2025-05-07T20:33:01.1672704Z 2025-05-07T20:33:01.1672829Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1672922Z op = silu_mul_quant 2025-05-07T20:33:01.1673002Z if compiled: 2025-05-07T20:33:01.1673097Z op = torch.compile(op) 2025-05-07T20:33:01.1673204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1673275Z 2025-05-07T20:33:01.1673366Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1673371Z 2025-05-07T20:33:01.1673464Z moe/activation_test.py:117: 2025-05-07T20:33:01.1673632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1673735Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1673834Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1674328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1674426Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1674779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1675002Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1675337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1675426Z kernel = self.compile( 2025-05-07T20:33:01.1675810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1675983Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1676109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1676122Z 2025-05-07T20:33:01.1676320Z self = 2025-05-07T20:33:01.1677083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1677578Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37885b6700>} 2025-05-07T20:33:01.1678308Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1678582Z context = 2025-05-07T20:33:01.1678587Z 2025-05-07T20:33:01.1678748Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1679002Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1679110Z module_map=module_map) 2025-05-07T20:33:01.1679267Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1679362Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1679444Z E ^ 2025-05-07T20:33:01.1679790Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1679795Z 2025-05-07T20:33:01.1680210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1680217Z 2025-05-07T20:33:01.1680314Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1680534Z self=, 2025-05-07T20:33:01.1680655Z T=1, 2025-05-07T20:33:01.1680730Z D=7168, 2025-05-07T20:33:01.1680818Z scale_ub=None, 2025-05-07T20:33:01.1680899Z contiguous=True, 2025-05-07T20:33:01.1680978Z compiled=False, 2025-05-07T20:33:01.1681058Z ) 2025-05-07T20:33:01.1681272Z self = 2025-05-07T20:33:01.1681432Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:01.1681437Z 2025-05-07T20:33:01.1681519Z @given( 2025-05-07T20:33:01.1681633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1681728Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1681847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1682003Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1682122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1682191Z ) 2025-05-07T20:33:01.1682433Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1682529Z def test_silu_mul_quant( 2025-05-07T20:33:01.1682604Z self, 2025-05-07T20:33:01.1682679Z T: int, 2025-05-07T20:33:01.1682760Z D: int, 2025-05-07T20:33:01.1682852Z scale_ub: Optional[float], 2025-05-07T20:33:01.1682939Z contiguous: bool, 2025-05-07T20:33:01.1683026Z compiled: bool, 2025-05-07T20:33:01.1683101Z ) -> None: 2025-05-07T20:33:01.1683190Z torch.manual_seed(2025) 2025-05-07T20:33:01.1683265Z 2025-05-07T20:33:01.1683427Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1683507Z 2025-05-07T20:33:01.1683593Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1683712Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1683808Z x = x_sign * x_clamp 2025-05-07T20:33:01.1683887Z x0 = x[:, :D] 2025-05-07T20:33:01.1683966Z x1 = x[:, D:] 2025-05-07T20:33:01.1684041Z 2025-05-07T20:33:01.1684125Z if contiguous: 2025-05-07T20:33:01.1684213Z x0 = x0.contiguous() 2025-05-07T20:33:01.1684304Z x1 = x1.contiguous() 2025-05-07T20:33:01.1684375Z 2025-05-07T20:33:01.1684460Z if scale_ub is not None: 2025-05-07T20:33:01.1684567Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1684696Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1684769Z ) 2025-05-07T20:33:01.1684847Z else: 2025-05-07T20:33:01.1684939Z scale_ub_tensor = None 2025-05-07T20:33:01.1685015Z 2025-05-07T20:33:01.1685138Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1685224Z op = silu_mul_quant 2025-05-07T20:33:01.1685316Z if compiled: 2025-05-07T20:33:01.1685411Z op = torch.compile(op) 2025-05-07T20:33:01.1685513Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1685675Z 2025-05-07T20:33:01.1685765Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1685772Z 2025-05-07T20:33:01.1685864Z moe/activation_test.py:117: 2025-05-07T20:33:01.1685994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1686088Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1686188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1686676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1686768Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1687130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1687346Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1687688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1687783Z kernel = self.compile( 2025-05-07T20:33:01.1688219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1688396Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1688518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1688523Z 2025-05-07T20:33:01.1688719Z self = 2025-05-07T20:33:01.1689490Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1690160Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37885b7a60>} 2025-05-07T20:33:01.1690903Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1691092Z context = 2025-05-07T20:33:01.1691096Z 2025-05-07T20:33:01.1691261Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1691517Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1691621Z module_map=module_map) 2025-05-07T20:33:01.1691783Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1691879Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1691955Z E ^ 2025-05-07T20:33:01.1692306Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1692318Z 2025-05-07T20:33:01.1692729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1692734Z 2025-05-07T20:33:01.1692839Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1693053Z self=, 2025-05-07T20:33:01.1693130Z T=16384, 2025-05-07T20:33:01.1693209Z D=7168, 2025-05-07T20:33:01.1693285Z scale_ub=1200.0, 2025-05-07T20:33:01.1693369Z contiguous=False, 2025-05-07T20:33:01.1693454Z compiled=True, 2025-05-07T20:33:01.1693523Z ) 2025-05-07T20:33:01.1693738Z self = 2025-05-07T20:33:01.1693918Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:01.1693925Z 2025-05-07T20:33:01.1693999Z @given( 2025-05-07T20:33:01.1694118Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1694294Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1694408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1694526Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1694633Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1694706Z ) 2025-05-07T20:33:01.1694951Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1695040Z def test_silu_mul_quant( 2025-05-07T20:33:01.1695114Z self, 2025-05-07T20:33:01.1695197Z T: int, 2025-05-07T20:33:01.1695274Z D: int, 2025-05-07T20:33:01.1695378Z scale_ub: Optional[float], 2025-05-07T20:33:01.1695464Z contiguous: bool, 2025-05-07T20:33:01.1695548Z compiled: bool, 2025-05-07T20:33:01.1695635Z ) -> None: 2025-05-07T20:33:01.1695726Z torch.manual_seed(2025) 2025-05-07T20:33:01.1695793Z 2025-05-07T20:33:01.1695969Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1696038Z 2025-05-07T20:33:01.1696168Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1696296Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1696381Z x = x_sign * x_clamp 2025-05-07T20:33:01.1696459Z x0 = x[:, :D] 2025-05-07T20:33:01.1696544Z x1 = x[:, D:] 2025-05-07T20:33:01.1696615Z 2025-05-07T20:33:01.1696700Z if contiguous: 2025-05-07T20:33:01.1696786Z x0 = x0.contiguous() 2025-05-07T20:33:01.1696871Z x1 = x1.contiguous() 2025-05-07T20:33:01.1696948Z 2025-05-07T20:33:01.1697034Z if scale_ub is not None: 2025-05-07T20:33:01.1697137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1697276Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1697393Z ) 2025-05-07T20:33:01.1697468Z else: 2025-05-07T20:33:01.1697569Z scale_ub_tensor = None 2025-05-07T20:33:01.1697643Z 2025-05-07T20:33:01.1697778Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1697872Z op = silu_mul_quant 2025-05-07T20:33:01.1697958Z if compiled: 2025-05-07T20:33:01.1698064Z op = torch.compile(op) 2025-05-07T20:33:01.1698166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1698237Z 2025-05-07T20:33:01.1698329Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1698333Z 2025-05-07T20:33:01.1698424Z moe/activation_test.py:117: 2025-05-07T20:33:01.1698549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1698650Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1698745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1699107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1699203Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1699689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1699787Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1700141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1700358Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1700698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1700789Z kernel = self.compile( 2025-05-07T20:33:01.1701192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1701359Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1701488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1701492Z 2025-05-07T20:33:01.1701778Z self = 2025-05-07T20:33:01.1702540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1703035Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7950d60>} 2025-05-07T20:33:01.1703766Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1703950Z context = 2025-05-07T20:33:01.1703957Z 2025-05-07T20:33:01.1704123Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1704385Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1704542Z module_map=module_map) 2025-05-07T20:33:01.1704699Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1704793Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1704870Z E ^ 2025-05-07T20:33:01.1705217Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1705221Z 2025-05-07T20:33:01.1705632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1705641Z 2025-05-07T20:33:01.1705739Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1705996Z self=, 2025-05-07T20:33:01.1706077Z T=1, 2025-05-07T20:33:01.1706149Z D=7168, 2025-05-07T20:33:01.1706235Z scale_ub=None, 2025-05-07T20:33:01.1706327Z contiguous=False, 2025-05-07T20:33:01.1706406Z compiled=False, 2025-05-07T20:33:01.1706475Z ) 2025-05-07T20:33:01.1706690Z self = 2025-05-07T20:33:01.1706854Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:01.1706861Z 2025-05-07T20:33:01.1706939Z @given( 2025-05-07T20:33:01.1707051Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1707146Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1707261Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1707373Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1707624Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1707707Z ) 2025-05-07T20:33:01.1707947Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1708041Z def test_silu_mul_quant( 2025-05-07T20:33:01.1708122Z self, 2025-05-07T20:33:01.1708200Z T: int, 2025-05-07T20:33:01.1708270Z D: int, 2025-05-07T20:33:01.1708368Z scale_ub: Optional[float], 2025-05-07T20:33:01.1708453Z contiguous: bool, 2025-05-07T20:33:01.1708542Z compiled: bool, 2025-05-07T20:33:01.1708615Z ) -> None: 2025-05-07T20:33:01.1708707Z torch.manual_seed(2025) 2025-05-07T20:33:01.1708781Z 2025-05-07T20:33:01.1708964Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1709034Z 2025-05-07T20:33:01.1709125Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1709246Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1709329Z x = x_sign * x_clamp 2025-05-07T20:33:01.1709410Z x0 = x[:, :D] 2025-05-07T20:33:01.1709490Z x1 = x[:, D:] 2025-05-07T20:33:01.1709559Z 2025-05-07T20:33:01.1709645Z if contiguous: 2025-05-07T20:33:01.1709823Z x0 = x0.contiguous() 2025-05-07T20:33:01.1709917Z x1 = x1.contiguous() 2025-05-07T20:33:01.1709986Z 2025-05-07T20:33:01.1710072Z if scale_ub is not None: 2025-05-07T20:33:01.1710182Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1710316Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1710391Z ) 2025-05-07T20:33:01.1710472Z else: 2025-05-07T20:33:01.1710563Z scale_ub_tensor = None 2025-05-07T20:33:01.1710634Z 2025-05-07T20:33:01.1710767Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1710855Z op = silu_mul_quant 2025-05-07T20:33:01.1710935Z if compiled: 2025-05-07T20:33:01.1711038Z op = torch.compile(op) 2025-05-07T20:33:01.1711143Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1711222Z 2025-05-07T20:33:01.1711308Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1711313Z 2025-05-07T20:33:01.1711409Z moe/activation_test.py:117: 2025-05-07T20:33:01.1711589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1711686Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1711783Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1712277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1712370Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1712732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1712948Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1713284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1713424Z kernel = self.compile( 2025-05-07T20:33:01.1713809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1713981Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1714109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1714114Z 2025-05-07T20:33:01.1714309Z self = 2025-05-07T20:33:01.1715075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1715564Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7951760>} 2025-05-07T20:33:01.1716309Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1716495Z context = 2025-05-07T20:33:01.1716500Z 2025-05-07T20:33:01.1716656Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1716917Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1717022Z module_map=module_map) 2025-05-07T20:33:01.1717178Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1717278Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1717353Z E ^ 2025-05-07T20:33:01.1717704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1717712Z 2025-05-07T20:33:01.1718223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1718231Z 2025-05-07T20:33:01.1718330Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1718551Z self=, 2025-05-07T20:33:01.1718627Z T=2048, 2025-05-07T20:33:01.1718708Z D=7168, 2025-05-07T20:33:01.1718786Z scale_ub=None, 2025-05-07T20:33:01.1718867Z contiguous=False, 2025-05-07T20:33:01.1718951Z compiled=True, 2025-05-07T20:33:01.1719020Z ) 2025-05-07T20:33:01.1719230Z self = 2025-05-07T20:33:01.1719406Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.1719410Z 2025-05-07T20:33:01.1719485Z @given( 2025-05-07T20:33:01.1719598Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1719706Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1719817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1719941Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1720127Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1720200Z ) 2025-05-07T20:33:01.1720443Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1720534Z def test_silu_mul_quant( 2025-05-07T20:33:01.1720608Z self, 2025-05-07T20:33:01.1720688Z T: int, 2025-05-07T20:33:01.1720760Z D: int, 2025-05-07T20:33:01.1720851Z scale_ub: Optional[float], 2025-05-07T20:33:01.1720942Z contiguous: bool, 2025-05-07T20:33:01.1721023Z compiled: bool, 2025-05-07T20:33:01.1721098Z ) -> None: 2025-05-07T20:33:01.1721196Z torch.manual_seed(2025) 2025-05-07T20:33:01.1721266Z 2025-05-07T20:33:01.1721473Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1721556Z 2025-05-07T20:33:01.1721642Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1721773Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1721860Z x = x_sign * x_clamp 2025-05-07T20:33:01.1721937Z x0 = x[:, :D] 2025-05-07T20:33:01.1722017Z x1 = x[:, D:] 2025-05-07T20:33:01.1722085Z 2025-05-07T20:33:01.1722163Z if contiguous: 2025-05-07T20:33:01.1722256Z x0 = x0.contiguous() 2025-05-07T20:33:01.1722341Z x1 = x1.contiguous() 2025-05-07T20:33:01.1722411Z 2025-05-07T20:33:01.1722501Z if scale_ub is not None: 2025-05-07T20:33:01.1722607Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1722737Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1722817Z ) 2025-05-07T20:33:01.1722893Z else: 2025-05-07T20:33:01.1722990Z scale_ub_tensor = None 2025-05-07T20:33:01.1723063Z 2025-05-07T20:33:01.1723189Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1723285Z op = silu_mul_quant 2025-05-07T20:33:01.1723370Z if compiled: 2025-05-07T20:33:01.1723473Z op = torch.compile(op) 2025-05-07T20:33:01.1723581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1723652Z 2025-05-07T20:33:01.1723740Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1723744Z 2025-05-07T20:33:01.1723844Z moe/activation_test.py:117: 2025-05-07T20:33:01.1723968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1724070Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1724165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1724523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1724621Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1725107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1725292Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1725656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1725872Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1726214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1726305Z kernel = self.compile( 2025-05-07T20:33:01.1726685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1726863Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1726987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1726994Z 2025-05-07T20:33:01.1727196Z self = 2025-05-07T20:33:01.1727961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1728496Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7952f20>} 2025-05-07T20:33:01.1729235Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1729420Z context = 2025-05-07T20:33:01.1729424Z 2025-05-07T20:33:01.1729587Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1729883Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1729992Z module_map=module_map) 2025-05-07T20:33:01.1730156Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1730250Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1730321Z E ^ 2025-05-07T20:33:01.1730673Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1730677Z 2025-05-07T20:33:01.1731087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1731091Z 2025-05-07T20:33:01.1731195Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1731410Z self=, 2025-05-07T20:33:01.1731480Z T=4096, 2025-05-07T20:33:01.1731562Z D=7168, 2025-05-07T20:33:01.1731640Z scale_ub=None, 2025-05-07T20:33:01.1731721Z contiguous=False, 2025-05-07T20:33:01.1731807Z compiled=True, 2025-05-07T20:33:01.1731881Z ) 2025-05-07T20:33:01.1732100Z self = 2025-05-07T20:33:01.1732269Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.1732274Z 2025-05-07T20:33:01.1732348Z @given( 2025-05-07T20:33:01.1732472Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1732566Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1732676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1732793Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1732901Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1732979Z ) 2025-05-07T20:33:01.1733217Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1733307Z def test_silu_mul_quant( 2025-05-07T20:33:01.1733390Z self, 2025-05-07T20:33:01.1733465Z T: int, 2025-05-07T20:33:01.1733622Z D: int, 2025-05-07T20:33:01.1733723Z scale_ub: Optional[float], 2025-05-07T20:33:01.1733811Z contiguous: bool, 2025-05-07T20:33:01.1733890Z compiled: bool, 2025-05-07T20:33:01.1733971Z ) -> None: 2025-05-07T20:33:01.1734064Z torch.manual_seed(2025) 2025-05-07T20:33:01.1734133Z 2025-05-07T20:33:01.1734300Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1734369Z 2025-05-07T20:33:01.1734455Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1734581Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1734663Z x = x_sign * x_clamp 2025-05-07T20:33:01.1734744Z x0 = x[:, :D] 2025-05-07T20:33:01.1734819Z x1 = x[:, D:] 2025-05-07T20:33:01.1734891Z 2025-05-07T20:33:01.1734976Z if contiguous: 2025-05-07T20:33:01.1735064Z x0 = x0.contiguous() 2025-05-07T20:33:01.1735148Z x1 = x1.contiguous() 2025-05-07T20:33:01.1735220Z 2025-05-07T20:33:01.1735311Z if scale_ub is not None: 2025-05-07T20:33:01.1735457Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1735592Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1735667Z ) 2025-05-07T20:33:01.1735741Z else: 2025-05-07T20:33:01.1735840Z scale_ub_tensor = None 2025-05-07T20:33:01.1735911Z 2025-05-07T20:33:01.1736041Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1736127Z op = silu_mul_quant 2025-05-07T20:33:01.1736206Z if compiled: 2025-05-07T20:33:01.1736308Z op = torch.compile(op) 2025-05-07T20:33:01.1736410Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1736479Z 2025-05-07T20:33:01.1736571Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1736620Z 2025-05-07T20:33:01.1736715Z moe/activation_test.py:117: 2025-05-07T20:33:01.1736848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1736952Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1737050Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1737421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1737510Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1737994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1738095Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1738449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1738666Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1739009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1739103Z kernel = self.compile( 2025-05-07T20:33:01.1739496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1739667Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1739791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1739795Z 2025-05-07T20:33:01.1739998Z self = 2025-05-07T20:33:01.1741261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1741793Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f37888280e0>} 2025-05-07T20:33:01.1742786Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1742978Z context = 2025-05-07T20:33:01.1742990Z 2025-05-07T20:33:01.1743150Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1743405Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1743514Z module_map=module_map) 2025-05-07T20:33:01.1743671Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1743765Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1743846Z E ^ 2025-05-07T20:33:01.1744197Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1744202Z 2025-05-07T20:33:01.1744620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1744698Z 2025-05-07T20:33:01.1744797Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1745013Z self=, 2025-05-07T20:33:01.1745099Z T=16384, 2025-05-07T20:33:01.1745176Z D=5120, 2025-05-07T20:33:01.1745253Z scale_ub=1200.0, 2025-05-07T20:33:01.1745343Z contiguous=False, 2025-05-07T20:33:01.1745419Z compiled=False, 2025-05-07T20:33:01.1745487Z ) 2025-05-07T20:33:01.1745705Z self = 2025-05-07T20:33:01.1745881Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:01.1745886Z 2025-05-07T20:33:01.1746033Z @given( 2025-05-07T20:33:01.1746146Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1746243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1746366Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1746481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1746588Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1746662Z ) 2025-05-07T20:33:01.1746900Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1746997Z def test_silu_mul_quant( 2025-05-07T20:33:01.1747070Z self, 2025-05-07T20:33:01.1747145Z T: int, 2025-05-07T20:33:01.1747225Z D: int, 2025-05-07T20:33:01.1747318Z scale_ub: Optional[float], 2025-05-07T20:33:01.1747470Z contiguous: bool, 2025-05-07T20:33:01.1747557Z compiled: bool, 2025-05-07T20:33:01.1747631Z ) -> None: 2025-05-07T20:33:01.1747719Z torch.manual_seed(2025) 2025-05-07T20:33:01.1747798Z 2025-05-07T20:33:01.1747960Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1748027Z 2025-05-07T20:33:01.1748124Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1748247Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1748331Z x = x_sign * x_clamp 2025-05-07T20:33:01.1748412Z x0 = x[:, :D] 2025-05-07T20:33:01.1748488Z x1 = x[:, D:] 2025-05-07T20:33:01.1748560Z 2025-05-07T20:33:01.1748639Z if contiguous: 2025-05-07T20:33:01.1748725Z x0 = x0.contiguous() 2025-05-07T20:33:01.1748814Z x1 = x1.contiguous() 2025-05-07T20:33:01.1748882Z 2025-05-07T20:33:01.1748968Z if scale_ub is not None: 2025-05-07T20:33:01.1749075Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1749208Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1749284Z ) 2025-05-07T20:33:01.1749360Z else: 2025-05-07T20:33:01.1749448Z scale_ub_tensor = None 2025-05-07T20:33:01.1749519Z 2025-05-07T20:33:01.1749774Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1749862Z op = silu_mul_quant 2025-05-07T20:33:01.1749953Z if compiled: 2025-05-07T20:33:01.1750047Z op = torch.compile(op) 2025-05-07T20:33:01.1750148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1750220Z 2025-05-07T20:33:01.1750306Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1750311Z 2025-05-07T20:33:01.1750402Z moe/activation_test.py:117: 2025-05-07T20:33:01.1750536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1750634Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1750733Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1751228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1751322Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1751686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1751942Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1752274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1752371Z kernel = self.compile( 2025-05-07T20:33:01.1752766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1752942Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1753064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1753069Z 2025-05-07T20:33:01.1753267Z self = 2025-05-07T20:33:01.1754083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1754576Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3788828b80>} 2025-05-07T20:33:01.1755314Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1755498Z context = 2025-05-07T20:33:01.1755503Z 2025-05-07T20:33:01.1755663Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1755925Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1756031Z module_map=module_map) 2025-05-07T20:33:01.1756199Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1756295Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1756369Z E ^ 2025-05-07T20:33:01.1756724Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1756729Z 2025-05-07T20:33:01.1757138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1757142Z 2025-05-07T20:33:01.1757247Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1757467Z self=, 2025-05-07T20:33:01.1757539Z T=16384, 2025-05-07T20:33:01.1757615Z D=5120, 2025-05-07T20:33:01.1757693Z scale_ub=1200.0, 2025-05-07T20:33:01.1757773Z contiguous=True, 2025-05-07T20:33:01.1757863Z compiled=True, 2025-05-07T20:33:01.1757932Z ) 2025-05-07T20:33:01.1758230Z self = 2025-05-07T20:33:01.1758409Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.1758416Z 2025-05-07T20:33:01.1758490Z @given( 2025-05-07T20:33:01.1758611Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1758706Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1758817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1758938Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1759047Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1759119Z ) 2025-05-07T20:33:01.1759364Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1759455Z def test_silu_mul_quant( 2025-05-07T20:33:01.1759530Z self, 2025-05-07T20:33:01.1759614Z T: int, 2025-05-07T20:33:01.1759688Z D: int, 2025-05-07T20:33:01.1759782Z scale_ub: Optional[float], 2025-05-07T20:33:01.1759876Z contiguous: bool, 2025-05-07T20:33:01.1759961Z compiled: bool, 2025-05-07T20:33:01.1760087Z ) -> None: 2025-05-07T20:33:01.1760177Z torch.manual_seed(2025) 2025-05-07T20:33:01.1760246Z 2025-05-07T20:33:01.1760414Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1760484Z 2025-05-07T20:33:01.1760572Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1760699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1760785Z x = x_sign * x_clamp 2025-05-07T20:33:01.1760863Z x0 = x[:, :D] 2025-05-07T20:33:01.1760944Z x1 = x[:, D:] 2025-05-07T20:33:01.1761011Z 2025-05-07T20:33:01.1761090Z if contiguous: 2025-05-07T20:33:01.1761185Z x0 = x0.contiguous() 2025-05-07T20:33:01.1761270Z x1 = x1.contiguous() 2025-05-07T20:33:01.1761386Z 2025-05-07T20:33:01.1761476Z if scale_ub is not None: 2025-05-07T20:33:01.1761576Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1761717Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1761792Z ) 2025-05-07T20:33:01.1761866Z else: 2025-05-07T20:33:01.1761962Z scale_ub_tensor = None 2025-05-07T20:33:01.1762029Z 2025-05-07T20:33:01.1762158Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1762248Z op = silu_mul_quant 2025-05-07T20:33:01.1762330Z if compiled: 2025-05-07T20:33:01.1762425Z op = torch.compile(op) 2025-05-07T20:33:01.1762533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1762601Z 2025-05-07T20:33:01.1762689Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1762701Z 2025-05-07T20:33:01.1762795Z moe/activation_test.py:117: 2025-05-07T20:33:01.1762919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1763022Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1763123Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1763486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1763583Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1764065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1764164Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1764516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1764733Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1765077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1765168Z kernel = self.compile( 2025-05-07T20:33:01.1765657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1765898Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1766066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1766071Z 2025-05-07T20:33:01.1766274Z self = 2025-05-07T20:33:01.1767126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1767617Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f378882a2a0>} 2025-05-07T20:33:01.1768364Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1768608Z context = 2025-05-07T20:33:01.1768613Z 2025-05-07T20:33:01.1768779Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1769035Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1769138Z module_map=module_map) 2025-05-07T20:33:01.1773840Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1773941Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1774024Z E ^ 2025-05-07T20:33:01.1774379Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1774459Z 2025-05-07T20:33:01.1774883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1774892Z 2025-05-07T20:33:01.1775001Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1775223Z self=, 2025-05-07T20:33:01.1775308Z T=16384, 2025-05-07T20:33:01.1775389Z D=5120, 2025-05-07T20:33:01.1775470Z scale_ub=None, 2025-05-07T20:33:01.1775562Z contiguous=False, 2025-05-07T20:33:01.1775643Z compiled=True, 2025-05-07T20:33:01.1775714Z ) 2025-05-07T20:33:01.1775937Z self = 2025-05-07T20:33:01.1776112Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.1776116Z 2025-05-07T20:33:01.1776193Z @given( 2025-05-07T20:33:01.1776317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1776419Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1776538Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1776657Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1776769Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1776849Z ) 2025-05-07T20:33:01.1777089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1777180Z def test_silu_mul_quant( 2025-05-07T20:33:01.1777259Z self, 2025-05-07T20:33:01.1777336Z T: int, 2025-05-07T20:33:01.1777412Z D: int, 2025-05-07T20:33:01.1777512Z scale_ub: Optional[float], 2025-05-07T20:33:01.1777600Z contiguous: bool, 2025-05-07T20:33:01.1777681Z compiled: bool, 2025-05-07T20:33:01.1777766Z ) -> None: 2025-05-07T20:33:01.1777861Z torch.manual_seed(2025) 2025-05-07T20:33:01.1777931Z 2025-05-07T20:33:01.1778103Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1778180Z 2025-05-07T20:33:01.1778277Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1778489Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1778579Z x = x_sign * x_clamp 2025-05-07T20:33:01.1778667Z x0 = x[:, :D] 2025-05-07T20:33:01.1778744Z x1 = x[:, D:] 2025-05-07T20:33:01.1778813Z 2025-05-07T20:33:01.1778903Z if contiguous: 2025-05-07T20:33:01.1778991Z x0 = x0.contiguous() 2025-05-07T20:33:01.1779080Z x1 = x1.contiguous() 2025-05-07T20:33:01.1779157Z 2025-05-07T20:33:01.1779245Z if scale_ub is not None: 2025-05-07T20:33:01.1779348Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1779486Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1779562Z ) 2025-05-07T20:33:01.1779640Z else: 2025-05-07T20:33:01.1779731Z scale_ub_tensor = None 2025-05-07T20:33:01.1779802Z 2025-05-07T20:33:01.1779938Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1780027Z op = silu_mul_quant 2025-05-07T20:33:01.1780109Z if compiled: 2025-05-07T20:33:01.1780218Z op = torch.compile(op) 2025-05-07T20:33:01.1780366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1780440Z 2025-05-07T20:33:01.1780534Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1780539Z 2025-05-07T20:33:01.1780634Z moe/activation_test.py:117: 2025-05-07T20:33:01.1780766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1780864Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1780961Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1781338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1781429Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1781915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1782059Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1782420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1782649Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1782987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1783076Z kernel = self.compile( 2025-05-07T20:33:01.1783461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1783634Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1783760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1783771Z 2025-05-07T20:33:01.1783971Z self = 2025-05-07T20:33:01.1784738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1785236Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f378882b060>} 2025-05-07T20:33:01.1785966Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1786160Z context = 2025-05-07T20:33:01.1786165Z 2025-05-07T20:33:01.1786322Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1786578Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1786804Z module_map=module_map) 2025-05-07T20:33:01.1786963Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1787058Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1787138Z E ^ 2025-05-07T20:33:01.1787624Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1787630Z 2025-05-07T20:33:01.1788048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1788052Z 2025-05-07T20:33:01.1788151Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1788371Z self=, 2025-05-07T20:33:01.1788454Z T=2048, 2025-05-07T20:33:01.1788529Z D=5120, 2025-05-07T20:33:01.1788615Z scale_ub=None, 2025-05-07T20:33:01.1788697Z contiguous=False, 2025-05-07T20:33:01.1788775Z compiled=True, 2025-05-07T20:33:01.1788850Z ) 2025-05-07T20:33:01.1789069Z self = 2025-05-07T20:33:01.1789290Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.1789295Z 2025-05-07T20:33:01.1789375Z @given( 2025-05-07T20:33:01.1789490Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1789587Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1789704Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1789818Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1789935Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1790008Z ) 2025-05-07T20:33:01.1790244Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1790340Z def test_silu_mul_quant( 2025-05-07T20:33:01.1790458Z self, 2025-05-07T20:33:01.1790533Z T: int, 2025-05-07T20:33:01.1790614Z D: int, 2025-05-07T20:33:01.1790711Z scale_ub: Optional[float], 2025-05-07T20:33:01.1790799Z contiguous: bool, 2025-05-07T20:33:01.1790887Z compiled: bool, 2025-05-07T20:33:01.1790963Z ) -> None: 2025-05-07T20:33:01.1791055Z torch.manual_seed(2025) 2025-05-07T20:33:01.1791129Z 2025-05-07T20:33:01.1791290Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1791370Z 2025-05-07T20:33:01.1791458Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1791576Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1791667Z x = x_sign * x_clamp 2025-05-07T20:33:01.1791743Z x0 = x[:, :D] 2025-05-07T20:33:01.1791819Z x1 = x[:, D:] 2025-05-07T20:33:01.1791896Z 2025-05-07T20:33:01.1791976Z if contiguous: 2025-05-07T20:33:01.1792064Z x0 = x0.contiguous() 2025-05-07T20:33:01.1792162Z x1 = x1.contiguous() 2025-05-07T20:33:01.1792229Z 2025-05-07T20:33:01.1792316Z if scale_ub is not None: 2025-05-07T20:33:01.1792432Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1792564Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1792636Z ) 2025-05-07T20:33:01.1792712Z else: 2025-05-07T20:33:01.1792799Z scale_ub_tensor = None 2025-05-07T20:33:01.1792873Z 2025-05-07T20:33:01.1792998Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1793083Z op = silu_mul_quant 2025-05-07T20:33:01.1793169Z if compiled: 2025-05-07T20:33:01.1793263Z op = torch.compile(op) 2025-05-07T20:33:01.1793364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1793442Z 2025-05-07T20:33:01.1793529Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1793534Z 2025-05-07T20:33:01.1793627Z moe/activation_test.py:117: 2025-05-07T20:33:01.1793757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1793938Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1794038Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1794404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1794493Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1794987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1795080Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1795432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1795656Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1795990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1796087Z kernel = self.compile( 2025-05-07T20:33:01.1796489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1796699Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1796829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1796833Z 2025-05-07T20:33:01.1797030Z self = 2025-05-07T20:33:01.1797797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1798287Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d78f07c0>} 2025-05-07T20:33:01.1799066Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1799261Z context = 2025-05-07T20:33:01.1799265Z 2025-05-07T20:33:01.1799422Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1799681Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1799784Z module_map=module_map) 2025-05-07T20:33:01.1799940Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1800038Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1800112Z E ^ 2025-05-07T20:33:01.1800458Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1800471Z 2025-05-07T20:33:01.1800886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1800893Z 2025-05-07T20:33:01.1800994Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1801214Z self=, 2025-05-07T20:33:01.1801292Z T=2048, 2025-05-07T20:33:01.1801363Z D=5120, 2025-05-07T20:33:01.1801448Z scale_ub=1200.0, 2025-05-07T20:33:01.1801530Z contiguous=False, 2025-05-07T20:33:01.1801608Z compiled=True, 2025-05-07T20:33:01.1801681Z ) 2025-05-07T20:33:01.1801892Z self = 2025-05-07T20:33:01.1802068Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:01.1802073Z 2025-05-07T20:33:01.1802147Z @given( 2025-05-07T20:33:01.1802262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1802367Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1802562Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1802675Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1802793Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1802865Z ) 2025-05-07T20:33:01.1803109Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1803196Z def test_silu_mul_quant( 2025-05-07T20:33:01.1803270Z self, 2025-05-07T20:33:01.1803350Z T: int, 2025-05-07T20:33:01.1803423Z D: int, 2025-05-07T20:33:01.1803515Z scale_ub: Optional[float], 2025-05-07T20:33:01.1803610Z contiguous: bool, 2025-05-07T20:33:01.1803690Z compiled: bool, 2025-05-07T20:33:01.1803765Z ) -> None: 2025-05-07T20:33:01.1803862Z torch.manual_seed(2025) 2025-05-07T20:33:01.1803931Z 2025-05-07T20:33:01.1804098Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1804177Z 2025-05-07T20:33:01.1804265Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1804392Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1804526Z x = x_sign * x_clamp 2025-05-07T20:33:01.1804603Z x0 = x[:, :D] 2025-05-07T20:33:01.1804684Z x1 = x[:, D:] 2025-05-07T20:33:01.1804754Z 2025-05-07T20:33:01.1804832Z if contiguous: 2025-05-07T20:33:01.1804925Z x0 = x0.contiguous() 2025-05-07T20:33:01.1805010Z x1 = x1.contiguous() 2025-05-07T20:33:01.1805079Z 2025-05-07T20:33:01.1805178Z if scale_ub is not None: 2025-05-07T20:33:01.1805280Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1805410Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1805492Z ) 2025-05-07T20:33:01.1805566Z else: 2025-05-07T20:33:01.1805656Z scale_ub_tensor = None 2025-05-07T20:33:01.1805775Z 2025-05-07T20:33:01.1805901Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1805998Z op = silu_mul_quant 2025-05-07T20:33:01.1806080Z if compiled: 2025-05-07T20:33:01.1806179Z op = torch.compile(op) 2025-05-07T20:33:01.1806288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1806356Z 2025-05-07T20:33:01.1806444Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1806448Z 2025-05-07T20:33:01.1806549Z moe/activation_test.py:117: 2025-05-07T20:33:01.1806673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1806767Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1806866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1807231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1807326Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1807814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1807911Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1808273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1808492Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1808824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1808920Z kernel = self.compile( 2025-05-07T20:33:01.1809316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1809491Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1809612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1809619Z 2025-05-07T20:33:01.1809814Z self = 2025-05-07T20:33:01.1810667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1811162Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d78f1580>} 2025-05-07T20:33:01.1811902Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1812086Z context = 2025-05-07T20:33:01.1812090Z 2025-05-07T20:33:01.1812256Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1812510Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1812618Z module_map=module_map) 2025-05-07T20:33:01.1812825Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1812919Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1812993Z E ^ 2025-05-07T20:33:01.1813346Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1813351Z 2025-05-07T20:33:01.1813783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1813788Z 2025-05-07T20:33:01.1813890Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1814105Z self=, 2025-05-07T20:33:01.1814257Z T=4096, 2025-05-07T20:33:01.1814339Z D=5120, 2025-05-07T20:33:01.1814418Z scale_ub=1200.0, 2025-05-07T20:33:01.1814498Z contiguous=True, 2025-05-07T20:33:01.1814584Z compiled=True, 2025-05-07T20:33:01.1814651Z ) 2025-05-07T20:33:01.1814866Z self = 2025-05-07T20:33:01.1815037Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.1815042Z 2025-05-07T20:33:01.1815115Z @given( 2025-05-07T20:33:01.1815235Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1815329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1815441Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1815557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1815667Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1815739Z ) 2025-05-07T20:33:01.1815984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1816076Z def test_silu_mul_quant( 2025-05-07T20:33:01.1816155Z self, 2025-05-07T20:33:01.1816229Z T: int, 2025-05-07T20:33:01.1816319Z D: int, 2025-05-07T20:33:01.1816422Z scale_ub: Optional[float], 2025-05-07T20:33:01.1816506Z contiguous: bool, 2025-05-07T20:33:01.1816586Z compiled: bool, 2025-05-07T20:33:01.1816668Z ) -> None: 2025-05-07T20:33:01.1816759Z torch.manual_seed(2025) 2025-05-07T20:33:01.1816829Z 2025-05-07T20:33:01.1816996Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1817068Z 2025-05-07T20:33:01.1817157Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1817281Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1817367Z x = x_sign * x_clamp 2025-05-07T20:33:01.1817447Z x0 = x[:, :D] 2025-05-07T20:33:01.1817528Z x1 = x[:, D:] 2025-05-07T20:33:01.1817596Z 2025-05-07T20:33:01.1817689Z if contiguous: 2025-05-07T20:33:01.1817779Z x0 = x0.contiguous() 2025-05-07T20:33:01.1817863Z x1 = x1.contiguous() 2025-05-07T20:33:01.1818026Z 2025-05-07T20:33:01.1818121Z if scale_ub is not None: 2025-05-07T20:33:01.1818234Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1818384Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1818460Z ) 2025-05-07T20:33:01.1818539Z else: 2025-05-07T20:33:01.1818641Z scale_ub_tensor = None 2025-05-07T20:33:01.1818714Z 2025-05-07T20:33:01.1818850Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1818950Z op = silu_mul_quant 2025-05-07T20:33:01.1819036Z if compiled: 2025-05-07T20:33:01.1819143Z op = torch.compile(op) 2025-05-07T20:33:01.1819254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1819327Z 2025-05-07T20:33:01.1819426Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1819433Z 2025-05-07T20:33:01.1819533Z moe/activation_test.py:117: 2025-05-07T20:33:01.1819677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1819912Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1820006Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1820370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1820466Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1820951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1821049Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1821401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1821618Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1822000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1822096Z kernel = self.compile( 2025-05-07T20:33:01.1822501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1822671Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1822793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1822797Z 2025-05-07T20:33:01.1822999Z self = 2025-05-07T20:33:01.1823757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1824254Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d78f2840>} 2025-05-07T20:33:01.1824992Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1825178Z context = 2025-05-07T20:33:01.1825182Z 2025-05-07T20:33:01.1825345Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1825600Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1825708Z module_map=module_map) 2025-05-07T20:33:01.1825864Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1825957Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1826037Z E ^ 2025-05-07T20:33:01.1826385Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1826390Z 2025-05-07T20:33:01.1826950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1826957Z 2025-05-07T20:33:01.1827055Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1827271Z self=, 2025-05-07T20:33:01.1827347Z T=128, 2025-05-07T20:33:01.1827540Z D=5120, 2025-05-07T20:33:01.1827649Z scale_ub=1200.0, 2025-05-07T20:33:01.1827767Z contiguous=False, 2025-05-07T20:33:01.1827879Z compiled=True, 2025-05-07T20:33:01.1827960Z ) 2025-05-07T20:33:01.1828180Z self = 2025-05-07T20:33:01.1828346Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:01.1828350Z 2025-05-07T20:33:01.1828428Z @given( 2025-05-07T20:33:01.1828542Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1828639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1828760Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1828926Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1829033Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1829109Z ) 2025-05-07T20:33:01.1829355Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1829442Z def test_silu_mul_quant( 2025-05-07T20:33:01.1829520Z self, 2025-05-07T20:33:01.1829592Z T: int, 2025-05-07T20:33:01.1829670Z D: int, 2025-05-07T20:33:01.1829762Z scale_ub: Optional[float], 2025-05-07T20:33:01.1829848Z contiguous: bool, 2025-05-07T20:33:01.1829937Z compiled: bool, 2025-05-07T20:33:01.1830011Z ) -> None: 2025-05-07T20:33:01.1830101Z torch.manual_seed(2025) 2025-05-07T20:33:01.1830226Z 2025-05-07T20:33:01.1830388Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1830459Z 2025-05-07T20:33:01.1830556Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1830678Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1830764Z x = x_sign * x_clamp 2025-05-07T20:33:01.1830850Z x0 = x[:, :D] 2025-05-07T20:33:01.1830926Z x1 = x[:, D:] 2025-05-07T20:33:01.1830993Z 2025-05-07T20:33:01.1831077Z if contiguous: 2025-05-07T20:33:01.1831168Z x0 = x0.contiguous() 2025-05-07T20:33:01.1831259Z x1 = x1.contiguous() 2025-05-07T20:33:01.1831335Z 2025-05-07T20:33:01.1831421Z if scale_ub is not None: 2025-05-07T20:33:01.1831535Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1831663Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1831736Z ) 2025-05-07T20:33:01.1831818Z else: 2025-05-07T20:33:01.1831907Z scale_ub_tensor = None 2025-05-07T20:33:01.1831977Z 2025-05-07T20:33:01.1832114Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1832200Z op = silu_mul_quant 2025-05-07T20:33:01.1832282Z if compiled: 2025-05-07T20:33:01.1832385Z op = torch.compile(op) 2025-05-07T20:33:01.1832489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1832564Z 2025-05-07T20:33:01.1832653Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1832657Z 2025-05-07T20:33:01.1832750Z moe/activation_test.py:117: 2025-05-07T20:33:01.1832881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1832978Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1833074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1833448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1833539Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1834120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1834218Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1834572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1834798Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1835137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1835231Z kernel = self.compile( 2025-05-07T20:33:01.1835619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1835790Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1835924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1835929Z 2025-05-07T20:33:01.1836131Z self = 2025-05-07T20:33:01.1836895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1837449Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d78f34c0>} 2025-05-07T20:33:01.1838184Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1838384Z context = 2025-05-07T20:33:01.1838426Z 2025-05-07T20:33:01.1838588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1838860Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1838965Z module_map=module_map) 2025-05-07T20:33:01.1839122Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1839220Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1839295Z E ^ 2025-05-07T20:33:01.1839644Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1839649Z 2025-05-07T20:33:01.1840540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1840550Z 2025-05-07T20:33:01.1840697Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1840925Z self=, 2025-05-07T20:33:01.1841005Z T=16384, 2025-05-07T20:33:01.1841081Z D=7168, 2025-05-07T20:33:01.1841169Z scale_ub=1200.0, 2025-05-07T20:33:01.1841256Z contiguous=True, 2025-05-07T20:33:01.1841337Z compiled=True, 2025-05-07T20:33:01.1841413Z ) 2025-05-07T20:33:01.1841627Z self = 2025-05-07T20:33:01.1841798Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.1841803Z 2025-05-07T20:33:01.1841882Z @given( 2025-05-07T20:33:01.1841996Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1842096Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1842214Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1842326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1842441Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1842515Z ) 2025-05-07T20:33:01.1842756Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1842850Z def test_silu_mul_quant( 2025-05-07T20:33:01.1843167Z self, 2025-05-07T20:33:01.1843245Z T: int, 2025-05-07T20:33:01.1843329Z D: int, 2025-05-07T20:33:01.1843424Z scale_ub: Optional[float], 2025-05-07T20:33:01.1843513Z contiguous: bool, 2025-05-07T20:33:01.1843594Z compiled: bool, 2025-05-07T20:33:01.1843669Z ) -> None: 2025-05-07T20:33:01.1843767Z torch.manual_seed(2025) 2025-05-07T20:33:01.1843837Z 2025-05-07T20:33:01.1844000Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1844077Z 2025-05-07T20:33:01.1844165Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1844285Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1844376Z x = x_sign * x_clamp 2025-05-07T20:33:01.1844455Z x0 = x[:, :D] 2025-05-07T20:33:01.1844535Z x1 = x[:, D:] 2025-05-07T20:33:01.1844612Z 2025-05-07T20:33:01.1844693Z if contiguous: 2025-05-07T20:33:01.1844780Z x0 = x0.contiguous() 2025-05-07T20:33:01.1844879Z x1 = x1.contiguous() 2025-05-07T20:33:01.1845051Z 2025-05-07T20:33:01.1845141Z if scale_ub is not None: 2025-05-07T20:33:01.1845245Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1845375Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1845454Z ) 2025-05-07T20:33:01.1845528Z else: 2025-05-07T20:33:01.1845620Z scale_ub_tensor = None 2025-05-07T20:33:01.1845696Z 2025-05-07T20:33:01.1845822Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1845907Z op = silu_mul_quant 2025-05-07T20:33:01.1845994Z if compiled: 2025-05-07T20:33:01.1846091Z op = torch.compile(op) 2025-05-07T20:33:01.1846191Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1846333Z 2025-05-07T20:33:01.1846422Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1846426Z 2025-05-07T20:33:01.1846528Z moe/activation_test.py:117: 2025-05-07T20:33:01.1846661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1846768Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1846870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1847236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1847326Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1847822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1847913Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1848277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1848500Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1848842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1848942Z kernel = self.compile( 2025-05-07T20:33:01.1849342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1849513Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1849644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1849648Z 2025-05-07T20:33:01.1849847Z self = 2025-05-07T20:33:01.1850618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1851200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d76fcc20>} 2025-05-07T20:33:01.1851944Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1852131Z context = 2025-05-07T20:33:01.1852136Z 2025-05-07T20:33:01.1852294Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1852560Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1852665Z module_map=module_map) 2025-05-07T20:33:01.1852828Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1852921Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1853004Z E ^ 2025-05-07T20:33:01.1853362Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1853367Z 2025-05-07T20:33:01.1853825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1853829Z 2025-05-07T20:33:01.1853928Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1854151Z self=, 2025-05-07T20:33:01.1854225Z T=16384, 2025-05-07T20:33:01.1854306Z D=5120, 2025-05-07T20:33:01.1854391Z scale_ub=1200.0, 2025-05-07T20:33:01.1854471Z contiguous=True, 2025-05-07T20:33:01.1854564Z compiled=False, 2025-05-07T20:33:01.1854634Z ) 2025-05-07T20:33:01.1854847Z self = 2025-05-07T20:33:01.1855026Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.1855074Z 2025-05-07T20:33:01.1855148Z @given( 2025-05-07T20:33:01.1855269Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1855369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1855482Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1855604Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1855712Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1855784Z ) 2025-05-07T20:33:01.1856028Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1856117Z def test_silu_mul_quant( 2025-05-07T20:33:01.1856191Z self, 2025-05-07T20:33:01.1856272Z T: int, 2025-05-07T20:33:01.1856349Z D: int, 2025-05-07T20:33:01.1856444Z scale_ub: Optional[float], 2025-05-07T20:33:01.1856535Z contiguous: bool, 2025-05-07T20:33:01.1856627Z compiled: bool, 2025-05-07T20:33:01.1856707Z ) -> None: 2025-05-07T20:33:01.1856804Z torch.manual_seed(2025) 2025-05-07T20:33:01.1856874Z 2025-05-07T20:33:01.1857048Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1857126Z 2025-05-07T20:33:01.1857245Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1857417Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1857530Z x = x_sign * x_clamp 2025-05-07T20:33:01.1857639Z x0 = x[:, :D] 2025-05-07T20:33:01.1857753Z x1 = x[:, D:] 2025-05-07T20:33:01.1857850Z 2025-05-07T20:33:01.1857963Z if contiguous: 2025-05-07T20:33:01.1858098Z x0 = x0.contiguous() 2025-05-07T20:33:01.1858215Z x1 = x1.contiguous() 2025-05-07T20:33:01.1858310Z 2025-05-07T20:33:01.1858433Z if scale_ub is not None: 2025-05-07T20:33:01.1858567Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1858705Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1858782Z ) 2025-05-07T20:33:01.1858857Z else: 2025-05-07T20:33:01.1858950Z scale_ub_tensor = None 2025-05-07T20:33:01.1859120Z 2025-05-07T20:33:01.1859257Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1859350Z op = silu_mul_quant 2025-05-07T20:33:01.1859430Z if compiled: 2025-05-07T20:33:01.1859526Z op = torch.compile(op) 2025-05-07T20:33:01.1859634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1859704Z 2025-05-07T20:33:01.1859790Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1859800Z 2025-05-07T20:33:01.1859893Z moe/activation_test.py:117: 2025-05-07T20:33:01.1860017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1860119Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1860214Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1860709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1860811Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1861174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1861435Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1861779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1861871Z kernel = self.compile( 2025-05-07T20:33:01.1862275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1862444Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1862569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1862574Z 2025-05-07T20:33:01.1862822Z self = 2025-05-07T20:33:01.1863593Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1864098Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d76fd580>} 2025-05-07T20:33:01.1864841Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1865033Z context = 2025-05-07T20:33:01.1865039Z 2025-05-07T20:33:01.1865205Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1865464Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1865582Z module_map=module_map) 2025-05-07T20:33:01.1865745Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1865841Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1865922Z E ^ 2025-05-07T20:33:01.1866268Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1866272Z 2025-05-07T20:33:01.1866708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1866712Z 2025-05-07T20:33:01.1866809Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1867028Z self=, 2025-05-07T20:33:01.1867111Z T=1, 2025-05-07T20:33:01.1867185Z D=7168, 2025-05-07T20:33:01.1867270Z scale_ub=1200.0, 2025-05-07T20:33:01.1867358Z contiguous=False, 2025-05-07T20:33:01.1867552Z compiled=False, 2025-05-07T20:33:01.1867634Z ) 2025-05-07T20:33:01.1867941Z self = 2025-05-07T20:33:01.1868147Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:01.1868154Z 2025-05-07T20:33:01.1868268Z @given( 2025-05-07T20:33:01.1868424Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1868551Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1868713Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1868868Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1869011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1869111Z ) 2025-05-07T20:33:01.1869430Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1869570Z def test_silu_mul_quant( 2025-05-07T20:33:01.1869677Z self, 2025-05-07T20:33:01.1869780Z T: int, 2025-05-07T20:33:01.1869860Z D: int, 2025-05-07T20:33:01.1869960Z scale_ub: Optional[float], 2025-05-07T20:33:01.1872793Z contiguous: bool, 2025-05-07T20:33:01.1872901Z compiled: bool, 2025-05-07T20:33:01.1872989Z ) -> None: 2025-05-07T20:33:01.1873083Z torch.manual_seed(2025) 2025-05-07T20:33:01.1873161Z 2025-05-07T20:33:01.1873328Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1873401Z 2025-05-07T20:33:01.1873497Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1873624Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1873714Z x = x_sign * x_clamp 2025-05-07T20:33:01.1873797Z x0 = x[:, :D] 2025-05-07T20:33:01.1873873Z x1 = x[:, D:] 2025-05-07T20:33:01.1873945Z 2025-05-07T20:33:01.1874028Z if contiguous: 2025-05-07T20:33:01.1874181Z x0 = x0.contiguous() 2025-05-07T20:33:01.1874267Z x1 = x1.contiguous() 2025-05-07T20:33:01.1874343Z 2025-05-07T20:33:01.1874432Z if scale_ub is not None: 2025-05-07T20:33:01.1874536Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1874699Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1874774Z ) 2025-05-07T20:33:01.1874848Z else: 2025-05-07T20:33:01.1874944Z scale_ub_tensor = None 2025-05-07T20:33:01.1875016Z 2025-05-07T20:33:01.1875146Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1875234Z op = silu_mul_quant 2025-05-07T20:33:01.1875317Z if compiled: 2025-05-07T20:33:01.1875420Z op = torch.compile(op) 2025-05-07T20:33:01.1875521Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1875592Z 2025-05-07T20:33:01.1875685Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1875690Z 2025-05-07T20:33:01.1875788Z moe/activation_test.py:117: 2025-05-07T20:33:01.1875914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1876019Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1876121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1876621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1876715Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1877063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1877286Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1877622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1877714Z kernel = self.compile( 2025-05-07T20:33:01.1878101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1878278Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1878493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1878501Z 2025-05-07T20:33:01.1878699Z self = 2025-05-07T20:33:01.1879461Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1879959Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d76fe8e0>} 2025-05-07T20:33:01.1880690Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1880889Z context = 2025-05-07T20:33:01.1880935Z 2025-05-07T20:33:01.1881186Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1881448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1881554Z module_map=module_map) 2025-05-07T20:33:01.1881713Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1881813Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1881889Z E ^ 2025-05-07T20:33:01.1882236Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1882241Z 2025-05-07T20:33:01.1882657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1882702Z 2025-05-07T20:33:01.1882803Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1883032Z self=, 2025-05-07T20:33:01.1883115Z T=4096, 2025-05-07T20:33:01.1883194Z D=7168, 2025-05-07T20:33:01.1883286Z scale_ub=1200.0, 2025-05-07T20:33:01.1883370Z contiguous=False, 2025-05-07T20:33:01.1883450Z compiled=True, 2025-05-07T20:33:01.1883529Z ) 2025-05-07T20:33:01.1883743Z self = 2025-05-07T20:33:01.1883914Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:01.1883926Z 2025-05-07T20:33:01.1884002Z @given( 2025-05-07T20:33:01.1884116Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1884219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1884329Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1884446Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1884567Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1884643Z ) 2025-05-07T20:33:01.1884886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1884986Z def test_silu_mul_quant( 2025-05-07T20:33:01.1885062Z self, 2025-05-07T20:33:01.1885137Z T: int, 2025-05-07T20:33:01.1885217Z D: int, 2025-05-07T20:33:01.1885313Z scale_ub: Optional[float], 2025-05-07T20:33:01.1885408Z contiguous: bool, 2025-05-07T20:33:01.1885490Z compiled: bool, 2025-05-07T20:33:01.1885566Z ) -> None: 2025-05-07T20:33:01.1885662Z torch.manual_seed(2025) 2025-05-07T20:33:01.1885730Z 2025-05-07T20:33:01.1885895Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1885973Z 2025-05-07T20:33:01.1886064Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1886186Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1886275Z x = x_sign * x_clamp 2025-05-07T20:33:01.1886352Z x0 = x[:, :D] 2025-05-07T20:33:01.1886478Z x1 = x[:, D:] 2025-05-07T20:33:01.1886561Z 2025-05-07T20:33:01.1886646Z if contiguous: 2025-05-07T20:33:01.1886733Z x0 = x0.contiguous() 2025-05-07T20:33:01.1886823Z x1 = x1.contiguous() 2025-05-07T20:33:01.1886893Z 2025-05-07T20:33:01.1886986Z if scale_ub is not None: 2025-05-07T20:33:01.1887088Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1887221Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1887303Z ) 2025-05-07T20:33:01.1887374Z else: 2025-05-07T20:33:01.1887464Z scale_ub_tensor = None 2025-05-07T20:33:01.1887540Z 2025-05-07T20:33:01.1887666Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1887752Z op = silu_mul_quant 2025-05-07T20:33:01.1887845Z if compiled: 2025-05-07T20:33:01.1887942Z op = torch.compile(op) 2025-05-07T20:33:01.1888048Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1888169Z 2025-05-07T20:33:01.1888316Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1888321Z 2025-05-07T20:33:01.1888422Z moe/activation_test.py:117: 2025-05-07T20:33:01.1888547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1888646Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1888750Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1889115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1889206Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1889698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1889833Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1890192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1890414Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1890753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1890847Z kernel = self.compile( 2025-05-07T20:33:01.1891224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1891403Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1891525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1891530Z 2025-05-07T20:33:01.1891727Z self = 2025-05-07T20:33:01.1892495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1892993Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d76ffa60>} 2025-05-07T20:33:01.1893729Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1893913Z context = 2025-05-07T20:33:01.1893917Z 2025-05-07T20:33:01.1894076Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1894336Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1894442Z module_map=module_map) 2025-05-07T20:33:01.1894605Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1894747Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1894828Z E ^ 2025-05-07T20:33:01.1895182Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1895187Z 2025-05-07T20:33:01.1895597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1895601Z 2025-05-07T20:33:01.1895704Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1895920Z self=, 2025-05-07T20:33:01.1895997Z T=128, 2025-05-07T20:33:01.1900972Z D=7168, 2025-05-07T20:33:01.1901071Z scale_ub=1200.0, 2025-05-07T20:33:01.1901154Z contiguous=False, 2025-05-07T20:33:01.1901253Z compiled=True, 2025-05-07T20:33:01.1901325Z ) 2025-05-07T20:33:01.1901544Z self = 2025-05-07T20:33:01.1901731Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:01.1901869Z 2025-05-07T20:33:01.1901948Z @given( 2025-05-07T20:33:01.1902066Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1902171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1902281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1902398Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1902506Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1902574Z ) 2025-05-07T20:33:01.1902820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1902912Z def test_silu_mul_quant( 2025-05-07T20:33:01.1902987Z self, 2025-05-07T20:33:01.1903067Z T: int, 2025-05-07T20:33:01.1903190Z D: int, 2025-05-07T20:33:01.1903286Z scale_ub: Optional[float], 2025-05-07T20:33:01.1903380Z contiguous: bool, 2025-05-07T20:33:01.1903466Z compiled: bool, 2025-05-07T20:33:01.1903552Z ) -> None: 2025-05-07T20:33:01.1903652Z torch.manual_seed(2025) 2025-05-07T20:33:01.1903726Z 2025-05-07T20:33:01.1903899Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1903971Z 2025-05-07T20:33:01.1904064Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1904194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1904284Z x = x_sign * x_clamp 2025-05-07T20:33:01.1904365Z x0 = x[:, :D] 2025-05-07T20:33:01.1904450Z x1 = x[:, D:] 2025-05-07T20:33:01.1904522Z 2025-05-07T20:33:01.1904605Z if contiguous: 2025-05-07T20:33:01.1904698Z x0 = x0.contiguous() 2025-05-07T20:33:01.1904785Z x1 = x1.contiguous() 2025-05-07T20:33:01.1904861Z 2025-05-07T20:33:01.1904955Z if scale_ub is not None: 2025-05-07T20:33:01.1905058Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1905198Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1905278Z ) 2025-05-07T20:33:01.1905357Z else: 2025-05-07T20:33:01.1905455Z scale_ub_tensor = None 2025-05-07T20:33:01.1905526Z 2025-05-07T20:33:01.1905654Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1905751Z op = silu_mul_quant 2025-05-07T20:33:01.1905833Z if compiled: 2025-05-07T20:33:01.1905932Z op = torch.compile(op) 2025-05-07T20:33:01.1906042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1906111Z 2025-05-07T20:33:01.1906204Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1906214Z 2025-05-07T20:33:01.1906309Z moe/activation_test.py:117: 2025-05-07T20:33:01.1906436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1906543Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1906641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1907067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1907168Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1907805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1907899Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1908261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1908480Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1908821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1908913Z kernel = self.compile( 2025-05-07T20:33:01.1909297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1909480Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1909701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1909707Z 2025-05-07T20:33:01.1909914Z self = 2025-05-07T20:33:01.1910680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1911172Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d74d4ea0>} 2025-05-07T20:33:01.1911916Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1912179Z context = 2025-05-07T20:33:01.1912187Z 2025-05-07T20:33:01.1912355Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1912611Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1912717Z module_map=module_map) 2025-05-07T20:33:01.1912880Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1912974Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1913053Z E ^ 2025-05-07T20:33:01.1913403Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1913408Z 2025-05-07T20:33:01.1913826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1913831Z 2025-05-07T20:33:01.1913938Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1914161Z self=, 2025-05-07T20:33:01.1914244Z T=2048, 2025-05-07T20:33:01.1914322Z D=7168, 2025-05-07T20:33:01.1914401Z scale_ub=None, 2025-05-07T20:33:01.1914490Z contiguous=True, 2025-05-07T20:33:01.1914572Z compiled=True, 2025-05-07T20:33:01.1914646Z ) 2025-05-07T20:33:01.1914864Z self = 2025-05-07T20:33:01.1915032Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.1915037Z 2025-05-07T20:33:01.1915114Z @given( 2025-05-07T20:33:01.1915237Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1915335Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1915449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1915571Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1915725Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1915809Z ) 2025-05-07T20:33:01.1916053Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1916144Z def test_silu_mul_quant( 2025-05-07T20:33:01.1916228Z self, 2025-05-07T20:33:01.1916305Z T: int, 2025-05-07T20:33:01.1916379Z D: int, 2025-05-07T20:33:01.1916479Z scale_ub: Optional[float], 2025-05-07T20:33:01.1916569Z contiguous: bool, 2025-05-07T20:33:01.1916653Z compiled: bool, 2025-05-07T20:33:01.1916737Z ) -> None: 2025-05-07T20:33:01.1916829Z torch.manual_seed(2025) 2025-05-07T20:33:01.1916901Z 2025-05-07T20:33:01.1917073Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1917149Z 2025-05-07T20:33:01.1917244Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1917366Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1917458Z x = x_sign * x_clamp 2025-05-07T20:33:01.1917549Z x0 = x[:, :D] 2025-05-07T20:33:01.1917717Z x1 = x[:, D:] 2025-05-07T20:33:01.1917790Z 2025-05-07T20:33:01.1917879Z if contiguous: 2025-05-07T20:33:01.1917968Z x0 = x0.contiguous() 2025-05-07T20:33:01.1918055Z x1 = x1.contiguous() 2025-05-07T20:33:01.1918134Z 2025-05-07T20:33:01.1918221Z if scale_ub is not None: 2025-05-07T20:33:01.1918327Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1918465Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1918542Z ) 2025-05-07T20:33:01.1918623Z else: 2025-05-07T20:33:01.1918717Z scale_ub_tensor = None 2025-05-07T20:33:01.1918790Z 2025-05-07T20:33:01.1918923Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1919056Z op = silu_mul_quant 2025-05-07T20:33:01.1919141Z if compiled: 2025-05-07T20:33:01.1919248Z op = torch.compile(op) 2025-05-07T20:33:01.1919352Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1919429Z 2025-05-07T20:33:01.1919524Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1919529Z 2025-05-07T20:33:01.1919622Z moe/activation_test.py:117: 2025-05-07T20:33:01.1919754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1919850Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1919947Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1920319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.1920409Z return fn(*args, **kwargs) 2025-05-07T20:33:01.1920893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1920998Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1921355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1921582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1921918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1922010Z kernel = self.compile( 2025-05-07T20:33:01.1922393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1922566Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1922689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1922693Z 2025-05-07T20:33:01.1922897Z self = 2025-05-07T20:33:01.1923709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1924214Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d74d5c60>} 2025-05-07T20:33:01.1924945Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1925140Z context = 2025-05-07T20:33:01.1925144Z 2025-05-07T20:33:01.1925304Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1925560Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1925676Z module_map=module_map) 2025-05-07T20:33:01.1925837Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1926017Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1926102Z E ^ 2025-05-07T20:33:01.1926450Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1926454Z 2025-05-07T20:33:01.1926894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1926898Z 2025-05-07T20:33:01.1926998Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1927216Z self=, 2025-05-07T20:33:01.1927299Z T=16384, 2025-05-07T20:33:01.1927372Z D=5120, 2025-05-07T20:33:01.1927452Z scale_ub=None, 2025-05-07T20:33:01.1927583Z contiguous=False, 2025-05-07T20:33:01.1927664Z compiled=False, 2025-05-07T20:33:01.1927742Z ) 2025-05-07T20:33:01.1927959Z self = 2025-05-07T20:33:01.1928138Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:01.1928143Z 2025-05-07T20:33:01.1928222Z @given( 2025-05-07T20:33:01.1928336Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1928431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1928547Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1928660Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1928777Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1928849Z ) 2025-05-07T20:33:01.1929088Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1929180Z def test_silu_mul_quant( 2025-05-07T20:33:01.1929259Z self, 2025-05-07T20:33:01.1929333Z T: int, 2025-05-07T20:33:01.1929412Z D: int, 2025-05-07T20:33:01.1929507Z scale_ub: Optional[float], 2025-05-07T20:33:01.1929597Z contiguous: bool, 2025-05-07T20:33:01.1929687Z compiled: bool, 2025-05-07T20:33:01.1929766Z ) -> None: 2025-05-07T20:33:01.1929859Z torch.manual_seed(2025) 2025-05-07T20:33:01.1929937Z 2025-05-07T20:33:01.1930102Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1930176Z 2025-05-07T20:33:01.1930270Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1930393Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1932286Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.1932299Z 2025-05-07T20:33:01.1932416Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:01.1932421Z 2025-05-07T20:33:01.1932526Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1932746Z self=, 2025-05-07T20:33:01.1932826Z T=4096, 2025-05-07T20:33:01.1932909Z D=7168, 2025-05-07T20:33:01.1932989Z scale_ub=1200.0, 2025-05-07T20:33:01.1933071Z contiguous=True, 2025-05-07T20:33:01.1933158Z compiled=True, 2025-05-07T20:33:01.1933228Z ) 2025-05-07T20:33:01.1933442Z self = 2025-05-07T20:33:01.1933613Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.1933620Z 2025-05-07T20:33:01.1933694Z @given( 2025-05-07T20:33:01.1933814Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1933913Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1934112Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1934234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1934347Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1934422Z ) 2025-05-07T20:33:01.1934669Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1934758Z def test_silu_mul_quant( 2025-05-07T20:33:01.1934842Z self, 2025-05-07T20:33:01.1934917Z T: int, 2025-05-07T20:33:01.1934993Z D: int, 2025-05-07T20:33:01.1935097Z scale_ub: Optional[float], 2025-05-07T20:33:01.1935188Z contiguous: bool, 2025-05-07T20:33:01.1935269Z compiled: bool, 2025-05-07T20:33:01.1935395Z ) -> None: 2025-05-07T20:33:01.1935487Z torch.manual_seed(2025) 2025-05-07T20:33:01.1935560Z 2025-05-07T20:33:01.1935734Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1935808Z 2025-05-07T20:33:01.1935900Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1936027Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1937790Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.1937798Z 2025-05-07T20:33:01.1937918Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:01.1937922Z 2025-05-07T20:33:01.1938022Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1938248Z self=, 2025-05-07T20:33:01.1938326Z T=16384, 2025-05-07T20:33:01.1938401Z D=7168, 2025-05-07T20:33:01.1938490Z scale_ub=None, 2025-05-07T20:33:01.1938573Z contiguous=False, 2025-05-07T20:33:01.1938655Z compiled=False, 2025-05-07T20:33:01.1938731Z ) 2025-05-07T20:33:01.1938942Z self = 2025-05-07T20:33:01.1939113Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:01.1939117Z 2025-05-07T20:33:01.1939196Z @given( 2025-05-07T20:33:01.1939310Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1939411Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1939521Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1939637Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1939798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1939871Z ) 2025-05-07T20:33:01.1940649Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1940796Z def test_silu_mul_quant( 2025-05-07T20:33:01.1940874Z self, 2025-05-07T20:33:01.1940949Z T: int, 2025-05-07T20:33:01.1941030Z D: int, 2025-05-07T20:33:01.1941124Z scale_ub: Optional[float], 2025-05-07T20:33:01.1941216Z contiguous: bool, 2025-05-07T20:33:01.1941300Z compiled: bool, 2025-05-07T20:33:01.1941377Z ) -> None: 2025-05-07T20:33:01.1941476Z torch.manual_seed(2025) 2025-05-07T20:33:01.1941547Z 2025-05-07T20:33:01.1941710Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1943647Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.1943743Z 2025-05-07T20:33:01.1943859Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.1943864Z 2025-05-07T20:33:01.1943970Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1944186Z self=, 2025-05-07T20:33:01.1944260Z T=2048, 2025-05-07T20:33:01.1944341Z D=7168, 2025-05-07T20:33:01.1944425Z scale_ub=1200.0, 2025-05-07T20:33:01.1944510Z contiguous=True, 2025-05-07T20:33:01.1944666Z compiled=True, 2025-05-07T20:33:01.1944739Z ) 2025-05-07T20:33:01.1944956Z self = 2025-05-07T20:33:01.1945128Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.1945135Z 2025-05-07T20:33:01.1945210Z @given( 2025-05-07T20:33:01.1945332Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1945427Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1945538Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1945659Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1945769Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1945844Z ) 2025-05-07T20:33:01.1946083Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1946174Z def test_silu_mul_quant( 2025-05-07T20:33:01.1946259Z self, 2025-05-07T20:33:01.1946338Z T: int, 2025-05-07T20:33:01.1946413Z D: int, 2025-05-07T20:33:01.1946516Z scale_ub: Optional[float], 2025-05-07T20:33:01.1946607Z contiguous: bool, 2025-05-07T20:33:01.1946689Z compiled: bool, 2025-05-07T20:33:01.1946777Z ) -> None: 2025-05-07T20:33:01.1946869Z torch.manual_seed(2025) 2025-05-07T20:33:01.1946938Z 2025-05-07T20:33:01.1947107Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1947182Z 2025-05-07T20:33:01.1947272Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1947463Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1949283Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.1949300Z 2025-05-07T20:33:01.1949416Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:01.1949421Z 2025-05-07T20:33:01.1949521Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1949747Z self=, 2025-05-07T20:33:01.1949822Z T=2048, 2025-05-07T20:33:01.1949898Z D=7168, 2025-05-07T20:33:01.1949984Z scale_ub=None, 2025-05-07T20:33:01.1950066Z contiguous=True, 2025-05-07T20:33:01.1950151Z compiled=False, 2025-05-07T20:33:01.1950230Z ) 2025-05-07T20:33:01.1950440Z self = 2025-05-07T20:33:01.1950612Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:01.1950618Z 2025-05-07T20:33:01.1950692Z @given( 2025-05-07T20:33:01.1950804Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1950906Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1951108Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1951222Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1951340Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1951413Z ) 2025-05-07T20:33:01.1951651Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1951747Z def test_silu_mul_quant( 2025-05-07T20:33:01.1951822Z self, 2025-05-07T20:33:01.1951906Z T: int, 2025-05-07T20:33:01.1951977Z D: int, 2025-05-07T20:33:01.1952070Z scale_ub: Optional[float], 2025-05-07T20:33:01.1952159Z contiguous: bool, 2025-05-07T20:33:01.1952241Z compiled: bool, 2025-05-07T20:33:01.1952317Z ) -> None: 2025-05-07T20:33:01.1952455Z torch.manual_seed(2025) 2025-05-07T20:33:01.1952525Z 2025-05-07T20:33:01.1952687Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1952769Z 2025-05-07T20:33:01.1952860Z > x_sign = torch.sign(x) 2025-05-07T20:33:01.1954606Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.1954612Z 2025-05-07T20:33:01.1954725Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:01.1954732Z 2025-05-07T20:33:01.1954829Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1955050Z self=, 2025-05-07T20:33:01.1955127Z T=1, 2025-05-07T20:33:01.1955203Z D=7168, 2025-05-07T20:33:01.1955285Z scale_ub=1200.0, 2025-05-07T20:33:01.1955367Z contiguous=True, 2025-05-07T20:33:01.1955455Z compiled=False, 2025-05-07T20:33:01.1955524Z ) 2025-05-07T20:33:01.1955735Z self = 2025-05-07T20:33:01.1955901Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.1955905Z 2025-05-07T20:33:01.1955977Z @given( 2025-05-07T20:33:01.1956089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1956190Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1956300Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1956419Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1956529Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1956602Z ) 2025-05-07T20:33:01.1956981Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1957078Z def test_silu_mul_quant( 2025-05-07T20:33:01.1957152Z self, 2025-05-07T20:33:01.1957228Z T: int, 2025-05-07T20:33:01.1957300Z D: int, 2025-05-07T20:33:01.1957392Z scale_ub: Optional[float], 2025-05-07T20:33:01.1957478Z contiguous: bool, 2025-05-07T20:33:01.1957559Z compiled: bool, 2025-05-07T20:33:01.1957631Z ) -> None: 2025-05-07T20:33:01.1957725Z torch.manual_seed(2025) 2025-05-07T20:33:01.1957793Z 2025-05-07T20:33:01.1957955Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1958024Z 2025-05-07T20:33:01.1958113Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1958241Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1958329Z x = x_sign * x_clamp 2025-05-07T20:33:01.1958404Z x0 = x[:, :D] 2025-05-07T20:33:01.1958485Z x1 = x[:, D:] 2025-05-07T20:33:01.1958553Z 2025-05-07T20:33:01.1958636Z if contiguous: 2025-05-07T20:33:01.1958815Z x0 = x0.contiguous() 2025-05-07T20:33:01.1958905Z x1 = x1.contiguous() 2025-05-07T20:33:01.1958975Z 2025-05-07T20:33:01.1959070Z if scale_ub is not None: 2025-05-07T20:33:01.1959174Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1959310Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1959384Z ) 2025-05-07T20:33:01.1959461Z else: 2025-05-07T20:33:01.1959561Z scale_ub_tensor = None 2025-05-07T20:33:01.1959646Z 2025-05-07T20:33:01.1959773Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1959869Z op = silu_mul_quant 2025-05-07T20:33:01.1959951Z if compiled: 2025-05-07T20:33:01.1960585Z op = torch.compile(op) 2025-05-07T20:33:01.1960696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1960772Z 2025-05-07T20:33:01.1960864Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1960869Z 2025-05-07T20:33:01.1960974Z moe/activation_test.py:117: 2025-05-07T20:33:01.1961100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1961202Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1961299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1961847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1961946Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1962308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1962524Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1962875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1962973Z kernel = self.compile( 2025-05-07T20:33:01.1963362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1963535Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1963657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1963662Z 2025-05-07T20:33:01.1963873Z self = 2025-05-07T20:33:01.1964636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1965136Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7504b80>} 2025-05-07T20:33:01.1965920Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1966115Z context = 2025-05-07T20:33:01.1966120Z 2025-05-07T20:33:01.1966281Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1966538Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1966650Z module_map=module_map) 2025-05-07T20:33:01.1966809Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1966902Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1966986Z E ^ 2025-05-07T20:33:01.1967336Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1967341Z 2025-05-07T20:33:01.1967813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1967854Z 2025-05-07T20:33:01.1967955Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1968173Z self=, 2025-05-07T20:33:01.1968259Z T=128, 2025-05-07T20:33:01.1968332Z D=5120, 2025-05-07T20:33:01.1968411Z scale_ub=None, 2025-05-07T20:33:01.1968498Z contiguous=True, 2025-05-07T20:33:01.1968578Z compiled=False, 2025-05-07T20:33:01.1968649Z ) 2025-05-07T20:33:01.1968868Z self = 2025-05-07T20:33:01.1969036Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:01.1969041Z 2025-05-07T20:33:01.1969156Z @given( 2025-05-07T20:33:01.1969270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1969369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1969486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1969602Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1969713Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1969791Z ) 2025-05-07T20:33:01.1970030Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1970126Z def test_silu_mul_quant( 2025-05-07T20:33:01.1970205Z self, 2025-05-07T20:33:01.1970279Z T: int, 2025-05-07T20:33:01.1970359Z D: int, 2025-05-07T20:33:01.1970451Z scale_ub: Optional[float], 2025-05-07T20:33:01.1970537Z contiguous: bool, 2025-05-07T20:33:01.1970627Z compiled: bool, 2025-05-07T20:33:01.1970702Z ) -> None: 2025-05-07T20:33:01.1970793Z torch.manual_seed(2025) 2025-05-07T20:33:01.1970875Z 2025-05-07T20:33:01.1971037Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1971107Z 2025-05-07T20:33:01.1971211Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1971378Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1971513Z x = x_sign * x_clamp 2025-05-07T20:33:01.1971645Z x0 = x[:, :D] 2025-05-07T20:33:01.1971737Z x1 = x[:, D:] 2025-05-07T20:33:01.1971827Z 2025-05-07T20:33:01.1971907Z if contiguous: 2025-05-07T20:33:01.1971994Z x0 = x0.contiguous() 2025-05-07T20:33:01.1972122Z x1 = x1.contiguous() 2025-05-07T20:33:01.1972221Z 2025-05-07T20:33:01.1972328Z if scale_ub is not None: 2025-05-07T20:33:01.1972438Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1972570Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1972645Z ) 2025-05-07T20:33:01.1972726Z else: 2025-05-07T20:33:01.1972818Z scale_ub_tensor = None 2025-05-07T20:33:01.1972892Z 2025-05-07T20:33:01.1973100Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1973188Z op = silu_mul_quant 2025-05-07T20:33:01.1973281Z if compiled: 2025-05-07T20:33:01.1973377Z op = torch.compile(op) 2025-05-07T20:33:01.1973478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1973554Z 2025-05-07T20:33:01.1973645Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1973650Z 2025-05-07T20:33:01.1973743Z moe/activation_test.py:117: 2025-05-07T20:33:01.1973875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1973973Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1974068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1974566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1974664Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1975033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1975382Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1975721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1975820Z kernel = self.compile( 2025-05-07T20:33:01.1976199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1976376Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1976500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1976504Z 2025-05-07T20:33:01.1976702Z self = 2025-05-07T20:33:01.1977518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1978015Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7505a80>} 2025-05-07T20:33:01.1978753Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1978938Z context = 2025-05-07T20:33:01.1978943Z 2025-05-07T20:33:01.1979102Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1979367Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1979473Z module_map=module_map) 2025-05-07T20:33:01.1979640Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1979737Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1979816Z E ^ 2025-05-07T20:33:01.1980170Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1980175Z 2025-05-07T20:33:01.1980585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1980590Z 2025-05-07T20:33:01.1980694Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1980911Z self=, 2025-05-07T20:33:01.1980988Z T=128, 2025-05-07T20:33:01.1981066Z D=7168, 2025-05-07T20:33:01.1981144Z scale_ub=None, 2025-05-07T20:33:01.1981228Z contiguous=True, 2025-05-07T20:33:01.1981318Z compiled=False, 2025-05-07T20:33:01.1981386Z ) 2025-05-07T20:33:01.1981643Z self = 2025-05-07T20:33:01.1981821Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:01.1981827Z 2025-05-07T20:33:01.1981898Z @given( 2025-05-07T20:33:01.1982025Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1982120Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1982234Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1982351Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1982461Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1982530Z ) 2025-05-07T20:33:01.1982771Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1982861Z def test_silu_mul_quant( 2025-05-07T20:33:01.1982934Z self, 2025-05-07T20:33:01.1983014Z T: int, 2025-05-07T20:33:01.1983088Z D: int, 2025-05-07T20:33:01.1983180Z scale_ub: Optional[float], 2025-05-07T20:33:01.1983278Z contiguous: bool, 2025-05-07T20:33:01.1983359Z compiled: bool, 2025-05-07T20:33:01.1983522Z ) -> None: 2025-05-07T20:33:01.1983619Z torch.manual_seed(2025) 2025-05-07T20:33:01.1983686Z 2025-05-07T20:33:01.1983854Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1983929Z 2025-05-07T20:33:01.1984016Z x_sign = torch.sign(x) 2025-05-07T20:33:01.1984142Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.1984227Z x = x_sign * x_clamp 2025-05-07T20:33:01.1984304Z x0 = x[:, :D] 2025-05-07T20:33:01.1984385Z x1 = x[:, D:] 2025-05-07T20:33:01.1984454Z 2025-05-07T20:33:01.1984533Z if contiguous: 2025-05-07T20:33:01.1984628Z x0 = x0.contiguous() 2025-05-07T20:33:01.1984713Z x1 = x1.contiguous() 2025-05-07T20:33:01.1984831Z 2025-05-07T20:33:01.1984919Z if scale_ub is not None: 2025-05-07T20:33:01.1985022Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.1985166Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.1985245Z ) 2025-05-07T20:33:01.1985318Z else: 2025-05-07T20:33:01.1985413Z scale_ub_tensor = None 2025-05-07T20:33:01.1985483Z 2025-05-07T20:33:01.1985610Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.1985700Z op = silu_mul_quant 2025-05-07T20:33:01.1985780Z if compiled: 2025-05-07T20:33:01.1985875Z op = torch.compile(op) 2025-05-07T20:33:01.1985982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1986053Z 2025-05-07T20:33:01.1986141Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.1986151Z 2025-05-07T20:33:01.1986244Z moe/activation_test.py:117: 2025-05-07T20:33:01.1986368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1986474Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.1986571Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.1987060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.1987162Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.1987641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.1987863Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.1988200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.1988293Z kernel = self.compile( 2025-05-07T20:33:01.1988680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.1988855Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.1989034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.1989039Z 2025-05-07T20:33:01.1989247Z self = 2025-05-07T20:33:01.1990008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.1990506Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7506980>} 2025-05-07T20:33:01.1991237Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.1991434Z context = 2025-05-07T20:33:01.1991439Z 2025-05-07T20:33:01.1991626Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.1991992Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.1992105Z module_map=module_map) 2025-05-07T20:33:01.1992263Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.1992356Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.1992439Z E ^ 2025-05-07T20:33:01.1992787Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.1992791Z 2025-05-07T20:33:01.1993208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.1993253Z 2025-05-07T20:33:01.1993354Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1993572Z self=, 2025-05-07T20:33:01.1993659Z T=2048, 2025-05-07T20:33:01.1993738Z D=7168, 2025-05-07T20:33:01.1993821Z scale_ub=1200.0, 2025-05-07T20:33:01.1993911Z contiguous=True, 2025-05-07T20:33:01.1993993Z compiled=False, 2025-05-07T20:33:01.1994070Z ) 2025-05-07T20:33:01.1994284Z self = 2025-05-07T20:33:01.1994455Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.1994460Z 2025-05-07T20:33:01.1994543Z @given( 2025-05-07T20:33:01.1994657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1994756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.1994873Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.1994987Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.1995108Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.1995180Z ) 2025-05-07T20:33:01.1995421Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.1995523Z def test_silu_mul_quant( 2025-05-07T20:33:01.1995596Z self, 2025-05-07T20:33:01.1995671Z T: int, 2025-05-07T20:33:01.1995752Z D: int, 2025-05-07T20:33:01.1995846Z scale_ub: Optional[float], 2025-05-07T20:33:01.1995936Z contiguous: bool, 2025-05-07T20:33:01.1996023Z compiled: bool, 2025-05-07T20:33:01.1996099Z ) -> None: 2025-05-07T20:33:01.1996192Z torch.manual_seed(2025) 2025-05-07T20:33:01.1996270Z 2025-05-07T20:33:01.1996432Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.1998246Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.1998258Z 2025-05-07T20:33:01.1998373Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.1998378Z 2025-05-07T20:33:01.1998482Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.1998701Z self=, 2025-05-07T20:33:01.1998774Z T=1, 2025-05-07T20:33:01.1998848Z D=5120, 2025-05-07T20:33:01.1998925Z scale_ub=1200.0, 2025-05-07T20:33:01.1999004Z contiguous=True, 2025-05-07T20:33:01.1999088Z compiled=False, 2025-05-07T20:33:01.1999158Z ) 2025-05-07T20:33:01.1999370Z self = 2025-05-07T20:33:01.1999539Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.1999546Z 2025-05-07T20:33:01.1999618Z @given( 2025-05-07T20:33:01.1999825Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.1999920Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2000030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2000148Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2000258Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2000333Z ) 2025-05-07T20:33:01.2000574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2000664Z def test_silu_mul_quant( 2025-05-07T20:33:01.2000736Z self, 2025-05-07T20:33:01.2000814Z T: int, 2025-05-07T20:33:01.2000889Z D: int, 2025-05-07T20:33:01.2000991Z scale_ub: Optional[float], 2025-05-07T20:33:01.2001142Z contiguous: bool, 2025-05-07T20:33:01.2001224Z compiled: bool, 2025-05-07T20:33:01.2001307Z ) -> None: 2025-05-07T20:33:01.2001403Z torch.manual_seed(2025) 2025-05-07T20:33:01.2001484Z 2025-05-07T20:33:01.2001682Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2001768Z 2025-05-07T20:33:01.2001857Z x_sign = torch.sign(x) 2025-05-07T20:33:01.2001985Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.2002071Z x = x_sign * x_clamp 2025-05-07T20:33:01.2002148Z x0 = x[:, :D] 2025-05-07T20:33:01.2002230Z x1 = x[:, D:] 2025-05-07T20:33:01.2002304Z 2025-05-07T20:33:01.2002384Z if contiguous: 2025-05-07T20:33:01.2002475Z x0 = x0.contiguous() 2025-05-07T20:33:01.2002559Z x1 = x1.contiguous() 2025-05-07T20:33:01.2002638Z 2025-05-07T20:33:01.2002726Z if scale_ub is not None: 2025-05-07T20:33:01.2002830Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.2002968Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.2003044Z ) 2025-05-07T20:33:01.2003122Z else: 2025-05-07T20:33:01.2003226Z scale_ub_tensor = None 2025-05-07T20:33:01.2003296Z 2025-05-07T20:33:01.2003422Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.2003519Z op = silu_mul_quant 2025-05-07T20:33:01.2003601Z if compiled: 2025-05-07T20:33:01.2003698Z op = torch.compile(op) 2025-05-07T20:33:01.2003807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.2003878Z 2025-05-07T20:33:01.2003971Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.2003976Z 2025-05-07T20:33:01.2004071Z moe/activation_test.py:117: 2025-05-07T20:33:01.2004197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2004299Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.2004398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.2004968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.2005073Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.2005429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.2005656Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.2005989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.2006082Z kernel = self.compile( 2025-05-07T20:33:01.2006485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.2006654Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.2006782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2006792Z 2025-05-07T20:33:01.2006994Z self = 2025-05-07T20:33:01.2007801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.2008337Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7507e20>} 2025-05-07T20:33:01.2009069Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.2009263Z context = 2025-05-07T20:33:01.2009305Z 2025-05-07T20:33:01.2009465Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.2009725Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.2009842Z module_map=module_map) 2025-05-07T20:33:01.2010001Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.2010102Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.2010179Z E ^ 2025-05-07T20:33:01.2010527Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.2010532Z 2025-05-07T20:33:01.2010946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.2010950Z 2025-05-07T20:33:01.2011049Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2011265Z self=, 2025-05-07T20:33:01.2011352Z T=2048, 2025-05-07T20:33:01.2011427Z D=5120, 2025-05-07T20:33:01.2011510Z scale_ub=None, 2025-05-07T20:33:01.2011594Z contiguous=True, 2025-05-07T20:33:01.2011676Z compiled=False, 2025-05-07T20:33:01.2011754Z ) 2025-05-07T20:33:01.2011967Z self = 2025-05-07T20:33:01.2012137Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:01.2012142Z 2025-05-07T20:33:01.2012223Z @given( 2025-05-07T20:33:01.2012337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2012436Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2012557Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2012670Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2012782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2012854Z ) 2025-05-07T20:33:01.2013095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2013192Z def test_silu_mul_quant( 2025-05-07T20:33:01.2013314Z self, 2025-05-07T20:33:01.2013390Z T: int, 2025-05-07T20:33:01.2013479Z D: int, 2025-05-07T20:33:01.2013575Z scale_ub: Optional[float], 2025-05-07T20:33:01.2013665Z contiguous: bool, 2025-05-07T20:33:01.2013753Z compiled: bool, 2025-05-07T20:33:01.2013829Z ) -> None: 2025-05-07T20:33:01.2013922Z torch.manual_seed(2025) 2025-05-07T20:33:01.2014001Z 2025-05-07T20:33:01.2014165Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2014243Z 2025-05-07T20:33:01.2014331Z > x_sign = torch.sign(x) 2025-05-07T20:33:01.2016153Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2016206Z 2025-05-07T20:33:01.2016322Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:01.2016327Z 2025-05-07T20:33:01.2016425Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2016649Z self=, 2025-05-07T20:33:01.2016724Z T=16384, 2025-05-07T20:33:01.2016797Z D=5120, 2025-05-07T20:33:01.2016882Z scale_ub=None, 2025-05-07T20:33:01.2016964Z contiguous=True, 2025-05-07T20:33:01.2017044Z compiled=False, 2025-05-07T20:33:01.2017122Z ) 2025-05-07T20:33:01.2017334Z self = 2025-05-07T20:33:01.2017553Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:01.2017557Z 2025-05-07T20:33:01.2017633Z @given( 2025-05-07T20:33:01.2017749Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2017852Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2017963Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2018077Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2018191Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2018264Z ) 2025-05-07T20:33:01.2018503Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2018599Z def test_silu_mul_quant( 2025-05-07T20:33:01.2018671Z self, 2025-05-07T20:33:01.2018750Z T: int, 2025-05-07T20:33:01.2018824Z D: int, 2025-05-07T20:33:01.2018917Z scale_ub: Optional[float], 2025-05-07T20:33:01.2019012Z contiguous: bool, 2025-05-07T20:33:01.2019093Z compiled: bool, 2025-05-07T20:33:01.2019167Z ) -> None: 2025-05-07T20:33:01.2019263Z torch.manual_seed(2025) 2025-05-07T20:33:01.2019335Z 2025-05-07T20:33:01.2019502Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2021265Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2021272Z 2025-05-07T20:33:01.2021386Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.2021394Z 2025-05-07T20:33:01.2021498Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2021758Z self=, 2025-05-07T20:33:01.2021839Z T=4096, 2025-05-07T20:33:01.2026547Z D=5120, 2025-05-07T20:33:01.2026641Z scale_ub=None, 2025-05-07T20:33:01.2026734Z contiguous=True, 2025-05-07T20:33:01.2026819Z compiled=False, 2025-05-07T20:33:01.2026891Z ) 2025-05-07T20:33:01.2027114Z self = 2025-05-07T20:33:01.2027283Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:01.2027289Z 2025-05-07T20:33:01.2027364Z @given( 2025-05-07T20:33:01.2027632Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2027733Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2027844Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2027964Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2028078Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2028158Z ) 2025-05-07T20:33:01.2028403Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2028623Z def test_silu_mul_quant( 2025-05-07T20:33:01.2028705Z self, 2025-05-07T20:33:01.2028781Z T: int, 2025-05-07T20:33:01.2028854Z D: int, 2025-05-07T20:33:01.2028956Z scale_ub: Optional[float], 2025-05-07T20:33:01.2029041Z contiguous: bool, 2025-05-07T20:33:01.2029126Z compiled: bool, 2025-05-07T20:33:01.2029210Z ) -> None: 2025-05-07T20:33:01.2029300Z torch.manual_seed(2025) 2025-05-07T20:33:01.2029372Z 2025-05-07T20:33:01.2029544Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2031307Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2031367Z 2025-05-07T20:33:01.2031486Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.2031491Z 2025-05-07T20:33:01.2031613Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2031868Z self=, 2025-05-07T20:33:01.2031939Z T=2048, 2025-05-07T20:33:01.2032013Z D=5120, 2025-05-07T20:33:01.2032108Z scale_ub=None, 2025-05-07T20:33:01.2032192Z contiguous=False, 2025-05-07T20:33:01.2032270Z compiled=False, 2025-05-07T20:33:01.2032351Z ) 2025-05-07T20:33:01.2032565Z self = 2025-05-07T20:33:01.2032741Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:01.2032746Z 2025-05-07T20:33:01.2032824Z @given( 2025-05-07T20:33:01.2032942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2033044Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2033155Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2033267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2033379Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2033451Z ) 2025-05-07T20:33:01.2033692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2033790Z def test_silu_mul_quant( 2025-05-07T20:33:01.2033864Z self, 2025-05-07T20:33:01.2033944Z T: int, 2025-05-07T20:33:01.2034020Z D: int, 2025-05-07T20:33:01.2034119Z scale_ub: Optional[float], 2025-05-07T20:33:01.2034210Z contiguous: bool, 2025-05-07T20:33:01.2034292Z compiled: bool, 2025-05-07T20:33:01.2034414Z ) -> None: 2025-05-07T20:33:01.2034513Z torch.manual_seed(2025) 2025-05-07T20:33:01.2034590Z 2025-05-07T20:33:01.2034753Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2036507Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2036515Z 2025-05-07T20:33:01.2036629Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.2036633Z 2025-05-07T20:33:01.2036738Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2037007Z self=, 2025-05-07T20:33:01.2037126Z T=4096, 2025-05-07T20:33:01.2037200Z D=7168, 2025-05-07T20:33:01.2037278Z scale_ub=None, 2025-05-07T20:33:01.2037369Z contiguous=True, 2025-05-07T20:33:01.2037450Z compiled=True, 2025-05-07T20:33:01.2037520Z ) 2025-05-07T20:33:01.2037741Z self = 2025-05-07T20:33:01.2037905Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.2037910Z 2025-05-07T20:33:01.2037985Z @given( 2025-05-07T20:33:01.2038105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2038200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2038309Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2038508Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2038623Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2038706Z ) 2025-05-07T20:33:01.2038952Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2039043Z def test_silu_mul_quant( 2025-05-07T20:33:01.2039124Z self, 2025-05-07T20:33:01.2039198Z T: int, 2025-05-07T20:33:01.2039273Z D: int, 2025-05-07T20:33:01.2039374Z scale_ub: Optional[float], 2025-05-07T20:33:01.2039459Z contiguous: bool, 2025-05-07T20:33:01.2039542Z compiled: bool, 2025-05-07T20:33:01.2039623Z ) -> None: 2025-05-07T20:33:01.2039713Z torch.manual_seed(2025) 2025-05-07T20:33:01.2039783Z 2025-05-07T20:33:01.2039953Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2042057Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2042076Z 2025-05-07T20:33:01.2042193Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.2042198Z 2025-05-07T20:33:01.2042297Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2042523Z self=, 2025-05-07T20:33:01.2042600Z T=2048, 2025-05-07T20:33:01.2042673Z D=5120, 2025-05-07T20:33:01.2042758Z scale_ub=1200.0, 2025-05-07T20:33:01.2042839Z contiguous=False, 2025-05-07T20:33:01.2042924Z compiled=False, 2025-05-07T20:33:01.2043002Z ) 2025-05-07T20:33:01.2043220Z self = 2025-05-07T20:33:01.2043550Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:01.2043557Z 2025-05-07T20:33:01.2043634Z @given( 2025-05-07T20:33:01.2043751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2043855Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2043968Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2044085Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2044201Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2044274Z ) 2025-05-07T20:33:01.2044512Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2044610Z def test_silu_mul_quant( 2025-05-07T20:33:01.2044685Z self, 2025-05-07T20:33:01.2044768Z T: int, 2025-05-07T20:33:01.2044844Z D: int, 2025-05-07T20:33:01.2044941Z scale_ub: Optional[float], 2025-05-07T20:33:01.2045039Z contiguous: bool, 2025-05-07T20:33:01.2045127Z compiled: bool, 2025-05-07T20:33:01.2045326Z ) -> None: 2025-05-07T20:33:01.2045429Z torch.manual_seed(2025) 2025-05-07T20:33:01.2045501Z 2025-05-07T20:33:01.2045673Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2047428Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2047498Z 2025-05-07T20:33:01.2047611Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.2047618Z 2025-05-07T20:33:01.2047725Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2047947Z self=, 2025-05-07T20:33:01.2048031Z T=4096, 2025-05-07T20:33:01.2048104Z D=7168, 2025-05-07T20:33:01.2048186Z scale_ub=1200.0, 2025-05-07T20:33:01.2048272Z contiguous=True, 2025-05-07T20:33:01.2048354Z compiled=False, 2025-05-07T20:33:01.2048430Z ) 2025-05-07T20:33:01.2048650Z self = 2025-05-07T20:33:01.2048816Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.2048821Z 2025-05-07T20:33:01.2048896Z @given( 2025-05-07T20:33:01.2049014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2049111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2049220Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2049340Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2049449Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2049536Z ) 2025-05-07T20:33:01.2049776Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2049866Z def test_silu_mul_quant( 2025-05-07T20:33:01.2049950Z self, 2025-05-07T20:33:01.2050028Z T: int, 2025-05-07T20:33:01.2050101Z D: int, 2025-05-07T20:33:01.2050200Z scale_ub: Optional[float], 2025-05-07T20:33:01.2050286Z contiguous: bool, 2025-05-07T20:33:01.2050367Z compiled: bool, 2025-05-07T20:33:01.2050449Z ) -> None: 2025-05-07T20:33:01.2050541Z torch.manual_seed(2025) 2025-05-07T20:33:01.2050612Z 2025-05-07T20:33:01.2050779Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2052585Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2052599Z 2025-05-07T20:33:01.2052712Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.2052716Z 2025-05-07T20:33:01.2052821Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2053047Z self=, 2025-05-07T20:33:01.2053126Z T=16384, 2025-05-07T20:33:01.2053204Z D=7168, 2025-05-07T20:33:01.2053293Z scale_ub=None, 2025-05-07T20:33:01.2053383Z contiguous=False, 2025-05-07T20:33:01.2053464Z compiled=True, 2025-05-07T20:33:01.2053544Z ) 2025-05-07T20:33:01.2053761Z self = 2025-05-07T20:33:01.2054024Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.2054029Z 2025-05-07T20:33:01.2054108Z @given( 2025-05-07T20:33:01.2054224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2054326Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2054440Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2054554Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2054671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2054748Z ) 2025-05-07T20:33:01.2054988Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2055085Z def test_silu_mul_quant( 2025-05-07T20:33:01.2055205Z self, 2025-05-07T20:33:01.2055285Z T: int, 2025-05-07T20:33:01.2055361Z D: int, 2025-05-07T20:33:01.2055464Z scale_ub: Optional[float], 2025-05-07T20:33:01.2055562Z contiguous: bool, 2025-05-07T20:33:01.2055650Z compiled: bool, 2025-05-07T20:33:01.2055725Z ) -> None: 2025-05-07T20:33:01.2055823Z torch.manual_seed(2025) 2025-05-07T20:33:01.2055897Z 2025-05-07T20:33:01.2056063Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2057828Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2057836Z 2025-05-07T20:33:01.2057954Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.2057961Z 2025-05-07T20:33:01.2058069Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2058288Z self=, 2025-05-07T20:33:01.2058374Z T=4096, 2025-05-07T20:33:01.2058454Z D=7168, 2025-05-07T20:33:01.2058535Z scale_ub=None, 2025-05-07T20:33:01.2058625Z contiguous=True, 2025-05-07T20:33:01.2058710Z compiled=False, 2025-05-07T20:33:01.2058785Z ) 2025-05-07T20:33:01.2059006Z self = 2025-05-07T20:33:01.2059174Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:01.2059178Z 2025-05-07T20:33:01.2059258Z @given( 2025-05-07T20:33:01.2059386Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2059485Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2059643Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2059768Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2059886Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2059967Z ) 2025-05-07T20:33:01.2060211Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2060304Z def test_silu_mul_quant( 2025-05-07T20:33:01.2060389Z self, 2025-05-07T20:33:01.2060467Z T: int, 2025-05-07T20:33:01.2060546Z D: int, 2025-05-07T20:33:01.2060652Z scale_ub: Optional[float], 2025-05-07T20:33:01.2060744Z contiguous: bool, 2025-05-07T20:33:01.2060831Z compiled: bool, 2025-05-07T20:33:01.2060922Z ) -> None: 2025-05-07T20:33:01.2061018Z torch.manual_seed(2025) 2025-05-07T20:33:01.2061092Z 2025-05-07T20:33:01.2061267Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2063214Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2063295Z 2025-05-07T20:33:01.2063416Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.2063420Z 2025-05-07T20:33:01.2063521Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2063748Z self=, 2025-05-07T20:33:01.2063870Z T=16384, 2025-05-07T20:33:01.2063948Z D=7168, 2025-05-07T20:33:01.2064039Z scale_ub=None, 2025-05-07T20:33:01.2064123Z contiguous=True, 2025-05-07T20:33:01.2064212Z compiled=False, 2025-05-07T20:33:01.2064295Z ) 2025-05-07T20:33:01.2064515Z self = 2025-05-07T20:33:01.2064696Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:01.2064701Z 2025-05-07T20:33:01.2064778Z @given( 2025-05-07T20:33:01.2064896Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2064998Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2065108Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2065226Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2065351Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2065429Z ) 2025-05-07T20:33:01.2065670Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2065771Z def test_silu_mul_quant( 2025-05-07T20:33:01.2065847Z self, 2025-05-07T20:33:01.2065927Z T: int, 2025-05-07T20:33:01.2066004Z D: int, 2025-05-07T20:33:01.2066110Z scale_ub: Optional[float], 2025-05-07T20:33:01.2066204Z contiguous: bool, 2025-05-07T20:33:01.2066289Z compiled: bool, 2025-05-07T20:33:01.2066367Z ) -> None: 2025-05-07T20:33:01.2066467Z torch.manual_seed(2025) 2025-05-07T20:33:01.2066542Z 2025-05-07T20:33:01.2066714Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2068618Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2068630Z 2025-05-07T20:33:01.2068750Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.2068755Z 2025-05-07T20:33:01.2068861Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2069081Z self=, 2025-05-07T20:33:01.2069167Z T=16384, 2025-05-07T20:33:01.2069247Z D=7168, 2025-05-07T20:33:01.2069329Z scale_ub=1200.0, 2025-05-07T20:33:01.2069423Z contiguous=True, 2025-05-07T20:33:01.2069507Z compiled=False, 2025-05-07T20:33:01.2069581Z ) 2025-05-07T20:33:01.2069802Z self = 2025-05-07T20:33:01.2069976Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.2069983Z 2025-05-07T20:33:01.2070068Z @given( 2025-05-07T20:33:01.2070190Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2070290Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2070453Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2070650Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2070760Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2070839Z ) 2025-05-07T20:33:01.2071089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2071180Z def test_silu_mul_quant( 2025-05-07T20:33:01.2071260Z self, 2025-05-07T20:33:01.2071338Z T: int, 2025-05-07T20:33:01.2071412Z D: int, 2025-05-07T20:33:01.2071514Z scale_ub: Optional[float], 2025-05-07T20:33:01.2071603Z contiguous: bool, 2025-05-07T20:33:01.2071687Z compiled: bool, 2025-05-07T20:33:01.2071772Z ) -> None: 2025-05-07T20:33:01.2071929Z torch.manual_seed(2025) 2025-05-07T20:33:01.2072008Z 2025-05-07T20:33:01.2072203Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2074115Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2074135Z 2025-05-07T20:33:01.2074293Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.2074300Z 2025-05-07T20:33:01.2074434Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2074731Z self=, 2025-05-07T20:33:01.2074838Z T=128, 2025-05-07T20:33:01.2074943Z D=5120, 2025-05-07T20:33:01.2075073Z scale_ub=1200.0, 2025-05-07T20:33:01.2075172Z contiguous=False, 2025-05-07T20:33:01.2075261Z compiled=False, 2025-05-07T20:33:01.2075344Z ) 2025-05-07T20:33:01.2075558Z self = 2025-05-07T20:33:01.2075736Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:01.2075740Z 2025-05-07T20:33:01.2075817Z @given( 2025-05-07T20:33:01.2075938Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2076040Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2076152Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2076267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2076386Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2076463Z ) 2025-05-07T20:33:01.2076704Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2076806Z def test_silu_mul_quant( 2025-05-07T20:33:01.2076955Z self, 2025-05-07T20:33:01.2077047Z T: int, 2025-05-07T20:33:01.2077128Z D: int, 2025-05-07T20:33:01.2077224Z scale_ub: Optional[float], 2025-05-07T20:33:01.2077320Z contiguous: bool, 2025-05-07T20:33:01.2077405Z compiled: bool, 2025-05-07T20:33:01.2077480Z ) -> None: 2025-05-07T20:33:01.2077583Z torch.manual_seed(2025) 2025-05-07T20:33:01.2077655Z 2025-05-07T20:33:01.2077819Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2077900Z 2025-05-07T20:33:01.2077995Z x_sign = torch.sign(x) 2025-05-07T20:33:01.2078117Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.2078211Z x = x_sign * x_clamp 2025-05-07T20:33:01.2078291Z x0 = x[:, :D] 2025-05-07T20:33:01.2078374Z x1 = x[:, D:] 2025-05-07T20:33:01.2078454Z 2025-05-07T20:33:01.2078533Z if contiguous: 2025-05-07T20:33:01.2078633Z x0 = x0.contiguous() 2025-05-07T20:33:01.2078727Z x1 = x1.contiguous() 2025-05-07T20:33:01.2078890Z 2025-05-07T20:33:01.2078985Z if scale_ub is not None: 2025-05-07T20:33:01.2079090Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.2079224Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.2079304Z ) 2025-05-07T20:33:01.2079381Z else: 2025-05-07T20:33:01.2079473Z scale_ub_tensor = None 2025-05-07T20:33:01.2079550Z 2025-05-07T20:33:01.2079675Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.2079763Z op = silu_mul_quant 2025-05-07T20:33:01.2079853Z if compiled: 2025-05-07T20:33:01.2079950Z op = torch.compile(op) 2025-05-07T20:33:01.2080059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.2080174Z 2025-05-07T20:33:01.2080263Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.2080267Z 2025-05-07T20:33:01.2080370Z moe/activation_test.py:117: 2025-05-07T20:33:01.2080499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2080601Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.2080702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.2081203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.2081301Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.2081663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.2081883Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.2082229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.2082322Z kernel = self.compile( 2025-05-07T20:33:01.2082724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.2082909Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.2083033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2083038Z 2025-05-07T20:33:01.2083243Z self = 2025-05-07T20:33:01.2084012Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.2084509Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d72fcae0>} 2025-05-07T20:33:01.2085309Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.2085503Z context = 2025-05-07T20:33:01.2085508Z 2025-05-07T20:33:01.2085673Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.2085935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.2086040Z module_map=module_map) 2025-05-07T20:33:01.2086205Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.2086302Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.2086383Z E ^ 2025-05-07T20:33:01.2086735Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.2086743Z 2025-05-07T20:33:01.2087159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.2087164Z 2025-05-07T20:33:01.2087348Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2087567Z self=, 2025-05-07T20:33:01.2087647Z T=2048, 2025-05-07T20:33:01.2087721Z D=7168, 2025-05-07T20:33:01.2087802Z scale_ub=None, 2025-05-07T20:33:01.2087897Z contiguous=False, 2025-05-07T20:33:01.2087981Z compiled=False, 2025-05-07T20:33:01.2088053Z ) 2025-05-07T20:33:01.2088273Z self = 2025-05-07T20:33:01.2088446Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:01.2088451Z 2025-05-07T20:33:01.2088527Z @given( 2025-05-07T20:33:01.2088650Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2088885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2088999Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2089118Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2089235Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2089328Z ) 2025-05-07T20:33:01.2089568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2089657Z def test_silu_mul_quant( 2025-05-07T20:33:01.2089747Z self, 2025-05-07T20:33:01.2089823Z T: int, 2025-05-07T20:33:01.2089900Z D: int, 2025-05-07T20:33:01.2090004Z scale_ub: Optional[float], 2025-05-07T20:33:01.2090092Z contiguous: bool, 2025-05-07T20:33:01.2090175Z compiled: bool, 2025-05-07T20:33:01.2090258Z ) -> None: 2025-05-07T20:33:01.2090349Z torch.manual_seed(2025) 2025-05-07T20:33:01.2090426Z 2025-05-07T20:33:01.2090590Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2092366Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2092380Z 2025-05-07T20:33:01.2092494Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.2092499Z 2025-05-07T20:33:01.2092598Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2092825Z self=, 2025-05-07T20:33:01.2092901Z T=128, 2025-05-07T20:33:01.2092982Z D=7168, 2025-05-07T20:33:01.2093069Z scale_ub=1200.0, 2025-05-07T20:33:01.2093152Z contiguous=True, 2025-05-07T20:33:01.2093234Z compiled=True, 2025-05-07T20:33:01.2093358Z ) 2025-05-07T20:33:01.2093576Z self = 2025-05-07T20:33:01.2093749Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.2093753Z 2025-05-07T20:33:01.2093831Z @given( 2025-05-07T20:33:01.2093947Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2094052Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2094164Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2094281Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2094398Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2094472Z ) 2025-05-07T20:33:01.2094713Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2094814Z def test_silu_mul_quant( 2025-05-07T20:33:01.2094891Z self, 2025-05-07T20:33:01.2094976Z T: int, 2025-05-07T20:33:01.2095052Z D: int, 2025-05-07T20:33:01.2095149Z scale_ub: Optional[float], 2025-05-07T20:33:01.2095326Z contiguous: bool, 2025-05-07T20:33:01.2095412Z compiled: bool, 2025-05-07T20:33:01.2095495Z ) -> None: 2025-05-07T20:33:01.2095593Z torch.manual_seed(2025) 2025-05-07T20:33:01.2095665Z 2025-05-07T20:33:01.2095828Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2095907Z 2025-05-07T20:33:01.2095996Z x_sign = torch.sign(x) 2025-05-07T20:33:01.2096119Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.2096210Z x = x_sign * x_clamp 2025-05-07T20:33:01.2096290Z x0 = x[:, :D] 2025-05-07T20:33:01.2096374Z x1 = x[:, D:] 2025-05-07T20:33:01.2096445Z 2025-05-07T20:33:01.2096525Z if contiguous: 2025-05-07T20:33:01.2096661Z x0 = x0.contiguous() 2025-05-07T20:33:01.2096747Z x1 = x1.contiguous() 2025-05-07T20:33:01.2096818Z 2025-05-07T20:33:01.2096914Z if scale_ub is not None: 2025-05-07T20:33:01.2097022Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.2097162Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.2097245Z ) 2025-05-07T20:33:01.2097321Z else: 2025-05-07T20:33:01.2097412Z scale_ub_tensor = None 2025-05-07T20:33:01.2097490Z 2025-05-07T20:33:01.2097619Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.2097712Z op = silu_mul_quant 2025-05-07T20:33:01.2097793Z if compiled: 2025-05-07T20:33:01.2097890Z op = torch.compile(op) 2025-05-07T20:33:01.2097999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.2098071Z 2025-05-07T20:33:01.2098163Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.2098171Z 2025-05-07T20:33:01.2098272Z moe/activation_test.py:117: 2025-05-07T20:33:01.2098400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2098501Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.2098611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.2098975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.2099072Z return fn(*args, **kwargs) 2025-05-07T20:33:01.2099562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.2099657Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.2100017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.2100237Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.2100579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.2100676Z kernel = self.compile( 2025-05-07T20:33:01.2101129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.2101314Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.2101439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2101443Z 2025-05-07T20:33:01.2101643Z self = 2025-05-07T20:33:01.2102417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.2102912Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f35d7180040>} 2025-05-07T20:33:01.2103860Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.2104163Z context = 2025-05-07T20:33:01.2104171Z 2025-05-07T20:33:01.2104359Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.2104628Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.2104734Z module_map=module_map) 2025-05-07T20:33:01.2104898Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.2104995Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.2105071Z E ^ 2025-05-07T20:33:01.2105431Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.2105484Z 2025-05-07T20:33:01.2105905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.2105914Z 2025-05-07T20:33:01.2106022Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2106240Z self=, 2025-05-07T20:33:01.2106317Z T=128, 2025-05-07T20:33:01.2106401Z D=7168, 2025-05-07T20:33:01.2106481Z scale_ub=1200.0, 2025-05-07T20:33:01.2106566Z contiguous=True, 2025-05-07T20:33:01.2106653Z compiled=False, 2025-05-07T20:33:01.2106726Z ) 2025-05-07T20:33:01.2106942Z self = 2025-05-07T20:33:01.2107117Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.2107122Z 2025-05-07T20:33:01.2107199Z @given( 2025-05-07T20:33:01.2107325Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2107534Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2107660Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2107788Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2107900Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2107970Z ) 2025-05-07T20:33:01.2108220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2108309Z def test_silu_mul_quant( 2025-05-07T20:33:01.2108382Z self, 2025-05-07T20:33:01.2108466Z T: int, 2025-05-07T20:33:01.2108543Z D: int, 2025-05-07T20:33:01.2108646Z scale_ub: Optional[float], 2025-05-07T20:33:01.2108733Z contiguous: bool, 2025-05-07T20:33:01.2108817Z compiled: bool, 2025-05-07T20:33:01.2108901Z ) -> None: 2025-05-07T20:33:01.2108993Z torch.manual_seed(2025) 2025-05-07T20:33:01.2109066Z 2025-05-07T20:33:01.2109238Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2109309Z 2025-05-07T20:33:01.2109458Z x_sign = torch.sign(x) 2025-05-07T20:33:01.2109594Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.2111354Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2111360Z 2025-05-07T20:33:01.2111483Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:01.2111490Z 2025-05-07T20:33:01.2111590Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2111816Z self=, 2025-05-07T20:33:01.2111895Z T=128, 2025-05-07T20:33:01.2112015Z D=5120, 2025-05-07T20:33:01.2112137Z scale_ub=1200.0, 2025-05-07T20:33:01.2112223Z contiguous=True, 2025-05-07T20:33:01.2112302Z compiled=True, 2025-05-07T20:33:01.2112378Z ) 2025-05-07T20:33:01.2112592Z self = 2025-05-07T20:33:01.2112755Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.2112759Z 2025-05-07T20:33:01.2112839Z @given( 2025-05-07T20:33:01.2112953Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2113053Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2113164Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2113277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2113437Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2113511Z ) 2025-05-07T20:33:01.2113754Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2113854Z def test_silu_mul_quant( 2025-05-07T20:33:01.2113929Z self, 2025-05-07T20:33:01.2114003Z T: int, 2025-05-07T20:33:01.2114084Z D: int, 2025-05-07T20:33:01.2114180Z scale_ub: Optional[float], 2025-05-07T20:33:01.2114267Z contiguous: bool, 2025-05-07T20:33:01.2114356Z compiled: bool, 2025-05-07T20:33:01.2114431Z ) -> None: 2025-05-07T20:33:01.2114527Z torch.manual_seed(2025) 2025-05-07T20:33:01.2114596Z 2025-05-07T20:33:01.2114758Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2114845Z 2025-05-07T20:33:01.2114934Z x_sign = torch.sign(x) 2025-05-07T20:33:01.2115057Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.2116817Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2116825Z 2025-05-07T20:33:01.2116940Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:01.2116944Z 2025-05-07T20:33:01.2117046Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2117266Z self=, 2025-05-07T20:33:01.2117342Z T=128, 2025-05-07T20:33:01.2117421Z D=7168, 2025-05-07T20:33:01.2117504Z scale_ub=None, 2025-05-07T20:33:01.2117594Z contiguous=True, 2025-05-07T20:33:01.2117676Z compiled=True, 2025-05-07T20:33:01.2117745Z ) 2025-05-07T20:33:01.2118008Z self = 2025-05-07T20:33:01.2118174Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.2118179Z 2025-05-07T20:33:01.2118250Z @given( 2025-05-07T20:33:01.2118369Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2118468Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2118584Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2118704Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2118816Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2118890Z ) 2025-05-07T20:33:01.2119131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2119219Z def test_silu_mul_quant( 2025-05-07T20:33:01.2119302Z self, 2025-05-07T20:33:01.2119374Z T: int, 2025-05-07T20:33:01.2119451Z D: int, 2025-05-07T20:33:01.2119561Z scale_ub: Optional[float], 2025-05-07T20:33:01.2119646Z contiguous: bool, 2025-05-07T20:33:01.2119810Z compiled: bool, 2025-05-07T20:33:01.2119896Z ) -> None: 2025-05-07T20:33:01.2119992Z torch.manual_seed(2025) 2025-05-07T20:33:01.2120065Z 2025-05-07T20:33:01.2120234Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2122047Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:01.2122091Z 2025-05-07T20:33:01.2122217Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:01.2122353Z =============================== warnings summary =============================== 2025-05-07T20:33:01.2122665Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:01.2122963Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:01.2123256Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:01.2124133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:01.2124366Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:01.2124371Z 2025-05-07T20:33:01.2124586Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:01.2124759Z ================= 1 failed, 1 deselected, 3 warnings in 13.91s ================= 2025-05-07T20:33:02.9396373Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:03.0018332Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:33:03.0018737Z 2025-05-07T20:33:05.0034444Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:33:07.1654302Z ============================= test session starts ============================== 2025-05-07T20:33:07.1654975Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:33:07.1655758Z cachedir: .pytest_cache 2025-05-07T20:33:07.1656344Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:33:07.1657072Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:33:07.1657475Z plugins: hypothesis-6.131.14 2025-05-07T20:33:08.7299880Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:33:08.8269444Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:33:08.8269851Z run-last-failure: rerun previous 1 failure 2025-05-07T20:33:08.8270065Z 2025-05-07T20:33:10.9537239Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.9537951Z self=, 2025-05-07T20:33:10.9538396Z T=1, 2025-05-07T20:33:10.9538618Z D=5120, 2025-05-07T20:33:10.9538900Z scale_ub=None, 2025-05-07T20:33:10.9539188Z contiguous=True, 2025-05-07T20:33:10.9539885Z compiled=True, 2025-05-07T20:33:10.9540351Z ) 2025-05-07T20:33:10.9540691Z self = 2025-05-07T20:33:10.9541183Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.9541447Z 2025-05-07T20:33:10.9541523Z @given( 2025-05-07T20:33:10.9541754Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.9542062Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.9542354Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.9542682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.9543002Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.9543396Z ) 2025-05-07T20:33:10.9543740Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.9544223Z def test_silu_mul_quant( 2025-05-07T20:33:10.9544485Z self, 2025-05-07T20:33:10.9544668Z T: int, 2025-05-07T20:33:10.9551750Z D: int, 2025-05-07T20:33:10.9551982Z scale_ub: Optional[float], 2025-05-07T20:33:10.9552248Z contiguous: bool, 2025-05-07T20:33:10.9552493Z compiled: bool, 2025-05-07T20:33:10.9552723Z ) -> None: 2025-05-07T20:33:10.9552931Z torch.manual_seed(2025) 2025-05-07T20:33:10.9553173Z 2025-05-07T20:33:10.9553444Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.9553792Z 2025-05-07T20:33:10.9553994Z x_sign = torch.sign(x) 2025-05-07T20:33:10.9554314Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.9554625Z x = x_sign * x_clamp 2025-05-07T20:33:10.9554855Z x0 = x[:, :D] 2025-05-07T20:33:10.9555070Z x1 = x[:, D:] 2025-05-07T20:33:10.9555278Z 2025-05-07T20:33:10.9555457Z if contiguous: 2025-05-07T20:33:10.9555689Z x0 = x0.contiguous() 2025-05-07T20:33:10.9555943Z x1 = x1.contiguous() 2025-05-07T20:33:10.9556174Z 2025-05-07T20:33:10.9556357Z if scale_ub is not None: 2025-05-07T20:33:10.9556624Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.9556954Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.9557259Z ) 2025-05-07T20:33:10.9557449Z else: 2025-05-07T20:33:10.9557650Z scale_ub_tensor = None 2025-05-07T20:33:10.9557894Z 2025-05-07T20:33:10.9558123Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.9558426Z op = silu_mul_quant 2025-05-07T20:33:10.9558663Z if compiled: 2025-05-07T20:33:10.9558903Z op = torch.compile(op) 2025-05-07T20:33:10.9559195Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.9559459Z 2025-05-07T20:33:10.9559647Z y_fp8, y_scale = fn() 2025-05-07T20:33:10.9559925Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:10.9560331Z 2025-05-07T20:33:10.9560576Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.9560899Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:10.9561178Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:10.9561482Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:10.9561837Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.9562141Z 2025-05-07T20:33:10.9562329Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:10.9562525Z 2025-05-07T20:33:10.9562623Z moe/activation_test.py:126: 2025-05-07T20:33:10.9562914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.9563237Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:10.9563564Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.9564359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:10.9565254Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:10.9565798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.9566464Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.9567158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:10.9567862Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.9568585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:10.9569212Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:10.9570785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:10.9571296Z fn() 2025-05-07T20:33:10.9571812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:10.9572404Z self.fn.run( 2025-05-07T20:33:10.9572862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.9573382Z kernel = self.compile( 2025-05-07T20:33:10.9573927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.9574598Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.9574982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.9575212Z 2025-05-07T20:33:10.9575418Z self = 2025-05-07T20:33:10.9576487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.9577924Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38ffeae700>} 2025-05-07T20:33:10.9579281Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.9580330Z context = 2025-05-07T20:33:10.9580609Z 2025-05-07T20:33:10.9580771Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.9581286Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.9581800Z module_map=module_map) 2025-05-07T20:33:10.9582163Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.9582515Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:10.9582777Z E ^ 2025-05-07T20:33:10.9583228Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.9583691Z 2025-05-07T20:33:10.9584115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.9584622Z 2025-05-07T20:33:10.9584721Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.9585125Z self=, 2025-05-07T20:33:10.9585527Z T=2048, 2025-05-07T20:33:10.9585710Z D=5120, 2025-05-07T20:33:10.9585901Z scale_ub=1200.0, 2025-05-07T20:33:10.9586119Z contiguous=True, 2025-05-07T20:33:10.9586327Z compiled=False, 2025-05-07T20:33:10.9586525Z ) 2025-05-07T20:33:10.9586844Z self = 2025-05-07T20:33:10.9587518Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.9587792Z 2025-05-07T20:33:10.9587866Z @given( 2025-05-07T20:33:10.9588088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.9588386Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.9588687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.9589006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.9589318Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.9589595Z ) 2025-05-07T20:33:10.9589948Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.9590426Z def test_silu_mul_quant( 2025-05-07T20:33:10.9590663Z self, 2025-05-07T20:33:10.9590855Z T: int, 2025-05-07T20:33:10.9591046Z D: int, 2025-05-07T20:33:10.9591257Z scale_ub: Optional[float], 2025-05-07T20:33:10.9591526Z contiguous: bool, 2025-05-07T20:33:10.9591762Z compiled: bool, 2025-05-07T20:33:10.9591972Z ) -> None: 2025-05-07T20:33:10.9592180Z torch.manual_seed(2025) 2025-05-07T20:33:10.9592421Z 2025-05-07T20:33:10.9592682Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.9593015Z 2025-05-07T20:33:10.9593192Z x_sign = torch.sign(x) 2025-05-07T20:33:10.9593468Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.9593769Z x = x_sign * x_clamp 2025-05-07T20:33:10.9594005Z x0 = x[:, :D] 2025-05-07T20:33:10.9594210Z x1 = x[:, D:] 2025-05-07T20:33:10.9594413Z 2025-05-07T20:33:10.9594594Z if contiguous: 2025-05-07T20:33:10.9594815Z x0 = x0.contiguous() 2025-05-07T20:33:10.9595072Z x1 = x1.contiguous() 2025-05-07T20:33:10.9595310Z 2025-05-07T20:33:10.9595490Z if scale_ub is not None: 2025-05-07T20:33:10.9595751Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.9596075Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.9596376Z ) 2025-05-07T20:33:10.9596552Z else: 2025-05-07T20:33:10.9596748Z scale_ub_tensor = None 2025-05-07T20:33:10.9596990Z 2025-05-07T20:33:10.9597208Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.9597509Z op = silu_mul_quant 2025-05-07T20:33:10.9597750Z if compiled: 2025-05-07T20:33:10.9597985Z op = torch.compile(op) 2025-05-07T20:33:10.9598274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.9598542Z 2025-05-07T20:33:10.9598725Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.9598890Z 2025-05-07T20:33:10.9598987Z moe/activation_test.py:117: 2025-05-07T20:33:10.9599281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.9599659Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.9599930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.9600615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.9601293Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.9601836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.9602503Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.9603155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.9603674Z kernel = self.compile( 2025-05-07T20:33:10.9604219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.9604885Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.9605377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.9605640Z 2025-05-07T20:33:10.9605848Z self = 2025-05-07T20:33:10.9606904Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.9608251Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38ffd5e020>} 2025-05-07T20:33:10.9609568Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.9610668Z context = 2025-05-07T20:33:10.9610952Z 2025-05-07T20:33:10.9611116Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.9611630Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.9612104Z module_map=module_map) 2025-05-07T20:33:10.9612462Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.9612801Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.9613052Z E ^ 2025-05-07T20:33:10.9613516Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.9613960Z 2025-05-07T20:33:10.9614431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.6219381Z 2025-05-07T20:33:11.6220250Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.6221146Z self=, 2025-05-07T20:33:11.6221968Z T=2048, 2025-05-07T20:33:11.6222326Z D=5120, 2025-05-07T20:33:11.6222689Z scale_ub=1200.0, 2025-05-07T20:33:11.6223107Z contiguous=True, 2025-05-07T20:33:11.6223531Z compiled=True, 2025-05-07T20:33:11.6223922Z ) 2025-05-07T20:33:11.6224359Z self = 2025-05-07T20:33:11.6224882Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.6225163Z 2025-05-07T20:33:11.6225244Z @given( 2025-05-07T20:33:11.6225467Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.6225769Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.6226074Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.6226404Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.6226719Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.6227287Z ) 2025-05-07T20:33:11.6227692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.6228131Z def test_silu_mul_quant( 2025-05-07T20:33:11.6228372Z self, 2025-05-07T20:33:11.6228602Z T: int, 2025-05-07T20:33:11.6228796Z D: int, 2025-05-07T20:33:11.6229014Z scale_ub: Optional[float], 2025-05-07T20:33:11.6229269Z contiguous: bool, 2025-05-07T20:33:11.6229501Z compiled: bool, 2025-05-07T20:33:11.6229738Z ) -> None: 2025-05-07T20:33:11.6229945Z torch.manual_seed(2025) 2025-05-07T20:33:11.6230175Z 2025-05-07T20:33:11.6230443Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.6230786Z 2025-05-07T20:33:11.6230969Z x_sign = torch.sign(x) 2025-05-07T20:33:11.6231261Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.6231572Z x = x_sign * x_clamp 2025-05-07T20:33:11.6231797Z x0 = x[:, :D] 2025-05-07T20:33:11.6232012Z x1 = x[:, D:] 2025-05-07T20:33:11.6232302Z 2025-05-07T20:33:11.6232554Z if contiguous: 2025-05-07T20:33:11.6232778Z x0 = x0.contiguous() 2025-05-07T20:33:11.6233033Z x1 = x1.contiguous() 2025-05-07T20:33:11.6233269Z 2025-05-07T20:33:11.6233452Z if scale_ub is not None: 2025-05-07T20:33:11.6233722Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.6234050Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.6234358Z ) 2025-05-07T20:33:11.6234550Z else: 2025-05-07T20:33:11.6234755Z scale_ub_tensor = None 2025-05-07T20:33:11.6234992Z 2025-05-07T20:33:11.6235216Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.6235517Z op = silu_mul_quant 2025-05-07T20:33:11.6235844Z if compiled: 2025-05-07T20:33:11.6236084Z op = torch.compile(op) 2025-05-07T20:33:11.6236378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.6236640Z 2025-05-07T20:33:11.6236828Z y_fp8, y_scale = fn() 2025-05-07T20:33:11.6237102Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:11.6237383Z 2025-05-07T20:33:11.6237606Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.6237930Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:11.6238216Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:11.6238518Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:11.6238867Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.6239168Z 2025-05-07T20:33:11.6239355Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:11.6239552Z 2025-05-07T20:33:11.6239649Z moe/activation_test.py:126: 2025-05-07T20:33:11.6239943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.6240534Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:11.6240857Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.6241678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:11.6242415Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:11.6242961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.6243654Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.6244395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:11.6245107Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.6245823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:11.6246536Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:11.6247149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:11.6247657Z fn() 2025-05-07T20:33:11.6248156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:11.6248732Z self.fn.run( 2025-05-07T20:33:11.6249195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.6249709Z kernel = self.compile( 2025-05-07T20:33:11.6250247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.6250888Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.6251276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.6251503Z 2025-05-07T20:33:11.6251715Z self = 2025-05-07T20:33:11.6252968Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.6254389Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38fec3e200>} 2025-05-07T20:33:11.6255703Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.6256761Z context = 2025-05-07T20:33:11.6257114Z 2025-05-07T20:33:11.6257277Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.6257805Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.6258267Z module_map=module_map) 2025-05-07T20:33:11.6258623Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.6258975Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:11.6259231Z E ^ 2025-05-07T20:33:11.6259676Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.6260128Z 2025-05-07T20:33:11.6260559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.6261067Z 2025-05-07T20:33:11.6261168Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.6261578Z self=, 2025-05-07T20:33:11.6261974Z T=16384, 2025-05-07T20:33:11.6262159Z D=7168, 2025-05-07T20:33:11.6262345Z scale_ub=1200.0, 2025-05-07T20:33:11.6262563Z contiguous=False, 2025-05-07T20:33:11.6262786Z compiled=False, 2025-05-07T20:33:11.6262983Z ) 2025-05-07T20:33:11.6263284Z self = 2025-05-07T20:33:11.6263775Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:11.6264062Z 2025-05-07T20:33:11.6264135Z @given( 2025-05-07T20:33:11.6264358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.6264658Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.6264956Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.6265278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.6265591Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.6265870Z ) 2025-05-07T20:33:11.6266217Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.6266696Z def test_silu_mul_quant( 2025-05-07T20:33:11.6266939Z self, 2025-05-07T20:33:11.6267124Z T: int, 2025-05-07T20:33:11.6267306Z D: int, 2025-05-07T20:33:11.6267572Z scale_ub: Optional[float], 2025-05-07T20:33:11.6267841Z contiguous: bool, 2025-05-07T20:33:11.6268074Z compiled: bool, 2025-05-07T20:33:11.6268284Z ) -> None: 2025-05-07T20:33:11.6268495Z torch.manual_seed(2025) 2025-05-07T20:33:11.6268732Z 2025-05-07T20:33:11.6268992Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.6269329Z 2025-05-07T20:33:11.6269516Z x_sign = torch.sign(x) 2025-05-07T20:33:11.6269797Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.6270098Z x = x_sign * x_clamp 2025-05-07T20:33:11.6270336Z x0 = x[:, :D] 2025-05-07T20:33:11.6270540Z x1 = x[:, D:] 2025-05-07T20:33:11.6270743Z 2025-05-07T20:33:11.6270920Z if contiguous: 2025-05-07T20:33:11.6271139Z x0 = x0.contiguous() 2025-05-07T20:33:11.6271479Z x1 = x1.contiguous() 2025-05-07T20:33:11.6271716Z 2025-05-07T20:33:11.6271900Z if scale_ub is not None: 2025-05-07T20:33:11.6272170Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.6272503Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.6272808Z ) 2025-05-07T20:33:11.6272988Z else: 2025-05-07T20:33:11.6273188Z scale_ub_tensor = None 2025-05-07T20:33:11.6273434Z 2025-05-07T20:33:11.6273652Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.6273961Z op = silu_mul_quant 2025-05-07T20:33:11.6274203Z if compiled: 2025-05-07T20:33:11.6274438Z op = torch.compile(op) 2025-05-07T20:33:11.6274774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.6275036Z 2025-05-07T20:33:11.6275214Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.6275388Z 2025-05-07T20:33:11.6275484Z moe/activation_test.py:117: 2025-05-07T20:33:11.6275783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.6276095Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.6276370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.6277046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.6277721Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.6278246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.6278912Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.6279566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.6280091Z kernel = self.compile( 2025-05-07T20:33:11.6280631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.6281275Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.6281666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.6281886Z 2025-05-07T20:33:11.6282091Z self = 2025-05-07T20:33:11.6283149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.6284550Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38fee484a0>} 2025-05-07T20:33:11.6285922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.6286988Z context = 2025-05-07T20:33:11.6287268Z 2025-05-07T20:33:11.6287430Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.6287947Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.6288423Z module_map=module_map) 2025-05-07T20:33:11.6288778Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.6289140Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.6289401Z E ^ 2025-05-07T20:33:11.6289863Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.6290311Z 2025-05-07T20:33:11.6290741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.3285931Z 2025-05-07T20:33:12.3286273Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.3286784Z self=, 2025-05-07T20:33:12.3287392Z T=1, 2025-05-07T20:33:12.3287634Z D=7168, 2025-05-07T20:33:12.3287868Z scale_ub=None, 2025-05-07T20:33:12.3288146Z contiguous=True, 2025-05-07T20:33:12.3288430Z compiled=True, 2025-05-07T20:33:12.3288630Z ) 2025-05-07T20:33:12.3288952Z self = 2025-05-07T20:33:12.3289430Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:12.3289686Z 2025-05-07T20:33:12.3289762Z @given( 2025-05-07T20:33:12.3289988Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.3290410Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.3290709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.3291040Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.3291377Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.3291654Z ) 2025-05-07T20:33:12.3291991Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.3292425Z def test_silu_mul_quant( 2025-05-07T20:33:12.3292659Z self, 2025-05-07T20:33:12.3292844Z T: int, 2025-05-07T20:33:12.3293039Z D: int, 2025-05-07T20:33:12.3293253Z scale_ub: Optional[float], 2025-05-07T20:33:12.3293515Z contiguous: bool, 2025-05-07T20:33:12.3293756Z compiled: bool, 2025-05-07T20:33:12.3293984Z ) -> None: 2025-05-07T20:33:12.3294215Z torch.manual_seed(2025) 2025-05-07T20:33:12.3294477Z 2025-05-07T20:33:12.3294754Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.3295094Z 2025-05-07T20:33:12.3301527Z x_sign = torch.sign(x) 2025-05-07T20:33:12.3301854Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.3302184Z x = x_sign * x_clamp 2025-05-07T20:33:12.3302430Z x0 = x[:, :D] 2025-05-07T20:33:12.3302641Z x1 = x[:, D:] 2025-05-07T20:33:12.3302850Z 2025-05-07T20:33:12.3303037Z if contiguous: 2025-05-07T20:33:12.3303259Z x0 = x0.contiguous() 2025-05-07T20:33:12.3303514Z x1 = x1.contiguous() 2025-05-07T20:33:12.3303754Z 2025-05-07T20:33:12.3303935Z if scale_ub is not None: 2025-05-07T20:33:12.3304207Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.3304546Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.3304852Z ) 2025-05-07T20:33:12.3305038Z else: 2025-05-07T20:33:12.3305245Z scale_ub_tensor = None 2025-05-07T20:33:12.3305495Z 2025-05-07T20:33:12.3305718Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.3306141Z op = silu_mul_quant 2025-05-07T20:33:12.3306391Z if compiled: 2025-05-07T20:33:12.3306632Z op = torch.compile(op) 2025-05-07T20:33:12.3306921Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.3307193Z 2025-05-07T20:33:12.3307376Z y_fp8, y_scale = fn() 2025-05-07T20:33:12.3307715Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:12.3307997Z 2025-05-07T20:33:12.3308219Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.3308547Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:12.3308831Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:12.3309138Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:12.3309482Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.3309787Z 2025-05-07T20:33:12.3309984Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:12.3310177Z 2025-05-07T20:33:12.3310275Z moe/activation_test.py:126: 2025-05-07T20:33:12.3310623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.3311021Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:12.3311340Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.3312145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:12.3312883Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:12.3313420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.3314092Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.3314777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:12.3315544Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.3316264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:12.3316915Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:12.3317518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:12.3318027Z fn() 2025-05-07T20:33:12.3318545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:12.3319115Z self.fn.run( 2025-05-07T20:33:12.3319580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.3320100Z kernel = self.compile( 2025-05-07T20:33:12.3320636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.3321282Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.3321674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.3321898Z 2025-05-07T20:33:12.3322100Z self = 2025-05-07T20:33:12.3323194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.3324557Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38fef04ea0>} 2025-05-07T20:33:12.3325915Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.3327021Z context = 2025-05-07T20:33:12.3327306Z 2025-05-07T20:33:12.3327469Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.3327982Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.3328446Z module_map=module_map) 2025-05-07T20:33:12.3328808Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.3329159Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:12.3329590Z E ^ 2025-05-07T20:33:12.3330052Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.3330502Z 2025-05-07T20:33:12.3330928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.3331438Z 2025-05-07T20:33:12.3331536Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.3331986Z self=, 2025-05-07T20:33:12.3332415Z T=4096, 2025-05-07T20:33:12.3332592Z D=5120, 2025-05-07T20:33:12.3332778Z scale_ub=None, 2025-05-07T20:33:12.3332992Z contiguous=False, 2025-05-07T20:33:12.3333206Z compiled=False, 2025-05-07T20:33:12.3333410Z ) 2025-05-07T20:33:12.3333722Z self = 2025-05-07T20:33:12.3334202Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:12.3334474Z 2025-05-07T20:33:12.3334549Z @given( 2025-05-07T20:33:12.3334773Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.3335072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.3335379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.3335752Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.3336072Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.3336343Z ) 2025-05-07T20:33:12.3336692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.3337130Z def test_silu_mul_quant( 2025-05-07T20:33:12.3337365Z self, 2025-05-07T20:33:12.3337553Z T: int, 2025-05-07T20:33:12.3337746Z D: int, 2025-05-07T20:33:12.3337954Z scale_ub: Optional[float], 2025-05-07T20:33:12.3338222Z contiguous: bool, 2025-05-07T20:33:12.3338458Z compiled: bool, 2025-05-07T20:33:12.3338667Z ) -> None: 2025-05-07T20:33:12.3338878Z torch.manual_seed(2025) 2025-05-07T20:33:12.3339116Z 2025-05-07T20:33:12.3339375Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.3339711Z 2025-05-07T20:33:12.3339903Z x_sign = torch.sign(x) 2025-05-07T20:33:12.3340452Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.3340750Z x = x_sign * x_clamp 2025-05-07T20:33:12.3340992Z x0 = x[:, :D] 2025-05-07T20:33:12.3341208Z x1 = x[:, D:] 2025-05-07T20:33:12.3341415Z 2025-05-07T20:33:12.3341601Z if contiguous: 2025-05-07T20:33:12.3341834Z x0 = x0.contiguous() 2025-05-07T20:33:12.3342085Z x1 = x1.contiguous() 2025-05-07T20:33:12.3342321Z 2025-05-07T20:33:12.3342507Z if scale_ub is not None: 2025-05-07T20:33:12.3342770Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.3343098Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.3343403Z ) 2025-05-07T20:33:12.3343589Z else: 2025-05-07T20:33:12.3343798Z scale_ub_tensor = None 2025-05-07T20:33:12.3344048Z 2025-05-07T20:33:12.3344273Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.3344620Z op = silu_mul_quant 2025-05-07T20:33:12.3344867Z if compiled: 2025-05-07T20:33:12.3345111Z op = torch.compile(op) 2025-05-07T20:33:12.3345472Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.3345748Z 2025-05-07T20:33:12.3345937Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.3346094Z 2025-05-07T20:33:12.3346191Z moe/activation_test.py:117: 2025-05-07T20:33:12.3346482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.3346803Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.3347074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.3347800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.3348476Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.3349030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.3349695Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.3350352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.3351018Z kernel = self.compile( 2025-05-07T20:33:12.3351574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.3352244Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.3352634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.3352853Z 2025-05-07T20:33:12.3353061Z self = 2025-05-07T20:33:12.3354163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.3355572Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38fe31c2c0>} 2025-05-07T20:33:12.3356886Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.3357944Z context = 2025-05-07T20:33:12.3358223Z 2025-05-07T20:33:12.3358390Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.3358900Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.3359369Z module_map=module_map) 2025-05-07T20:33:12.3359729Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.3360073Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.3360328Z E ^ 2025-05-07T20:33:12.3360787Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.3361229Z 2025-05-07T20:33:12.3361661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0402042Z 2025-05-07T20:33:13.0402343Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0402781Z self=, 2025-05-07T20:33:13.0403267Z T=4096, 2025-05-07T20:33:13.0403529Z D=7168, 2025-05-07T20:33:13.0403790Z scale_ub=None, 2025-05-07T20:33:13.0404088Z contiguous=False, 2025-05-07T20:33:13.0404581Z compiled=False, 2025-05-07T20:33:13.0404967Z ) 2025-05-07T20:33:13.0405593Z self = 2025-05-07T20:33:13.0406574Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:13.0407106Z 2025-05-07T20:33:13.0407261Z @given( 2025-05-07T20:33:13.0407926Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0408543Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0409132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0409756Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0410384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0410934Z ) 2025-05-07T20:33:13.0411623Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0412485Z def test_silu_mul_quant( 2025-05-07T20:33:13.0412952Z self, 2025-05-07T20:33:13.0413320Z T: int, 2025-05-07T20:33:13.0413690Z D: int, 2025-05-07T20:33:13.0414112Z scale_ub: Optional[float], 2025-05-07T20:33:13.0414490Z contiguous: bool, 2025-05-07T20:33:13.0414725Z compiled: bool, 2025-05-07T20:33:13.0414950Z ) -> None: 2025-05-07T20:33:13.0415156Z torch.manual_seed(2025) 2025-05-07T20:33:13.0415387Z 2025-05-07T20:33:13.0415660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0416120Z 2025-05-07T20:33:13.0416305Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0416586Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0416885Z x = x_sign * x_clamp 2025-05-07T20:33:13.0417116Z x0 = x[:, :D] 2025-05-07T20:33:13.0417324Z x1 = x[:, D:] 2025-05-07T20:33:13.0417529Z 2025-05-07T20:33:13.0417723Z if contiguous: 2025-05-07T20:33:13.0417956Z x0 = x0.contiguous() 2025-05-07T20:33:13.0418218Z x1 = x1.contiguous() 2025-05-07T20:33:13.0418449Z 2025-05-07T20:33:13.0418643Z if scale_ub is not None: 2025-05-07T20:33:13.0418914Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0419242Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0419624Z ) 2025-05-07T20:33:13.0419814Z else: 2025-05-07T20:33:13.0420018Z scale_ub_tensor = None 2025-05-07T20:33:13.0420275Z 2025-05-07T20:33:13.0420513Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0420815Z op = silu_mul_quant 2025-05-07T20:33:13.0421070Z if compiled: 2025-05-07T20:33:13.0421321Z op = torch.compile(op) 2025-05-07T20:33:13.0421608Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0421880Z 2025-05-07T20:33:13.0422079Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0422242Z 2025-05-07T20:33:13.0422351Z moe/activation_test.py:117: 2025-05-07T20:33:13.0422638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0422966Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0423246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0423954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0424692Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0425235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0425931Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0426582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0427107Z kernel = self.compile( 2025-05-07T20:33:13.0427727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0428370Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0428770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0429002Z 2025-05-07T20:33:13.0429205Z self = 2025-05-07T20:33:13.0430324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0431671Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38fe31d300>} 2025-05-07T20:33:13.0433030Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0434041Z context = 2025-05-07T20:33:13.0434326Z 2025-05-07T20:33:13.0434494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0435017Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0435476Z module_map=module_map) 2025-05-07T20:33:13.0435916Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0436266Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0436516Z E ^ 2025-05-07T20:33:13.0436975Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0437413Z 2025-05-07T20:33:13.0437827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0438328Z 2025-05-07T20:33:13.0438435Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0438838Z self=, 2025-05-07T20:33:13.0439233Z T=128, 2025-05-07T20:33:13.0439464Z D=7168, 2025-05-07T20:33:13.0439647Z scale_ub=None, 2025-05-07T20:33:13.0439865Z contiguous=False, 2025-05-07T20:33:13.0440352Z compiled=True, 2025-05-07T20:33:13.0440554Z ) 2025-05-07T20:33:13.0440879Z self = 2025-05-07T20:33:13.0441362Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:13.0441622Z 2025-05-07T20:33:13.0441702Z @given( 2025-05-07T20:33:13.0441927Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0442236Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0442546Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0442866Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0443189Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0443472Z ) 2025-05-07T20:33:13.0443814Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0444259Z def test_silu_mul_quant( 2025-05-07T20:33:13.0444502Z self, 2025-05-07T20:33:13.0444688Z T: int, 2025-05-07T20:33:13.0444888Z D: int, 2025-05-07T20:33:13.0445103Z scale_ub: Optional[float], 2025-05-07T20:33:13.0445367Z contiguous: bool, 2025-05-07T20:33:13.0445603Z compiled: bool, 2025-05-07T20:33:13.0445826Z ) -> None: 2025-05-07T20:33:13.0446036Z torch.manual_seed(2025) 2025-05-07T20:33:13.0446270Z 2025-05-07T20:33:13.0446542Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0446881Z 2025-05-07T20:33:13.0447067Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0447352Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0447658Z x = x_sign * x_clamp 2025-05-07T20:33:13.0447889Z x0 = x[:, :D] 2025-05-07T20:33:13.0448104Z x1 = x[:, D:] 2025-05-07T20:33:13.0448309Z 2025-05-07T20:33:13.0448486Z if contiguous: 2025-05-07T20:33:13.0448720Z x0 = x0.contiguous() 2025-05-07T20:33:13.0448974Z x1 = x1.contiguous() 2025-05-07T20:33:13.0449206Z 2025-05-07T20:33:13.0449469Z if scale_ub is not None: 2025-05-07T20:33:13.0449743Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0450070Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0450377Z ) 2025-05-07T20:33:13.0450566Z else: 2025-05-07T20:33:13.0450768Z scale_ub_tensor = None 2025-05-07T20:33:13.0451015Z 2025-05-07T20:33:13.0451239Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0451543Z op = silu_mul_quant 2025-05-07T20:33:13.0451785Z if compiled: 2025-05-07T20:33:13.0452032Z op = torch.compile(op) 2025-05-07T20:33:13.0452326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0452591Z 2025-05-07T20:33:13.0452784Z y_fp8, y_scale = fn() 2025-05-07T20:33:13.0453075Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:13.0453355Z 2025-05-07T20:33:13.0453590Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0453925Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:13.0454329Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:13.0454689Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:13.0455042Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:13.0455350Z 2025-05-07T20:33:13.0455544Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:13.0455745Z 2025-05-07T20:33:13.0455842Z moe/activation_test.py:126: 2025-05-07T20:33:13.0456136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0456459Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:13.0456787Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:13.0457572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:13.0458442Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:13.0458975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0459678Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0460358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:13.0461088Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:13.0461810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:13.0462443Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:13.0463041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:13.0463553Z fn() 2025-05-07T20:33:13.0464077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:13.0464654Z self.fn.run( 2025-05-07T20:33:13.0465113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0465633Z kernel = self.compile( 2025-05-07T20:33:13.0466180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0466819Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0467201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0467481Z 2025-05-07T20:33:13.0467686Z self = 2025-05-07T20:33:13.0468798Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0470201Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38fe31fe20>} 2025-05-07T20:33:13.0471512Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0472521Z context = 2025-05-07T20:33:13.0472809Z 2025-05-07T20:33:13.0472973Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0473491Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0473949Z module_map=module_map) 2025-05-07T20:33:13.0474310Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0474666Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:13.0475041Z E ^ 2025-05-07T20:33:13.0475493Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0475945Z 2025-05-07T20:33:13.0476369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.2862170Z 2025-05-07T20:33:13.2862588Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.2863231Z self=, 2025-05-07T20:33:13.2863783Z T=128, 2025-05-07T20:33:13.2864051Z D=7168, 2025-05-07T20:33:13.2864326Z scale_ub=None, 2025-05-07T20:33:13.2864614Z contiguous=False, 2025-05-07T20:33:13.2865068Z compiled=False, 2025-05-07T20:33:13.2865279Z ) 2025-05-07T20:33:13.2865597Z self = 2025-05-07T20:33:13.2866091Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:13.2866384Z 2025-05-07T20:33:13.2866462Z @given( 2025-05-07T20:33:13.2866693Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.2866996Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.2867314Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.2867738Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.2868062Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.2868350Z ) 2025-05-07T20:33:13.2868709Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.2869200Z def test_silu_mul_quant( 2025-05-07T20:33:13.2869442Z self, 2025-05-07T20:33:13.2869629Z T: int, 2025-05-07T20:33:13.2869836Z D: int, 2025-05-07T20:33:13.2870054Z scale_ub: Optional[float], 2025-05-07T20:33:13.2870318Z contiguous: bool, 2025-05-07T20:33:13.2870564Z compiled: bool, 2025-05-07T20:33:13.2870791Z ) -> None: 2025-05-07T20:33:13.2871000Z torch.manual_seed(2025) 2025-05-07T20:33:13.2871235Z 2025-05-07T20:33:13.2871506Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.2871840Z 2025-05-07T20:33:13.2872028Z x_sign = torch.sign(x) 2025-05-07T20:33:13.2872316Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.2872625Z x = x_sign * x_clamp 2025-05-07T20:33:13.2872855Z x0 = x[:, :D] 2025-05-07T20:33:13.2873073Z x1 = x[:, D:] 2025-05-07T20:33:13.2873274Z 2025-05-07T20:33:13.2873447Z if contiguous: 2025-05-07T20:33:13.2873676Z x0 = x0.contiguous() 2025-05-07T20:33:13.2873929Z x1 = x1.contiguous() 2025-05-07T20:33:13.2874165Z 2025-05-07T20:33:13.2874354Z if scale_ub is not None: 2025-05-07T20:33:13.2874624Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.2875031Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.2875351Z ) 2025-05-07T20:33:13.2881439Z else: 2025-05-07T20:33:13.2881677Z scale_ub_tensor = None 2025-05-07T20:33:13.2881930Z 2025-05-07T20:33:13.2882155Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.2882459Z op = silu_mul_quant 2025-05-07T20:33:13.2882708Z if compiled: 2025-05-07T20:33:13.2882948Z op = torch.compile(op) 2025-05-07T20:33:13.2883232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.2883526Z 2025-05-07T20:33:13.2883723Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.2883903Z 2025-05-07T20:33:13.2884009Z moe/activation_test.py:117: 2025-05-07T20:33:13.2884330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.2884739Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.2885026Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.2885828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.2886559Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.2887107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.2887780Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.2888425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.2888940Z kernel = self.compile( 2025-05-07T20:33:13.2889482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.2890201Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.2890585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.2890814Z 2025-05-07T20:33:13.2891019Z self = 2025-05-07T20:33:13.2892136Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.2893487Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9e787c0>} 2025-05-07T20:33:13.2894833Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.2895839Z context = 2025-05-07T20:33:13.2896126Z 2025-05-07T20:33:13.2896299Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.2896817Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.2897281Z module_map=module_map) 2025-05-07T20:33:13.2897635Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.2897980Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.2898225Z E ^ 2025-05-07T20:33:13.2898696Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.2899159Z 2025-05-07T20:33:13.2899588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.2900090Z 2025-05-07T20:33:13.2900195Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.2900599Z self=, 2025-05-07T20:33:13.2900991Z T=4096, 2025-05-07T20:33:13.2901218Z D=5120, 2025-05-07T20:33:13.2901406Z scale_ub=1200.0, 2025-05-07T20:33:13.2901623Z contiguous=True, 2025-05-07T20:33:13.2901834Z compiled=False, 2025-05-07T20:33:13.2902028Z ) 2025-05-07T20:33:13.2902337Z self = 2025-05-07T20:33:13.2902813Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:13.2903075Z 2025-05-07T20:33:13.2903150Z @given( 2025-05-07T20:33:13.2903362Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.2903657Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.2903948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.2904257Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.2904571Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.2904850Z ) 2025-05-07T20:33:13.2905183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.2905610Z def test_silu_mul_quant( 2025-05-07T20:33:13.2905930Z self, 2025-05-07T20:33:13.2906109Z T: int, 2025-05-07T20:33:13.2906482Z D: int, 2025-05-07T20:33:13.2906692Z scale_ub: Optional[float], 2025-05-07T20:33:13.2906951Z contiguous: bool, 2025-05-07T20:33:13.2907182Z compiled: bool, 2025-05-07T20:33:13.2907476Z ) -> None: 2025-05-07T20:33:13.2907707Z torch.manual_seed(2025) 2025-05-07T20:33:13.2907934Z 2025-05-07T20:33:13.2908196Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.2908527Z 2025-05-07T20:33:13.2908703Z x_sign = torch.sign(x) 2025-05-07T20:33:13.2908982Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.2909279Z x = x_sign * x_clamp 2025-05-07T20:33:13.2909557Z x0 = x[:, :D] 2025-05-07T20:33:13.2909763Z x1 = x[:, D:] 2025-05-07T20:33:13.2909959Z 2025-05-07T20:33:13.2910132Z if contiguous: 2025-05-07T20:33:13.2910355Z x0 = x0.contiguous() 2025-05-07T20:33:13.2910601Z x1 = x1.contiguous() 2025-05-07T20:33:13.2910821Z 2025-05-07T20:33:13.2911002Z if scale_ub is not None: 2025-05-07T20:33:13.2911266Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.2911583Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.2911885Z ) 2025-05-07T20:33:13.2912069Z else: 2025-05-07T20:33:13.2912266Z scale_ub_tensor = None 2025-05-07T20:33:13.2912497Z 2025-05-07T20:33:13.2912712Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.2913006Z op = silu_mul_quant 2025-05-07T20:33:13.2913238Z if compiled: 2025-05-07T20:33:13.2913471Z op = torch.compile(op) 2025-05-07T20:33:13.2913833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.2914198Z 2025-05-07T20:33:13.2914450Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.2914705Z 2025-05-07T20:33:13.2914856Z moe/activation_test.py:117: 2025-05-07T20:33:13.2915154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.2915471Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.2915745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.2916446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.2917119Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.2917655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.2918326Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.2918993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.2919510Z kernel = self.compile( 2025-05-07T20:33:13.2920126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.2920794Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.2921175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.2921402Z 2025-05-07T20:33:13.2921634Z self = 2025-05-07T20:33:13.2922758Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.2924187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9e78c20>} 2025-05-07T20:33:13.2925902Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.2926971Z context = 2025-05-07T20:33:13.2927256Z 2025-05-07T20:33:13.2927417Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.2927933Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.2928397Z module_map=module_map) 2025-05-07T20:33:13.2928774Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.2929166Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.2929444Z E ^ 2025-05-07T20:33:13.2929975Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.2930567Z 2025-05-07T20:33:13.2931074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.2931695Z 2025-05-07T20:33:13.2931806Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.2932267Z self=, 2025-05-07T20:33:13.2932723Z T=1, 2025-05-07T20:33:13.2932910Z D=5120, 2025-05-07T20:33:13.2933109Z scale_ub=None, 2025-05-07T20:33:13.2933331Z contiguous=True, 2025-05-07T20:33:13.2933560Z compiled=True, 2025-05-07T20:33:13.2933771Z ) 2025-05-07T20:33:13.2934115Z self = 2025-05-07T20:33:13.2934662Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:13.2934956Z 2025-05-07T20:33:13.2935035Z @given( 2025-05-07T20:33:13.2935272Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.2935616Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.2935950Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.2936314Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.2936673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.2936979Z ) 2025-05-07T20:33:13.2937365Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.2937864Z def test_silu_mul_quant( 2025-05-07T20:33:13.2938115Z self, 2025-05-07T20:33:13.2938311Z T: int, 2025-05-07T20:33:13.2938506Z D: int, 2025-05-07T20:33:13.2938732Z scale_ub: Optional[float], 2025-05-07T20:33:13.2939021Z contiguous: bool, 2025-05-07T20:33:13.2939266Z compiled: bool, 2025-05-07T20:33:13.2939497Z ) -> None: 2025-05-07T20:33:13.2939722Z torch.manual_seed(2025) 2025-05-07T20:33:13.2939974Z 2025-05-07T20:33:13.2940682Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.2941154Z 2025-05-07T20:33:13.2941431Z x_sign = torch.sign(x) 2025-05-07T20:33:13.2941722Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.2942022Z x = x_sign * x_clamp 2025-05-07T20:33:13.2942250Z x0 = x[:, :D] 2025-05-07T20:33:13.2942453Z x1 = x[:, D:] 2025-05-07T20:33:13.2942648Z 2025-05-07T20:33:13.2942817Z if contiguous: 2025-05-07T20:33:13.2943029Z x0 = x0.contiguous() 2025-05-07T20:33:13.2943271Z x1 = x1.contiguous() 2025-05-07T20:33:13.2943493Z 2025-05-07T20:33:13.2943665Z if scale_ub is not None: 2025-05-07T20:33:13.2943919Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.2944240Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.2944549Z ) 2025-05-07T20:33:13.2944755Z else: 2025-05-07T20:33:13.2944952Z scale_ub_tensor = None 2025-05-07T20:33:13.2945184Z 2025-05-07T20:33:13.2945398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.2945695Z op = silu_mul_quant 2025-05-07T20:33:13.2946047Z if compiled: 2025-05-07T20:33:13.2946291Z op = torch.compile(op) 2025-05-07T20:33:13.2946574Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.2946828Z 2025-05-07T20:33:13.2947007Z y_fp8, y_scale = fn() 2025-05-07T20:33:13.2947277Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:13.2947617Z 2025-05-07T20:33:13.2947835Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.2948151Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:13.2948430Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:13.2948722Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:13.2949058Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:13.2949477Z 2025-05-07T20:33:13.2949659Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:13.2949845Z 2025-05-07T20:33:13.2949940Z moe/activation_test.py:126: 2025-05-07T20:33:13.2950227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.2950545Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:13.2950856Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:13.2951626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:13.2952357Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:13.2952894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.2953552Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.2954221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:13.2954925Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:13.2955664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:13.2956288Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:13.2956895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:13.2957402Z fn() 2025-05-07T20:33:13.2957913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:13.2958507Z self.fn.run( 2025-05-07T20:33:13.2958978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.2959491Z kernel = self.compile( 2025-05-07T20:33:13.2960029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.2960739Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.2961129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.2961351Z 2025-05-07T20:33:13.2961552Z self = 2025-05-07T20:33:13.2962624Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.2963972Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9e7a980>} 2025-05-07T20:33:13.2965284Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.2966401Z context = 2025-05-07T20:33:13.2966754Z 2025-05-07T20:33:13.2966917Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.2967425Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.2967887Z module_map=module_map) 2025-05-07T20:33:13.2968236Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.2968583Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:13.2968840Z E ^ 2025-05-07T20:33:13.2969288Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.2969760Z 2025-05-07T20:33:13.2970228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.9914830Z 2025-05-07T20:33:13.9915213Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.9915645Z self=, 2025-05-07T20:33:13.9916108Z T=2048, 2025-05-07T20:33:13.9916374Z D=5120, 2025-05-07T20:33:13.9916631Z scale_ub=None, 2025-05-07T20:33:13.9916913Z contiguous=True, 2025-05-07T20:33:13.9917196Z compiled=True, 2025-05-07T20:33:13.9917500Z ) 2025-05-07T20:33:13.9917866Z self = 2025-05-07T20:33:13.9918344Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:13.9918611Z 2025-05-07T20:33:13.9918688Z @given( 2025-05-07T20:33:13.9918913Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.9919217Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.9919518Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.9919842Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.9920170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.9920448Z ) 2025-05-07T20:33:13.9920803Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.9921248Z def test_silu_mul_quant( 2025-05-07T20:33:13.9921483Z self, 2025-05-07T20:33:13.9921675Z T: int, 2025-05-07T20:33:13.9921866Z D: int, 2025-05-07T20:33:13.9922072Z scale_ub: Optional[float], 2025-05-07T20:33:13.9922340Z contiguous: bool, 2025-05-07T20:33:13.9922575Z compiled: bool, 2025-05-07T20:33:13.9922792Z ) -> None: 2025-05-07T20:33:13.9923005Z torch.manual_seed(2025) 2025-05-07T20:33:13.9923245Z 2025-05-07T20:33:13.9923511Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.9923845Z 2025-05-07T20:33:13.9924035Z x_sign = torch.sign(x) 2025-05-07T20:33:13.9924327Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.9924784Z x = x_sign * x_clamp 2025-05-07T20:33:13.9925041Z x0 = x[:, :D] 2025-05-07T20:33:13.9925255Z x1 = x[:, D:] 2025-05-07T20:33:13.9925452Z 2025-05-07T20:33:13.9925627Z if contiguous: 2025-05-07T20:33:13.9925852Z x0 = x0.contiguous() 2025-05-07T20:33:13.9926096Z x1 = x1.contiguous() 2025-05-07T20:33:13.9926330Z 2025-05-07T20:33:13.9926517Z if scale_ub is not None: 2025-05-07T20:33:13.9926778Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.9927108Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.9927409Z ) 2025-05-07T20:33:13.9927596Z else: 2025-05-07T20:33:13.9927799Z scale_ub_tensor = None 2025-05-07T20:33:13.9928045Z 2025-05-07T20:33:13.9928267Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.9928583Z op = silu_mul_quant 2025-05-07T20:33:13.9928832Z if compiled: 2025-05-07T20:33:13.9929075Z op = torch.compile(op) 2025-05-07T20:33:13.9929363Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.9929755Z 2025-05-07T20:33:13.9929945Z y_fp8, y_scale = fn() 2025-05-07T20:33:13.9930219Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:13.9930505Z 2025-05-07T20:33:13.9930739Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.9931061Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:13.9931347Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:13.9931653Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:13.9931995Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:13.9932295Z 2025-05-07T20:33:13.9932489Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:13.9932745Z 2025-05-07T20:33:13.9932848Z moe/activation_test.py:126: 2025-05-07T20:33:13.9933137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.9933463Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:13.9933793Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:13.9934585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:13.9935328Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:13.9935869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.9936537Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.9937224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:13.9937933Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:13.9938673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:13.9939304Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:13.9939897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:13.9940917Z fn() 2025-05-07T20:33:13.9941465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:13.9942053Z self.fn.run( 2025-05-07T20:33:13.9942521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.9943046Z kernel = self.compile( 2025-05-07T20:33:13.9943603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.9944274Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.9944763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.9944992Z 2025-05-07T20:33:13.9945214Z self = 2025-05-07T20:33:13.9946295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.9947754Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9ec8860>} 2025-05-07T20:33:13.9949080Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.9950145Z context = 2025-05-07T20:33:13.9950430Z 2025-05-07T20:33:13.9950602Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.9951228Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.9951698Z module_map=module_map) 2025-05-07T20:33:13.9952059Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.9952411Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:13.9952674Z E ^ 2025-05-07T20:33:13.9953146Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.9953611Z 2025-05-07T20:33:13.9954041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.9954561Z 2025-05-07T20:33:13.9954749Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.9955165Z self=, 2025-05-07T20:33:13.9955564Z T=128, 2025-05-07T20:33:13.9955750Z D=5120, 2025-05-07T20:33:13.9955938Z scale_ub=None, 2025-05-07T20:33:13.9956151Z contiguous=True, 2025-05-07T20:33:13.9956375Z compiled=True, 2025-05-07T20:33:13.9956575Z ) 2025-05-07T20:33:13.9956896Z self = 2025-05-07T20:33:13.9957388Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:13.9957652Z 2025-05-07T20:33:13.9957736Z @given( 2025-05-07T20:33:13.9957962Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.9958266Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.9958560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.9958971Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.9959441Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.9959833Z ) 2025-05-07T20:33:13.9960254Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.9960702Z def test_silu_mul_quant( 2025-05-07T20:33:13.9960942Z self, 2025-05-07T20:33:13.9961130Z T: int, 2025-05-07T20:33:13.9961407Z D: int, 2025-05-07T20:33:13.9961661Z scale_ub: Optional[float], 2025-05-07T20:33:13.9961966Z contiguous: bool, 2025-05-07T20:33:13.9962204Z compiled: bool, 2025-05-07T20:33:13.9962423Z ) -> None: 2025-05-07T20:33:13.9962628Z torch.manual_seed(2025) 2025-05-07T20:33:13.9962866Z 2025-05-07T20:33:13.9963129Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.9963458Z 2025-05-07T20:33:13.9963641Z x_sign = torch.sign(x) 2025-05-07T20:33:13.9963924Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.9964233Z x = x_sign * x_clamp 2025-05-07T20:33:13.9964465Z x0 = x[:, :D] 2025-05-07T20:33:13.9964679Z x1 = x[:, D:] 2025-05-07T20:33:13.9964882Z 2025-05-07T20:33:13.9965122Z if contiguous: 2025-05-07T20:33:13.9965353Z x0 = x0.contiguous() 2025-05-07T20:33:13.9965604Z x1 = x1.contiguous() 2025-05-07T20:33:13.9965837Z 2025-05-07T20:33:13.9966022Z if scale_ub is not None: 2025-05-07T20:33:13.9966294Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.9966621Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.9966928Z ) 2025-05-07T20:33:13.9967119Z else: 2025-05-07T20:33:13.9967319Z scale_ub_tensor = None 2025-05-07T20:33:13.9967568Z 2025-05-07T20:33:13.9967796Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.9968096Z op = silu_mul_quant 2025-05-07T20:33:13.9968345Z if compiled: 2025-05-07T20:33:13.9968595Z op = torch.compile(op) 2025-05-07T20:33:13.9968879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.9969148Z 2025-05-07T20:33:13.9969334Z y_fp8, y_scale = fn() 2025-05-07T20:33:13.9969666Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:13.9969985Z 2025-05-07T20:33:13.9975496Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.9975850Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:13.9976138Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:13.9976453Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:13.9976809Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:13.9977112Z 2025-05-07T20:33:13.9977319Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:13.9977521Z 2025-05-07T20:33:13.9977623Z moe/activation_test.py:126: 2025-05-07T20:33:13.9977927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.9978334Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:13.9978662Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:13.9979473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:13.9980275Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:13.9980816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.9981495Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.9982185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:13.9982898Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:13.9983614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:13.9984254Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:13.9984914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:13.9985415Z fn() 2025-05-07T20:33:13.9985935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:13.9986527Z self.fn.run( 2025-05-07T20:33:13.9986990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.9987600Z kernel = self.compile( 2025-05-07T20:33:13.9988157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.9988802Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.9989188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.9989419Z 2025-05-07T20:33:13.9989627Z self = 2025-05-07T20:33:13.9990974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.9992611Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9836ca0>} 2025-05-07T20:33:13.9993929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.9994933Z context = 2025-05-07T20:33:13.9995221Z 2025-05-07T20:33:13.9995384Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.9995902Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.9996486Z module_map=module_map) 2025-05-07T20:33:13.9996840Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.9997186Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:13.9997443Z E ^ 2025-05-07T20:33:13.9997894Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.9998339Z 2025-05-07T20:33:13.9998749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.7881018Z 2025-05-07T20:33:14.7881231Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:14.7881842Z self=, 2025-05-07T20:33:14.7882566Z T=4096, 2025-05-07T20:33:14.7882837Z D=5120, 2025-05-07T20:33:14.7883089Z scale_ub=None, 2025-05-07T20:33:14.7883374Z contiguous=True, 2025-05-07T20:33:14.7883680Z compiled=True, 2025-05-07T20:33:14.7883960Z ) 2025-05-07T20:33:14.7884369Z self = 2025-05-07T20:33:14.7885007Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:14.7885269Z 2025-05-07T20:33:14.7885348Z @given( 2025-05-07T20:33:14.7885565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.7885872Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.7886170Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.7886488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.7886822Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.7887112Z ) 2025-05-07T20:33:14.7887463Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.7887912Z def test_silu_mul_quant( 2025-05-07T20:33:14.7888145Z self, 2025-05-07T20:33:14.7888343Z T: int, 2025-05-07T20:33:14.7888540Z D: int, 2025-05-07T20:33:14.7888757Z scale_ub: Optional[float], 2025-05-07T20:33:14.7889028Z contiguous: bool, 2025-05-07T20:33:14.7889256Z compiled: bool, 2025-05-07T20:33:14.7889479Z ) -> None: 2025-05-07T20:33:14.7889681Z torch.manual_seed(2025) 2025-05-07T20:33:14.7889907Z 2025-05-07T20:33:14.7890175Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.7890515Z 2025-05-07T20:33:14.7890740Z x_sign = torch.sign(x) 2025-05-07T20:33:14.7891020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.7891319Z x = x_sign * x_clamp 2025-05-07T20:33:14.7891554Z x0 = x[:, :D] 2025-05-07T20:33:14.7891764Z x1 = x[:, D:] 2025-05-07T20:33:14.7891976Z 2025-05-07T20:33:14.7892154Z if contiguous: 2025-05-07T20:33:14.7892390Z x0 = x0.contiguous() 2025-05-07T20:33:14.7892637Z x1 = x1.contiguous() 2025-05-07T20:33:14.7892952Z 2025-05-07T20:33:14.7893143Z if scale_ub is not None: 2025-05-07T20:33:14.7893415Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.7893763Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.7894074Z ) 2025-05-07T20:33:14.7894268Z else: 2025-05-07T20:33:14.7894467Z scale_ub_tensor = None 2025-05-07T20:33:14.7894712Z 2025-05-07T20:33:14.7894939Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.7895246Z op = silu_mul_quant 2025-05-07T20:33:14.7895493Z if compiled: 2025-05-07T20:33:14.7895742Z op = torch.compile(op) 2025-05-07T20:33:14.7896032Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.7896303Z 2025-05-07T20:33:14.7896498Z y_fp8, y_scale = fn() 2025-05-07T20:33:14.7896777Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:14.7897057Z 2025-05-07T20:33:14.7897293Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.7897745Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:14.7898033Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:14.7898339Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:14.7898694Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:14.7898988Z 2025-05-07T20:33:14.7899182Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:14.7899368Z 2025-05-07T20:33:14.7899471Z moe/activation_test.py:126: 2025-05-07T20:33:14.7899754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.7900080Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:14.7900396Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:14.7901220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:14.7901953Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:14.7902503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:14.7903173Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:14.7903849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:14.7904562Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:14.7905279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:14.7905902Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:14.7906497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:14.7906999Z fn() 2025-05-07T20:33:14.7907610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:14.7908177Z self.fn.run( 2025-05-07T20:33:14.7908635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:14.7909151Z kernel = self.compile( 2025-05-07T20:33:14.7909700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:14.7910328Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:14.7910710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.7910932Z 2025-05-07T20:33:14.7911132Z self = 2025-05-07T20:33:14.7912247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:14.7913605Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d974a200>} 2025-05-07T20:33:14.7914947Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:14.7915963Z context = 2025-05-07T20:33:14.7916248Z 2025-05-07T20:33:14.7916415Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:14.7916920Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:14.7917376Z module_map=module_map) 2025-05-07T20:33:14.7917736Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:14.7918166Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:14.7918417Z E ^ 2025-05-07T20:33:14.7918862Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:14.7919299Z 2025-05-07T20:33:14.7919726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.7920225Z 2025-05-07T20:33:14.7920332Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:14.7920728Z self=, 2025-05-07T20:33:14.7921112Z T=16384, 2025-05-07T20:33:14.7921297Z D=5120, 2025-05-07T20:33:14.7921477Z scale_ub=None, 2025-05-07T20:33:14.7921729Z contiguous=True, 2025-05-07T20:33:14.7921942Z compiled=True, 2025-05-07T20:33:14.7922127Z ) 2025-05-07T20:33:14.7922435Z self = 2025-05-07T20:33:14.7922916Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:14.7923177Z 2025-05-07T20:33:14.7923248Z @given( 2025-05-07T20:33:14.7923470Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.7923771Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.7924063Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.7924373Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.7924688Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.7924957Z ) 2025-05-07T20:33:14.7925292Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.7925722Z def test_silu_mul_quant( 2025-05-07T20:33:14.7925950Z self, 2025-05-07T20:33:14.7926131Z T: int, 2025-05-07T20:33:14.7926317Z D: int, 2025-05-07T20:33:14.7926525Z scale_ub: Optional[float], 2025-05-07T20:33:14.7926787Z contiguous: bool, 2025-05-07T20:33:14.7927022Z compiled: bool, 2025-05-07T20:33:14.7927234Z ) -> None: 2025-05-07T20:33:14.7927437Z torch.manual_seed(2025) 2025-05-07T20:33:14.7927669Z 2025-05-07T20:33:14.7927936Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.7928268Z 2025-05-07T20:33:14.7928451Z x_sign = torch.sign(x) 2025-05-07T20:33:14.7928733Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.7929211Z x = x_sign * x_clamp 2025-05-07T20:33:14.7929441Z x0 = x[:, :D] 2025-05-07T20:33:14.7929647Z x1 = x[:, D:] 2025-05-07T20:33:14.7929851Z 2025-05-07T20:33:14.7930024Z if contiguous: 2025-05-07T20:33:14.7930244Z x0 = x0.contiguous() 2025-05-07T20:33:14.7930491Z x1 = x1.contiguous() 2025-05-07T20:33:14.7930716Z 2025-05-07T20:33:14.7930899Z if scale_ub is not None: 2025-05-07T20:33:14.7931242Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.7931585Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.7931890Z ) 2025-05-07T20:33:14.7932082Z else: 2025-05-07T20:33:14.7932290Z scale_ub_tensor = None 2025-05-07T20:33:14.7932535Z 2025-05-07T20:33:14.7932761Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.7933061Z op = silu_mul_quant 2025-05-07T20:33:14.7933300Z if compiled: 2025-05-07T20:33:14.7933544Z op = torch.compile(op) 2025-05-07T20:33:14.7933862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.7934245Z 2025-05-07T20:33:14.7934496Z y_fp8, y_scale = fn() 2025-05-07T20:33:14.7934888Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:14.7935243Z 2025-05-07T20:33:14.7935470Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.7935799Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:14.7936091Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:14.7936491Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:14.7936844Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:14.7937148Z 2025-05-07T20:33:14.7937337Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:14.7937530Z 2025-05-07T20:33:14.7937627Z moe/activation_test.py:126: 2025-05-07T20:33:14.7937920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.7938249Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:14.7938561Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:14.7939336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:14.7940282Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:14.7940820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:14.7941494Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:14.7942172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:14.7942878Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:14.7943613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:14.7944248Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:14.7944873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:14.7945424Z fn() 2025-05-07T20:33:14.7945921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:14.7946502Z self.fn.run( 2025-05-07T20:33:14.7946965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:14.7947546Z kernel = self.compile( 2025-05-07T20:33:14.7948082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:14.7948720Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:14.7949109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.7949330Z 2025-05-07T20:33:14.7949535Z self = 2025-05-07T20:33:14.7950594Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:14.7952031Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8d38680>} 2025-05-07T20:33:14.7953785Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:14.7954856Z context = 2025-05-07T20:33:14.7955138Z 2025-05-07T20:33:14.7955300Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:14.7955810Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:14.7956268Z module_map=module_map) 2025-05-07T20:33:14.7956627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:14.7956980Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:14.7957241Z E ^ 2025-05-07T20:33:14.7957791Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:14.7958302Z 2025-05-07T20:33:14.7958719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.8157923Z W0507 20:33:14.814000 89314 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:14.8160582Z W0507 20:33:14.814000 89314 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:14.8162961Z W0507 20:33:14.814000 89314 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:14.8164947Z W0507 20:33:14.814000 89314 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:14.8166127Z W0507 20:33:14.814000 89314 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:15.2257511Z 2025-05-07T20:33:15.2257708Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.2258339Z self=, 2025-05-07T20:33:15.2258920Z T=1, 2025-05-07T20:33:15.2259172Z D=5120, 2025-05-07T20:33:15.2259437Z scale_ub=1200.0, 2025-05-07T20:33:15.2259741Z contiguous=True, 2025-05-07T20:33:15.2260042Z compiled=True, 2025-05-07T20:33:15.2260312Z ) 2025-05-07T20:33:15.2260694Z self = 2025-05-07T20:33:15.2261192Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:15.2261450Z 2025-05-07T20:33:15.2261533Z @given( 2025-05-07T20:33:15.2261768Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.2262079Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.2262374Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.2262686Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.2262996Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.2263263Z ) 2025-05-07T20:33:15.2263591Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.2264030Z def test_silu_mul_quant( 2025-05-07T20:33:15.2264258Z self, 2025-05-07T20:33:15.2264440Z T: int, 2025-05-07T20:33:15.2264634Z D: int, 2025-05-07T20:33:15.2264854Z scale_ub: Optional[float], 2025-05-07T20:33:15.2265115Z contiguous: bool, 2025-05-07T20:33:15.2265354Z compiled: bool, 2025-05-07T20:33:15.2265585Z ) -> None: 2025-05-07T20:33:15.2265916Z torch.manual_seed(2025) 2025-05-07T20:33:15.2266169Z 2025-05-07T20:33:15.2266449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.2266791Z 2025-05-07T20:33:15.2266977Z x_sign = torch.sign(x) 2025-05-07T20:33:15.2267264Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.2267664Z x = x_sign * x_clamp 2025-05-07T20:33:15.2267893Z x0 = x[:, :D] 2025-05-07T20:33:15.2268104Z x1 = x[:, D:] 2025-05-07T20:33:15.2268310Z 2025-05-07T20:33:15.2268482Z if contiguous: 2025-05-07T20:33:15.2268712Z x0 = x0.contiguous() 2025-05-07T20:33:15.2268968Z x1 = x1.contiguous() 2025-05-07T20:33:15.2269198Z 2025-05-07T20:33:15.2269379Z if scale_ub is not None: 2025-05-07T20:33:15.2269648Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.2269965Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.2270269Z ) 2025-05-07T20:33:15.2270461Z else: 2025-05-07T20:33:15.2270778Z scale_ub_tensor = None 2025-05-07T20:33:15.2271022Z 2025-05-07T20:33:15.2271245Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.2271547Z op = silu_mul_quant 2025-05-07T20:33:15.2271781Z if compiled: 2025-05-07T20:33:15.2272020Z op = torch.compile(op) 2025-05-07T20:33:15.2272311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.2272567Z 2025-05-07T20:33:15.2272749Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.2272906Z 2025-05-07T20:33:15.2273004Z moe/activation_test.py:117: 2025-05-07T20:33:15.2273282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.2273605Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.2273953Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.2274494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:15.2275066Z return fn(*args, **kwargs) 2025-05-07T20:33:15.2275737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.2276402Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.2276920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.2277586Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.2278234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.2278747Z kernel = self.compile( 2025-05-07T20:33:15.2279287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.2279930Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.2280319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.2280545Z 2025-05-07T20:33:15.2280744Z self = 2025-05-07T20:33:15.2281802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.2283152Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8934680>} 2025-05-07T20:33:15.2284465Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.2285546Z context = 2025-05-07T20:33:15.2285828Z 2025-05-07T20:33:15.2285994Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.2286505Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.2286965Z module_map=module_map) 2025-05-07T20:33:15.2287322Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.2287667Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.2287917Z E ^ 2025-05-07T20:33:15.2288368Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.2288804Z 2025-05-07T20:33:15.2289213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.2289722Z 2025-05-07T20:33:15.2289818Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.2290223Z self=, 2025-05-07T20:33:15.2290664Z T=1, 2025-05-07T20:33:15.2290882Z D=5120, 2025-05-07T20:33:15.2291072Z scale_ub=None, 2025-05-07T20:33:15.2291286Z contiguous=False, 2025-05-07T20:33:15.2291499Z compiled=True, 2025-05-07T20:33:15.2291699Z ) 2025-05-07T20:33:15.2292004Z self = 2025-05-07T20:33:15.2292467Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:15.2292731Z 2025-05-07T20:33:15.2292806Z @given( 2025-05-07T20:33:15.2293025Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.2293319Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.2293617Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.2294008Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.2294326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.2294592Z ) 2025-05-07T20:33:15.2294941Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.2295380Z def test_silu_mul_quant( 2025-05-07T20:33:15.2295610Z self, 2025-05-07T20:33:15.2295801Z T: int, 2025-05-07T20:33:15.2295995Z D: int, 2025-05-07T20:33:15.2296210Z scale_ub: Optional[float], 2025-05-07T20:33:15.2296475Z contiguous: bool, 2025-05-07T20:33:15.2296712Z compiled: bool, 2025-05-07T20:33:15.2296920Z ) -> None: 2025-05-07T20:33:15.2297125Z torch.manual_seed(2025) 2025-05-07T20:33:15.2297382Z 2025-05-07T20:33:15.2297650Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.2297978Z 2025-05-07T20:33:15.2298163Z x_sign = torch.sign(x) 2025-05-07T20:33:15.2298454Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.2298753Z x = x_sign * x_clamp 2025-05-07T20:33:15.2298994Z x0 = x[:, :D] 2025-05-07T20:33:15.2299208Z x1 = x[:, D:] 2025-05-07T20:33:15.2299413Z 2025-05-07T20:33:15.2299593Z if contiguous: 2025-05-07T20:33:15.2299814Z x0 = x0.contiguous() 2025-05-07T20:33:15.2305671Z x1 = x1.contiguous() 2025-05-07T20:33:15.2305942Z 2025-05-07T20:33:15.2306134Z if scale_ub is not None: 2025-05-07T20:33:15.2306409Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.2306736Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.2307038Z ) 2025-05-07T20:33:15.2307227Z else: 2025-05-07T20:33:15.2307482Z scale_ub_tensor = None 2025-05-07T20:33:15.2307731Z 2025-05-07T20:33:15.2307957Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.2308262Z op = silu_mul_quant 2025-05-07T20:33:15.2308513Z if compiled: 2025-05-07T20:33:15.2308749Z op = torch.compile(op) 2025-05-07T20:33:15.2309034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.2309375Z 2025-05-07T20:33:15.2309565Z y_fp8, y_scale = fn() 2025-05-07T20:33:15.2309847Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:15.2310126Z 2025-05-07T20:33:15.2310356Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.2310680Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:15.2310963Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:15.2311270Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:15.2311617Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:15.2311910Z 2025-05-07T20:33:15.2312100Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:15.2312294Z 2025-05-07T20:33:15.2312389Z moe/activation_test.py:126: 2025-05-07T20:33:15.2312683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.2313000Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:15.2313322Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:15.2314194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:15.2314919Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:15.2315460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.2316132Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.2316818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:15.2317522Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:15.2318285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:15.2318904Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:15.2319494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:15.2319995Z fn() 2025-05-07T20:33:15.2320489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:15.2321052Z self.fn.run( 2025-05-07T20:33:15.2321503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.2322018Z kernel = self.compile( 2025-05-07T20:33:15.2322563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.2323194Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.2323572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.2323794Z 2025-05-07T20:33:15.2323995Z self = 2025-05-07T20:33:15.2325053Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.2326392Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d892ad40>} 2025-05-07T20:33:15.2327694Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.2328691Z context = 2025-05-07T20:33:15.2328980Z 2025-05-07T20:33:15.2329139Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.2329693Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.2330151Z module_map=module_map) 2025-05-07T20:33:15.2330504Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.2330853Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:15.2331102Z E ^ 2025-05-07T20:33:15.2331553Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.2331991Z 2025-05-07T20:33:15.2332398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.3756715Z 2025-05-07T20:33:15.3757041Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.3757454Z self=, 2025-05-07T20:33:15.3757919Z T=1, 2025-05-07T20:33:15.3758186Z D=5120, 2025-05-07T20:33:15.3758447Z scale_ub=None, 2025-05-07T20:33:15.3758725Z contiguous=True, 2025-05-07T20:33:15.3759222Z compiled=False, 2025-05-07T20:33:15.3759432Z ) 2025-05-07T20:33:15.3759738Z self = 2025-05-07T20:33:15.3760212Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:15.3760474Z 2025-05-07T20:33:15.3760562Z @given( 2025-05-07T20:33:15.3760786Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.3761086Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.3761386Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.3761717Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.3762039Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.3762397Z ) 2025-05-07T20:33:15.3762781Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.3763239Z def test_silu_mul_quant( 2025-05-07T20:33:15.3763478Z self, 2025-05-07T20:33:15.3763665Z T: int, 2025-05-07T20:33:15.3763875Z D: int, 2025-05-07T20:33:15.3764088Z scale_ub: Optional[float], 2025-05-07T20:33:15.3764347Z contiguous: bool, 2025-05-07T20:33:15.3764586Z compiled: bool, 2025-05-07T20:33:15.3764805Z ) -> None: 2025-05-07T20:33:15.3765018Z torch.manual_seed(2025) 2025-05-07T20:33:15.3765267Z 2025-05-07T20:33:15.3765541Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.3765872Z 2025-05-07T20:33:15.3766067Z x_sign = torch.sign(x) 2025-05-07T20:33:15.3766353Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.3766650Z x = x_sign * x_clamp 2025-05-07T20:33:15.3766890Z x0 = x[:, :D] 2025-05-07T20:33:15.3767110Z x1 = x[:, D:] 2025-05-07T20:33:15.3767323Z 2025-05-07T20:33:15.3767504Z if contiguous: 2025-05-07T20:33:15.3767740Z x0 = x0.contiguous() 2025-05-07T20:33:15.3768006Z x1 = x1.contiguous() 2025-05-07T20:33:15.3768245Z 2025-05-07T20:33:15.3768446Z if scale_ub is not None: 2025-05-07T20:33:15.3768723Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.3769055Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.3769368Z ) 2025-05-07T20:33:15.3769564Z else: 2025-05-07T20:33:15.3769772Z scale_ub_tensor = None 2025-05-07T20:33:15.3770025Z 2025-05-07T20:33:15.3770266Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.3770569Z op = silu_mul_quant 2025-05-07T20:33:15.3770825Z if compiled: 2025-05-07T20:33:15.3771073Z op = torch.compile(op) 2025-05-07T20:33:15.3771362Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.3771636Z 2025-05-07T20:33:15.3771826Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.3771990Z 2025-05-07T20:33:15.3772165Z moe/activation_test.py:117: 2025-05-07T20:33:15.3772454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.3772786Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.3773065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.3773736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.3774410Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.3774954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.3775639Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.3776286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.3776816Z kernel = self.compile( 2025-05-07T20:33:15.3777366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.3778090Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.3778480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.3778712Z 2025-05-07T20:33:15.3778916Z self = 2025-05-07T20:33:15.3779970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.3781316Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9226660>} 2025-05-07T20:33:15.3782664Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.3783674Z context = 2025-05-07T20:33:15.3783958Z 2025-05-07T20:33:15.3784131Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.3784638Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.3785122Z module_map=module_map) 2025-05-07T20:33:15.3785505Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.3785858Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.3786107Z E ^ 2025-05-07T20:33:15.3786558Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.3787000Z 2025-05-07T20:33:15.3787522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.3788023Z 2025-05-07T20:33:15.3788130Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.3788530Z self=, 2025-05-07T20:33:15.3788928Z T=128, 2025-05-07T20:33:15.3789110Z D=5120, 2025-05-07T20:33:15.3789290Z scale_ub=None, 2025-05-07T20:33:15.3789499Z contiguous=False, 2025-05-07T20:33:15.3789715Z compiled=True, 2025-05-07T20:33:15.3789905Z ) 2025-05-07T20:33:15.3790221Z self = 2025-05-07T20:33:15.3790701Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:15.3790959Z 2025-05-07T20:33:15.3791038Z @given( 2025-05-07T20:33:15.3791254Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.3791561Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.3791856Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.3792224Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.3792547Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.3792824Z ) 2025-05-07T20:33:15.3793157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.3793592Z def test_silu_mul_quant( 2025-05-07T20:33:15.3793823Z self, 2025-05-07T20:33:15.3794005Z T: int, 2025-05-07T20:33:15.3794198Z D: int, 2025-05-07T20:33:15.3794407Z scale_ub: Optional[float], 2025-05-07T20:33:15.3794674Z contiguous: bool, 2025-05-07T20:33:15.3794906Z compiled: bool, 2025-05-07T20:33:15.3795127Z ) -> None: 2025-05-07T20:33:15.3795336Z torch.manual_seed(2025) 2025-05-07T20:33:15.3795568Z 2025-05-07T20:33:15.3795834Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.3796173Z 2025-05-07T20:33:15.3796354Z x_sign = torch.sign(x) 2025-05-07T20:33:15.3796644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.3796994Z x = x_sign * x_clamp 2025-05-07T20:33:15.3797289Z x0 = x[:, :D] 2025-05-07T20:33:15.3797501Z x1 = x[:, D:] 2025-05-07T20:33:15.3797703Z 2025-05-07T20:33:15.3797876Z if contiguous: 2025-05-07T20:33:15.3798105Z x0 = x0.contiguous() 2025-05-07T20:33:15.3798357Z x1 = x1.contiguous() 2025-05-07T20:33:15.3798582Z 2025-05-07T20:33:15.3798767Z if scale_ub is not None: 2025-05-07T20:33:15.3799030Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.3799351Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.3799647Z ) 2025-05-07T20:33:15.3799835Z else: 2025-05-07T20:33:15.3800043Z scale_ub_tensor = None 2025-05-07T20:33:15.3800337Z 2025-05-07T20:33:15.3800556Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.3800858Z op = silu_mul_quant 2025-05-07T20:33:15.3801107Z if compiled: 2025-05-07T20:33:15.3801350Z op = torch.compile(op) 2025-05-07T20:33:15.3801638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.3801906Z 2025-05-07T20:33:15.3802093Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.3802253Z 2025-05-07T20:33:15.3802349Z moe/activation_test.py:117: 2025-05-07T20:33:15.3802635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.3802962Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.3803235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.3803784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:15.3804326Z return fn(*args, **kwargs) 2025-05-07T20:33:15.3804983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.3805819Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.3806478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.3807148Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.3807792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.3808310Z kernel = self.compile( 2025-05-07T20:33:15.3808843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.3809484Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.3809872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.3810103Z 2025-05-07T20:33:15.3810309Z self = 2025-05-07T20:33:15.3811428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.3812774Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d892bb00>} 2025-05-07T20:33:15.3814136Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.3815156Z context = 2025-05-07T20:33:15.3815467Z 2025-05-07T20:33:15.3815629Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.3816143Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.3816594Z module_map=module_map) 2025-05-07T20:33:15.3817038Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.3817492Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.3817747Z E ^ 2025-05-07T20:33:15.3818198Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.3818642Z 2025-05-07T20:33:15.3819060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.3819560Z 2025-05-07T20:33:15.3819672Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.3820077Z self=, 2025-05-07T20:33:15.3820465Z T=128, 2025-05-07T20:33:15.3820718Z D=7168, 2025-05-07T20:33:15.3820917Z scale_ub=1200.0, 2025-05-07T20:33:15.3821149Z contiguous=False, 2025-05-07T20:33:15.3821388Z compiled=False, 2025-05-07T20:33:15.5399794Z ) 2025-05-07T20:33:15.5400372Z self = 2025-05-07T20:33:15.5400936Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:15.5401214Z 2025-05-07T20:33:15.5401302Z @given( 2025-05-07T20:33:15.5401535Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.5401862Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.5402160Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.5402483Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.5402800Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.5403079Z ) 2025-05-07T20:33:15.5403414Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.5403868Z def test_silu_mul_quant( 2025-05-07T20:33:15.5404107Z self, 2025-05-07T20:33:15.5404295Z T: int, 2025-05-07T20:33:15.5404481Z D: int, 2025-05-07T20:33:15.5404689Z scale_ub: Optional[float], 2025-05-07T20:33:15.5404945Z contiguous: bool, 2025-05-07T20:33:15.5405181Z compiled: bool, 2025-05-07T20:33:15.5405398Z ) -> None: 2025-05-07T20:33:15.5405597Z torch.manual_seed(2025) 2025-05-07T20:33:15.5405832Z 2025-05-07T20:33:15.5406096Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.5406428Z 2025-05-07T20:33:15.5406618Z x_sign = torch.sign(x) 2025-05-07T20:33:15.5406935Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.5407235Z x = x_sign * x_clamp 2025-05-07T20:33:15.5407459Z x0 = x[:, :D] 2025-05-07T20:33:15.5407658Z x1 = x[:, D:] 2025-05-07T20:33:15.5407845Z 2025-05-07T20:33:15.5408014Z if contiguous: 2025-05-07T20:33:15.5408228Z x0 = x0.contiguous() 2025-05-07T20:33:15.5408472Z x1 = x1.contiguous() 2025-05-07T20:33:15.5408693Z 2025-05-07T20:33:15.5408989Z if scale_ub is not None: 2025-05-07T20:33:15.5409262Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.5409590Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.5409887Z ) 2025-05-07T20:33:15.5410064Z else: 2025-05-07T20:33:15.5410263Z scale_ub_tensor = None 2025-05-07T20:33:15.5410505Z 2025-05-07T20:33:15.5410722Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.5411028Z op = silu_mul_quant 2025-05-07T20:33:15.5411268Z if compiled: 2025-05-07T20:33:15.5411502Z op = torch.compile(op) 2025-05-07T20:33:15.5411792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.5412059Z 2025-05-07T20:33:15.5412234Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.5412400Z 2025-05-07T20:33:15.5412495Z moe/activation_test.py:117: 2025-05-07T20:33:15.5412776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.5413102Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.5413537Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.5414219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.5414892Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.5415418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.5416091Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.5416740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.5417260Z kernel = self.compile( 2025-05-07T20:33:15.5417796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.5418510Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.5418909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.5419135Z 2025-05-07T20:33:15.5419336Z self = 2025-05-07T20:33:15.5420398Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.5421762Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8d66200>} 2025-05-07T20:33:15.5423077Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.5424086Z context = 2025-05-07T20:33:15.5424368Z 2025-05-07T20:33:15.5424532Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.5425063Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.5425547Z module_map=module_map) 2025-05-07T20:33:15.5425901Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.5426243Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.5426489Z E ^ 2025-05-07T20:33:15.5426935Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.5427372Z 2025-05-07T20:33:15.5427847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.5428354Z 2025-05-07T20:33:15.5428450Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.5428895Z self=, 2025-05-07T20:33:15.5429297Z T=128, 2025-05-07T20:33:15.5429472Z D=5120, 2025-05-07T20:33:15.5429653Z scale_ub=None, 2025-05-07T20:33:15.5429859Z contiguous=False, 2025-05-07T20:33:15.5430069Z compiled=False, 2025-05-07T20:33:15.5430265Z ) 2025-05-07T20:33:15.5430570Z self = 2025-05-07T20:33:15.5431040Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:15.5431302Z 2025-05-07T20:33:15.5431372Z @given( 2025-05-07T20:33:15.5431591Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.5431895Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.5432186Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.5432515Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.5432829Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.5433099Z ) 2025-05-07T20:33:15.5433576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.5434010Z def test_silu_mul_quant( 2025-05-07T20:33:15.5434230Z self, 2025-05-07T20:33:15.5434411Z T: int, 2025-05-07T20:33:15.5434594Z D: int, 2025-05-07T20:33:15.5434796Z scale_ub: Optional[float], 2025-05-07T20:33:15.5435055Z contiguous: bool, 2025-05-07T20:33:15.5435284Z compiled: bool, 2025-05-07T20:33:15.5435489Z ) -> None: 2025-05-07T20:33:15.5435690Z torch.manual_seed(2025) 2025-05-07T20:33:15.5435917Z 2025-05-07T20:33:15.5436174Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.5436501Z 2025-05-07T20:33:15.5436680Z x_sign = torch.sign(x) 2025-05-07T20:33:15.5437002Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.5437296Z x = x_sign * x_clamp 2025-05-07T20:33:15.5437521Z x0 = x[:, :D] 2025-05-07T20:33:15.5437720Z x1 = x[:, D:] 2025-05-07T20:33:15.5437911Z 2025-05-07T20:33:15.5438074Z if contiguous: 2025-05-07T20:33:15.5438291Z x0 = x0.contiguous() 2025-05-07T20:33:15.5438535Z x1 = x1.contiguous() 2025-05-07T20:33:15.5438763Z 2025-05-07T20:33:15.5438943Z if scale_ub is not None: 2025-05-07T20:33:15.5439194Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.5439519Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.5439823Z ) 2025-05-07T20:33:15.5440002Z else: 2025-05-07T20:33:15.5440619Z scale_ub_tensor = None 2025-05-07T20:33:15.5440865Z 2025-05-07T20:33:15.5441083Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.5441384Z op = silu_mul_quant 2025-05-07T20:33:15.5441620Z if compiled: 2025-05-07T20:33:15.5441854Z op = torch.compile(op) 2025-05-07T20:33:15.5442141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.5442401Z 2025-05-07T20:33:15.5442585Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.5442746Z 2025-05-07T20:33:15.5442846Z moe/activation_test.py:117: 2025-05-07T20:33:15.5443136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.5443456Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.5443721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.5444402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.5445073Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.5445605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.5446286Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.5447018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.5447540Z kernel = self.compile( 2025-05-07T20:33:15.5448077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.5454474Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.5454874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.5455127Z 2025-05-07T20:33:15.5455359Z self = 2025-05-07T20:33:15.5456423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.5457789Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8935940>} 2025-05-07T20:33:15.5459304Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.5460304Z context = 2025-05-07T20:33:15.5460583Z 2025-05-07T20:33:15.5460748Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.5461252Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.5461706Z module_map=module_map) 2025-05-07T20:33:15.5462058Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.5462475Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.5462732Z E ^ 2025-05-07T20:33:15.5463192Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.5463639Z 2025-05-07T20:33:15.5464075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.5464575Z 2025-05-07T20:33:15.5464675Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.5465073Z self=, 2025-05-07T20:33:15.5465461Z T=128, 2025-05-07T20:33:15.5465632Z D=5120, 2025-05-07T20:33:15.5465811Z scale_ub=1200.0, 2025-05-07T20:33:15.5466022Z contiguous=True, 2025-05-07T20:33:15.5466225Z compiled=False, 2025-05-07T20:33:15.5466421Z ) 2025-05-07T20:33:15.5466723Z self = 2025-05-07T20:33:15.5467201Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:15.5467542Z 2025-05-07T20:33:15.5467618Z @given( 2025-05-07T20:33:15.5467842Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.5468150Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.5468444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.5468766Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.5469087Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.5469358Z ) 2025-05-07T20:33:15.5469695Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.5470121Z def test_silu_mul_quant( 2025-05-07T20:33:15.5470352Z self, 2025-05-07T20:33:15.5470529Z T: int, 2025-05-07T20:33:15.5470712Z D: int, 2025-05-07T20:33:15.5470916Z scale_ub: Optional[float], 2025-05-07T20:33:15.5471167Z contiguous: bool, 2025-05-07T20:33:15.5471393Z compiled: bool, 2025-05-07T20:33:15.5471612Z ) -> None: 2025-05-07T20:33:15.5471809Z torch.manual_seed(2025) 2025-05-07T20:33:15.5472046Z 2025-05-07T20:33:15.5472354Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.5472685Z 2025-05-07T20:33:15.5472868Z x_sign = torch.sign(x) 2025-05-07T20:33:15.5473146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.5473435Z x = x_sign * x_clamp 2025-05-07T20:33:15.5473660Z x0 = x[:, :D] 2025-05-07T20:33:15.5473864Z x1 = x[:, D:] 2025-05-07T20:33:15.5474052Z 2025-05-07T20:33:15.5474221Z if contiguous: 2025-05-07T20:33:15.5474436Z x0 = x0.contiguous() 2025-05-07T20:33:15.5474675Z x1 = x1.contiguous() 2025-05-07T20:33:15.5474904Z 2025-05-07T20:33:15.5475081Z if scale_ub is not None: 2025-05-07T20:33:15.5475343Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.5475663Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.5475959Z ) 2025-05-07T20:33:15.5476136Z else: 2025-05-07T20:33:15.5476330Z scale_ub_tensor = None 2025-05-07T20:33:15.5476567Z 2025-05-07T20:33:15.5476895Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.5477192Z op = silu_mul_quant 2025-05-07T20:33:15.5477429Z if compiled: 2025-05-07T20:33:15.5477665Z op = torch.compile(op) 2025-05-07T20:33:15.5477941Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.5478195Z 2025-05-07T20:33:15.5478372Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.5478529Z 2025-05-07T20:33:15.5478619Z moe/activation_test.py:117: 2025-05-07T20:33:15.5478898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.5479213Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.5479477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.5480229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.5480901Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.5481425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.5482098Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.5482751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.5483273Z kernel = self.compile( 2025-05-07T20:33:15.5483811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.5484450Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.5484826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.5485066Z 2025-05-07T20:33:15.5485304Z self = 2025-05-07T20:33:15.5486366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.5487706Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d872cc20>} 2025-05-07T20:33:15.5489055Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.5490209Z context = 2025-05-07T20:33:15.5490595Z 2025-05-07T20:33:15.5490816Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.5491488Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.5492038Z module_map=module_map) 2025-05-07T20:33:15.5492401Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.5492750Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.5492990Z E ^ 2025-05-07T20:33:15.5493442Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.5493877Z 2025-05-07T20:33:15.5494293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.7072475Z 2025-05-07T20:33:15.7072860Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.7073380Z self=, 2025-05-07T20:33:15.7073966Z T=1, 2025-05-07T20:33:15.7074230Z D=7168, 2025-05-07T20:33:15.7074479Z scale_ub=1200.0, 2025-05-07T20:33:15.7074799Z contiguous=True, 2025-05-07T20:33:15.7075070Z compiled=True, 2025-05-07T20:33:15.7075327Z ) 2025-05-07T20:33:15.7075819Z self = 2025-05-07T20:33:15.7076373Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:15.7076648Z 2025-05-07T20:33:15.7076724Z @given( 2025-05-07T20:33:15.7076953Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.7077259Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.7077560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.7077883Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.7078204Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.7078479Z ) 2025-05-07T20:33:15.7078826Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.7079336Z def test_silu_mul_quant( 2025-05-07T20:33:15.7079575Z self, 2025-05-07T20:33:15.7079756Z T: int, 2025-05-07T20:33:15.7079947Z D: int, 2025-05-07T20:33:15.7080166Z scale_ub: Optional[float], 2025-05-07T20:33:15.7080431Z contiguous: bool, 2025-05-07T20:33:15.7080666Z compiled: bool, 2025-05-07T20:33:15.7080889Z ) -> None: 2025-05-07T20:33:15.7081094Z torch.manual_seed(2025) 2025-05-07T20:33:15.7081335Z 2025-05-07T20:33:15.7081602Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.7081925Z 2025-05-07T20:33:15.7082105Z x_sign = torch.sign(x) 2025-05-07T20:33:15.7082383Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.7082673Z x = x_sign * x_clamp 2025-05-07T20:33:15.7082910Z x0 = x[:, :D] 2025-05-07T20:33:15.7083119Z x1 = x[:, D:] 2025-05-07T20:33:15.7083321Z 2025-05-07T20:33:15.7083501Z if contiguous: 2025-05-07T20:33:15.7083730Z x0 = x0.contiguous() 2025-05-07T20:33:15.7083981Z x1 = x1.contiguous() 2025-05-07T20:33:15.7084206Z 2025-05-07T20:33:15.7084391Z if scale_ub is not None: 2025-05-07T20:33:15.7084658Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.7084974Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.7085276Z ) 2025-05-07T20:33:15.7085461Z else: 2025-05-07T20:33:15.7085653Z scale_ub_tensor = None 2025-05-07T20:33:15.7085893Z 2025-05-07T20:33:15.7086110Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.7086405Z op = silu_mul_quant 2025-05-07T20:33:15.7086639Z if compiled: 2025-05-07T20:33:15.7086876Z op = torch.compile(op) 2025-05-07T20:33:15.7087153Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.7087414Z 2025-05-07T20:33:15.7087602Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.7087762Z 2025-05-07T20:33:15.7087859Z moe/activation_test.py:117: 2025-05-07T20:33:15.7088137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.7088540Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.7088812Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.7089359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:15.7089902Z return fn(*args, **kwargs) 2025-05-07T20:33:15.7090555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.7091211Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.7091741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.7092401Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.7093045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.7093558Z kernel = self.compile( 2025-05-07T20:33:15.7094149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.7094834Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.7095225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.7095443Z 2025-05-07T20:33:15.7095642Z self = 2025-05-07T20:33:15.7096720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.7098085Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d872dee0>} 2025-05-07T20:33:15.7099476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.7100473Z context = 2025-05-07T20:33:15.7100755Z 2025-05-07T20:33:15.7100913Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.7101419Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.7101873Z module_map=module_map) 2025-05-07T20:33:15.7102222Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.7102576Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.7102826Z E ^ 2025-05-07T20:33:15.7103269Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.7103716Z 2025-05-07T20:33:15.7104143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.7104652Z 2025-05-07T20:33:15.7104748Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.7105168Z self=, 2025-05-07T20:33:15.7105581Z T=1, 2025-05-07T20:33:15.7105754Z D=7168, 2025-05-07T20:33:15.7105940Z scale_ub=1200.0, 2025-05-07T20:33:15.7106162Z contiguous=False, 2025-05-07T20:33:15.7106372Z compiled=True, 2025-05-07T20:33:15.7106557Z ) 2025-05-07T20:33:15.7106863Z self = 2025-05-07T20:33:15.7107335Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:15.7107657Z 2025-05-07T20:33:15.7107727Z @given( 2025-05-07T20:33:15.7107937Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.7108231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.7108585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.7108908Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.7109222Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.7109499Z ) 2025-05-07T20:33:15.7109831Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.7110267Z def test_silu_mul_quant( 2025-05-07T20:33:15.7110497Z self, 2025-05-07T20:33:15.7110682Z T: int, 2025-05-07T20:33:15.7110861Z D: int, 2025-05-07T20:33:15.7111076Z scale_ub: Optional[float], 2025-05-07T20:33:15.7111338Z contiguous: bool, 2025-05-07T20:33:15.7111563Z compiled: bool, 2025-05-07T20:33:15.7111774Z ) -> None: 2025-05-07T20:33:15.7111978Z torch.manual_seed(2025) 2025-05-07T20:33:15.7112210Z 2025-05-07T20:33:15.7112472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.7112798Z 2025-05-07T20:33:15.7112983Z x_sign = torch.sign(x) 2025-05-07T20:33:15.7113321Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.7113659Z x = x_sign * x_clamp 2025-05-07T20:33:15.7113886Z x0 = x[:, :D] 2025-05-07T20:33:15.7114095Z x1 = x[:, D:] 2025-05-07T20:33:15.7114288Z 2025-05-07T20:33:15.7114456Z if contiguous: 2025-05-07T20:33:15.7114674Z x0 = x0.contiguous() 2025-05-07T20:33:15.7114919Z x1 = x1.contiguous() 2025-05-07T20:33:15.7115140Z 2025-05-07T20:33:15.7115321Z if scale_ub is not None: 2025-05-07T20:33:15.7115583Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.7115911Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.7116200Z ) 2025-05-07T20:33:15.7116387Z else: 2025-05-07T20:33:15.7116634Z scale_ub_tensor = None 2025-05-07T20:33:15.7116870Z 2025-05-07T20:33:15.7117091Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.7117391Z op = silu_mul_quant 2025-05-07T20:33:15.7117627Z if compiled: 2025-05-07T20:33:15.7117862Z op = torch.compile(op) 2025-05-07T20:33:15.7118152Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.7118410Z 2025-05-07T20:33:15.7118588Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.7118746Z 2025-05-07T20:33:15.7118846Z moe/activation_test.py:117: 2025-05-07T20:33:15.7119126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.7119458Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.7119729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.7120282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:15.7120826Z return fn(*args, **kwargs) 2025-05-07T20:33:15.7121474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.7122145Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.7122673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.7123362Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.7124015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.7124539Z kernel = self.compile( 2025-05-07T20:33:15.7125077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.7125713Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.7126105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.7126328Z 2025-05-07T20:33:15.7126537Z self = 2025-05-07T20:33:15.7127640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.7129031Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d872ec00>} 2025-05-07T20:33:15.7130344Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.7131391Z context = 2025-05-07T20:33:15.7131667Z 2025-05-07T20:33:15.7131824Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.7132331Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.7132870Z module_map=module_map) 2025-05-07T20:33:15.7133221Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.7133554Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.7133798Z E ^ 2025-05-07T20:33:15.7134245Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.7134688Z 2025-05-07T20:33:15.7135127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.9244198Z 2025-05-07T20:33:15.9244484Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.9244976Z self=, 2025-05-07T20:33:15.9245769Z T=1, 2025-05-07T20:33:15.9245959Z D=7168, 2025-05-07T20:33:15.9246173Z scale_ub=None, 2025-05-07T20:33:15.9246375Z contiguous=False, 2025-05-07T20:33:15.9246593Z compiled=True, 2025-05-07T20:33:15.9246783Z ) 2025-05-07T20:33:15.9247091Z self = 2025-05-07T20:33:15.9247564Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:15.9247821Z 2025-05-07T20:33:15.9247896Z @given( 2025-05-07T20:33:15.9248114Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.9248425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.9248718Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.9249040Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.9249359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.9249631Z ) 2025-05-07T20:33:15.9249976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.9250422Z def test_silu_mul_quant( 2025-05-07T20:33:15.9250654Z self, 2025-05-07T20:33:15.9250838Z T: int, 2025-05-07T20:33:15.9251031Z D: int, 2025-05-07T20:33:15.9251257Z scale_ub: Optional[float], 2025-05-07T20:33:15.9251513Z contiguous: bool, 2025-05-07T20:33:15.9251741Z compiled: bool, 2025-05-07T20:33:15.9251957Z ) -> None: 2025-05-07T20:33:15.9252161Z torch.manual_seed(2025) 2025-05-07T20:33:15.9252406Z 2025-05-07T20:33:15.9252671Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.9252998Z 2025-05-07T20:33:15.9253180Z x_sign = torch.sign(x) 2025-05-07T20:33:15.9253468Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.9253768Z x = x_sign * x_clamp 2025-05-07T20:33:15.9254001Z x0 = x[:, :D] 2025-05-07T20:33:15.9254212Z x1 = x[:, D:] 2025-05-07T20:33:15.9254411Z 2025-05-07T20:33:15.9254587Z if contiguous: 2025-05-07T20:33:15.9254808Z x0 = x0.contiguous() 2025-05-07T20:33:15.9255070Z x1 = x1.contiguous() 2025-05-07T20:33:15.9255378Z 2025-05-07T20:33:15.9255568Z if scale_ub is not None: 2025-05-07T20:33:15.9255836Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.9256153Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.9256457Z ) 2025-05-07T20:33:15.9256645Z else: 2025-05-07T20:33:15.9256846Z scale_ub_tensor = None 2025-05-07T20:33:15.9257086Z 2025-05-07T20:33:15.9257308Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.9257604Z op = silu_mul_quant 2025-05-07T20:33:15.9257844Z if compiled: 2025-05-07T20:33:15.9258083Z op = torch.compile(op) 2025-05-07T20:33:15.9258360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.9258625Z 2025-05-07T20:33:15.9258809Z y_fp8, y_scale = fn() 2025-05-07T20:33:15.9259077Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:15.9259351Z 2025-05-07T20:33:15.9259578Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.9260022Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:15.9260302Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:15.9260599Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:15.9260942Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:15.9261235Z 2025-05-07T20:33:15.9261435Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:15.9261620Z 2025-05-07T20:33:15.9261722Z moe/activation_test.py:126: 2025-05-07T20:33:15.9262002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.9262322Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:15.9262634Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:15.9263455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:15.9264183Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:15.9264728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.9265405Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.9266079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:15.9266793Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:15.9267601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:15.9268227Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:15.9268813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:15.9269315Z fn() 2025-05-07T20:33:15.9269817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:15.9270385Z self.fn.run( 2025-05-07T20:33:15.9270832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.9271344Z kernel = self.compile( 2025-05-07T20:33:15.9271887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.9272516Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.9272898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.9273122Z 2025-05-07T20:33:15.9273325Z self = 2025-05-07T20:33:15.9274434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.9275788Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8518180>} 2025-05-07T20:33:15.9277130Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.9278129Z context = 2025-05-07T20:33:15.9278406Z 2025-05-07T20:33:15.9278571Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.9279091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.9279548Z module_map=module_map) 2025-05-07T20:33:15.9279910Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.9280353Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:15.9280609Z E ^ 2025-05-07T20:33:15.9281055Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.9281491Z 2025-05-07T20:33:15.9281905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.9282404Z 2025-05-07T20:33:15.9282505Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.9282905Z self=, 2025-05-07T20:33:15.9283293Z T=1, 2025-05-07T20:33:15.9283464Z D=5120, 2025-05-07T20:33:15.9283647Z scale_ub=1200.0, 2025-05-07T20:33:15.9283909Z contiguous=False, 2025-05-07T20:33:15.9284124Z compiled=True, 2025-05-07T20:33:15.9284312Z ) 2025-05-07T20:33:15.9284620Z self = 2025-05-07T20:33:15.9285122Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:15.9285414Z 2025-05-07T20:33:15.9285490Z @given( 2025-05-07T20:33:15.9285701Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.9286007Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.9286308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.9286620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.9286939Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.9287215Z ) 2025-05-07T20:33:15.9299327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.9299847Z def test_silu_mul_quant( 2025-05-07T20:33:15.9300109Z self, 2025-05-07T20:33:15.9300318Z T: int, 2025-05-07T20:33:15.9300515Z D: int, 2025-05-07T20:33:15.9300758Z scale_ub: Optional[float], 2025-05-07T20:33:15.9301028Z contiguous: bool, 2025-05-07T20:33:15.9301263Z compiled: bool, 2025-05-07T20:33:15.9301496Z ) -> None: 2025-05-07T20:33:15.9301708Z torch.manual_seed(2025) 2025-05-07T20:33:15.9301949Z 2025-05-07T20:33:15.9302213Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.9302551Z 2025-05-07T20:33:15.9302739Z x_sign = torch.sign(x) 2025-05-07T20:33:15.9303026Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.9303337Z x = x_sign * x_clamp 2025-05-07T20:33:15.9303579Z x0 = x[:, :D] 2025-05-07T20:33:15.9303789Z x1 = x[:, D:] 2025-05-07T20:33:15.9303993Z 2025-05-07T20:33:15.9304175Z if contiguous: 2025-05-07T20:33:15.9304401Z x0 = x0.contiguous() 2025-05-07T20:33:15.9304657Z x1 = x1.contiguous() 2025-05-07T20:33:15.9304898Z 2025-05-07T20:33:15.9305080Z if scale_ub is not None: 2025-05-07T20:33:15.9305432Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.9305770Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.9306075Z ) 2025-05-07T20:33:15.9306267Z else: 2025-05-07T20:33:15.9306474Z scale_ub_tensor = None 2025-05-07T20:33:15.9306724Z 2025-05-07T20:33:15.9306953Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.9307261Z op = silu_mul_quant 2025-05-07T20:33:15.9307597Z if compiled: 2025-05-07T20:33:15.9307837Z op = torch.compile(op) 2025-05-07T20:33:15.9308130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.9308403Z 2025-05-07T20:33:15.9308589Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.9308759Z 2025-05-07T20:33:15.9308856Z moe/activation_test.py:117: 2025-05-07T20:33:15.9309157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.9309481Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.9309762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.9310382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:15.9310976Z return fn(*args, **kwargs) 2025-05-07T20:33:15.9311635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.9312312Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.9312857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.9313529Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.9314255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.9315057Z kernel = self.compile( 2025-05-07T20:33:15.9319226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.9319879Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.9340750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.9341131Z 2025-05-07T20:33:15.9341435Z self = 2025-05-07T20:33:15.9343175Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.9345500Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8519300>} 2025-05-07T20:33:15.9347145Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.9348261Z context = 2025-05-07T20:33:15.9348542Z 2025-05-07T20:33:15.9348711Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.9349224Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.9349682Z module_map=module_map) 2025-05-07T20:33:15.9350038Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.9350375Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.9350618Z E ^ 2025-05-07T20:33:15.9351072Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.9351512Z 2025-05-07T20:33:15.9351930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.0739575Z 2025-05-07T20:33:16.0739856Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.0740542Z self=, 2025-05-07T20:33:16.0740988Z T=1, 2025-05-07T20:33:16.0741199Z D=5120, 2025-05-07T20:33:16.0741415Z scale_ub=1200.0, 2025-05-07T20:33:16.0741666Z contiguous=False, 2025-05-07T20:33:16.0741915Z compiled=False, 2025-05-07T20:33:16.0742149Z ) 2025-05-07T20:33:16.0742487Z self = 2025-05-07T20:33:16.0742997Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:16.0743278Z 2025-05-07T20:33:16.0743374Z @given( 2025-05-07T20:33:16.0743623Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.0743966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.0744294Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.0744652Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.0745175Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.0745482Z ) 2025-05-07T20:33:16.0745860Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.0746316Z def test_silu_mul_quant( 2025-05-07T20:33:16.0746571Z self, 2025-05-07T20:33:16.0746780Z T: int, 2025-05-07T20:33:16.0746979Z D: int, 2025-05-07T20:33:16.0747198Z scale_ub: Optional[float], 2025-05-07T20:33:16.0747565Z contiguous: bool, 2025-05-07T20:33:16.0747801Z compiled: bool, 2025-05-07T20:33:16.0748029Z ) -> None: 2025-05-07T20:33:16.0748249Z torch.manual_seed(2025) 2025-05-07T20:33:16.0748486Z 2025-05-07T20:33:16.0748756Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.0749201Z 2025-05-07T20:33:16.0749391Z x_sign = torch.sign(x) 2025-05-07T20:33:16.0749686Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.0749999Z x = x_sign * x_clamp 2025-05-07T20:33:16.0750237Z x0 = x[:, :D] 2025-05-07T20:33:16.0750457Z x1 = x[:, D:] 2025-05-07T20:33:16.0750669Z 2025-05-07T20:33:16.0750849Z if contiguous: 2025-05-07T20:33:16.0751082Z x0 = x0.contiguous() 2025-05-07T20:33:16.0751342Z x1 = x1.contiguous() 2025-05-07T20:33:16.0751583Z 2025-05-07T20:33:16.0751781Z if scale_ub is not None: 2025-05-07T20:33:16.0752061Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.0752395Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.0752708Z ) 2025-05-07T20:33:16.0752902Z else: 2025-05-07T20:33:16.0753118Z scale_ub_tensor = None 2025-05-07T20:33:16.0753383Z 2025-05-07T20:33:16.0753642Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.0753993Z op = silu_mul_quant 2025-05-07T20:33:16.0754265Z if compiled: 2025-05-07T20:33:16.0754546Z op = torch.compile(op) 2025-05-07T20:33:16.0754888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.0755212Z 2025-05-07T20:33:16.0755447Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.0755634Z 2025-05-07T20:33:16.0755750Z moe/activation_test.py:117: 2025-05-07T20:33:16.0756083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.0756470Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.0756791Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.0757629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.0758463Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.0759104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.0760015Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.0760691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.0761225Z kernel = self.compile( 2025-05-07T20:33:16.0761784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.0762441Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.0762837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.0763072Z 2025-05-07T20:33:16.0763277Z self = 2025-05-07T20:33:16.0764427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.0765851Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d851a020>} 2025-05-07T20:33:16.0767215Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.0768227Z context = 2025-05-07T20:33:16.0768514Z 2025-05-07T20:33:16.0768681Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.0769202Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.0769672Z module_map=module_map) 2025-05-07T20:33:16.0770079Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.0770439Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.0770704Z E ^ 2025-05-07T20:33:16.0771175Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.0771621Z 2025-05-07T20:33:16.0772051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.0772577Z 2025-05-07T20:33:16.0772684Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.0773110Z self=, 2025-05-07T20:33:16.0773515Z T=16384, 2025-05-07T20:33:16.0773708Z D=5120, 2025-05-07T20:33:16.0773907Z scale_ub=1200.0, 2025-05-07T20:33:16.0774131Z contiguous=False, 2025-05-07T20:33:16.0774353Z compiled=True, 2025-05-07T20:33:16.0774565Z ) 2025-05-07T20:33:16.0774886Z self = 2025-05-07T20:33:16.0775374Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:16.0775665Z 2025-05-07T20:33:16.0775750Z @given( 2025-05-07T20:33:16.0775998Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.0776313Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.0776626Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.0776961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.0777288Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.0777586Z ) 2025-05-07T20:33:16.0777946Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.0778409Z def test_silu_mul_quant( 2025-05-07T20:33:16.0778651Z self, 2025-05-07T20:33:16.0778851Z T: int, 2025-05-07T20:33:16.0779058Z D: int, 2025-05-07T20:33:16.0779281Z scale_ub: Optional[float], 2025-05-07T20:33:16.0779567Z contiguous: bool, 2025-05-07T20:33:16.0779815Z compiled: bool, 2025-05-07T20:33:16.0780043Z ) -> None: 2025-05-07T20:33:16.0780321Z torch.manual_seed(2025) 2025-05-07T20:33:16.0780570Z 2025-05-07T20:33:16.0780849Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.0781201Z 2025-05-07T20:33:16.0781403Z x_sign = torch.sign(x) 2025-05-07T20:33:16.0781696Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.0782014Z x = x_sign * x_clamp 2025-05-07T20:33:16.0782260Z x0 = x[:, :D] 2025-05-07T20:33:16.0782476Z x1 = x[:, D:] 2025-05-07T20:33:16.0782684Z 2025-05-07T20:33:16.0782873Z if contiguous: 2025-05-07T20:33:16.0783108Z x0 = x0.contiguous() 2025-05-07T20:33:16.0783359Z x1 = x1.contiguous() 2025-05-07T20:33:16.0783597Z 2025-05-07T20:33:16.0783798Z if scale_ub is not None: 2025-05-07T20:33:16.0784072Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.0784409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.0784722Z ) 2025-05-07T20:33:16.0784914Z else: 2025-05-07T20:33:16.0785212Z scale_ub_tensor = None 2025-05-07T20:33:16.0785471Z 2025-05-07T20:33:16.0785701Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.0786018Z op = silu_mul_quant 2025-05-07T20:33:16.0786277Z if compiled: 2025-05-07T20:33:16.0786528Z op = torch.compile(op) 2025-05-07T20:33:16.0786827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.0787106Z 2025-05-07T20:33:16.0787300Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.0787525Z 2025-05-07T20:33:16.0787626Z moe/activation_test.py:117: 2025-05-07T20:33:16.0787922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.0788253Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.0788583Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.0789152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:16.0789705Z return fn(*args, **kwargs) 2025-05-07T20:33:16.0790353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.0791040Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.0791584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.0792266Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.0792923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.0793454Z kernel = self.compile( 2025-05-07T20:33:16.0794022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.0794675Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.0795069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.0795302Z 2025-05-07T20:33:16.0795521Z self = 2025-05-07T20:33:16.0796637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.0798004Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d851b600>} 2025-05-07T20:33:16.0799336Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.0800448Z context = 2025-05-07T20:33:16.0800738Z 2025-05-07T20:33:16.0800918Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.0801440Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.0801917Z module_map=module_map) 2025-05-07T20:33:16.0802296Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.0802648Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.0802916Z E ^ 2025-05-07T20:33:16.0803386Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.0803831Z 2025-05-07T20:33:16.0804269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.0804801Z 2025-05-07T20:33:16.0804906Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.0805321Z self=, 2025-05-07T20:33:16.0805812Z T=2048, 2025-05-07T20:33:16.0805996Z D=7168, 2025-05-07T20:33:16.0806199Z scale_ub=1200.0, 2025-05-07T20:33:16.0806429Z contiguous=False, 2025-05-07T20:33:16.0806654Z compiled=True, 2025-05-07T20:33:16.2706770Z ) 2025-05-07T20:33:16.2707531Z self = 2025-05-07T20:33:16.2708352Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:16.2708799Z 2025-05-07T20:33:16.2708926Z @given( 2025-05-07T20:33:16.2709273Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.2709785Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.2710261Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.2711158Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.2711718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.2712180Z ) 2025-05-07T20:33:16.2712748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.2713480Z def test_silu_mul_quant( 2025-05-07T20:33:16.2713867Z self, 2025-05-07T20:33:16.2714165Z T: int, 2025-05-07T20:33:16.2714473Z D: int, 2025-05-07T20:33:16.2714819Z scale_ub: Optional[float], 2025-05-07T20:33:16.2715251Z contiguous: bool, 2025-05-07T20:33:16.2715631Z compiled: bool, 2025-05-07T20:33:16.2716001Z ) -> None: 2025-05-07T20:33:16.2716334Z torch.manual_seed(2025) 2025-05-07T20:33:16.2716714Z 2025-05-07T20:33:16.2717151Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.2717718Z 2025-05-07T20:33:16.2718015Z x_sign = torch.sign(x) 2025-05-07T20:33:16.2718497Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.2718990Z x = x_sign * x_clamp 2025-05-07T20:33:16.2719377Z x0 = x[:, :D] 2025-05-07T20:33:16.2719727Z x1 = x[:, D:] 2025-05-07T20:33:16.2720062Z 2025-05-07T20:33:16.2720344Z if contiguous: 2025-05-07T20:33:16.2720710Z x0 = x0.contiguous() 2025-05-07T20:33:16.2721118Z x1 = x1.contiguous() 2025-05-07T20:33:16.2721489Z 2025-05-07T20:33:16.2721787Z if scale_ub is not None: 2025-05-07T20:33:16.2722218Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.2722729Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.2723220Z ) 2025-05-07T20:33:16.2723515Z else: 2025-05-07T20:33:16.2723852Z scale_ub_tensor = None 2025-05-07T20:33:16.2724252Z 2025-05-07T20:33:16.2724624Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.2725136Z op = silu_mul_quant 2025-05-07T20:33:16.2725537Z if compiled: 2025-05-07T20:33:16.2725935Z op = torch.compile(op) 2025-05-07T20:33:16.2726534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.2726901Z 2025-05-07T20:33:16.2727172Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.2727398Z 2025-05-07T20:33:16.2727543Z moe/activation_test.py:117: 2025-05-07T20:33:16.2727944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.2728424Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.2728831Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.2729734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:16.2730625Z return fn(*args, **kwargs) 2025-05-07T20:33:16.2731675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.2732831Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.2733703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.2734975Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.2736275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.2737234Z kernel = self.compile( 2025-05-07T20:33:16.2738169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.2739330Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.2739988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.2740744Z 2025-05-07T20:33:16.2741074Z self = 2025-05-07T20:33:16.2742931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.2745423Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8268720>} 2025-05-07T20:33:16.2747759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.2749489Z context = 2025-05-07T20:33:16.2749993Z 2025-05-07T20:33:16.2750257Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.2751063Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.2751810Z module_map=module_map) 2025-05-07T20:33:16.2752430Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.2753028Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.2753476Z E ^ 2025-05-07T20:33:16.2754287Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.2755093Z 2025-05-07T20:33:16.2755838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.2756772Z 2025-05-07T20:33:16.2756943Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.2757651Z self=, 2025-05-07T20:33:16.2758351Z T=1, 2025-05-07T20:33:16.2758639Z D=5120, 2025-05-07T20:33:16.2758954Z scale_ub=None, 2025-05-07T20:33:16.2759318Z contiguous=False, 2025-05-07T20:33:16.2759690Z compiled=False, 2025-05-07T20:33:16.2760038Z ) 2025-05-07T20:33:16.2760578Z self = 2025-05-07T20:33:16.2761535Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:16.2761986Z 2025-05-07T20:33:16.2762113Z @given( 2025-05-07T20:33:16.2762479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.2762988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.2763476Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.2764003Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.2764560Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.2765033Z ) 2025-05-07T20:33:16.2765637Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.2766408Z def test_silu_mul_quant( 2025-05-07T20:33:16.2766807Z self, 2025-05-07T20:33:16.2767125Z T: int, 2025-05-07T20:33:16.2767452Z D: int, 2025-05-07T20:33:16.2767803Z scale_ub: Optional[float], 2025-05-07T20:33:16.2768266Z contiguous: bool, 2025-05-07T20:33:16.2768679Z compiled: bool, 2025-05-07T20:33:16.2769051Z ) -> None: 2025-05-07T20:33:16.2769621Z torch.manual_seed(2025) 2025-05-07T20:33:16.2770039Z 2025-05-07T20:33:16.2770474Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.2771065Z 2025-05-07T20:33:16.2771374Z x_sign = torch.sign(x) 2025-05-07T20:33:16.2771859Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.2772367Z x = x_sign * x_clamp 2025-05-07T20:33:16.2772764Z x0 = x[:, :D] 2025-05-07T20:33:16.2773126Z x1 = x[:, D:] 2025-05-07T20:33:16.2773456Z 2025-05-07T20:33:16.2773767Z if contiguous: 2025-05-07T20:33:16.2774155Z x0 = x0.contiguous() 2025-05-07T20:33:16.2774581Z x1 = x1.contiguous() 2025-05-07T20:33:16.2774996Z 2025-05-07T20:33:16.2775439Z if scale_ub is not None: 2025-05-07T20:33:16.2775911Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.2776487Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.2777020Z ) 2025-05-07T20:33:16.2777334Z else: 2025-05-07T20:33:16.2777698Z scale_ub_tensor = None 2025-05-07T20:33:16.2778123Z 2025-05-07T20:33:16.2778491Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.2779028Z op = silu_mul_quant 2025-05-07T20:33:16.2779461Z if compiled: 2025-05-07T20:33:16.2779877Z op = torch.compile(op) 2025-05-07T20:33:16.2793910Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.2794332Z 2025-05-07T20:33:16.2794598Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.2794826Z 2025-05-07T20:33:16.2794958Z moe/activation_test.py:117: 2025-05-07T20:33:16.2795365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.2795820Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.2796183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.2797128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.2798130Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.2798886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.2799854Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.2800826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.2801572Z kernel = self.compile( 2025-05-07T20:33:16.2802356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.2803306Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.2803841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.2804178Z 2025-05-07T20:33:16.2804585Z self = 2025-05-07T20:33:16.2806219Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.2808464Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8269120>} 2025-05-07T20:33:16.2810759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.2812613Z context = 2025-05-07T20:33:16.2813112Z 2025-05-07T20:33:16.2813411Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.2814404Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.2815281Z module_map=module_map) 2025-05-07T20:33:16.2815891Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.2816476Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.2816907Z E ^ 2025-05-07T20:33:16.2817717Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.2818530Z 2025-05-07T20:33:16.2819284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.2820200Z 2025-05-07T20:33:16.2820373Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.2821167Z self=, 2025-05-07T20:33:16.2821853Z T=4096, 2025-05-07T20:33:16.2822150Z D=7168, 2025-05-07T20:33:16.2822459Z scale_ub=1200.0, 2025-05-07T20:33:16.2822825Z contiguous=False, 2025-05-07T20:33:16.2823181Z compiled=False, 2025-05-07T20:33:16.2823511Z ) 2025-05-07T20:33:16.2824036Z self = 2025-05-07T20:33:16.2824887Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:16.2825364Z 2025-05-07T20:33:16.2825483Z @given( 2025-05-07T20:33:16.2825850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.2826368Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.2826870Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.2827516Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.2828074Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.2828548Z ) 2025-05-07T20:33:16.2829139Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.2829911Z def test_silu_mul_quant( 2025-05-07T20:33:16.2830300Z self, 2025-05-07T20:33:16.2830669Z T: int, 2025-05-07T20:33:16.2830988Z D: int, 2025-05-07T20:33:16.2831350Z scale_ub: Optional[float], 2025-05-07T20:33:16.2831791Z contiguous: bool, 2025-05-07T20:33:16.2832181Z compiled: bool, 2025-05-07T20:33:16.2832547Z ) -> None: 2025-05-07T20:33:16.2832884Z torch.manual_seed(2025) 2025-05-07T20:33:16.2833276Z 2025-05-07T20:33:16.2833718Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.2834285Z 2025-05-07T20:33:16.2834589Z x_sign = torch.sign(x) 2025-05-07T20:33:16.2835061Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.2835571Z x = x_sign * x_clamp 2025-05-07T20:33:16.2835958Z x0 = x[:, :D] 2025-05-07T20:33:16.2836310Z x1 = x[:, D:] 2025-05-07T20:33:16.2836641Z 2025-05-07T20:33:16.2836925Z if contiguous: 2025-05-07T20:33:16.2837378Z x0 = x0.contiguous() 2025-05-07T20:33:16.2837821Z x1 = x1.contiguous() 2025-05-07T20:33:16.2838208Z 2025-05-07T20:33:16.2838514Z if scale_ub is not None: 2025-05-07T20:33:16.2838963Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.2839506Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.2840022Z ) 2025-05-07T20:33:16.2840673Z else: 2025-05-07T20:33:16.2841009Z scale_ub_tensor = None 2025-05-07T20:33:16.2841423Z 2025-05-07T20:33:16.2841797Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.2842312Z op = silu_mul_quant 2025-05-07T20:33:16.2842722Z if compiled: 2025-05-07T20:33:16.2843130Z op = torch.compile(op) 2025-05-07T20:33:16.2843618Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.2844072Z 2025-05-07T20:33:16.2844373Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.2844656Z 2025-05-07T20:33:16.2844821Z moe/activation_test.py:117: 2025-05-07T20:33:16.2845559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.2846117Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.2846583Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.2847792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.2849007Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.2849939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.2851135Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.2852289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.2853345Z kernel = self.compile( 2025-05-07T20:33:16.2854296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.2855492Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.2856172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.2856572Z 2025-05-07T20:33:16.2856912Z self = 2025-05-07T20:33:16.2858843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.2861344Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d826a480>} 2025-05-07T20:33:16.2863780Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.2865657Z context = 2025-05-07T20:33:16.2866168Z 2025-05-07T20:33:16.2866442Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.2867339Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.2868204Z module_map=module_map) 2025-05-07T20:33:16.2868809Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.2869403Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.2869819Z E ^ 2025-05-07T20:33:16.2870625Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.2871447Z 2025-05-07T20:33:16.2872293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.4441093Z 2025-05-07T20:33:16.4441701Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.4442413Z self=, 2025-05-07T20:33:16.4443140Z T=16384, 2025-05-07T20:33:16.4443434Z D=7168, 2025-05-07T20:33:16.4443738Z scale_ub=None, 2025-05-07T20:33:16.4444072Z contiguous=True, 2025-05-07T20:33:16.4444414Z compiled=True, 2025-05-07T20:33:16.4444735Z ) 2025-05-07T20:33:16.4445244Z self = 2025-05-07T20:33:16.4446037Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:16.4446464Z 2025-05-07T20:33:16.4446577Z @given( 2025-05-07T20:33:16.4446923Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.4447452Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.4447924Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.4448871Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.4449400Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.4449838Z ) 2025-05-07T20:33:16.4450386Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.4451127Z def test_silu_mul_quant( 2025-05-07T20:33:16.4451511Z self, 2025-05-07T20:33:16.4451813Z T: int, 2025-05-07T20:33:16.4452127Z D: int, 2025-05-07T20:33:16.4452472Z scale_ub: Optional[float], 2025-05-07T20:33:16.4452888Z contiguous: bool, 2025-05-07T20:33:16.4453269Z compiled: bool, 2025-05-07T20:33:16.4453633Z ) -> None: 2025-05-07T20:33:16.4453965Z torch.manual_seed(2025) 2025-05-07T20:33:16.4454500Z 2025-05-07T20:33:16.4454923Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.4455462Z 2025-05-07T20:33:16.4455765Z x_sign = torch.sign(x) 2025-05-07T20:33:16.4456234Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.4456741Z x = x_sign * x_clamp 2025-05-07T20:33:16.4457121Z x0 = x[:, :D] 2025-05-07T20:33:16.4457466Z x1 = x[:, D:] 2025-05-07T20:33:16.4457788Z 2025-05-07T20:33:16.4458076Z if contiguous: 2025-05-07T20:33:16.4458435Z x0 = x0.contiguous() 2025-05-07T20:33:16.4458828Z x1 = x1.contiguous() 2025-05-07T20:33:16.4459203Z 2025-05-07T20:33:16.4459484Z if scale_ub is not None: 2025-05-07T20:33:16.4459912Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.4460448Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.4460942Z ) 2025-05-07T20:33:16.4461239Z else: 2025-05-07T20:33:16.4461560Z scale_ub_tensor = None 2025-05-07T20:33:16.4461970Z 2025-05-07T20:33:16.4462327Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.4462755Z op = silu_mul_quant 2025-05-07T20:33:16.4463092Z if compiled: 2025-05-07T20:33:16.4463436Z op = torch.compile(op) 2025-05-07T20:33:16.4463832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.4464208Z 2025-05-07T20:33:16.4464460Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.4464686Z 2025-05-07T20:33:16.4464817Z moe/activation_test.py:117: 2025-05-07T20:33:16.4465224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.4465692Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.4466082Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.4466895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:16.4467891Z return fn(*args, **kwargs) 2025-05-07T20:33:16.4468941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.4470131Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.4471032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.4472140Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.4473249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.4474130Z kernel = self.compile( 2025-05-07T20:33:16.4475052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.4476213Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.4476865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.4477274Z 2025-05-07T20:33:16.4477612Z self = 2025-05-07T20:33:16.4479585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.4482022Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d826b740>} 2025-05-07T20:33:16.4484245Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.4485969Z context = 2025-05-07T20:33:16.4486437Z 2025-05-07T20:33:16.4486807Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.4487642Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.4488421Z module_map=module_map) 2025-05-07T20:33:16.4489016Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.4489565Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.4489953Z E ^ 2025-05-07T20:33:16.4490630Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.4491434Z 2025-05-07T20:33:16.4492177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.4493108Z 2025-05-07T20:33:16.4493274Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.4493963Z self=, 2025-05-07T20:33:16.4494657Z T=4096, 2025-05-07T20:33:16.4494952Z D=5120, 2025-05-07T20:33:16.4495256Z scale_ub=None, 2025-05-07T20:33:16.4495604Z contiguous=False, 2025-05-07T20:33:16.4495963Z compiled=True, 2025-05-07T20:33:16.4496297Z ) 2025-05-07T20:33:16.4496834Z self = 2025-05-07T20:33:16.4497680Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:16.4498157Z 2025-05-07T20:33:16.4498279Z @given( 2025-05-07T20:33:16.4498651Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.4499171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.4499678Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.4500235Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.4500767Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.4501201Z ) 2025-05-07T20:33:16.4501761Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.4502478Z def test_silu_mul_quant( 2025-05-07T20:33:16.4502846Z self, 2025-05-07T20:33:16.4503229Z T: int, 2025-05-07T20:33:16.4503554Z D: int, 2025-05-07T20:33:16.4503907Z scale_ub: Optional[float], 2025-05-07T20:33:16.4504362Z contiguous: bool, 2025-05-07T20:33:16.4504752Z compiled: bool, 2025-05-07T20:33:16.4505104Z ) -> None: 2025-05-07T20:33:16.4505448Z torch.manual_seed(2025) 2025-05-07T20:33:16.4505854Z 2025-05-07T20:33:16.4506291Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.4506864Z 2025-05-07T20:33:16.4507174Z x_sign = torch.sign(x) 2025-05-07T20:33:16.4507736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.4508258Z x = x_sign * x_clamp 2025-05-07T20:33:16.4508652Z x0 = x[:, :D] 2025-05-07T20:33:16.4509004Z x1 = x[:, D:] 2025-05-07T20:33:16.4509332Z 2025-05-07T20:33:16.4509623Z if contiguous: 2025-05-07T20:33:16.4509994Z x0 = x0.contiguous() 2025-05-07T20:33:16.4510420Z x1 = x1.contiguous() 2025-05-07T20:33:16.4510810Z 2025-05-07T20:33:16.4511120Z if scale_ub is not None: 2025-05-07T20:33:16.4511720Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.4512278Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.4512791Z ) 2025-05-07T20:33:16.4513093Z else: 2025-05-07T20:33:16.4513439Z scale_ub_tensor = None 2025-05-07T20:33:16.4513853Z 2025-05-07T20:33:16.4514218Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.4514747Z op = silu_mul_quant 2025-05-07T20:33:16.4515164Z if compiled: 2025-05-07T20:33:16.4515563Z op = torch.compile(op) 2025-05-07T20:33:16.4516061Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.4516524Z 2025-05-07T20:33:16.4516908Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.4517184Z 2025-05-07T20:33:16.4517342Z moe/activation_test.py:117: 2025-05-07T20:33:16.4517836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.4518402Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.4518863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.4519840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:16.4520817Z return fn(*args, **kwargs) 2025-05-07T20:33:16.4521970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.4523194Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.4524127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.4525321Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.4526481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.4527420Z kernel = self.compile( 2025-05-07T20:33:16.4528370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.4529523Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.4530193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.4530593Z 2025-05-07T20:33:16.4530931Z self = 2025-05-07T20:33:16.4532823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.4534708Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9a74c20>} 2025-05-07T20:33:16.4536629Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.4538129Z context = 2025-05-07T20:33:16.4538548Z 2025-05-07T20:33:16.4538779Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.4539539Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.4540487Z module_map=module_map) 2025-05-07T20:33:16.4540996Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.4541501Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.4541868Z E ^ 2025-05-07T20:33:16.4542539Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.4543202Z 2025-05-07T20:33:16.4544030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.5940025Z 2025-05-07T20:33:16.5940676Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.5941286Z self=, 2025-05-07T20:33:16.5941837Z T=4096, 2025-05-07T20:33:16.5942088Z D=5120, 2025-05-07T20:33:16.5942356Z scale_ub=1200.0, 2025-05-07T20:33:16.5942647Z contiguous=False, 2025-05-07T20:33:16.5942948Z compiled=False, 2025-05-07T20:33:16.5943205Z ) 2025-05-07T20:33:16.5943525Z self = 2025-05-07T20:33:16.5944026Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:16.5944551Z 2025-05-07T20:33:16.5944629Z @given( 2025-05-07T20:33:16.5944873Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.5945187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.5945512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.5945854Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.5946190Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.5946476Z ) 2025-05-07T20:33:16.5946835Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.5947273Z def test_silu_mul_quant( 2025-05-07T20:33:16.5947581Z self, 2025-05-07T20:33:16.5947773Z T: int, 2025-05-07T20:33:16.5947971Z D: int, 2025-05-07T20:33:16.5948182Z scale_ub: Optional[float], 2025-05-07T20:33:16.5948461Z contiguous: bool, 2025-05-07T20:33:16.5948704Z compiled: bool, 2025-05-07T20:33:16.5948921Z ) -> None: 2025-05-07T20:33:16.5949144Z torch.manual_seed(2025) 2025-05-07T20:33:16.5949387Z 2025-05-07T20:33:16.5949649Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.5950007Z 2025-05-07T20:33:16.5950215Z x_sign = torch.sign(x) 2025-05-07T20:33:16.5950496Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.5950798Z x = x_sign * x_clamp 2025-05-07T20:33:16.5951038Z x0 = x[:, :D] 2025-05-07T20:33:16.5951238Z x1 = x[:, D:] 2025-05-07T20:33:16.5951442Z 2025-05-07T20:33:16.5951618Z if contiguous: 2025-05-07T20:33:16.5951844Z x0 = x0.contiguous() 2025-05-07T20:33:16.5952098Z x1 = x1.contiguous() 2025-05-07T20:33:16.5952328Z 2025-05-07T20:33:16.5952507Z if scale_ub is not None: 2025-05-07T20:33:16.5952776Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.5953106Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.5953411Z ) 2025-05-07T20:33:16.5953590Z else: 2025-05-07T20:33:16.5953795Z scale_ub_tensor = None 2025-05-07T20:33:16.5954032Z 2025-05-07T20:33:16.5954373Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.5954682Z op = silu_mul_quant 2025-05-07T20:33:16.5954921Z if compiled: 2025-05-07T20:33:16.5955160Z op = torch.compile(op) 2025-05-07T20:33:16.5955447Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.5955706Z 2025-05-07T20:33:16.5955888Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.5956045Z 2025-05-07T20:33:16.5956148Z moe/activation_test.py:117: 2025-05-07T20:33:16.5956430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.5956758Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.5957035Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.5957716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.5958395Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.5958935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.5959759Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.5960409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.5960940Z kernel = self.compile( 2025-05-07T20:33:16.5961490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.5962142Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.5962528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.5962761Z 2025-05-07T20:33:16.5962961Z self = 2025-05-07T20:33:16.5964089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.5965485Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9a756c0>} 2025-05-07T20:33:16.5966800Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.5967839Z context = 2025-05-07T20:33:16.5968124Z 2025-05-07T20:33:16.5968284Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.5968791Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.5969252Z module_map=module_map) 2025-05-07T20:33:16.5969609Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.5969962Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.5970203Z E ^ 2025-05-07T20:33:16.5970651Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.5971095Z 2025-05-07T20:33:16.5971507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.5972007Z 2025-05-07T20:33:16.5972111Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.5972518Z self=, 2025-05-07T20:33:16.5972905Z T=4096, 2025-05-07T20:33:16.5973087Z D=5120, 2025-05-07T20:33:16.5973270Z scale_ub=1200.0, 2025-05-07T20:33:16.5973483Z contiguous=False, 2025-05-07T20:33:16.5973702Z compiled=True, 2025-05-07T20:33:16.5973898Z ) 2025-05-07T20:33:16.5974255Z self = 2025-05-07T20:33:16.5981621Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:16.5981927Z 2025-05-07T20:33:16.5982011Z @given( 2025-05-07T20:33:16.5982239Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.5982547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.5982847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.5983162Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.5983481Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.5983756Z ) 2025-05-07T20:33:16.5984105Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.5984538Z def test_silu_mul_quant( 2025-05-07T20:33:16.5984788Z self, 2025-05-07T20:33:16.5984979Z T: int, 2025-05-07T20:33:16.5985163Z D: int, 2025-05-07T20:33:16.5985373Z scale_ub: Optional[float], 2025-05-07T20:33:16.5985646Z contiguous: bool, 2025-05-07T20:33:16.5985990Z compiled: bool, 2025-05-07T20:33:16.5986213Z ) -> None: 2025-05-07T20:33:16.5986430Z torch.manual_seed(2025) 2025-05-07T20:33:16.5986661Z 2025-05-07T20:33:16.5986929Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.5987264Z 2025-05-07T20:33:16.5987503Z x_sign = torch.sign(x) 2025-05-07T20:33:16.5987792Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.5988093Z x = x_sign * x_clamp 2025-05-07T20:33:16.5988319Z x0 = x[:, :D] 2025-05-07T20:33:16.5988527Z x1 = x[:, D:] 2025-05-07T20:33:16.5988730Z 2025-05-07T20:33:16.5988900Z if contiguous: 2025-05-07T20:33:16.5989123Z x0 = x0.contiguous() 2025-05-07T20:33:16.5989427Z x1 = x1.contiguous() 2025-05-07T20:33:16.5989661Z 2025-05-07T20:33:16.5989840Z if scale_ub is not None: 2025-05-07T20:33:16.5990110Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.5990445Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.5990741Z ) 2025-05-07T20:33:16.5990936Z else: 2025-05-07T20:33:16.5991142Z scale_ub_tensor = None 2025-05-07T20:33:16.5991383Z 2025-05-07T20:33:16.5991613Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.5991918Z op = silu_mul_quant 2025-05-07T20:33:16.5992156Z if compiled: 2025-05-07T20:33:16.5992399Z op = torch.compile(op) 2025-05-07T20:33:16.5992688Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.5992950Z 2025-05-07T20:33:16.5993137Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.5993303Z 2025-05-07T20:33:16.5993407Z moe/activation_test.py:117: 2025-05-07T20:33:16.5993703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.5994022Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.5994304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.5994865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:16.5995411Z return fn(*args, **kwargs) 2025-05-07T20:33:16.5996057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.5996728Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.5997259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.5997917Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.5998569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.5999091Z kernel = self.compile( 2025-05-07T20:33:16.5999670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.6000317Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.6000706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.6000926Z 2025-05-07T20:33:16.6001133Z self = 2025-05-07T20:33:16.6002187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.6003542Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9a76fc0>} 2025-05-07T20:33:16.6004866Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.6006006Z context = 2025-05-07T20:33:16.6006289Z 2025-05-07T20:33:16.6006455Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.6006957Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.6007421Z module_map=module_map) 2025-05-07T20:33:16.6007776Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.6008119Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.6008380Z E ^ 2025-05-07T20:33:16.6008839Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.6009325Z 2025-05-07T20:33:16.6009754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.6010273Z 2025-05-07T20:33:16.6010380Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.6010784Z self=, 2025-05-07T20:33:16.6011181Z T=2048, 2025-05-07T20:33:16.6011360Z D=7168, 2025-05-07T20:33:16.6011546Z scale_ub=1200.0, 2025-05-07T20:33:16.6011766Z contiguous=False, 2025-05-07T20:33:16.6011987Z compiled=False, 2025-05-07T20:33:16.7977306Z ) 2025-05-07T20:33:16.7977726Z self = 2025-05-07T20:33:16.7978428Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:16.7978811Z 2025-05-07T20:33:16.7978915Z @given( 2025-05-07T20:33:16.7979221Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.7979653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.7980056Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.7980476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.7980852Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.7981137Z ) 2025-05-07T20:33:16.7981486Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.7981916Z def test_silu_mul_quant( 2025-05-07T20:33:16.7982156Z self, 2025-05-07T20:33:16.7982352Z T: int, 2025-05-07T20:33:16.7982548Z D: int, 2025-05-07T20:33:16.7982765Z scale_ub: Optional[float], 2025-05-07T20:33:16.7983036Z contiguous: bool, 2025-05-07T20:33:16.7983274Z compiled: bool, 2025-05-07T20:33:16.7983488Z ) -> None: 2025-05-07T20:33:16.7983697Z torch.manual_seed(2025) 2025-05-07T20:33:16.7983940Z 2025-05-07T20:33:16.7984204Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.7984551Z 2025-05-07T20:33:16.7984750Z x_sign = torch.sign(x) 2025-05-07T20:33:16.7985152Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.7985473Z x = x_sign * x_clamp 2025-05-07T20:33:16.7985709Z x0 = x[:, :D] 2025-05-07T20:33:16.7985913Z x1 = x[:, D:] 2025-05-07T20:33:16.7986118Z 2025-05-07T20:33:16.7986302Z if contiguous: 2025-05-07T20:33:16.7986529Z x0 = x0.contiguous() 2025-05-07T20:33:16.7986785Z x1 = x1.contiguous() 2025-05-07T20:33:16.7987022Z 2025-05-07T20:33:16.7987201Z if scale_ub is not None: 2025-05-07T20:33:16.7987537Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.7987877Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.7988185Z ) 2025-05-07T20:33:16.7988369Z else: 2025-05-07T20:33:16.7988581Z scale_ub_tensor = None 2025-05-07T20:33:16.7988828Z 2025-05-07T20:33:16.7989055Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.7989367Z op = silu_mul_quant 2025-05-07T20:33:16.7989624Z if compiled: 2025-05-07T20:33:16.7989989Z op = torch.compile(op) 2025-05-07T20:33:16.7990287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.7990562Z 2025-05-07T20:33:16.7990745Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.7990917Z 2025-05-07T20:33:16.7991016Z moe/activation_test.py:117: 2025-05-07T20:33:16.7991306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.7991628Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.7991905Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.7992592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.7993274Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.7993867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.7994547Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.7995210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.7995732Z kernel = self.compile( 2025-05-07T20:33:16.7996262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.7996906Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.7997302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.7997525Z 2025-05-07T20:33:16.7997725Z self = 2025-05-07T20:33:16.7998790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.8000148Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9a77ec0>} 2025-05-07T20:33:16.8001460Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.8002463Z context = 2025-05-07T20:33:16.8002747Z 2025-05-07T20:33:16.8002908Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.8003429Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.8003895Z module_map=module_map) 2025-05-07T20:33:16.8004252Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.8004605Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.8004903Z E ^ 2025-05-07T20:33:16.8005374Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.8005812Z 2025-05-07T20:33:16.8006227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.8006736Z 2025-05-07T20:33:16.8006833Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.8007235Z self=, 2025-05-07T20:33:16.8007625Z T=1, 2025-05-07T20:33:16.8007801Z D=7168, 2025-05-07T20:33:16.8007994Z scale_ub=None, 2025-05-07T20:33:16.8008209Z contiguous=True, 2025-05-07T20:33:16.8008425Z compiled=False, 2025-05-07T20:33:16.8008624Z ) 2025-05-07T20:33:16.8008939Z self = 2025-05-07T20:33:16.8009414Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:16.8009683Z 2025-05-07T20:33:16.8009807Z @given( 2025-05-07T20:33:16.8010073Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.8010379Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.8010680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.8011007Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.8011332Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.8011607Z ) 2025-05-07T20:33:16.8011956Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.8012403Z def test_silu_mul_quant( 2025-05-07T20:33:16.8012639Z self, 2025-05-07T20:33:16.8012839Z T: int, 2025-05-07T20:33:16.8013034Z D: int, 2025-05-07T20:33:16.8013321Z scale_ub: Optional[float], 2025-05-07T20:33:16.8013588Z contiguous: bool, 2025-05-07T20:33:16.8013828Z compiled: bool, 2025-05-07T20:33:16.8014045Z ) -> None: 2025-05-07T20:33:16.8014262Z torch.manual_seed(2025) 2025-05-07T20:33:16.8014512Z 2025-05-07T20:33:16.8014786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.8015127Z 2025-05-07T20:33:16.8015313Z x_sign = torch.sign(x) 2025-05-07T20:33:16.8015611Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.8015920Z x = x_sign * x_clamp 2025-05-07T20:33:16.8016154Z x0 = x[:, :D] 2025-05-07T20:33:16.8016375Z x1 = x[:, D:] 2025-05-07T20:33:16.8016583Z 2025-05-07T20:33:16.8016761Z if contiguous: 2025-05-07T20:33:16.8016989Z x0 = x0.contiguous() 2025-05-07T20:33:16.8017243Z x1 = x1.contiguous() 2025-05-07T20:33:16.8017480Z 2025-05-07T20:33:16.8017666Z if scale_ub is not None: 2025-05-07T20:33:16.8017941Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.8018275Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.8018581Z ) 2025-05-07T20:33:16.8018780Z else: 2025-05-07T20:33:16.8018994Z scale_ub_tensor = None 2025-05-07T20:33:16.8019237Z 2025-05-07T20:33:16.8019460Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.8019768Z op = silu_mul_quant 2025-05-07T20:33:16.8020009Z if compiled: 2025-05-07T20:33:16.8020264Z op = torch.compile(op) 2025-05-07T20:33:16.8020554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.8020819Z 2025-05-07T20:33:16.8021006Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.8021165Z 2025-05-07T20:33:16.8021267Z moe/activation_test.py:117: 2025-05-07T20:33:16.8021559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.8021880Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.8022156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.8022883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.8023563Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.8024114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.8024797Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.8025496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.8026033Z kernel = self.compile( 2025-05-07T20:33:16.8026573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.8027219Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.8027655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.8027882Z 2025-05-07T20:33:16.8028089Z self = 2025-05-07T20:33:16.8029232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.8030583Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d88d4cc0>} 2025-05-07T20:33:16.8031900Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.8032952Z context = 2025-05-07T20:33:16.8033277Z 2025-05-07T20:33:16.8033443Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.8033962Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.8034440Z module_map=module_map) 2025-05-07T20:33:16.8034796Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.8035163Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.8035422Z E ^ 2025-05-07T20:33:16.8035879Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.8036328Z 2025-05-07T20:33:16.8036746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.8037258Z 2025-05-07T20:33:16.8037361Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.8037772Z self=, 2025-05-07T20:33:16.8038175Z T=16384, 2025-05-07T20:33:16.8038369Z D=7168, 2025-05-07T20:33:16.8038578Z scale_ub=1200.0, 2025-05-07T20:33:16.8038795Z contiguous=False, 2025-05-07T20:33:16.8039026Z compiled=True, 2025-05-07T20:33:16.8039232Z ) 2025-05-07T20:33:16.8039540Z self = 2025-05-07T20:33:16.8040030Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:16.8040506Z 2025-05-07T20:33:16.8040595Z @given( 2025-05-07T20:33:16.8040821Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.8041127Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.8041428Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.8041756Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.8042072Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.8042356Z ) 2025-05-07T20:33:16.8042706Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.8043206Z def test_silu_mul_quant( 2025-05-07T20:33:16.8043445Z self, 2025-05-07T20:33:16.8043643Z T: int, 2025-05-07T20:33:16.8043832Z D: int, 2025-05-07T20:33:16.8044048Z scale_ub: Optional[float], 2025-05-07T20:33:16.8044316Z contiguous: bool, 2025-05-07T20:33:16.8044546Z compiled: bool, 2025-05-07T20:33:16.8044772Z ) -> None: 2025-05-07T20:33:16.8044982Z torch.manual_seed(2025) 2025-05-07T20:33:16.8045219Z 2025-05-07T20:33:16.8045477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.8045811Z 2025-05-07T20:33:16.8046001Z x_sign = torch.sign(x) 2025-05-07T20:33:16.8046280Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.8046583Z x = x_sign * x_clamp 2025-05-07T20:33:16.8046818Z x0 = x[:, :D] 2025-05-07T20:33:16.8047032Z x1 = x[:, D:] 2025-05-07T20:33:16.8047243Z 2025-05-07T20:33:16.8047428Z if contiguous: 2025-05-07T20:33:16.8047658Z x0 = x0.contiguous() 2025-05-07T20:33:16.8047913Z x1 = x1.contiguous() 2025-05-07T20:33:16.8048272Z 2025-05-07T20:33:16.8048460Z if scale_ub is not None: 2025-05-07T20:33:16.8048729Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.8049056Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.8049358Z ) 2025-05-07T20:33:16.8049560Z else: 2025-05-07T20:33:16.8049777Z scale_ub_tensor = None 2025-05-07T20:33:16.8050016Z 2025-05-07T20:33:16.8050254Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.8050570Z op = silu_mul_quant 2025-05-07T20:33:16.8050827Z if compiled: 2025-05-07T20:33:16.8051070Z op = torch.compile(op) 2025-05-07T20:33:16.8051372Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.8051715Z 2025-05-07T20:33:16.8051902Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.8052074Z 2025-05-07T20:33:16.8052176Z moe/activation_test.py:117: 2025-05-07T20:33:16.8052470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.8052789Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.8053064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.8053613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:16.8054160Z return fn(*args, **kwargs) 2025-05-07T20:33:16.8054807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.8055479Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.8056007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.8056673Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.8057331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.8057858Z kernel = self.compile( 2025-05-07T20:33:16.8058411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.8059044Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.8059436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.8059658Z 2025-05-07T20:33:16.8059864Z self = 2025-05-07T20:33:16.8060920Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.8062318Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d88d60c0>} 2025-05-07T20:33:16.8063636Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.8064638Z context = 2025-05-07T20:33:16.8064918Z 2025-05-07T20:33:16.8065088Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.8065598Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.8066066Z module_map=module_map) 2025-05-07T20:33:16.8066421Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.8066774Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.8067020Z E ^ 2025-05-07T20:33:16.8067520Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.8068023Z 2025-05-07T20:33:16.8068488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.9411536Z 2025-05-07T20:33:16.9411919Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.9412380Z self=, 2025-05-07T20:33:16.9412780Z T=1, 2025-05-07T20:33:16.9412966Z D=7168, 2025-05-07T20:33:16.9413155Z scale_ub=None, 2025-05-07T20:33:16.9413362Z contiguous=False, 2025-05-07T20:33:16.9413627Z compiled=False, 2025-05-07T20:33:16.9413826Z ) 2025-05-07T20:33:16.9414142Z self = 2025-05-07T20:33:16.9414626Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:16.9415044Z 2025-05-07T20:33:16.9415120Z @given( 2025-05-07T20:33:16.9415350Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.9415663Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.9415960Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.9416295Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.9416625Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.9416910Z ) 2025-05-07T20:33:16.9417246Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.9417697Z def test_silu_mul_quant( 2025-05-07T20:33:16.9417940Z self, 2025-05-07T20:33:16.9418122Z T: int, 2025-05-07T20:33:16.9418312Z D: int, 2025-05-07T20:33:16.9418525Z scale_ub: Optional[float], 2025-05-07T20:33:16.9418781Z contiguous: bool, 2025-05-07T20:33:16.9419009Z compiled: bool, 2025-05-07T20:33:16.9419234Z ) -> None: 2025-05-07T20:33:16.9419434Z torch.manual_seed(2025) 2025-05-07T20:33:16.9419666Z 2025-05-07T20:33:16.9419935Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.9420274Z 2025-05-07T20:33:16.9420464Z x_sign = torch.sign(x) 2025-05-07T20:33:16.9420755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.9421047Z x = x_sign * x_clamp 2025-05-07T20:33:16.9421290Z x0 = x[:, :D] 2025-05-07T20:33:16.9421515Z x1 = x[:, D:] 2025-05-07T20:33:16.9421728Z 2025-05-07T20:33:16.9421905Z if contiguous: 2025-05-07T20:33:16.9422149Z x0 = x0.contiguous() 2025-05-07T20:33:16.9422411Z x1 = x1.contiguous() 2025-05-07T20:33:16.9422634Z 2025-05-07T20:33:16.9422816Z if scale_ub is not None: 2025-05-07T20:33:16.9423093Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.9423418Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.9423728Z ) 2025-05-07T20:33:16.9423917Z else: 2025-05-07T20:33:16.9424192Z scale_ub_tensor = None 2025-05-07T20:33:16.9424440Z 2025-05-07T20:33:16.9424667Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.9424961Z op = silu_mul_quant 2025-05-07T20:33:16.9425201Z if compiled: 2025-05-07T20:33:16.9425438Z op = torch.compile(op) 2025-05-07T20:33:16.9425717Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.9425981Z 2025-05-07T20:33:16.9426164Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.9426322Z 2025-05-07T20:33:16.9426418Z moe/activation_test.py:117: 2025-05-07T20:33:16.9426698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.9427032Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.9427308Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.9428057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.9439889Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.9440816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.9441562Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.9442237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.9442753Z kernel = self.compile( 2025-05-07T20:33:16.9443297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.9443943Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.9444333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.9444631Z 2025-05-07T20:33:16.9444833Z self = 2025-05-07T20:33:16.9445944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.9447309Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d88d6c00>} 2025-05-07T20:33:16.9448621Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.9449616Z context = 2025-05-07T20:33:16.9449900Z 2025-05-07T20:33:16.9450061Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.9450579Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.9451041Z module_map=module_map) 2025-05-07T20:33:16.9451395Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.9451737Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.9451989Z E ^ 2025-05-07T20:33:16.9452435Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.9452883Z 2025-05-07T20:33:16.9453298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.9453802Z 2025-05-07T20:33:16.9453899Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.9454302Z self=, 2025-05-07T20:33:16.9454689Z T=2048, 2025-05-07T20:33:16.9454876Z D=7168, 2025-05-07T20:33:16.9455059Z scale_ub=None, 2025-05-07T20:33:16.9455265Z contiguous=False, 2025-05-07T20:33:16.9455485Z compiled=True, 2025-05-07T20:33:16.9455744Z ) 2025-05-07T20:33:16.9456051Z self = 2025-05-07T20:33:16.9456536Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:16.9456807Z 2025-05-07T20:33:16.9456879Z @given( 2025-05-07T20:33:16.9457096Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.9457390Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.9457683Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.9458002Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.9458312Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.9458587Z ) 2025-05-07T20:33:16.9458925Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.9459357Z def test_silu_mul_quant( 2025-05-07T20:33:16.9459585Z self, 2025-05-07T20:33:16.9459777Z T: int, 2025-05-07T20:33:16.9459963Z D: int, 2025-05-07T20:33:16.9460172Z scale_ub: Optional[float], 2025-05-07T20:33:16.9461095Z contiguous: bool, 2025-05-07T20:33:16.9461327Z compiled: bool, 2025-05-07T20:33:16.9461541Z ) -> None: 2025-05-07T20:33:16.9461757Z torch.manual_seed(2025) 2025-05-07T20:33:16.9461993Z 2025-05-07T20:33:16.9462251Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.9462590Z 2025-05-07T20:33:16.9462774Z x_sign = torch.sign(x) 2025-05-07T20:33:16.9463050Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.9463355Z x = x_sign * x_clamp 2025-05-07T20:33:16.9463586Z x0 = x[:, :D] 2025-05-07T20:33:16.9463784Z x1 = x[:, D:] 2025-05-07T20:33:16.9463982Z 2025-05-07T20:33:16.9464157Z if contiguous: 2025-05-07T20:33:16.9464417Z x0 = x0.contiguous() 2025-05-07T20:33:16.9464663Z x1 = x1.contiguous() 2025-05-07T20:33:16.9464897Z 2025-05-07T20:33:16.9465076Z if scale_ub is not None: 2025-05-07T20:33:16.9465346Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.9465670Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.9465978Z ) 2025-05-07T20:33:16.9466157Z else: 2025-05-07T20:33:16.9466365Z scale_ub_tensor = None 2025-05-07T20:33:16.9466603Z 2025-05-07T20:33:16.9466816Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.9467117Z op = silu_mul_quant 2025-05-07T20:33:16.9467353Z if compiled: 2025-05-07T20:33:16.9467647Z op = torch.compile(op) 2025-05-07T20:33:16.9467930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.9468191Z 2025-05-07T20:33:16.9468368Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.9468541Z 2025-05-07T20:33:16.9468634Z moe/activation_test.py:117: 2025-05-07T20:33:16.9468924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.9469250Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.9469521Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.9470093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:16.9470638Z return fn(*args, **kwargs) 2025-05-07T20:33:16.9471282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.9471957Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.9472485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.9473149Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.9473798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.9474314Z kernel = self.compile( 2025-05-07T20:33:16.9474918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.9475609Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.9475997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.9476222Z 2025-05-07T20:33:16.9476419Z self = 2025-05-07T20:33:16.9477519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.9478855Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d93802c0>} 2025-05-07T20:33:16.9480211Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.9481245Z context = 2025-05-07T20:33:16.9481524Z 2025-05-07T20:33:16.9481693Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.9482209Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.9482660Z module_map=module_map) 2025-05-07T20:33:16.9483017Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.9483373Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.9483622Z E ^ 2025-05-07T20:33:16.9484075Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.9484556Z 2025-05-07T20:33:16.9484979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.9485490Z 2025-05-07T20:33:16.9485595Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.9485996Z self=, 2025-05-07T20:33:16.9486387Z T=4096, 2025-05-07T20:33:16.9486571Z D=7168, 2025-05-07T20:33:16.9486751Z scale_ub=None, 2025-05-07T20:33:16.9486963Z contiguous=False, 2025-05-07T20:33:16.9487184Z compiled=True, 2025-05-07T20:33:17.3597426Z ) 2025-05-07T20:33:17.3598127Z self = 2025-05-07T20:33:17.3598831Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:17.3599197Z 2025-05-07T20:33:17.3599306Z @given( 2025-05-07T20:33:17.3599657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.3600111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.3600439Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.3600767Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.3601093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.3601373Z ) 2025-05-07T20:33:17.3601708Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.3602144Z def test_silu_mul_quant( 2025-05-07T20:33:17.3602391Z self, 2025-05-07T20:33:17.3602578Z T: int, 2025-05-07T20:33:17.3602772Z D: int, 2025-05-07T20:33:17.3602992Z scale_ub: Optional[float], 2025-05-07T20:33:17.3603251Z contiguous: bool, 2025-05-07T20:33:17.3603493Z compiled: bool, 2025-05-07T20:33:17.3603716Z ) -> None: 2025-05-07T20:33:17.3603918Z torch.manual_seed(2025) 2025-05-07T20:33:17.3604160Z 2025-05-07T20:33:17.3604430Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.3604772Z 2025-05-07T20:33:17.3605096Z x_sign = torch.sign(x) 2025-05-07T20:33:17.3605392Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.3605696Z x = x_sign * x_clamp 2025-05-07T20:33:17.3605927Z x0 = x[:, :D] 2025-05-07T20:33:17.3606138Z x1 = x[:, D:] 2025-05-07T20:33:17.3606342Z 2025-05-07T20:33:17.3606514Z if contiguous: 2025-05-07T20:33:17.3606737Z x0 = x0.contiguous() 2025-05-07T20:33:17.3606989Z x1 = x1.contiguous() 2025-05-07T20:33:17.3607216Z 2025-05-07T20:33:17.3607401Z if scale_ub is not None: 2025-05-07T20:33:17.3607672Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.3607995Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.3608297Z ) 2025-05-07T20:33:17.3608489Z else: 2025-05-07T20:33:17.3608696Z scale_ub_tensor = None 2025-05-07T20:33:17.3608951Z 2025-05-07T20:33:17.3609183Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.3609483Z op = silu_mul_quant 2025-05-07T20:33:17.3609910Z if compiled: 2025-05-07T20:33:17.3610147Z op = torch.compile(op) 2025-05-07T20:33:17.3610440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.3610711Z 2025-05-07T20:33:17.3610888Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.3611052Z 2025-05-07T20:33:17.3611147Z moe/activation_test.py:117: 2025-05-07T20:33:17.3611433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.3611750Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.3612013Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.3612560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:17.3613171Z return fn(*args, **kwargs) 2025-05-07T20:33:17.3613817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.3614489Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.3615037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.3615699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.3616346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.3616871Z kernel = self.compile( 2025-05-07T20:33:17.3617402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.3618033Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.3618415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.3618637Z 2025-05-07T20:33:17.3618842Z self = 2025-05-07T20:33:17.3619907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.3621269Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9380d60>} 2025-05-07T20:33:17.3622580Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.3623584Z context = 2025-05-07T20:33:17.3623865Z 2025-05-07T20:33:17.3624031Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.3624592Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.3625059Z module_map=module_map) 2025-05-07T20:33:17.3625424Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.3625761Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.3626014Z E ^ 2025-05-07T20:33:17.3626467Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.3626902Z 2025-05-07T20:33:17.3627320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.3627939Z 2025-05-07T20:33:17.3628035Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.3628444Z self=, 2025-05-07T20:33:17.3628834Z T=16384, 2025-05-07T20:33:17.3629020Z D=5120, 2025-05-07T20:33:17.3629204Z scale_ub=1200.0, 2025-05-07T20:33:17.3629423Z contiguous=False, 2025-05-07T20:33:17.3629644Z compiled=False, 2025-05-07T20:33:17.3629886Z ) 2025-05-07T20:33:17.3630236Z self = 2025-05-07T20:33:17.3630727Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:17.3630997Z 2025-05-07T20:33:17.3631070Z @given( 2025-05-07T20:33:17.3631293Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.3631599Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.3631893Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.3632216Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.3632531Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.3632804Z ) 2025-05-07T20:33:17.3633134Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.3633626Z def test_silu_mul_quant( 2025-05-07T20:33:17.3633859Z self, 2025-05-07T20:33:17.3634040Z T: int, 2025-05-07T20:33:17.3634231Z D: int, 2025-05-07T20:33:17.3634441Z scale_ub: Optional[float], 2025-05-07T20:33:17.3634701Z contiguous: bool, 2025-05-07T20:33:17.3634928Z compiled: bool, 2025-05-07T20:33:17.3635144Z ) -> None: 2025-05-07T20:33:17.3635342Z torch.manual_seed(2025) 2025-05-07T20:33:17.3635582Z 2025-05-07T20:33:17.3635846Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.3636178Z 2025-05-07T20:33:17.3636369Z x_sign = torch.sign(x) 2025-05-07T20:33:17.3636651Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.3636945Z x = x_sign * x_clamp 2025-05-07T20:33:17.3637178Z x0 = x[:, :D] 2025-05-07T20:33:17.3637387Z x1 = x[:, D:] 2025-05-07T20:33:17.3637589Z 2025-05-07T20:33:17.3637763Z if contiguous: 2025-05-07T20:33:17.3637986Z x0 = x0.contiguous() 2025-05-07T20:33:17.3638246Z x1 = x1.contiguous() 2025-05-07T20:33:17.3638473Z 2025-05-07T20:33:17.3638669Z if scale_ub is not None: 2025-05-07T20:33:17.3638940Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.3639269Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.3639576Z ) 2025-05-07T20:33:17.3639765Z else: 2025-05-07T20:33:17.3639970Z scale_ub_tensor = None 2025-05-07T20:33:17.3640680Z 2025-05-07T20:33:17.3640926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.3641269Z op = silu_mul_quant 2025-05-07T20:33:17.3641536Z if compiled: 2025-05-07T20:33:17.3641805Z op = torch.compile(op) 2025-05-07T20:33:17.3642121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.3642383Z 2025-05-07T20:33:17.3642573Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.3642731Z 2025-05-07T20:33:17.3642833Z moe/activation_test.py:117: 2025-05-07T20:33:17.3643201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.3643528Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.3643804Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.3644478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.3645156Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.3645740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.3646408Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.3647056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.3647571Z kernel = self.compile( 2025-05-07T20:33:17.3648104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.3648800Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.3649238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.3649461Z 2025-05-07T20:33:17.3649661Z self = 2025-05-07T20:33:17.3650720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.3652058Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9381c60>} 2025-05-07T20:33:17.3653434Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.3654436Z context = 2025-05-07T20:33:17.3654717Z 2025-05-07T20:33:17.3654883Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.3655425Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.3655908Z module_map=module_map) 2025-05-07T20:33:17.3656266Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.3656609Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.3656853Z E ^ 2025-05-07T20:33:17.3657302Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.3657743Z 2025-05-07T20:33:17.3658165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.3658663Z 2025-05-07T20:33:17.3658771Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.3659169Z self=, 2025-05-07T20:33:17.3659565Z T=16384, 2025-05-07T20:33:17.3659752Z D=5120, 2025-05-07T20:33:17.3659928Z scale_ub=1200.0, 2025-05-07T20:33:17.3660148Z contiguous=True, 2025-05-07T20:33:17.3660369Z compiled=True, 2025-05-07T20:33:17.3660562Z ) 2025-05-07T20:33:17.3660876Z self = 2025-05-07T20:33:17.3661357Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:17.3661621Z 2025-05-07T20:33:17.3661703Z @given( 2025-05-07T20:33:17.3661918Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.3662221Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.3662529Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.3662906Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.3663226Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.3663506Z ) 2025-05-07T20:33:17.3663837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.3664267Z def test_silu_mul_quant( 2025-05-07T20:33:17.3664498Z self, 2025-05-07T20:33:17.3664674Z T: int, 2025-05-07T20:33:17.3664871Z D: int, 2025-05-07T20:33:17.3665082Z scale_ub: Optional[float], 2025-05-07T20:33:17.3665334Z contiguous: bool, 2025-05-07T20:33:17.3665570Z compiled: bool, 2025-05-07T20:33:17.3665782Z ) -> None: 2025-05-07T20:33:17.3665985Z torch.manual_seed(2025) 2025-05-07T20:33:17.3666211Z 2025-05-07T20:33:17.3666467Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.3666800Z 2025-05-07T20:33:17.3666975Z x_sign = torch.sign(x) 2025-05-07T20:33:17.3667253Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.3667626Z x = x_sign * x_clamp 2025-05-07T20:33:17.3667969Z x0 = x[:, :D] 2025-05-07T20:33:17.3668175Z x1 = x[:, D:] 2025-05-07T20:33:17.3668373Z 2025-05-07T20:33:17.3668547Z if contiguous: 2025-05-07T20:33:17.3668772Z x0 = x0.contiguous() 2025-05-07T20:33:17.3669018Z x1 = x1.contiguous() 2025-05-07T20:33:17.3669244Z 2025-05-07T20:33:17.3669425Z if scale_ub is not None: 2025-05-07T20:33:17.3669691Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.3670010Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.3670308Z ) 2025-05-07T20:33:17.3670494Z else: 2025-05-07T20:33:17.3670699Z scale_ub_tensor = None 2025-05-07T20:33:17.3670935Z 2025-05-07T20:33:17.3671208Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.3671516Z op = silu_mul_quant 2025-05-07T20:33:17.3671749Z if compiled: 2025-05-07T20:33:17.3671991Z op = torch.compile(op) 2025-05-07T20:33:17.3672279Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.3672536Z 2025-05-07T20:33:17.3672715Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.3672871Z 2025-05-07T20:33:17.3672968Z moe/activation_test.py:117: 2025-05-07T20:33:17.3673241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.3673555Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.3673820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.3674364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:17.3674898Z return fn(*args, **kwargs) 2025-05-07T20:33:17.3675543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.3676212Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.3676740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.3677400Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.3678048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.3678563Z kernel = self.compile( 2025-05-07T20:33:17.3679093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.3679737Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.3680130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.3680349Z 2025-05-07T20:33:17.3680548Z self = 2025-05-07T20:33:17.3681664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.3683067Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9383380>} 2025-05-07T20:33:17.3684374Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.3685391Z context = 2025-05-07T20:33:17.3685702Z 2025-05-07T20:33:17.3685861Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.3686371Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.3686827Z module_map=module_map) 2025-05-07T20:33:17.3687181Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.3687604Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.3687851Z E ^ 2025-05-07T20:33:17.3688298Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.3688735Z 2025-05-07T20:33:17.3689149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.5258916Z 2025-05-07T20:33:17.5259109Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.5259683Z self=, 2025-05-07T20:33:17.5260340Z T=16384, 2025-05-07T20:33:17.5260593Z D=5120, 2025-05-07T20:33:17.5268133Z scale_ub=None, 2025-05-07T20:33:17.5268431Z contiguous=False, 2025-05-07T20:33:17.5268738Z compiled=True, 2025-05-07T20:33:17.5268935Z ) 2025-05-07T20:33:17.5269252Z self = 2025-05-07T20:33:17.5269760Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:17.5270033Z 2025-05-07T20:33:17.5270118Z @given( 2025-05-07T20:33:17.5270344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.5270657Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.5270959Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.5271279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.5271614Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.5271896Z ) 2025-05-07T20:33:17.5272238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.5272694Z def test_silu_mul_quant( 2025-05-07T20:33:17.5272941Z self, 2025-05-07T20:33:17.5273131Z T: int, 2025-05-07T20:33:17.5273327Z D: int, 2025-05-07T20:33:17.5273549Z scale_ub: Optional[float], 2025-05-07T20:33:17.5273813Z contiguous: bool, 2025-05-07T20:33:17.5274057Z compiled: bool, 2025-05-07T20:33:17.5274284Z ) -> None: 2025-05-07T20:33:17.5274499Z torch.manual_seed(2025) 2025-05-07T20:33:17.5274732Z 2025-05-07T20:33:17.5275002Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.5275349Z 2025-05-07T20:33:17.5275537Z x_sign = torch.sign(x) 2025-05-07T20:33:17.5275829Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.5276134Z x = x_sign * x_clamp 2025-05-07T20:33:17.5276373Z x0 = x[:, :D] 2025-05-07T20:33:17.5276594Z x1 = x[:, D:] 2025-05-07T20:33:17.5276800Z 2025-05-07T20:33:17.5276979Z if contiguous: 2025-05-07T20:33:17.5277210Z x0 = x0.contiguous() 2025-05-07T20:33:17.5277463Z x1 = x1.contiguous() 2025-05-07T20:33:17.5277696Z 2025-05-07T20:33:17.5277886Z if scale_ub is not None: 2025-05-07T20:33:17.5278273Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.5278604Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.5278911Z ) 2025-05-07T20:33:17.5279100Z else: 2025-05-07T20:33:17.5279315Z scale_ub_tensor = None 2025-05-07T20:33:17.5279563Z 2025-05-07T20:33:17.5279794Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.5280106Z op = silu_mul_quant 2025-05-07T20:33:17.5280346Z if compiled: 2025-05-07T20:33:17.5280587Z op = torch.compile(op) 2025-05-07T20:33:17.5280879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.5281143Z 2025-05-07T20:33:17.5281341Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.5281507Z 2025-05-07T20:33:17.5281607Z moe/activation_test.py:117: 2025-05-07T20:33:17.5281896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.5282227Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.5282508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.5283194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:17.5283746Z return fn(*args, **kwargs) 2025-05-07T20:33:17.5284399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.5285072Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.5285596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.5286270Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.5286933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.5287506Z kernel = self.compile( 2025-05-07T20:33:17.5288054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.5288710Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.5289100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.5289324Z 2025-05-07T20:33:17.5289531Z self = 2025-05-07T20:33:17.5290596Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.5291953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d83485e0>} 2025-05-07T20:33:17.5293276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.5294293Z context = 2025-05-07T20:33:17.5294579Z 2025-05-07T20:33:17.5294742Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.5295267Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.5295729Z module_map=module_map) 2025-05-07T20:33:17.5296094Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.5296441Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.5296701Z E ^ 2025-05-07T20:33:17.5297165Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.5297611Z 2025-05-07T20:33:17.5298073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.5298584Z 2025-05-07T20:33:17.5298689Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.5299096Z self=, 2025-05-07T20:33:17.5299488Z T=2048, 2025-05-07T20:33:17.5299674Z D=5120, 2025-05-07T20:33:17.5299861Z scale_ub=None, 2025-05-07T20:33:17.5300073Z contiguous=False, 2025-05-07T20:33:17.5300291Z compiled=True, 2025-05-07T20:33:17.5300490Z ) 2025-05-07T20:33:17.5300810Z self = 2025-05-07T20:33:17.5301293Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:17.5301563Z 2025-05-07T20:33:17.5301641Z @given( 2025-05-07T20:33:17.5301869Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.5302180Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.5302481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.5302815Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.5303225Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.5303503Z ) 2025-05-07T20:33:17.5303845Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.5304290Z def test_silu_mul_quant( 2025-05-07T20:33:17.5304524Z self, 2025-05-07T20:33:17.5304724Z T: int, 2025-05-07T20:33:17.5304925Z D: int, 2025-05-07T20:33:17.5305134Z scale_ub: Optional[float], 2025-05-07T20:33:17.5305404Z contiguous: bool, 2025-05-07T20:33:17.5305640Z compiled: bool, 2025-05-07T20:33:17.5305859Z ) -> None: 2025-05-07T20:33:17.5306075Z torch.manual_seed(2025) 2025-05-07T20:33:17.5306316Z 2025-05-07T20:33:17.5306629Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.5306969Z 2025-05-07T20:33:17.5307160Z x_sign = torch.sign(x) 2025-05-07T20:33:17.5307536Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.5307843Z x = x_sign * x_clamp 2025-05-07T20:33:17.5308076Z x0 = x[:, :D] 2025-05-07T20:33:17.5308292Z x1 = x[:, D:] 2025-05-07T20:33:17.5308497Z 2025-05-07T20:33:17.5308684Z if contiguous: 2025-05-07T20:33:17.5308912Z x0 = x0.contiguous() 2025-05-07T20:33:17.5309157Z x1 = x1.contiguous() 2025-05-07T20:33:17.5309391Z 2025-05-07T20:33:17.5309578Z if scale_ub is not None: 2025-05-07T20:33:17.5309842Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.5310183Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.5310486Z ) 2025-05-07T20:33:17.5310679Z else: 2025-05-07T20:33:17.5310892Z scale_ub_tensor = None 2025-05-07T20:33:17.5311148Z 2025-05-07T20:33:17.5311371Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.5311679Z op = silu_mul_quant 2025-05-07T20:33:17.5311926Z if compiled: 2025-05-07T20:33:17.5312174Z op = torch.compile(op) 2025-05-07T20:33:17.5312467Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.5312744Z 2025-05-07T20:33:17.5312936Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.5313099Z 2025-05-07T20:33:17.5313197Z moe/activation_test.py:117: 2025-05-07T20:33:17.5313488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.5313814Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.5314084Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.5314638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:17.5315187Z return fn(*args, **kwargs) 2025-05-07T20:33:17.5315846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.5316562Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.5317103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.5317766Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.5318412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.5318931Z kernel = self.compile( 2025-05-07T20:33:17.5319463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.5320099Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.5320482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.5320711Z 2025-05-07T20:33:17.5320911Z self = 2025-05-07T20:33:17.5322011Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.5323389Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8349440>} 2025-05-07T20:33:17.5324692Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.5325782Z context = 2025-05-07T20:33:17.5326079Z 2025-05-07T20:33:17.5326243Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.5326810Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.5327275Z module_map=module_map) 2025-05-07T20:33:17.5327639Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.5327991Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.5328253Z E ^ 2025-05-07T20:33:17.5328699Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.5329151Z 2025-05-07T20:33:17.5329579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.6946186Z 2025-05-07T20:33:17.6946686Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.6947283Z self=, 2025-05-07T20:33:17.6947897Z T=2048, 2025-05-07T20:33:17.6948165Z D=5120, 2025-05-07T20:33:17.6948417Z scale_ub=1200.0, 2025-05-07T20:33:17.6948695Z contiguous=False, 2025-05-07T20:33:17.6948984Z compiled=True, 2025-05-07T20:33:17.6949236Z ) 2025-05-07T20:33:17.6949550Z self = 2025-05-07T20:33:17.6950042Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:17.6950315Z 2025-05-07T20:33:17.6950393Z @given( 2025-05-07T20:33:17.6950623Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.6950935Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.6951251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.6951583Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.6951901Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.6952176Z ) 2025-05-07T20:33:17.6952520Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.6952960Z def test_silu_mul_quant( 2025-05-07T20:33:17.6953204Z self, 2025-05-07T20:33:17.6953401Z T: int, 2025-05-07T20:33:17.6953742Z D: int, 2025-05-07T20:33:17.6953962Z scale_ub: Optional[float], 2025-05-07T20:33:17.6954232Z contiguous: bool, 2025-05-07T20:33:17.6954469Z compiled: bool, 2025-05-07T20:33:17.6954695Z ) -> None: 2025-05-07T20:33:17.6954900Z torch.manual_seed(2025) 2025-05-07T20:33:17.6955137Z 2025-05-07T20:33:17.6955393Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.6955777Z 2025-05-07T20:33:17.6955976Z x_sign = torch.sign(x) 2025-05-07T20:33:17.6956283Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.6956576Z x = x_sign * x_clamp 2025-05-07T20:33:17.6956814Z x0 = x[:, :D] 2025-05-07T20:33:17.6957035Z x1 = x[:, D:] 2025-05-07T20:33:17.6957237Z 2025-05-07T20:33:17.6957413Z if contiguous: 2025-05-07T20:33:17.6957646Z x0 = x0.contiguous() 2025-05-07T20:33:17.6957893Z x1 = x1.contiguous() 2025-05-07T20:33:17.6958132Z 2025-05-07T20:33:17.6958324Z if scale_ub is not None: 2025-05-07T20:33:17.6958711Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.6959044Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.6959352Z ) 2025-05-07T20:33:17.6959538Z else: 2025-05-07T20:33:17.6959737Z scale_ub_tensor = None 2025-05-07T20:33:17.6959985Z 2025-05-07T20:33:17.6960213Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.6960512Z op = silu_mul_quant 2025-05-07T20:33:17.6960755Z if compiled: 2025-05-07T20:33:17.6961032Z op = torch.compile(op) 2025-05-07T20:33:17.6961326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.6961589Z 2025-05-07T20:33:17.6961773Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.6962027Z 2025-05-07T20:33:17.6962133Z moe/activation_test.py:117: 2025-05-07T20:33:17.6962419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.6962746Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.6963034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.6963589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:17.6964134Z return fn(*args, **kwargs) 2025-05-07T20:33:17.6964782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.6965454Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.6966022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.6966692Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.6967346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.6967858Z kernel = self.compile( 2025-05-07T20:33:17.6968402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.6969059Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.6969445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.6969664Z 2025-05-07T20:33:17.6969868Z self = 2025-05-07T20:33:17.6970921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.6972270Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d834a660>} 2025-05-07T20:33:17.6973635Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.6974643Z context = 2025-05-07T20:33:17.6974921Z 2025-05-07T20:33:17.6975080Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.6975601Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.6976119Z module_map=module_map) 2025-05-07T20:33:17.6976477Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.6976827Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.6977078Z E ^ 2025-05-07T20:33:17.6977534Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.6977995Z 2025-05-07T20:33:17.6978456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.6978996Z 2025-05-07T20:33:17.6979093Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.6979493Z self=, 2025-05-07T20:33:17.6979891Z T=4096, 2025-05-07T20:33:17.6980069Z D=5120, 2025-05-07T20:33:17.6980252Z scale_ub=1200.0, 2025-05-07T20:33:17.6980470Z contiguous=True, 2025-05-07T20:33:17.6980676Z compiled=True, 2025-05-07T20:33:17.6980870Z ) 2025-05-07T20:33:17.6981174Z self = 2025-05-07T20:33:17.6981638Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:17.6981905Z 2025-05-07T20:33:17.6982027Z @given( 2025-05-07T20:33:17.6982250Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.6982550Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.6982845Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.6983174Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.6983489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.6983757Z ) 2025-05-07T20:33:17.6984105Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.6984548Z def test_silu_mul_quant( 2025-05-07T20:33:17.6984776Z self, 2025-05-07T20:33:17.6984957Z T: int, 2025-05-07T20:33:17.6985135Z D: int, 2025-05-07T20:33:17.6985331Z scale_ub: Optional[float], 2025-05-07T20:33:17.6985587Z contiguous: bool, 2025-05-07T20:33:17.6985847Z compiled: bool, 2025-05-07T20:33:17.6986051Z ) -> None: 2025-05-07T20:33:17.6986248Z torch.manual_seed(2025) 2025-05-07T20:33:17.6986475Z 2025-05-07T20:33:17.6986727Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.6987058Z 2025-05-07T20:33:17.6987237Z x_sign = torch.sign(x) 2025-05-07T20:33:17.6987584Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.6987876Z x = x_sign * x_clamp 2025-05-07T20:33:17.6988108Z x0 = x[:, :D] 2025-05-07T20:33:17.6988322Z x1 = x[:, D:] 2025-05-07T20:33:17.6988516Z 2025-05-07T20:33:17.6988693Z if contiguous: 2025-05-07T20:33:17.6988915Z x0 = x0.contiguous() 2025-05-07T20:33:17.6989158Z x1 = x1.contiguous() 2025-05-07T20:33:17.6989393Z 2025-05-07T20:33:17.6989576Z if scale_ub is not None: 2025-05-07T20:33:17.6989838Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.6990161Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.6990461Z ) 2025-05-07T20:33:17.6990643Z else: 2025-05-07T20:33:17.6990854Z scale_ub_tensor = None 2025-05-07T20:33:17.6991098Z 2025-05-07T20:33:17.6991371Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.6991675Z op = silu_mul_quant 2025-05-07T20:33:17.6991918Z if compiled: 2025-05-07T20:33:17.6992157Z op = torch.compile(op) 2025-05-07T20:33:17.6992441Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.6992707Z 2025-05-07T20:33:17.6992891Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.6993045Z 2025-05-07T20:33:17.6993138Z moe/activation_test.py:117: 2025-05-07T20:33:17.6993419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.6993736Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.6994000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.6994546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:17.6995101Z return fn(*args, **kwargs) 2025-05-07T20:33:17.6995806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.6996549Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.6997088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.6997758Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.6998400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.6998914Z kernel = self.compile( 2025-05-07T20:33:17.6999448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.7000082Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.7000469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.7000747Z 2025-05-07T20:33:17.7000944Z self = 2025-05-07T20:33:17.7002003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.7003338Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d834b9c0>} 2025-05-07T20:33:17.7004638Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.7005632Z context = 2025-05-07T20:33:17.7005916Z 2025-05-07T20:33:17.7006073Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.7006593Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.7007048Z module_map=module_map) 2025-05-07T20:33:17.7007396Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.7007740Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.7007985Z E ^ 2025-05-07T20:33:17.7008425Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.7008867Z 2025-05-07T20:33:17.7009283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.8718085Z 2025-05-07T20:33:17.8718363Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.8718983Z self=, 2025-05-07T20:33:17.8719566Z T=128, 2025-05-07T20:33:17.8719834Z D=5120, 2025-05-07T20:33:17.8720113Z scale_ub=1200.0, 2025-05-07T20:33:17.8720467Z contiguous=False, 2025-05-07T20:33:17.8720740Z compiled=True, 2025-05-07T20:33:17.8720937Z ) 2025-05-07T20:33:17.8721259Z self = 2025-05-07T20:33:17.8721742Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:17.8722006Z 2025-05-07T20:33:17.8722078Z @given( 2025-05-07T20:33:17.8722311Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.8722620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.8722915Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.8723229Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.8723561Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.8723833Z ) 2025-05-07T20:33:17.8724177Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.8724612Z def test_silu_mul_quant( 2025-05-07T20:33:17.8724844Z self, 2025-05-07T20:33:17.8725032Z T: int, 2025-05-07T20:33:17.8725374Z D: int, 2025-05-07T20:33:17.8725649Z scale_ub: Optional[float], 2025-05-07T20:33:17.8725909Z contiguous: bool, 2025-05-07T20:33:17.8726146Z compiled: bool, 2025-05-07T20:33:17.8726376Z ) -> None: 2025-05-07T20:33:17.8726585Z torch.manual_seed(2025) 2025-05-07T20:33:17.8726835Z 2025-05-07T20:33:17.8727108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.8727445Z 2025-05-07T20:33:17.8727635Z x_sign = torch.sign(x) 2025-05-07T20:33:17.8727909Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.8728205Z x = x_sign * x_clamp 2025-05-07T20:33:17.8728437Z x0 = x[:, :D] 2025-05-07T20:33:17.8728640Z x1 = x[:, D:] 2025-05-07T20:33:17.8728902Z 2025-05-07T20:33:17.8729072Z if contiguous: 2025-05-07T20:33:17.8729295Z x0 = x0.contiguous() 2025-05-07T20:33:17.8729546Z x1 = x1.contiguous() 2025-05-07T20:33:17.8729771Z 2025-05-07T20:33:17.8729960Z if scale_ub is not None: 2025-05-07T20:33:17.8730229Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.8730550Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.8730849Z ) 2025-05-07T20:33:17.8731038Z else: 2025-05-07T20:33:17.8731238Z scale_ub_tensor = None 2025-05-07T20:33:17.8731480Z 2025-05-07T20:33:17.8731704Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.8732003Z op = silu_mul_quant 2025-05-07T20:33:17.8738960Z if compiled: 2025-05-07T20:33:17.8739242Z op = torch.compile(op) 2025-05-07T20:33:17.8739543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.8739831Z 2025-05-07T20:33:17.8740023Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.8740441Z 2025-05-07T20:33:17.8740540Z moe/activation_test.py:117: 2025-05-07T20:33:17.8740835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.8741166Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.8741438Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.8741999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:17.8742552Z return fn(*args, **kwargs) 2025-05-07T20:33:17.8743205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.8743881Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.8744416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.8745104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.8745862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.8746388Z kernel = self.compile( 2025-05-07T20:33:17.8746934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.8747634Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.8748019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.8748247Z 2025-05-07T20:33:17.8748447Z self = 2025-05-07T20:33:17.8749555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.8750906Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359bf04fe0>} 2025-05-07T20:33:17.8752285Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.8753354Z context = 2025-05-07T20:33:17.8753638Z 2025-05-07T20:33:17.8753801Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.8754313Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.8754764Z module_map=module_map) 2025-05-07T20:33:17.8755127Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.8755474Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.8755788Z E ^ 2025-05-07T20:33:17.8756242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.8756687Z 2025-05-07T20:33:17.8757107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.8757608Z 2025-05-07T20:33:17.8757713Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.8758108Z self=, 2025-05-07T20:33:17.8758507Z T=16384, 2025-05-07T20:33:17.8758693Z D=7168, 2025-05-07T20:33:17.8758875Z scale_ub=1200.0, 2025-05-07T20:33:17.8759089Z contiguous=True, 2025-05-07T20:33:17.8759307Z compiled=True, 2025-05-07T20:33:17.8759503Z ) 2025-05-07T20:33:17.8759806Z self = 2025-05-07T20:33:17.8760293Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:17.8760566Z 2025-05-07T20:33:17.8760646Z @given( 2025-05-07T20:33:17.8760863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.8761175Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.8761486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.8761805Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.8762121Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.8762401Z ) 2025-05-07T20:33:17.8762757Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.8763202Z def test_silu_mul_quant( 2025-05-07T20:33:17.8763445Z self, 2025-05-07T20:33:17.8763635Z T: int, 2025-05-07T20:33:17.8763823Z D: int, 2025-05-07T20:33:17.8764044Z scale_ub: Optional[float], 2025-05-07T20:33:17.8764308Z contiguous: bool, 2025-05-07T20:33:17.8764540Z compiled: bool, 2025-05-07T20:33:17.8764765Z ) -> None: 2025-05-07T20:33:17.8764974Z torch.manual_seed(2025) 2025-05-07T20:33:17.8765205Z 2025-05-07T20:33:17.8765519Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.8765908Z 2025-05-07T20:33:17.8766098Z x_sign = torch.sign(x) 2025-05-07T20:33:17.8766381Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.8766683Z x = x_sign * x_clamp 2025-05-07T20:33:17.8766913Z x0 = x[:, :D] 2025-05-07T20:33:17.8767122Z x1 = x[:, D:] 2025-05-07T20:33:17.8767327Z 2025-05-07T20:33:17.8767503Z if contiguous: 2025-05-07T20:33:17.8767725Z x0 = x0.contiguous() 2025-05-07T20:33:17.8767974Z x1 = x1.contiguous() 2025-05-07T20:33:17.8768202Z 2025-05-07T20:33:17.8768378Z if scale_ub is not None: 2025-05-07T20:33:17.8768645Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.8768975Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.8769273Z ) 2025-05-07T20:33:17.8769462Z else: 2025-05-07T20:33:17.8769667Z scale_ub_tensor = None 2025-05-07T20:33:17.8769905Z 2025-05-07T20:33:17.8770133Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.8770520Z op = silu_mul_quant 2025-05-07T20:33:17.8770761Z if compiled: 2025-05-07T20:33:17.8771003Z op = torch.compile(op) 2025-05-07T20:33:17.8771295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.8771556Z 2025-05-07T20:33:17.8771744Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.8771907Z 2025-05-07T20:33:17.8772006Z moe/activation_test.py:117: 2025-05-07T20:33:17.8772288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.8772606Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.8772881Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.8773429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:17.8774015Z return fn(*args, **kwargs) 2025-05-07T20:33:17.8774670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.8775348Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.8775882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.8776554Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.8777207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.8777732Z kernel = self.compile( 2025-05-07T20:33:17.8778267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.8778910Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.8779309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.8779530Z 2025-05-07T20:33:17.8779740Z self = 2025-05-07T20:33:17.8780797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.8782141Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359bf05e40>} 2025-05-07T20:33:17.8783452Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.8784454Z context = 2025-05-07T20:33:17.8784735Z 2025-05-07T20:33:17.8784903Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.8785456Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.8785915Z module_map=module_map) 2025-05-07T20:33:17.8786270Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.8786615Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.8786868Z E ^ 2025-05-07T20:33:17.8787318Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.8787807Z 2025-05-07T20:33:17.8788245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.9951705Z 2025-05-07T20:33:17.9952078Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.9952679Z self=, 2025-05-07T20:33:17.9953236Z T=16384, 2025-05-07T20:33:17.9953490Z D=5120, 2025-05-07T20:33:17.9953756Z scale_ub=1200.0, 2025-05-07T20:33:17.9954110Z contiguous=True, 2025-05-07T20:33:17.9954406Z compiled=False, 2025-05-07T20:33:17.9954607Z ) 2025-05-07T20:33:17.9954954Z self = 2025-05-07T20:33:17.9955445Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:17.9955715Z 2025-05-07T20:33:17.9955796Z @given( 2025-05-07T20:33:17.9956019Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.9956333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.9956635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.9956950Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.9957268Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.9957611Z ) 2025-05-07T20:33:17.9957949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.9958384Z def test_silu_mul_quant( 2025-05-07T20:33:17.9958620Z self, 2025-05-07T20:33:17.9958804Z T: int, 2025-05-07T20:33:17.9958997Z D: int, 2025-05-07T20:33:17.9959208Z scale_ub: Optional[float], 2025-05-07T20:33:17.9959472Z contiguous: bool, 2025-05-07T20:33:17.9959697Z compiled: bool, 2025-05-07T20:33:17.9959916Z ) -> None: 2025-05-07T20:33:17.9960125Z torch.manual_seed(2025) 2025-05-07T20:33:17.9960357Z 2025-05-07T20:33:17.9960617Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.9960943Z 2025-05-07T20:33:17.9961122Z x_sign = torch.sign(x) 2025-05-07T20:33:17.9961407Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.9961723Z x = x_sign * x_clamp 2025-05-07T20:33:17.9961953Z x0 = x[:, :D] 2025-05-07T20:33:17.9962167Z x1 = x[:, D:] 2025-05-07T20:33:17.9962370Z 2025-05-07T20:33:17.9962545Z if contiguous: 2025-05-07T20:33:17.9962778Z x0 = x0.contiguous() 2025-05-07T20:33:17.9963040Z x1 = x1.contiguous() 2025-05-07T20:33:17.9963274Z 2025-05-07T20:33:17.9963465Z if scale_ub is not None: 2025-05-07T20:33:17.9963732Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.9964062Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.9964369Z ) 2025-05-07T20:33:17.9964560Z else: 2025-05-07T20:33:17.9964770Z scale_ub_tensor = None 2025-05-07T20:33:17.9965019Z 2025-05-07T20:33:17.9965249Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.9965579Z op = silu_mul_quant 2025-05-07T20:33:17.9965844Z if compiled: 2025-05-07T20:33:17.9966080Z op = torch.compile(op) 2025-05-07T20:33:17.9966369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.9966631Z 2025-05-07T20:33:17.9966829Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.9966991Z 2025-05-07T20:33:17.9967163Z moe/activation_test.py:117: 2025-05-07T20:33:17.9967461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.9967786Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.9968067Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.9968747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.9969424Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.9969952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.9970622Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.9971272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.9971801Z kernel = self.compile( 2025-05-07T20:33:17.9972364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.9973119Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.9973511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.9973735Z 2025-05-07T20:33:17.9973941Z self = 2025-05-07T20:33:17.9975020Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.9976373Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359bf06ca0>} 2025-05-07T20:33:17.9977741Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.9978750Z context = 2025-05-07T20:33:17.9979030Z 2025-05-07T20:33:17.9979194Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.9979699Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.9980160Z module_map=module_map) 2025-05-07T20:33:17.9980519Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.9980872Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.9981121Z E ^ 2025-05-07T20:33:17.9981576Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.9982019Z 2025-05-07T20:33:17.9982453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.9982959Z 2025-05-07T20:33:17.9983067Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.9983467Z self=, 2025-05-07T20:33:17.9983869Z T=1, 2025-05-07T20:33:17.9984047Z D=7168, 2025-05-07T20:33:17.9984231Z scale_ub=1200.0, 2025-05-07T20:33:17.9984451Z contiguous=False, 2025-05-07T20:33:17.9984672Z compiled=False, 2025-05-07T20:33:17.9984868Z ) 2025-05-07T20:33:17.9985181Z self = 2025-05-07T20:33:17.9985663Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:17.9985924Z 2025-05-07T20:33:17.9986002Z @given( 2025-05-07T20:33:17.9986230Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.9986543Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.9986891Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.9987212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.9987609Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.9987886Z ) 2025-05-07T20:33:17.9988222Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.9988668Z def test_silu_mul_quant( 2025-05-07T20:33:17.9988901Z self, 2025-05-07T20:33:17.9989086Z T: int, 2025-05-07T20:33:17.9989277Z D: int, 2025-05-07T20:33:17.9989498Z scale_ub: Optional[float], 2025-05-07T20:33:17.9989756Z contiguous: bool, 2025-05-07T20:33:17.9989996Z compiled: bool, 2025-05-07T20:33:17.9990217Z ) -> None: 2025-05-07T20:33:17.9990427Z torch.manual_seed(2025) 2025-05-07T20:33:17.9990667Z 2025-05-07T20:33:17.9990942Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.9991274Z 2025-05-07T20:33:17.9991464Z x_sign = torch.sign(x) 2025-05-07T20:33:17.9991755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.9992153Z x = x_sign * x_clamp 2025-05-07T20:33:17.9992381Z x0 = x[:, :D] 2025-05-07T20:33:17.9992602Z x1 = x[:, D:] 2025-05-07T20:33:17.9992808Z 2025-05-07T20:33:17.9992991Z if contiguous: 2025-05-07T20:33:17.9993228Z x0 = x0.contiguous() 2025-05-07T20:33:17.9993481Z x1 = x1.contiguous() 2025-05-07T20:33:17.9993717Z 2025-05-07T20:33:17.9993904Z if scale_ub is not None: 2025-05-07T20:33:17.9994183Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.9994510Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.9994805Z ) 2025-05-07T20:33:17.9994988Z else: 2025-05-07T20:33:17.9995191Z scale_ub_tensor = None 2025-05-07T20:33:17.9995487Z 2025-05-07T20:33:17.9995755Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.9996063Z op = silu_mul_quant 2025-05-07T20:33:17.9996310Z if compiled: 2025-05-07T20:33:17.9996556Z op = torch.compile(op) 2025-05-07T20:33:17.9996846Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.9997108Z 2025-05-07T20:33:17.9997300Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.9997461Z 2025-05-07T20:33:17.9997557Z moe/activation_test.py:117: 2025-05-07T20:33:17.9997839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.9998155Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.9998423Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.9999090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.9999759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.0000287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.0000953Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.0001605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.0002132Z kernel = self.compile( 2025-05-07T20:33:18.0002677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.0003312Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.0003706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.0003933Z 2025-05-07T20:33:18.0004134Z self = 2025-05-07T20:33:18.0005255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.0006620Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d80c40e0>} 2025-05-07T20:33:18.0007942Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.0008952Z context = 2025-05-07T20:33:18.0009238Z 2025-05-07T20:33:18.0009400Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.0009920Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.0010386Z module_map=module_map) 2025-05-07T20:33:18.0010749Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.0011100Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.0011389Z E ^ 2025-05-07T20:33:18.0011880Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.0012327Z 2025-05-07T20:33:18.0012745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.0013251Z 2025-05-07T20:33:18.0013359Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.0013756Z self=, 2025-05-07T20:33:18.0014152Z T=4096, 2025-05-07T20:33:18.0014339Z D=7168, 2025-05-07T20:33:18.0014526Z scale_ub=1200.0, 2025-05-07T20:33:18.0014744Z contiguous=False, 2025-05-07T20:33:18.0014966Z compiled=True, 2025-05-07T20:33:18.1651903Z ) 2025-05-07T20:33:18.1652393Z self = 2025-05-07T20:33:18.1653101Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:18.1653483Z 2025-05-07T20:33:18.1653608Z @given( 2025-05-07T20:33:18.1653937Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.1654356Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.1654663Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.1654995Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.1655317Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.1655601Z ) 2025-05-07T20:33:18.1655953Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.1656397Z def test_silu_mul_quant( 2025-05-07T20:33:18.1656636Z self, 2025-05-07T20:33:18.1656824Z T: int, 2025-05-07T20:33:18.1657018Z D: int, 2025-05-07T20:33:18.1657231Z scale_ub: Optional[float], 2025-05-07T20:33:18.1657505Z contiguous: bool, 2025-05-07T20:33:18.1657732Z compiled: bool, 2025-05-07T20:33:18.1657956Z ) -> None: 2025-05-07T20:33:18.1658168Z torch.manual_seed(2025) 2025-05-07T20:33:18.1658405Z 2025-05-07T20:33:18.1658669Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.1659015Z 2025-05-07T20:33:18.1659196Z x_sign = torch.sign(x) 2025-05-07T20:33:18.1659485Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.1659791Z x = x_sign * x_clamp 2025-05-07T20:33:18.1660032Z x0 = x[:, :D] 2025-05-07T20:33:18.1660242Z x1 = x[:, D:] 2025-05-07T20:33:18.1660473Z 2025-05-07T20:33:18.1660659Z if contiguous: 2025-05-07T20:33:18.1660891Z x0 = x0.contiguous() 2025-05-07T20:33:18.1661155Z x1 = x1.contiguous() 2025-05-07T20:33:18.1661398Z 2025-05-07T20:33:18.1661600Z if scale_ub is not None: 2025-05-07T20:33:18.1661868Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.1662317Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.1662625Z ) 2025-05-07T20:33:18.1662828Z else: 2025-05-07T20:33:18.1663028Z scale_ub_tensor = None 2025-05-07T20:33:18.1663282Z 2025-05-07T20:33:18.1663514Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.1663816Z op = silu_mul_quant 2025-05-07T20:33:18.1664061Z if compiled: 2025-05-07T20:33:18.1664293Z op = torch.compile(op) 2025-05-07T20:33:18.1664586Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.1664857Z 2025-05-07T20:33:18.1665049Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.1665212Z 2025-05-07T20:33:18.1665311Z moe/activation_test.py:117: 2025-05-07T20:33:18.1665597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.1665938Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.1666219Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.1666769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:18.1667511Z return fn(*args, **kwargs) 2025-05-07T20:33:18.1668172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.1668845Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.1669378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.1670054Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.1670711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.1671227Z kernel = self.compile( 2025-05-07T20:33:18.1671838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.1672474Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.1672866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.1673087Z 2025-05-07T20:33:18.1673291Z self = 2025-05-07T20:33:18.1674343Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.1675686Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d80c5300>} 2025-05-07T20:33:18.1676999Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.1678006Z context = 2025-05-07T20:33:18.1678305Z 2025-05-07T20:33:18.1678469Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.1678981Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.1679627Z module_map=module_map) 2025-05-07T20:33:18.1679982Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.1680331Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.1680583Z E ^ 2025-05-07T20:33:18.1681038Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.1681484Z 2025-05-07T20:33:18.1681908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.1682421Z 2025-05-07T20:33:18.1682572Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.1682983Z self=, 2025-05-07T20:33:18.1683375Z T=128, 2025-05-07T20:33:18.1683557Z D=7168, 2025-05-07T20:33:18.1683752Z scale_ub=1200.0, 2025-05-07T20:33:18.1683969Z contiguous=False, 2025-05-07T20:33:18.1690739Z compiled=True, 2025-05-07T20:33:18.1690951Z ) 2025-05-07T20:33:18.1691266Z self = 2025-05-07T20:33:18.1691756Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:18.1692034Z 2025-05-07T20:33:18.1692110Z @given( 2025-05-07T20:33:18.1692337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.1692641Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.1692943Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.1693259Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.1693574Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.1693929Z ) 2025-05-07T20:33:18.1694321Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.1694769Z def test_silu_mul_quant( 2025-05-07T20:33:18.1695011Z self, 2025-05-07T20:33:18.1695208Z T: int, 2025-05-07T20:33:18.1695389Z D: int, 2025-05-07T20:33:18.1695603Z scale_ub: Optional[float], 2025-05-07T20:33:18.1695902Z contiguous: bool, 2025-05-07T20:33:18.1696156Z compiled: bool, 2025-05-07T20:33:18.1696365Z ) -> None: 2025-05-07T20:33:18.1696573Z torch.manual_seed(2025) 2025-05-07T20:33:18.1696811Z 2025-05-07T20:33:18.1697071Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.1697403Z 2025-05-07T20:33:18.1697647Z x_sign = torch.sign(x) 2025-05-07T20:33:18.1697934Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.1698236Z x = x_sign * x_clamp 2025-05-07T20:33:18.1698475Z x0 = x[:, :D] 2025-05-07T20:33:18.1698687Z x1 = x[:, D:] 2025-05-07T20:33:18.1698894Z 2025-05-07T20:33:18.1699071Z if contiguous: 2025-05-07T20:33:18.1699289Z x0 = x0.contiguous() 2025-05-07T20:33:18.1699537Z x1 = x1.contiguous() 2025-05-07T20:33:18.1699770Z 2025-05-07T20:33:18.1699953Z if scale_ub is not None: 2025-05-07T20:33:18.1700219Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.1700542Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.1700835Z ) 2025-05-07T20:33:18.1701025Z else: 2025-05-07T20:33:18.1701233Z scale_ub_tensor = None 2025-05-07T20:33:18.1701479Z 2025-05-07T20:33:18.1701697Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.1702002Z op = silu_mul_quant 2025-05-07T20:33:18.1702245Z if compiled: 2025-05-07T20:33:18.1702480Z op = torch.compile(op) 2025-05-07T20:33:18.1702772Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.1703039Z 2025-05-07T20:33:18.1703223Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.1703387Z 2025-05-07T20:33:18.1703485Z moe/activation_test.py:117: 2025-05-07T20:33:18.1703770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.1704088Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.1704363Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.1704917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:18.1705467Z return fn(*args, **kwargs) 2025-05-07T20:33:18.1706166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.1706843Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.1707490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.1708163Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.1708810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.1709326Z kernel = self.compile( 2025-05-07T20:33:18.1709864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.1710500Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.1710887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.1711107Z 2025-05-07T20:33:18.1711316Z self = 2025-05-07T20:33:18.1712414Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.1713841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d80c6160>} 2025-05-07T20:33:18.1715153Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.1716162Z context = 2025-05-07T20:33:18.1716447Z 2025-05-07T20:33:18.1716620Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.1717128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.1717631Z module_map=module_map) 2025-05-07T20:33:18.1717994Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.1718350Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.1718596Z E ^ 2025-05-07T20:33:18.1719054Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.1719492Z 2025-05-07T20:33:18.1719912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.1720413Z 2025-05-07T20:33:18.1720520Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.1720914Z self=, 2025-05-07T20:33:18.1721313Z T=2048, 2025-05-07T20:33:18.1721500Z D=7168, 2025-05-07T20:33:18.1721678Z scale_ub=None, 2025-05-07T20:33:18.1721894Z contiguous=True, 2025-05-07T20:33:18.1722116Z compiled=True, 2025-05-07T20:33:18.3001223Z ) 2025-05-07T20:33:18.3001753Z self = 2025-05-07T20:33:18.3002488Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:18.3002852Z 2025-05-07T20:33:18.3002955Z @given( 2025-05-07T20:33:18.3003262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.3003611Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.3003910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.3004234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.3004726Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.3005004Z ) 2025-05-07T20:33:18.3005363Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.3005834Z def test_silu_mul_quant( 2025-05-07T20:33:18.3006077Z self, 2025-05-07T20:33:18.3006275Z T: int, 2025-05-07T20:33:18.3006478Z D: int, 2025-05-07T20:33:18.3006696Z scale_ub: Optional[float], 2025-05-07T20:33:18.3007085Z contiguous: bool, 2025-05-07T20:33:18.3007325Z compiled: bool, 2025-05-07T20:33:18.3007548Z ) -> None: 2025-05-07T20:33:18.3007762Z torch.manual_seed(2025) 2025-05-07T20:33:18.3008012Z 2025-05-07T20:33:18.3008278Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.3008623Z 2025-05-07T20:33:18.3008813Z x_sign = torch.sign(x) 2025-05-07T20:33:18.3009098Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.3009398Z x = x_sign * x_clamp 2025-05-07T20:33:18.3009639Z x0 = x[:, :D] 2025-05-07T20:33:18.3009855Z x1 = x[:, D:] 2025-05-07T20:33:18.3010054Z 2025-05-07T20:33:18.3010231Z if contiguous: 2025-05-07T20:33:18.3010456Z x0 = x0.contiguous() 2025-05-07T20:33:18.3010695Z x1 = x1.contiguous() 2025-05-07T20:33:18.3010926Z 2025-05-07T20:33:18.3011120Z if scale_ub is not None: 2025-05-07T20:33:18.3011382Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.3011718Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.3012142Z ) 2025-05-07T20:33:18.3012329Z else: 2025-05-07T20:33:18.3012537Z scale_ub_tensor = None 2025-05-07T20:33:18.3012778Z 2025-05-07T20:33:18.3012992Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.3013291Z op = silu_mul_quant 2025-05-07T20:33:18.3013535Z if compiled: 2025-05-07T20:33:18.3013775Z op = torch.compile(op) 2025-05-07T20:33:18.3014066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.3014335Z 2025-05-07T20:33:18.3014519Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.3014677Z 2025-05-07T20:33:18.3014772Z moe/activation_test.py:117: 2025-05-07T20:33:18.3015145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.3015467Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.3015737Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.3016289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:18.3016837Z return fn(*args, **kwargs) 2025-05-07T20:33:18.3017485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.3018147Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.3018666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.3019347Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.3019988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.3020505Z kernel = self.compile( 2025-05-07T20:33:18.3021054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.3021689Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.3022070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.3022290Z 2025-05-07T20:33:18.3022493Z self = 2025-05-07T20:33:18.3023555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.3024901Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d80c7420>} 2025-05-07T20:33:18.3026350Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.3027356Z context = 2025-05-07T20:33:18.3027710Z 2025-05-07T20:33:18.3027872Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.3028374Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.3028819Z module_map=module_map) 2025-05-07T20:33:18.3029171Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.3029515Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.3029761Z E ^ 2025-05-07T20:33:18.3030203Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.3030646Z 2025-05-07T20:33:18.3031059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.3031559Z 2025-05-07T20:33:18.3031667Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.3032144Z self=, 2025-05-07T20:33:18.3032528Z T=16384, 2025-05-07T20:33:18.3032715Z D=5120, 2025-05-07T20:33:18.3032905Z scale_ub=None, 2025-05-07T20:33:18.3033107Z contiguous=False, 2025-05-07T20:33:18.3033331Z compiled=False, 2025-05-07T20:33:18.3033531Z ) 2025-05-07T20:33:18.3033842Z self = 2025-05-07T20:33:18.3034328Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:18.3034598Z 2025-05-07T20:33:18.3034678Z @given( 2025-05-07T20:33:18.3034898Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.3035271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.3035566Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.3035886Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.3036212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.3036485Z ) 2025-05-07T20:33:18.3036822Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.3037259Z def test_silu_mul_quant( 2025-05-07T20:33:18.3037489Z self, 2025-05-07T20:33:18.3037675Z T: int, 2025-05-07T20:33:18.3037860Z D: int, 2025-05-07T20:33:18.3038080Z scale_ub: Optional[float], 2025-05-07T20:33:18.3038342Z contiguous: bool, 2025-05-07T20:33:18.3038569Z compiled: bool, 2025-05-07T20:33:18.3038780Z ) -> None: 2025-05-07T20:33:18.3038995Z torch.manual_seed(2025) 2025-05-07T20:33:18.3039225Z 2025-05-07T20:33:18.3039477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.3039804Z 2025-05-07T20:33:18.3039989Z x_sign = torch.sign(x) 2025-05-07T20:33:18.3040488Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.3042479Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.3044345Z 2025-05-07T20:33:18.3044460Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:18.3044677Z 2025-05-07T20:33:18.3044777Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.3045176Z self=, 2025-05-07T20:33:18.3045576Z T=4096, 2025-05-07T20:33:18.3045758Z D=7168, 2025-05-07T20:33:18.3046022Z scale_ub=1200.0, 2025-05-07T20:33:18.3046240Z contiguous=True, 2025-05-07T20:33:18.3046458Z compiled=True, 2025-05-07T20:33:18.3046653Z ) 2025-05-07T20:33:18.3046955Z self = 2025-05-07T20:33:18.3047429Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:18.3047708Z 2025-05-07T20:33:18.3047783Z @given( 2025-05-07T20:33:18.3048006Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.3048305Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.3048602Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.3048922Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.3049237Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.3049515Z ) 2025-05-07T20:33:18.3049861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.3050298Z def test_silu_mul_quant( 2025-05-07T20:33:18.3050529Z self, 2025-05-07T20:33:18.3050833Z T: int, 2025-05-07T20:33:18.3051019Z D: int, 2025-05-07T20:33:18.3051231Z scale_ub: Optional[float], 2025-05-07T20:33:18.3051493Z contiguous: bool, 2025-05-07T20:33:18.3051723Z compiled: bool, 2025-05-07T20:33:18.3051936Z ) -> None: 2025-05-07T20:33:18.3052145Z torch.manual_seed(2025) 2025-05-07T20:33:18.3052382Z 2025-05-07T20:33:18.3052638Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.3052979Z 2025-05-07T20:33:18.3053157Z x_sign = torch.sign(x) 2025-05-07T20:33:18.3053435Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.3055403Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.3057365Z 2025-05-07T20:33:18.3057480Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:18.3057695Z 2025-05-07T20:33:18.3057792Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.3058190Z self=, 2025-05-07T20:33:18.3058584Z T=16384, 2025-05-07T20:33:18.3058767Z D=7168, 2025-05-07T20:33:18.3058952Z scale_ub=None, 2025-05-07T20:33:18.3059152Z contiguous=False, 2025-05-07T20:33:18.3059372Z compiled=False, 2025-05-07T20:33:18.3059574Z ) 2025-05-07T20:33:18.3059882Z self = 2025-05-07T20:33:18.3060380Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:18.3060659Z 2025-05-07T20:33:18.3060739Z @given( 2025-05-07T20:33:18.3060976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.3061272Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.3061568Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.3061889Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.3062200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.3062478Z ) 2025-05-07T20:33:18.3062827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.3063266Z def test_silu_mul_quant( 2025-05-07T20:33:18.3063498Z self, 2025-05-07T20:33:18.3063689Z T: int, 2025-05-07T20:33:18.3063878Z D: int, 2025-05-07T20:33:18.3064090Z scale_ub: Optional[float], 2025-05-07T20:33:18.3064347Z contiguous: bool, 2025-05-07T20:33:18.3064642Z compiled: bool, 2025-05-07T20:33:18.3064852Z ) -> None: 2025-05-07T20:33:18.3065058Z torch.manual_seed(2025) 2025-05-07T20:33:18.3065291Z 2025-05-07T20:33:18.3065545Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.3067662Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.3069492Z 2025-05-07T20:33:18.3069605Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:18.4319065Z 2025-05-07T20:33:18.4319437Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.4320064Z self=, 2025-05-07T20:33:18.4320466Z T=2048, 2025-05-07T20:33:18.4320642Z D=7168, 2025-05-07T20:33:18.4320818Z scale_ub=1200.0, 2025-05-07T20:33:18.4321019Z contiguous=True, 2025-05-07T20:33:18.4321221Z compiled=True, 2025-05-07T20:33:18.4321407Z ) 2025-05-07T20:33:18.4321703Z self = 2025-05-07T20:33:18.4322187Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:18.4322472Z 2025-05-07T20:33:18.4322543Z @given( 2025-05-07T20:33:18.4322759Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.4323053Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.4323424Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.4323748Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.4324072Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.4324357Z ) 2025-05-07T20:33:18.4324695Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.4325127Z def test_silu_mul_quant( 2025-05-07T20:33:18.4325368Z self, 2025-05-07T20:33:18.4325560Z T: int, 2025-05-07T20:33:18.4325754Z D: int, 2025-05-07T20:33:18.4325977Z scale_ub: Optional[float], 2025-05-07T20:33:18.4326250Z contiguous: bool, 2025-05-07T20:33:18.4326480Z compiled: bool, 2025-05-07T20:33:18.4326696Z ) -> None: 2025-05-07T20:33:18.4326906Z torch.manual_seed(2025) 2025-05-07T20:33:18.4327148Z 2025-05-07T20:33:18.4327410Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.4327745Z 2025-05-07T20:33:18.4327965Z x_sign = torch.sign(x) 2025-05-07T20:33:18.4328250Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.4330233Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.4332172Z 2025-05-07T20:33:18.4332292Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:18.4332499Z 2025-05-07T20:33:18.4332593Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.4332987Z self=, 2025-05-07T20:33:18.4333380Z T=2048, 2025-05-07T20:33:18.4333554Z D=7168, 2025-05-07T20:33:18.4333740Z scale_ub=None, 2025-05-07T20:33:18.4334018Z contiguous=True, 2025-05-07T20:33:18.4334236Z compiled=False, 2025-05-07T20:33:18.4334429Z ) 2025-05-07T20:33:18.4334736Z self = 2025-05-07T20:33:18.4335217Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:18.4335478Z 2025-05-07T20:33:18.4335549Z @given( 2025-05-07T20:33:18.4335767Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.4336067Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.4336355Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.4336672Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.4336985Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.4337259Z ) 2025-05-07T20:33:18.4337609Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.4338049Z def test_silu_mul_quant( 2025-05-07T20:33:18.4338279Z self, 2025-05-07T20:33:18.4338498Z T: int, 2025-05-07T20:33:18.4338720Z D: int, 2025-05-07T20:33:18.4338932Z scale_ub: Optional[float], 2025-05-07T20:33:18.4339189Z contiguous: bool, 2025-05-07T20:33:18.4339415Z compiled: bool, 2025-05-07T20:33:18.4339629Z ) -> None: 2025-05-07T20:33:18.4339838Z torch.manual_seed(2025) 2025-05-07T20:33:18.4340337Z 2025-05-07T20:33:18.4340603Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.4340926Z 2025-05-07T20:33:18.4341117Z > x_sign = torch.sign(x) 2025-05-07T20:33:18.4343159Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.4345059Z 2025-05-07T20:33:18.4345183Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:18.4345394Z 2025-05-07T20:33:18.4345498Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.4345894Z self=, 2025-05-07T20:33:18.4346304Z T=1, 2025-05-07T20:33:18.4346475Z D=7168, 2025-05-07T20:33:18.4346652Z scale_ub=1200.0, 2025-05-07T20:33:18.4346862Z contiguous=True, 2025-05-07T20:33:18.4347071Z compiled=False, 2025-05-07T20:33:18.4347258Z ) 2025-05-07T20:33:18.4347617Z self = 2025-05-07T20:33:18.4348092Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:18.4348345Z 2025-05-07T20:33:18.4348420Z @given( 2025-05-07T20:33:18.4348634Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.4348928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.4349220Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.4349534Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.4349851Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.4350117Z ) 2025-05-07T20:33:18.4350450Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.4350893Z def test_silu_mul_quant( 2025-05-07T20:33:18.4351138Z self, 2025-05-07T20:33:18.4351325Z T: int, 2025-05-07T20:33:18.4351521Z D: int, 2025-05-07T20:33:18.4351743Z scale_ub: Optional[float], 2025-05-07T20:33:18.4352004Z contiguous: bool, 2025-05-07T20:33:18.4352247Z compiled: bool, 2025-05-07T20:33:18.4352462Z ) -> None: 2025-05-07T20:33:18.4352739Z torch.manual_seed(2025) 2025-05-07T20:33:18.4352974Z 2025-05-07T20:33:18.4353237Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.4353558Z 2025-05-07T20:33:18.4353733Z x_sign = torch.sign(x) 2025-05-07T20:33:18.4354010Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.4354302Z x = x_sign * x_clamp 2025-05-07T20:33:18.4354522Z x0 = x[:, :D] 2025-05-07T20:33:18.4354729Z x1 = x[:, D:] 2025-05-07T20:33:18.4354923Z 2025-05-07T20:33:18.4355093Z if contiguous: 2025-05-07T20:33:18.4355317Z x0 = x0.contiguous() 2025-05-07T20:33:18.4355558Z x1 = x1.contiguous() 2025-05-07T20:33:18.4355816Z 2025-05-07T20:33:18.4356016Z if scale_ub is not None: 2025-05-07T20:33:18.4356275Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.4356603Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.4356900Z ) 2025-05-07T20:33:18.4357080Z else: 2025-05-07T20:33:18.4357277Z scale_ub_tensor = None 2025-05-07T20:33:18.4357637Z 2025-05-07T20:33:18.4357858Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.4358166Z op = silu_mul_quant 2025-05-07T20:33:18.4365589Z if compiled: 2025-05-07T20:33:18.4365850Z op = torch.compile(op) 2025-05-07T20:33:18.4366142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.4366407Z 2025-05-07T20:33:18.4366593Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.4366753Z 2025-05-07T20:33:18.4366847Z moe/activation_test.py:117: 2025-05-07T20:33:18.4367135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.4367459Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.4367805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.4368504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.4369185Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.4369714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.4370373Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.4371033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.4371552Z kernel = self.compile( 2025-05-07T20:33:18.4372083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.4372718Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.4373104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.4373328Z 2025-05-07T20:33:18.4373535Z self = 2025-05-07T20:33:18.4374602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.4375952Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359bc162a0>} 2025-05-07T20:33:18.4377266Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.4378323Z context = 2025-05-07T20:33:18.4378603Z 2025-05-07T20:33:18.4378772Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.4379334Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.4379808Z module_map=module_map) 2025-05-07T20:33:18.4380163Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.4380507Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.4380751Z E ^ 2025-05-07T20:33:18.4381196Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.4381632Z 2025-05-07T20:33:18.4382049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.4382546Z 2025-05-07T20:33:18.4382640Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.4383042Z self=, 2025-05-07T20:33:18.4383429Z T=128, 2025-05-07T20:33:18.4383606Z D=5120, 2025-05-07T20:33:18.4383784Z scale_ub=None, 2025-05-07T20:33:18.4383994Z contiguous=True, 2025-05-07T20:33:18.4384206Z compiled=False, 2025-05-07T20:33:18.4384447Z ) 2025-05-07T20:33:18.4384788Z self = 2025-05-07T20:33:18.4385266Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:18.4385530Z 2025-05-07T20:33:18.4385607Z @given( 2025-05-07T20:33:18.4385831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.4386173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.4386462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.4386777Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.4387092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.4387367Z ) 2025-05-07T20:33:18.4387763Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.4388241Z def test_silu_mul_quant( 2025-05-07T20:33:18.4388471Z self, 2025-05-07T20:33:18.4388655Z T: int, 2025-05-07T20:33:18.4388846Z D: int, 2025-05-07T20:33:18.4389054Z scale_ub: Optional[float], 2025-05-07T20:33:18.4389309Z contiguous: bool, 2025-05-07T20:33:18.4389538Z compiled: bool, 2025-05-07T20:33:18.4389752Z ) -> None: 2025-05-07T20:33:18.4389949Z torch.manual_seed(2025) 2025-05-07T20:33:18.4390180Z 2025-05-07T20:33:18.4390443Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.4390770Z 2025-05-07T20:33:18.4390958Z x_sign = torch.sign(x) 2025-05-07T20:33:18.4391234Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.4391526Z x = x_sign * x_clamp 2025-05-07T20:33:18.4391754Z x0 = x[:, :D] 2025-05-07T20:33:18.4391958Z x1 = x[:, D:] 2025-05-07T20:33:18.4392162Z 2025-05-07T20:33:18.4392335Z if contiguous: 2025-05-07T20:33:18.4392557Z x0 = x0.contiguous() 2025-05-07T20:33:18.4392800Z x1 = x1.contiguous() 2025-05-07T20:33:18.4393020Z 2025-05-07T20:33:18.4393205Z if scale_ub is not None: 2025-05-07T20:33:18.4393471Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.4393790Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.4394087Z ) 2025-05-07T20:33:18.4394272Z else: 2025-05-07T20:33:18.4394468Z scale_ub_tensor = None 2025-05-07T20:33:18.4394705Z 2025-05-07T20:33:18.4394925Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.4395223Z op = silu_mul_quant 2025-05-07T20:33:18.4395461Z if compiled: 2025-05-07T20:33:18.4395699Z op = torch.compile(op) 2025-05-07T20:33:18.4395982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.4396251Z 2025-05-07T20:33:18.4396434Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.4396591Z 2025-05-07T20:33:18.4396686Z moe/activation_test.py:117: 2025-05-07T20:33:18.4397008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.4397334Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.4397609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.4398277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.4398951Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.4399478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.4400138Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.4400781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.4401304Z kernel = self.compile( 2025-05-07T20:33:18.4401832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.4402515Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.4402942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.4403167Z 2025-05-07T20:33:18.4403366Z self = 2025-05-07T20:33:18.4404420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.4405771Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359bc171a0>} 2025-05-07T20:33:18.4407121Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.4408125Z context = 2025-05-07T20:33:18.4408403Z 2025-05-07T20:33:18.4408568Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.4409076Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.4409537Z module_map=module_map) 2025-05-07T20:33:18.4409888Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.4410234Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.4410479Z E ^ 2025-05-07T20:33:18.4410927Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.4411370Z 2025-05-07T20:33:18.4411789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.5545457Z 2025-05-07T20:33:18.5545709Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.5546129Z self=, 2025-05-07T20:33:18.5546550Z T=128, 2025-05-07T20:33:18.5546733Z D=7168, 2025-05-07T20:33:18.5546970Z scale_ub=None, 2025-05-07T20:33:18.5547173Z contiguous=True, 2025-05-07T20:33:18.5547399Z compiled=False, 2025-05-07T20:33:18.5547674Z ) 2025-05-07T20:33:18.5547983Z self = 2025-05-07T20:33:18.5548459Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:18.5548721Z 2025-05-07T20:33:18.5548796Z @given( 2025-05-07T20:33:18.5549017Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.5549313Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.5549615Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.5549939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.5550369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.5550654Z ) 2025-05-07T20:33:18.5550988Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.5551411Z def test_silu_mul_quant( 2025-05-07T20:33:18.5551642Z self, 2025-05-07T20:33:18.5551834Z T: int, 2025-05-07T20:33:18.5552016Z D: int, 2025-05-07T20:33:18.5552229Z scale_ub: Optional[float], 2025-05-07T20:33:18.5552490Z contiguous: bool, 2025-05-07T20:33:18.5552722Z compiled: bool, 2025-05-07T20:33:18.5552933Z ) -> None: 2025-05-07T20:33:18.5553138Z torch.manual_seed(2025) 2025-05-07T20:33:18.5553367Z 2025-05-07T20:33:18.5553626Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.5553955Z 2025-05-07T20:33:18.5554138Z x_sign = torch.sign(x) 2025-05-07T20:33:18.5554417Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.5554720Z x = x_sign * x_clamp 2025-05-07T20:33:18.5555026Z x0 = x[:, :D] 2025-05-07T20:33:18.5555285Z x1 = x[:, D:] 2025-05-07T20:33:18.5555485Z 2025-05-07T20:33:18.5555659Z if contiguous: 2025-05-07T20:33:18.5555903Z x0 = x0.contiguous() 2025-05-07T20:33:18.5556176Z x1 = x1.contiguous() 2025-05-07T20:33:18.5556407Z 2025-05-07T20:33:18.5556587Z if scale_ub is not None: 2025-05-07T20:33:18.5556856Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.5557185Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.5557484Z ) 2025-05-07T20:33:18.5557673Z else: 2025-05-07T20:33:18.5557878Z scale_ub_tensor = None 2025-05-07T20:33:18.5558123Z 2025-05-07T20:33:18.5558346Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.5558720Z op = silu_mul_quant 2025-05-07T20:33:18.5558964Z if compiled: 2025-05-07T20:33:18.5559204Z op = torch.compile(op) 2025-05-07T20:33:18.5559495Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.5559756Z 2025-05-07T20:33:18.5559935Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.5560102Z 2025-05-07T20:33:18.5560199Z moe/activation_test.py:117: 2025-05-07T20:33:18.5560492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.5560812Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.5561077Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.5561763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.5562449Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.5562974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.5563646Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.5564295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.5564820Z kernel = self.compile( 2025-05-07T20:33:18.5565367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.5566011Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.5566402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.5566628Z 2025-05-07T20:33:18.5566837Z self = 2025-05-07T20:33:18.5567903Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.5569307Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359be10040>} 2025-05-07T20:33:18.5570626Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.5571638Z context = 2025-05-07T20:33:18.5571919Z 2025-05-07T20:33:18.5572080Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.5572598Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.5573061Z module_map=module_map) 2025-05-07T20:33:18.5573424Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.5573765Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.5574019Z E ^ 2025-05-07T20:33:18.5574582Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.5575060Z 2025-05-07T20:33:18.5575486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.5576037Z 2025-05-07T20:33:18.5576150Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.5576551Z self=, 2025-05-07T20:33:18.5576951Z T=2048, 2025-05-07T20:33:18.5577130Z D=7168, 2025-05-07T20:33:18.5577316Z scale_ub=1200.0, 2025-05-07T20:33:18.5577530Z contiguous=True, 2025-05-07T20:33:18.5577740Z compiled=False, 2025-05-07T20:33:18.5577946Z ) 2025-05-07T20:33:18.5578258Z self = 2025-05-07T20:33:18.5578780Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:18.5579052Z 2025-05-07T20:33:18.5579131Z @given( 2025-05-07T20:33:18.5579361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.5579671Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.5579971Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.5580297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.5580618Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.5580889Z ) 2025-05-07T20:33:18.5581230Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.5581660Z def test_silu_mul_quant( 2025-05-07T20:33:18.5581891Z self, 2025-05-07T20:33:18.5582089Z T: int, 2025-05-07T20:33:18.5582290Z D: int, 2025-05-07T20:33:18.5582497Z scale_ub: Optional[float], 2025-05-07T20:33:18.5582770Z contiguous: bool, 2025-05-07T20:33:18.5583006Z compiled: bool, 2025-05-07T20:33:18.5583223Z ) -> None: 2025-05-07T20:33:18.5583443Z torch.manual_seed(2025) 2025-05-07T20:33:18.5583685Z 2025-05-07T20:33:18.5583961Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.5586007Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.5587928Z 2025-05-07T20:33:18.5588042Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:18.5588259Z 2025-05-07T20:33:18.5588360Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.5588817Z self=, 2025-05-07T20:33:18.5589208Z T=1, 2025-05-07T20:33:18.5589401Z D=5120, 2025-05-07T20:33:18.5589597Z scale_ub=1200.0, 2025-05-07T20:33:18.5589819Z contiguous=True, 2025-05-07T20:33:18.5590035Z compiled=False, 2025-05-07T20:33:18.5590236Z ) 2025-05-07T20:33:18.5590546Z self = 2025-05-07T20:33:18.5591019Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:18.5591286Z 2025-05-07T20:33:18.5591362Z @given( 2025-05-07T20:33:18.5591588Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.5591885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.5592177Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.5592496Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.5592809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.5593083Z ) 2025-05-07T20:33:18.5593470Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.5593942Z def test_silu_mul_quant( 2025-05-07T20:33:18.5594176Z self, 2025-05-07T20:33:18.5594368Z T: int, 2025-05-07T20:33:18.5594565Z D: int, 2025-05-07T20:33:18.5594779Z scale_ub: Optional[float], 2025-05-07T20:33:18.5595048Z contiguous: bool, 2025-05-07T20:33:18.5595282Z compiled: bool, 2025-05-07T20:33:18.5595487Z ) -> None: 2025-05-07T20:33:18.5595691Z torch.manual_seed(2025) 2025-05-07T20:33:18.5595915Z 2025-05-07T20:33:18.5596168Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.5596494Z 2025-05-07T20:33:18.5596674Z x_sign = torch.sign(x) 2025-05-07T20:33:18.5596948Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.5597286Z x = x_sign * x_clamp 2025-05-07T20:33:18.5597520Z x0 = x[:, :D] 2025-05-07T20:33:18.5597723Z x1 = x[:, D:] 2025-05-07T20:33:18.5597918Z 2025-05-07T20:33:18.5598101Z if contiguous: 2025-05-07T20:33:18.5598322Z x0 = x0.contiguous() 2025-05-07T20:33:18.5598575Z x1 = x1.contiguous() 2025-05-07T20:33:18.5598820Z 2025-05-07T20:33:18.5599010Z if scale_ub is not None: 2025-05-07T20:33:18.5599277Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.5599604Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.5599903Z ) 2025-05-07T20:33:18.5600086Z else: 2025-05-07T20:33:18.5600294Z scale_ub_tensor = None 2025-05-07T20:33:18.5600538Z 2025-05-07T20:33:18.5600761Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.5601068Z op = silu_mul_quant 2025-05-07T20:33:18.5601323Z if compiled: 2025-05-07T20:33:18.5601558Z op = torch.compile(op) 2025-05-07T20:33:18.5601850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.5602121Z 2025-05-07T20:33:18.5602308Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.5602475Z 2025-05-07T20:33:18.5602569Z moe/activation_test.py:117: 2025-05-07T20:33:18.5602866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.5603194Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.5603462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.5604137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.5604811Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.5605355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.5606030Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.5606733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.5607253Z kernel = self.compile( 2025-05-07T20:33:18.5607787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.5608426Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.5608817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.5609043Z 2025-05-07T20:33:18.5609252Z self = 2025-05-07T20:33:18.5610307Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.5611698Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359be11580>} 2025-05-07T20:33:18.5613053Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.5614090Z context = 2025-05-07T20:33:18.5614368Z 2025-05-07T20:33:18.5614533Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.5615039Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.5615499Z module_map=module_map) 2025-05-07T20:33:18.5615867Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.5616250Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.5616545Z E ^ 2025-05-07T20:33:18.5616994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.5617441Z 2025-05-07T20:33:18.5617856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.6449849Z 2025-05-07T20:33:18.6450090Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.6450520Z self=, 2025-05-07T20:33:18.6450929Z T=2048, 2025-05-07T20:33:18.6451108Z D=5120, 2025-05-07T20:33:18.6451307Z scale_ub=None, 2025-05-07T20:33:18.6451520Z contiguous=True, 2025-05-07T20:33:18.6451737Z compiled=False, 2025-05-07T20:33:18.6451938Z ) 2025-05-07T20:33:18.6452256Z self = 2025-05-07T20:33:18.6452755Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:18.6453024Z 2025-05-07T20:33:18.6453098Z @given( 2025-05-07T20:33:18.6453358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.6453664Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.6453966Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.6454292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.6454606Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.6454884Z ) 2025-05-07T20:33:18.6455226Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.6455657Z def test_silu_mul_quant( 2025-05-07T20:33:18.6455892Z self, 2025-05-07T20:33:18.6456077Z T: int, 2025-05-07T20:33:18.6456265Z D: int, 2025-05-07T20:33:18.6456476Z scale_ub: Optional[float], 2025-05-07T20:33:18.6456738Z contiguous: bool, 2025-05-07T20:33:18.6456966Z compiled: bool, 2025-05-07T20:33:18.6457187Z ) -> None: 2025-05-07T20:33:18.6457399Z torch.manual_seed(2025) 2025-05-07T20:33:18.6457647Z 2025-05-07T20:33:18.6458029Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.6458369Z 2025-05-07T20:33:18.6458566Z > x_sign = torch.sign(x) 2025-05-07T20:33:18.6460573Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.6462495Z 2025-05-07T20:33:18.6462610Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:18.6462826Z 2025-05-07T20:33:18.6462926Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.6463336Z self=, 2025-05-07T20:33:18.6463735Z T=16384, 2025-05-07T20:33:18.6463987Z D=5120, 2025-05-07T20:33:18.6464218Z scale_ub=None, 2025-05-07T20:33:18.6464425Z contiguous=True, 2025-05-07T20:33:18.6464635Z compiled=False, 2025-05-07T20:33:18.6464829Z ) 2025-05-07T20:33:18.6465137Z self = 2025-05-07T20:33:18.6465613Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:18.6465923Z 2025-05-07T20:33:18.6466011Z @given( 2025-05-07T20:33:18.6466246Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.6466548Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.6466846Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.6467169Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.6467639Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.6467911Z ) 2025-05-07T20:33:18.6468259Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.6468707Z def test_silu_mul_quant( 2025-05-07T20:33:18.6468935Z self, 2025-05-07T20:33:18.6469124Z T: int, 2025-05-07T20:33:18.6469318Z D: int, 2025-05-07T20:33:18.6469523Z scale_ub: Optional[float], 2025-05-07T20:33:18.6469791Z contiguous: bool, 2025-05-07T20:33:18.6470030Z compiled: bool, 2025-05-07T20:33:18.6470244Z ) -> None: 2025-05-07T20:33:18.6470458Z torch.manual_seed(2025) 2025-05-07T20:33:18.6477638Z 2025-05-07T20:33:18.6477938Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.6479994Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.6481868Z 2025-05-07T20:33:18.6481987Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:18.6482204Z 2025-05-07T20:33:18.6482304Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.6482706Z self=, 2025-05-07T20:33:18.6483107Z T=4096, 2025-05-07T20:33:18.6483293Z D=5120, 2025-05-07T20:33:18.6483487Z scale_ub=None, 2025-05-07T20:33:18.6483694Z contiguous=True, 2025-05-07T20:33:18.6483917Z compiled=False, 2025-05-07T20:33:18.6484123Z ) 2025-05-07T20:33:18.6484437Z self = 2025-05-07T20:33:18.6484917Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:18.6485250Z 2025-05-07T20:33:18.6485338Z @given( 2025-05-07T20:33:18.6485565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.6485867Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.6486165Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.6486484Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.6486801Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.6487080Z ) 2025-05-07T20:33:18.6487421Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.6487860Z def test_silu_mul_quant( 2025-05-07T20:33:18.6488101Z self, 2025-05-07T20:33:18.6488289Z T: int, 2025-05-07T20:33:18.6488476Z D: int, 2025-05-07T20:33:18.6488691Z scale_ub: Optional[float], 2025-05-07T20:33:18.6488965Z contiguous: bool, 2025-05-07T20:33:18.6489193Z compiled: bool, 2025-05-07T20:33:18.6489412Z ) -> None: 2025-05-07T20:33:18.6489622Z torch.manual_seed(2025) 2025-05-07T20:33:18.6489904Z 2025-05-07T20:33:18.6490203Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.6492201Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.6494100Z 2025-05-07T20:33:18.6494253Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:18.6494456Z 2025-05-07T20:33:18.6494556Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.6494958Z self=, 2025-05-07T20:33:18.6495362Z T=2048, 2025-05-07T20:33:18.6495548Z D=5120, 2025-05-07T20:33:18.6495728Z scale_ub=None, 2025-05-07T20:33:18.6495957Z contiguous=False, 2025-05-07T20:33:18.6496198Z compiled=False, 2025-05-07T20:33:18.6496393Z ) 2025-05-07T20:33:18.6496693Z self = 2025-05-07T20:33:18.6497167Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:18.6497431Z 2025-05-07T20:33:18.6497512Z @given( 2025-05-07T20:33:18.6497727Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.6498029Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.6498324Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.6498643Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.6498966Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.6499246Z ) 2025-05-07T20:33:18.6499598Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.6500038Z def test_silu_mul_quant( 2025-05-07T20:33:18.6500273Z self, 2025-05-07T20:33:18.6500453Z T: int, 2025-05-07T20:33:18.6500642Z D: int, 2025-05-07T20:33:18.6500851Z scale_ub: Optional[float], 2025-05-07T20:33:18.6501110Z contiguous: bool, 2025-05-07T20:33:18.6501340Z compiled: bool, 2025-05-07T20:33:18.6501553Z ) -> None: 2025-05-07T20:33:18.6501764Z torch.manual_seed(2025) 2025-05-07T20:33:18.6501996Z 2025-05-07T20:33:18.6502251Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.6504312Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.6506186Z 2025-05-07T20:33:18.6506301Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:18.6506505Z 2025-05-07T20:33:18.6506606Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.6507001Z self=, 2025-05-07T20:33:18.6507390Z T=4096, 2025-05-07T20:33:18.6507629Z D=7168, 2025-05-07T20:33:18.6507808Z scale_ub=None, 2025-05-07T20:33:18.6508014Z contiguous=True, 2025-05-07T20:33:18.6508231Z compiled=True, 2025-05-07T20:33:18.6508422Z ) 2025-05-07T20:33:18.6508731Z self = 2025-05-07T20:33:18.6509207Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:18.6509552Z 2025-05-07T20:33:18.6509633Z @given( 2025-05-07T20:33:18.6509850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.6510154Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.6510451Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.6510765Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.6511082Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.6511364Z ) 2025-05-07T20:33:18.6511708Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.6512141Z def test_silu_mul_quant( 2025-05-07T20:33:18.6512379Z self, 2025-05-07T20:33:18.6512562Z T: int, 2025-05-07T20:33:18.6512797Z D: int, 2025-05-07T20:33:18.6513009Z scale_ub: Optional[float], 2025-05-07T20:33:18.6513266Z contiguous: bool, 2025-05-07T20:33:18.6513504Z compiled: bool, 2025-05-07T20:33:18.6513724Z ) -> None: 2025-05-07T20:33:18.6513934Z torch.manual_seed(2025) 2025-05-07T20:33:18.6514170Z 2025-05-07T20:33:18.6514431Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.6516446Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.6518274Z 2025-05-07T20:33:18.6518394Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:18.6518600Z 2025-05-07T20:33:18.6518703Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.6519110Z self=, 2025-05-07T20:33:18.6519511Z T=2048, 2025-05-07T20:33:18.6519693Z D=5120, 2025-05-07T20:33:18.6519871Z scale_ub=1200.0, 2025-05-07T20:33:18.6520089Z contiguous=False, 2025-05-07T20:33:18.6520309Z compiled=False, 2025-05-07T20:33:18.7066230Z ) 2025-05-07T20:33:18.7066903Z self = 2025-05-07T20:33:18.7067719Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:18.7068093Z 2025-05-07T20:33:18.7068200Z @given( 2025-05-07T20:33:18.7068504Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.7068912Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.7069303Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.7069653Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.7070093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.7070375Z ) 2025-05-07T20:33:18.7070722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.7071178Z def test_silu_mul_quant( 2025-05-07T20:33:18.7071422Z self, 2025-05-07T20:33:18.7071605Z T: int, 2025-05-07T20:33:18.7071795Z D: int, 2025-05-07T20:33:18.7072013Z scale_ub: Optional[float], 2025-05-07T20:33:18.7072306Z contiguous: bool, 2025-05-07T20:33:18.7072537Z compiled: bool, 2025-05-07T20:33:18.7072747Z ) -> None: 2025-05-07T20:33:18.7072948Z torch.manual_seed(2025) 2025-05-07T20:33:18.7073182Z 2025-05-07T20:33:18.7073444Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.7075527Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.7077458Z 2025-05-07T20:33:18.7077574Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:18.7077781Z 2025-05-07T20:33:18.7077887Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.7078294Z self=, 2025-05-07T20:33:18.7078678Z T=4096, 2025-05-07T20:33:18.7078856Z D=7168, 2025-05-07T20:33:18.7079037Z scale_ub=1200.0, 2025-05-07T20:33:18.7079307Z contiguous=True, 2025-05-07T20:33:18.7079516Z compiled=False, 2025-05-07T20:33:18.7079713Z ) 2025-05-07T20:33:18.7080023Z self = 2025-05-07T20:33:18.7080512Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:18.7080777Z 2025-05-07T20:33:18.7080853Z @given( 2025-05-07T20:33:18.7081065Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.7081361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.7081662Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.7081979Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.7082288Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.7082559Z ) 2025-05-07T20:33:18.7082893Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.7083333Z def test_silu_mul_quant( 2025-05-07T20:33:18.7083570Z self, 2025-05-07T20:33:18.7083751Z T: int, 2025-05-07T20:33:18.7083933Z D: int, 2025-05-07T20:33:18.7084140Z scale_ub: Optional[float], 2025-05-07T20:33:18.7084401Z contiguous: bool, 2025-05-07T20:33:18.7084629Z compiled: bool, 2025-05-07T20:33:18.7084838Z ) -> None: 2025-05-07T20:33:18.7085042Z torch.manual_seed(2025) 2025-05-07T20:33:18.7085259Z 2025-05-07T20:33:18.7085513Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.7087523Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.7089385Z 2025-05-07T20:33:18.7089547Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:18.7089756Z 2025-05-07T20:33:18.7089859Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.7090250Z self=, 2025-05-07T20:33:18.7090647Z T=16384, 2025-05-07T20:33:18.7090830Z D=7168, 2025-05-07T20:33:18.7091004Z scale_ub=None, 2025-05-07T20:33:18.7091212Z contiguous=False, 2025-05-07T20:33:18.7091425Z compiled=True, 2025-05-07T20:33:18.7091615Z ) 2025-05-07T20:33:18.7091918Z self = 2025-05-07T20:33:18.7092394Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:18.7092672Z 2025-05-07T20:33:18.7092746Z @given( 2025-05-07T20:33:18.7092961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.7093273Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.7093562Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.7093873Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.7094265Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.7094531Z ) 2025-05-07T20:33:18.7094861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.7095297Z def test_silu_mul_quant( 2025-05-07T20:33:18.7095524Z self, 2025-05-07T20:33:18.7095700Z T: int, 2025-05-07T20:33:18.7095882Z D: int, 2025-05-07T20:33:18.7096092Z scale_ub: Optional[float], 2025-05-07T20:33:18.7096355Z contiguous: bool, 2025-05-07T20:33:18.7096574Z compiled: bool, 2025-05-07T20:33:18.7096779Z ) -> None: 2025-05-07T20:33:18.7096982Z torch.manual_seed(2025) 2025-05-07T20:33:18.7097212Z 2025-05-07T20:33:18.7097518Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.7099531Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.7101383Z 2025-05-07T20:33:18.7101503Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:18.7101710Z 2025-05-07T20:33:18.7101807Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.7102208Z self=, 2025-05-07T20:33:18.7102608Z T=4096, 2025-05-07T20:33:18.7102795Z D=7168, 2025-05-07T20:33:18.7102974Z scale_ub=None, 2025-05-07T20:33:18.7103178Z contiguous=True, 2025-05-07T20:33:18.7103390Z compiled=False, 2025-05-07T20:33:18.7103584Z ) 2025-05-07T20:33:18.7103893Z self = 2025-05-07T20:33:18.7104372Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:18.7104631Z 2025-05-07T20:33:18.7104701Z @given( 2025-05-07T20:33:18.7104919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.7105224Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.7105512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.7105829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.7106146Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.7106421Z ) 2025-05-07T20:33:18.7106755Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.7107201Z def test_silu_mul_quant( 2025-05-07T20:33:18.7107512Z self, 2025-05-07T20:33:18.7107693Z T: int, 2025-05-07T20:33:18.7107927Z D: int, 2025-05-07T20:33:18.7108136Z scale_ub: Optional[float], 2025-05-07T20:33:18.7108394Z contiguous: bool, 2025-05-07T20:33:18.7108622Z compiled: bool, 2025-05-07T20:33:18.7108836Z ) -> None: 2025-05-07T20:33:18.7109032Z torch.manual_seed(2025) 2025-05-07T20:33:18.7109267Z 2025-05-07T20:33:18.7109530Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.7111528Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.7113343Z 2025-05-07T20:33:18.7113566Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:18.7113769Z 2025-05-07T20:33:18.7113865Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.7114264Z self=, 2025-05-07T20:33:18.7114659Z T=16384, 2025-05-07T20:33:18.7114837Z D=7168, 2025-05-07T20:33:18.7115022Z scale_ub=None, 2025-05-07T20:33:18.7115223Z contiguous=True, 2025-05-07T20:33:18.7115428Z compiled=False, 2025-05-07T20:33:18.7115626Z ) 2025-05-07T20:33:18.7115931Z self = 2025-05-07T20:33:18.7116405Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:18.7116678Z 2025-05-07T20:33:18.7116848Z @given( 2025-05-07T20:33:18.7117068Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.7117375Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.7117666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.7117990Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.7118303Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.7118575Z ) 2025-05-07T20:33:18.7118913Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.7119353Z def test_silu_mul_quant( 2025-05-07T20:33:18.7119581Z self, 2025-05-07T20:33:18.7119770Z T: int, 2025-05-07T20:33:18.7119958Z D: int, 2025-05-07T20:33:18.7120163Z scale_ub: Optional[float], 2025-05-07T20:33:18.7120422Z contiguous: bool, 2025-05-07T20:33:18.7120651Z compiled: bool, 2025-05-07T20:33:18.7120853Z ) -> None: 2025-05-07T20:33:18.7121048Z torch.manual_seed(2025) 2025-05-07T20:33:18.7121276Z 2025-05-07T20:33:18.7121534Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.7123523Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.7125438Z 2025-05-07T20:33:18.7125547Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:18.7125755Z 2025-05-07T20:33:18.7125852Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.7126249Z self=, 2025-05-07T20:33:18.7126639Z T=16384, 2025-05-07T20:33:18.7126814Z D=7168, 2025-05-07T20:33:18.7126998Z scale_ub=1200.0, 2025-05-07T20:33:18.7127257Z contiguous=True, 2025-05-07T20:33:18.7127461Z compiled=False, 2025-05-07T20:33:18.7127649Z ) 2025-05-07T20:33:18.7127945Z self = 2025-05-07T20:33:18.7128413Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:18.7128684Z 2025-05-07T20:33:18.7128754Z @given( 2025-05-07T20:33:18.7128962Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.7129255Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.7129540Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.7129855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.7130160Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.7130421Z ) 2025-05-07T20:33:18.7130753Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.7131191Z def test_silu_mul_quant( 2025-05-07T20:33:18.7131415Z self, 2025-05-07T20:33:18.7131603Z T: int, 2025-05-07T20:33:18.7131875Z D: int, 2025-05-07T20:33:18.7132080Z scale_ub: Optional[float], 2025-05-07T20:33:18.7132341Z contiguous: bool, 2025-05-07T20:33:18.7132568Z compiled: bool, 2025-05-07T20:33:18.7132772Z ) -> None: 2025-05-07T20:33:18.7132977Z torch.manual_seed(2025) 2025-05-07T20:33:18.7133207Z 2025-05-07T20:33:18.7133457Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.7135465Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.7137439Z 2025-05-07T20:33:18.7137551Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:18.8985965Z 2025-05-07T20:33:18.8986125Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.8986664Z self=, 2025-05-07T20:33:18.8987057Z T=128, 2025-05-07T20:33:18.8987302Z D=5120, 2025-05-07T20:33:18.8987620Z scale_ub=1200.0, 2025-05-07T20:33:18.8987944Z contiguous=False, 2025-05-07T20:33:18.8988244Z compiled=False, 2025-05-07T20:33:18.8988461Z ) 2025-05-07T20:33:18.8988776Z self = 2025-05-07T20:33:18.8989261Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:18.8989535Z 2025-05-07T20:33:18.8989612Z @given( 2025-05-07T20:33:18.8989839Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.8990147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.8990445Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.8990766Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.8991090Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.8991364Z ) 2025-05-07T20:33:18.8991706Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.8992140Z def test_silu_mul_quant( 2025-05-07T20:33:18.8992374Z self, 2025-05-07T20:33:18.8992560Z T: int, 2025-05-07T20:33:18.8992748Z D: int, 2025-05-07T20:33:18.8992954Z scale_ub: Optional[float], 2025-05-07T20:33:18.8993213Z contiguous: bool, 2025-05-07T20:33:18.8993439Z compiled: bool, 2025-05-07T20:33:18.8993654Z ) -> None: 2025-05-07T20:33:18.8993851Z torch.manual_seed(2025) 2025-05-07T20:33:18.8994081Z 2025-05-07T20:33:18.8994454Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.8994785Z 2025-05-07T20:33:18.8994969Z x_sign = torch.sign(x) 2025-05-07T20:33:18.8995247Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.8995540Z x = x_sign * x_clamp 2025-05-07T20:33:18.8995775Z x0 = x[:, :D] 2025-05-07T20:33:18.8996019Z x1 = x[:, D:] 2025-05-07T20:33:18.8996231Z 2025-05-07T20:33:18.8996409Z if contiguous: 2025-05-07T20:33:18.8996631Z x0 = x0.contiguous() 2025-05-07T20:33:18.8996881Z x1 = x1.contiguous() 2025-05-07T20:33:18.8997113Z 2025-05-07T20:33:18.8997299Z if scale_ub is not None: 2025-05-07T20:33:18.8997559Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.8997882Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.8998186Z ) 2025-05-07T20:33:18.8998372Z else: 2025-05-07T20:33:18.8998567Z scale_ub_tensor = None 2025-05-07T20:33:18.8998809Z 2025-05-07T20:33:18.8999099Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.8999451Z op = silu_mul_quant 2025-05-07T20:33:18.8999691Z if compiled: 2025-05-07T20:33:18.8999931Z op = torch.compile(op) 2025-05-07T20:33:18.9000210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.9000478Z 2025-05-07T20:33:18.9000658Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.9000817Z 2025-05-07T20:33:18.9000911Z moe/activation_test.py:117: 2025-05-07T20:33:18.9001192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.9001506Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.9001772Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.9002444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.9003185Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.9003754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.9004427Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.9005082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.9005598Z kernel = self.compile( 2025-05-07T20:33:18.9006140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.9006781Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.9007171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.9007397Z 2025-05-07T20:33:18.9007599Z self = 2025-05-07T20:33:18.9008675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.9010022Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359bbf11c0>} 2025-05-07T20:33:18.9011326Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.9012320Z context = 2025-05-07T20:33:18.9012599Z 2025-05-07T20:33:18.9012760Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.9013269Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.9013780Z module_map=module_map) 2025-05-07T20:33:18.9014139Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.9014484Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.9014731Z E ^ 2025-05-07T20:33:18.9015183Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.9015624Z 2025-05-07T20:33:18.9016038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.9016541Z 2025-05-07T20:33:18.9016639Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.9017047Z self=, 2025-05-07T20:33:18.9017433Z T=2048, 2025-05-07T20:33:18.9017609Z D=7168, 2025-05-07T20:33:18.9017798Z scale_ub=None, 2025-05-07T20:33:18.9018004Z contiguous=False, 2025-05-07T20:33:18.9018225Z compiled=False, 2025-05-07T20:33:18.9018430Z ) 2025-05-07T20:33:18.9018754Z self = 2025-05-07T20:33:18.9019313Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:18.9027210Z 2025-05-07T20:33:18.9027305Z @given( 2025-05-07T20:33:18.9027603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.9027903Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.9028202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.9028523Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.9028838Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.9029121Z ) 2025-05-07T20:33:18.9029464Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.9029976Z def test_silu_mul_quant( 2025-05-07T20:33:18.9030205Z self, 2025-05-07T20:33:18.9030396Z T: int, 2025-05-07T20:33:18.9030588Z D: int, 2025-05-07T20:33:18.9030798Z scale_ub: Optional[float], 2025-05-07T20:33:18.9031066Z contiguous: bool, 2025-05-07T20:33:18.9031297Z compiled: bool, 2025-05-07T20:33:18.9031506Z ) -> None: 2025-05-07T20:33:18.9031714Z torch.manual_seed(2025) 2025-05-07T20:33:18.9031951Z 2025-05-07T20:33:18.9032212Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.9034226Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.9036183Z 2025-05-07T20:33:18.9036299Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:18.9036514Z 2025-05-07T20:33:18.9036613Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.9037017Z self=, 2025-05-07T20:33:18.9037414Z T=128, 2025-05-07T20:33:18.9037596Z D=7168, 2025-05-07T20:33:18.9037780Z scale_ub=1200.0, 2025-05-07T20:33:18.9037987Z contiguous=True, 2025-05-07T20:33:18.9038198Z compiled=True, 2025-05-07T20:33:18.9038394Z ) 2025-05-07T20:33:18.9038693Z self = 2025-05-07T20:33:18.9039164Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:18.9039432Z 2025-05-07T20:33:18.9039507Z @given( 2025-05-07T20:33:18.9039729Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.9040032Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.9040659Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.9040994Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.9041306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.9041581Z ) 2025-05-07T20:33:18.9041916Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.9042353Z def test_silu_mul_quant( 2025-05-07T20:33:18.9042577Z self, 2025-05-07T20:33:18.9042768Z T: int, 2025-05-07T20:33:18.9042953Z D: int, 2025-05-07T20:33:18.9043161Z scale_ub: Optional[float], 2025-05-07T20:33:18.9043425Z contiguous: bool, 2025-05-07T20:33:18.9043659Z compiled: bool, 2025-05-07T20:33:18.9043880Z ) -> None: 2025-05-07T20:33:18.9044090Z torch.manual_seed(2025) 2025-05-07T20:33:18.9044326Z 2025-05-07T20:33:18.9044585Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.9044920Z 2025-05-07T20:33:18.9045108Z x_sign = torch.sign(x) 2025-05-07T20:33:18.9045389Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.9045813Z x = x_sign * x_clamp 2025-05-07T20:33:18.9046048Z x0 = x[:, :D] 2025-05-07T20:33:18.9046249Z x1 = x[:, D:] 2025-05-07T20:33:18.9046447Z 2025-05-07T20:33:18.9046621Z if contiguous: 2025-05-07T20:33:18.9046839Z x0 = x0.contiguous() 2025-05-07T20:33:18.9047084Z x1 = x1.contiguous() 2025-05-07T20:33:18.9047315Z 2025-05-07T20:33:18.9047492Z if scale_ub is not None: 2025-05-07T20:33:18.9047758Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.9048079Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.9048371Z ) 2025-05-07T20:33:18.9048558Z else: 2025-05-07T20:33:18.9048828Z scale_ub_tensor = None 2025-05-07T20:33:18.9049064Z 2025-05-07T20:33:18.9049282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.9049586Z op = silu_mul_quant 2025-05-07T20:33:18.9049837Z if compiled: 2025-05-07T20:33:18.9050069Z op = torch.compile(op) 2025-05-07T20:33:18.9050355Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.9050619Z 2025-05-07T20:33:18.9050800Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.9050961Z 2025-05-07T20:33:18.9051058Z moe/activation_test.py:117: 2025-05-07T20:33:18.9051339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.9051652Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.9051922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.9052492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:18.9053045Z return fn(*args, **kwargs) 2025-05-07T20:33:18.9053695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.9054368Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.9054895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.9055553Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.9056259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.9056785Z kernel = self.compile( 2025-05-07T20:33:18.9057334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.9057969Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.9058358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.9058585Z 2025-05-07T20:33:18.9058793Z self = 2025-05-07T20:33:18.9059898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.9061239Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359b85fb00>} 2025-05-07T20:33:18.9062553Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.9063554Z context = 2025-05-07T20:33:18.9063835Z 2025-05-07T20:33:18.9064003Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.9064514Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.9065019Z module_map=module_map) 2025-05-07T20:33:18.9065414Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.9065767Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.9066010Z E ^ 2025-05-07T20:33:18.9066600Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.9067044Z 2025-05-07T20:33:18.9067544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.1876882Z 2025-05-07T20:33:19.1877157Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.1877576Z self=, 2025-05-07T20:33:19.1878094Z T=128, 2025-05-07T20:33:19.1878293Z D=7168, 2025-05-07T20:33:19.1878483Z scale_ub=1200.0, 2025-05-07T20:33:19.1878703Z contiguous=True, 2025-05-07T20:33:19.1878950Z compiled=False, 2025-05-07T20:33:19.1879159Z ) 2025-05-07T20:33:19.1879480Z self = 2025-05-07T20:33:19.1879970Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.1880236Z 2025-05-07T20:33:19.1880321Z @given( 2025-05-07T20:33:19.1880547Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.1880860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.1881165Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.1881490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.1881806Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.1882083Z ) 2025-05-07T20:33:19.1882425Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.1882856Z def test_silu_mul_quant( 2025-05-07T20:33:19.1883092Z self, 2025-05-07T20:33:19.1883289Z T: int, 2025-05-07T20:33:19.1883474Z D: int, 2025-05-07T20:33:19.1883692Z scale_ub: Optional[float], 2025-05-07T20:33:19.1883957Z contiguous: bool, 2025-05-07T20:33:19.1884189Z compiled: bool, 2025-05-07T20:33:19.1884412Z ) -> None: 2025-05-07T20:33:19.1884619Z torch.manual_seed(2025) 2025-05-07T20:33:19.1884849Z 2025-05-07T20:33:19.1885114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.1885444Z 2025-05-07T20:33:19.1885632Z x_sign = torch.sign(x) 2025-05-07T20:33:19.1885915Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.1887950Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.1889875Z 2025-05-07T20:33:19.1889990Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:19.1890200Z 2025-05-07T20:33:19.1890307Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.1890701Z self=, 2025-05-07T20:33:19.1891104Z T=128, 2025-05-07T20:33:19.1891285Z D=5120, 2025-05-07T20:33:19.1891470Z scale_ub=1200.0, 2025-05-07T20:33:19.1891683Z contiguous=True, 2025-05-07T20:33:19.1891895Z compiled=True, 2025-05-07T20:33:19.1892089Z ) 2025-05-07T20:33:19.1892396Z self = 2025-05-07T20:33:19.1892876Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.1893143Z 2025-05-07T20:33:19.1893232Z @given( 2025-05-07T20:33:19.1893573Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.1893880Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.1894186Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.1894506Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.1894831Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.1895110Z ) 2025-05-07T20:33:19.1895457Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.1895900Z def test_silu_mul_quant( 2025-05-07T20:33:19.1896138Z self, 2025-05-07T20:33:19.1896331Z T: int, 2025-05-07T20:33:19.1896514Z D: int, 2025-05-07T20:33:19.1896724Z scale_ub: Optional[float], 2025-05-07T20:33:19.1897046Z contiguous: bool, 2025-05-07T20:33:19.1897273Z compiled: bool, 2025-05-07T20:33:19.1897489Z ) -> None: 2025-05-07T20:33:19.1897701Z torch.manual_seed(2025) 2025-05-07T20:33:19.1897944Z 2025-05-07T20:33:19.1898202Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.1898535Z 2025-05-07T20:33:19.1898720Z x_sign = torch.sign(x) 2025-05-07T20:33:19.1899002Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.1900949Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.1902793Z 2025-05-07T20:33:19.1902908Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:19.1903119Z 2025-05-07T20:33:19.1903218Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.1903633Z self=, 2025-05-07T20:33:19.1904030Z T=128, 2025-05-07T20:33:19.1904227Z D=7168, 2025-05-07T20:33:19.1904498Z scale_ub=None, 2025-05-07T20:33:19.1904782Z contiguous=True, 2025-05-07T20:33:19.1905100Z compiled=True, 2025-05-07T20:33:19.1905399Z ) 2025-05-07T20:33:19.1905882Z self = 2025-05-07T20:33:19.1906438Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.1906809Z 2025-05-07T20:33:19.1906892Z @given( 2025-05-07T20:33:19.1907190Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.1907701Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.1908133Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.1908670Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.1909149Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.1909568Z ) 2025-05-07T20:33:19.1910064Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.1910657Z def test_silu_mul_quant( 2025-05-07T20:33:19.1910984Z self, 2025-05-07T20:33:19.1911251Z T: int, 2025-05-07T20:33:19.1911516Z D: int, 2025-05-07T20:33:19.1911800Z scale_ub: Optional[float], 2025-05-07T20:33:19.1912158Z contiguous: bool, 2025-05-07T20:33:19.1912476Z compiled: bool, 2025-05-07T20:33:19.1912769Z ) -> None: 2025-05-07T20:33:19.1913054Z torch.manual_seed(2025) 2025-05-07T20:33:19.1913371Z 2025-05-07T20:33:19.1913720Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.1916585Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.1919190Z 2025-05-07T20:33:19.1919355Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.1919648Z 2025-05-07T20:33:19.1920150Z FAILED 2025-05-07T20:33:19.1920300Z 2025-05-07T20:33:19.1920475Z =================================== FAILURES =================================== 2025-05-07T20:33:19.1921044Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:19.1921678Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:19.1922508Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:33:19.1923229Z | yield 2025-05-07T20:33:19.1923842Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:33:19.1924542Z | self._callTestMethod(testMethod) 2025-05-07T20:33:19.1924938Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:33:19.1925651Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:33:19.1926394Z | if method() is not None: 2025-05-07T20:33:19.1926755Z | ~~~~~~^^ 2025-05-07T20:33:19.1927604Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:19.1928602Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.1928985Z | ^^^^^^^ 2025-05-07T20:33:19.1929753Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:19.1930593Z | raise the_error_hypothesis_found 2025-05-07T20:33:19.1931165Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:19.1931727Z +-+---------------- 1 ---------------- 2025-05-07T20:33:19.1932122Z | Traceback (most recent call last): 2025-05-07T20:33:19.1933089Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:19.1934140Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.1937033Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.1939692Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:19.1940540Z | self=, 2025-05-07T20:33:19.1941081Z | T=2048, 2025-05-07T20:33:19.1941402Z | D=5120, # or any other generated value 2025-05-07T20:33:19.1941859Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:19.1942333Z | contiguous=True, # or any other generated value 2025-05-07T20:33:19.1942778Z | compiled=False, # or any other generated value 2025-05-07T20:33:19.1943092Z | ) 2025-05-07T20:33:19.1943269Z | 2025-05-07T20:33:19.1943893Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:19.1944544Z +---------------- 2 ---------------- 2025-05-07T20:33:19.1944836Z | Traceback (most recent call last): 2025-05-07T20:33:19.1945535Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:19.1946295Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.1948373Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.1950380Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:19.1950815Z | self=, 2025-05-07T20:33:19.1951212Z | T=128, 2025-05-07T20:33:19.1951415Z | D=7168, 2025-05-07T20:33:19.1951624Z | scale_ub=None, 2025-05-07T20:33:19.1951859Z | contiguous=True, 2025-05-07T20:33:19.1952102Z | compiled=True, 2025-05-07T20:33:19.1952329Z | ) 2025-05-07T20:33:19.1952505Z | 2025-05-07T20:33:19.1953022Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:19.1953622Z +---------------- 3 ---------------- 2025-05-07T20:33:19.1953907Z | Traceback (most recent call last): 2025-05-07T20:33:19.1954616Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:19.1955390Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.1957843Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.1960500Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:19.1961103Z | self=, 2025-05-07T20:33:19.1961795Z | T=128, 2025-05-07T20:33:19.1962088Z | D=5120, 2025-05-07T20:33:19.1962386Z | scale_ub=1200.0, 2025-05-07T20:33:19.1962725Z | contiguous=True, 2025-05-07T20:33:19.1963061Z | compiled=True, 2025-05-07T20:33:19.1963368Z | ) 2025-05-07T20:33:19.1963618Z | 2025-05-07T20:33:19.1969630Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:19.1970512Z +---------------- 4 ---------------- 2025-05-07T20:33:19.1970907Z | Traceback (most recent call last): 2025-05-07T20:33:19.1971899Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:19.1972883Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.1973268Z | ~~~~~~^^ 2025-05-07T20:33:19.1974250Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:19.1975299Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.1976457Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:19.1977530Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.1977922Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:33:19.1978283Z | a, 2025-05-07T20:33:19.1978548Z | ^^ 2025-05-07T20:33:19.1978835Z | ...<23 lines>... 2025-05-07T20:33:19.1979166Z | USE_INT64=use_int64, 2025-05-07T20:33:19.1979518Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:19.1979925Z | ) 2025-05-07T20:33:19.1980178Z | ^ 2025-05-07T20:33:19.1980906Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:19.1981931Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.1982558Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:19.1983426Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:19.1984498Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.1985138Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:19.1986023Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:19.1986977Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.1987612Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:19.1988450Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:19.1989224Z | fn() 2025-05-07T20:33:19.1989490Z | ~~^^ 2025-05-07T20:33:19.1990259Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:19.1991140Z | self.fn.run( 2025-05-07T20:33:19.1991441Z | ~~~~~~~~~~~^ 2025-05-07T20:33:19.1991725Z | *args, 2025-05-07T20:33:19.1992024Z | ^^^^^^ 2025-05-07T20:33:19.1992316Z | **current, 2025-05-07T20:33:19.1992616Z | ^^^^^^^^^^ 2025-05-07T20:33:19.1992914Z | ) 2025-05-07T20:33:19.1993170Z | ^ 2025-05-07T20:33:19.1993837Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:19.1994685Z | kernel = self.compile( 2025-05-07T20:33:19.1995041Z | src, 2025-05-07T20:33:19.1995322Z | target=target, 2025-05-07T20:33:19.2016251Z | options=options.__dict__, 2025-05-07T20:33:19.2016633Z | ) 2025-05-07T20:33:19.2017383Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:19.2018342Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2019290Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:19.2020330Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2020957Z | module_map=module_map) 2025-05-07T20:33:19.2021435Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2021891Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.2022245Z | ^ 2025-05-07T20:33:19.2023050Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2023809Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:19.2024334Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:19.2025022Z | self=, 2025-05-07T20:33:19.2025591Z | T=1, # or any other generated value 2025-05-07T20:33:19.2026022Z | D=5120, # or any other generated value 2025-05-07T20:33:19.2026502Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:19.2026984Z | contiguous=True, # or any other generated value 2025-05-07T20:33:19.2027664Z | compiled=True, # or any other generated value 2025-05-07T20:33:19.2028055Z | ) 2025-05-07T20:33:19.2028293Z | 2025-05-07T20:33:19.2029009Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:19.2029823Z +------------------------------------ 2025-05-07T20:33:19.2030300Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:19.2030804Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2031336Z self=, 2025-05-07T20:33:19.2031867Z T=1, 2025-05-07T20:33:19.2032109Z D=5120, 2025-05-07T20:33:19.2032360Z scale_ub=None, 2025-05-07T20:33:19.2032646Z contiguous=True, 2025-05-07T20:33:19.2032946Z compiled=True, 2025-05-07T20:33:19.2033225Z ) 2025-05-07T20:33:19.2033641Z self = 2025-05-07T20:33:19.2034283Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.2034636Z 2025-05-07T20:33:19.2034752Z @given( 2025-05-07T20:33:19.2035054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2035472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2035873Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2036313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2036752Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2037132Z ) 2025-05-07T20:33:19.2037593Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2038177Z def test_silu_mul_quant( 2025-05-07T20:33:19.2038489Z self, 2025-05-07T20:33:19.2038740Z T: int, 2025-05-07T20:33:19.2038986Z D: int, 2025-05-07T20:33:19.2039261Z scale_ub: Optional[float], 2025-05-07T20:33:19.2039612Z contiguous: bool, 2025-05-07T20:33:19.2039916Z compiled: bool, 2025-05-07T20:33:19.2040526Z ) -> None: 2025-05-07T20:33:19.2040817Z torch.manual_seed(2025) 2025-05-07T20:33:19.2041304Z 2025-05-07T20:33:19.2041675Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2042142Z 2025-05-07T20:33:19.2042393Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2042774Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2043207Z x = x_sign * x_clamp 2025-05-07T20:33:19.2043525Z x0 = x[:, :D] 2025-05-07T20:33:19.2043816Z x1 = x[:, D:] 2025-05-07T20:33:19.2044104Z 2025-05-07T20:33:19.2044354Z if contiguous: 2025-05-07T20:33:19.2044666Z x0 = x0.contiguous() 2025-05-07T20:33:19.2045016Z x1 = x1.contiguous() 2025-05-07T20:33:19.2045331Z 2025-05-07T20:33:19.2045568Z if scale_ub is not None: 2025-05-07T20:33:19.2045932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2046387Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2046787Z ) 2025-05-07T20:33:19.2047059Z else: 2025-05-07T20:33:19.2047352Z scale_ub_tensor = None 2025-05-07T20:33:19.2047777Z 2025-05-07T20:33:19.2048160Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2048574Z op = silu_mul_quant 2025-05-07T20:33:19.2048904Z if compiled: 2025-05-07T20:33:19.2049233Z op = torch.compile(op) 2025-05-07T20:33:19.2049627Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2049983Z 2025-05-07T20:33:19.2050246Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.2050624Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.2051008Z 2025-05-07T20:33:19.2051324Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2051763Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.2052238Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.2052646Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.2053126Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2053540Z 2025-05-07T20:33:19.2053793Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.2054056Z 2025-05-07T20:33:19.2054182Z moe/activation_test.py:126: 2025-05-07T20:33:19.2054569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2054989Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.2055408Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2056485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.2057479Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.2058176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2059062Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2059962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.2060905Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.2061851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.2062693Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.2063523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.2064203Z fn() 2025-05-07T20:33:19.2064883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.2065660Z self.fn.run( 2025-05-07T20:33:19.2066283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2067044Z kernel = self.compile( 2025-05-07T20:33:19.2067862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2068708Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2069210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2069511Z 2025-05-07T20:33:19.2069771Z self = 2025-05-07T20:33:19.2071174Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2072998Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38ffeae700>} 2025-05-07T20:33:19.2074808Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2076160Z context = 2025-05-07T20:33:19.2076537Z 2025-05-07T20:33:19.2076752Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2077431Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2078036Z module_map=module_map) 2025-05-07T20:33:19.2078489Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2078945Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.2079293Z E ^ 2025-05-07T20:33:19.2079942Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2080537Z 2025-05-07T20:33:19.2081085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2081768Z 2025-05-07T20:33:19.2081900Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2082434Z self=, 2025-05-07T20:33:19.2082941Z T=2048, 2025-05-07T20:33:19.2083181Z D=5120, 2025-05-07T20:33:19.2083431Z scale_ub=1200.0, 2025-05-07T20:33:19.2083735Z contiguous=True, 2025-05-07T20:33:19.2084048Z compiled=False, 2025-05-07T20:33:19.2084317Z ) 2025-05-07T20:33:19.2084739Z self = 2025-05-07T20:33:19.2085391Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.2085762Z 2025-05-07T20:33:19.2085865Z @given( 2025-05-07T20:33:19.2086152Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2086543Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2086951Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2087411Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2087830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2088202Z ) 2025-05-07T20:33:19.2088657Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2089227Z def test_silu_mul_quant( 2025-05-07T20:33:19.2089543Z self, 2025-05-07T20:33:19.2089813Z T: int, 2025-05-07T20:33:19.2090082Z D: int, 2025-05-07T20:33:19.2090378Z scale_ub: Optional[float], 2025-05-07T20:33:19.2090726Z contiguous: bool, 2025-05-07T20:33:19.2091045Z compiled: bool, 2025-05-07T20:33:19.2091331Z ) -> None: 2025-05-07T20:33:19.2091615Z torch.manual_seed(2025) 2025-05-07T20:33:19.2091946Z 2025-05-07T20:33:19.2092287Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2092749Z 2025-05-07T20:33:19.2093073Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2093461Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2093886Z x = x_sign * x_clamp 2025-05-07T20:33:19.2094218Z x0 = x[:, :D] 2025-05-07T20:33:19.2094509Z x1 = x[:, D:] 2025-05-07T20:33:19.2094799Z 2025-05-07T20:33:19.2095057Z if contiguous: 2025-05-07T20:33:19.2095367Z x0 = x0.contiguous() 2025-05-07T20:33:19.2095725Z x1 = x1.contiguous() 2025-05-07T20:33:19.2096072Z 2025-05-07T20:33:19.2096343Z if scale_ub is not None: 2025-05-07T20:33:19.2096698Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2097143Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2097554Z ) 2025-05-07T20:33:19.2097818Z else: 2025-05-07T20:33:19.2098088Z scale_ub_tensor = None 2025-05-07T20:33:19.2098409Z 2025-05-07T20:33:19.2098715Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2099145Z op = silu_mul_quant 2025-05-07T20:33:19.2099591Z if compiled: 2025-05-07T20:33:19.2099913Z op = torch.compile(op) 2025-05-07T20:33:19.2100318Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2100676Z 2025-05-07T20:33:19.2100924Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2101154Z 2025-05-07T20:33:19.2101288Z moe/activation_test.py:117: 2025-05-07T20:33:19.2101697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2102130Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2102503Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2103444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2104489Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2105242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2106149Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2107009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2107797Z kernel = self.compile( 2025-05-07T20:33:19.2108534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2109442Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2109961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2110259Z 2025-05-07T20:33:19.2110526Z self = 2025-05-07T20:33:19.2111997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2113865Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38ffd5e020>} 2025-05-07T20:33:19.2115663Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2117044Z context = 2025-05-07T20:33:19.2117421Z 2025-05-07T20:33:19.2117650Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2118378Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2119033Z module_map=module_map) 2025-05-07T20:33:19.2119597Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2120086Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2120433Z E ^ 2025-05-07T20:33:19.2121044Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2121689Z 2025-05-07T20:33:19.2122273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2122997Z 2025-05-07T20:33:19.2123134Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2123677Z self=, 2025-05-07T20:33:19.2124204Z T=2048, 2025-05-07T20:33:19.2124440Z D=5120, 2025-05-07T20:33:19.2124687Z scale_ub=1200.0, 2025-05-07T20:33:19.2124975Z contiguous=True, 2025-05-07T20:33:19.2125249Z compiled=True, 2025-05-07T20:33:19.2125511Z ) 2025-05-07T20:33:19.2125948Z self = 2025-05-07T20:33:19.2126660Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.2127062Z 2025-05-07T20:33:19.2127163Z @given( 2025-05-07T20:33:19.2127455Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2127856Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2128272Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2128715Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2129166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2129538Z ) 2025-05-07T20:33:19.2129997Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2130597Z def test_silu_mul_quant( 2025-05-07T20:33:19.2130923Z self, 2025-05-07T20:33:19.2131235Z T: int, 2025-05-07T20:33:19.2131500Z D: int, 2025-05-07T20:33:19.2131776Z scale_ub: Optional[float], 2025-05-07T20:33:19.2132125Z contiguous: bool, 2025-05-07T20:33:19.2132429Z compiled: bool, 2025-05-07T20:33:19.2132712Z ) -> None: 2025-05-07T20:33:19.2132988Z torch.manual_seed(2025) 2025-05-07T20:33:19.2133299Z 2025-05-07T20:33:19.2133636Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2134070Z 2025-05-07T20:33:19.2134314Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2134674Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2135069Z x = x_sign * x_clamp 2025-05-07T20:33:19.2135378Z x0 = x[:, :D] 2025-05-07T20:33:19.2135646Z x1 = x[:, D:] 2025-05-07T20:33:19.2135912Z 2025-05-07T20:33:19.2136150Z if contiguous: 2025-05-07T20:33:19.2136446Z x0 = x0.contiguous() 2025-05-07T20:33:19.2136770Z x1 = x1.contiguous() 2025-05-07T20:33:19.2137082Z 2025-05-07T20:33:19.2137332Z if scale_ub is not None: 2025-05-07T20:33:19.2137683Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2138120Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2138525Z ) 2025-05-07T20:33:19.2138775Z else: 2025-05-07T20:33:19.2139048Z scale_ub_tensor = None 2025-05-07T20:33:19.2139377Z 2025-05-07T20:33:19.2139666Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2140349Z op = silu_mul_quant 2025-05-07T20:33:19.2140686Z if compiled: 2025-05-07T20:33:19.2140998Z op = torch.compile(op) 2025-05-07T20:33:19.2141392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2141759Z 2025-05-07T20:33:19.2142005Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.2142380Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.2142763Z 2025-05-07T20:33:19.2143073Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2143503Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.2144038Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.2144463Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.2144927Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2145337Z 2025-05-07T20:33:19.2145603Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.2145867Z 2025-05-07T20:33:19.2145995Z moe/activation_test.py:126: 2025-05-07T20:33:19.2146386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2146830Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.2147253Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2148381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.2149395Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.2150112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2151177Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2152109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.2153079Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.2154078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.2154928Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.2155740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.2156521Z fn() 2025-05-07T20:33:19.2157205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.2157975Z self.fn.run( 2025-05-07T20:33:19.2158605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2159324Z kernel = self.compile( 2025-05-07T20:33:19.2160042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2160932Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2161488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2161796Z 2025-05-07T20:33:19.2162082Z self = 2025-05-07T20:33:19.2163517Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2165383Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38fec3e200>} 2025-05-07T20:33:19.2167264Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2168646Z context = 2025-05-07T20:33:19.2169042Z 2025-05-07T20:33:19.2169273Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2169973Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2170620Z module_map=module_map) 2025-05-07T20:33:19.2171120Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2171593Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.2171949Z E ^ 2025-05-07T20:33:19.2172630Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2173272Z 2025-05-07T20:33:19.2173849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2174540Z 2025-05-07T20:33:19.2174680Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2175229Z self=, 2025-05-07T20:33:19.2175772Z T=16384, 2025-05-07T20:33:19.2176024Z D=7168, 2025-05-07T20:33:19.2176286Z scale_ub=1200.0, 2025-05-07T20:33:19.2176594Z contiguous=False, 2025-05-07T20:33:19.2176893Z compiled=False, 2025-05-07T20:33:19.2177185Z ) 2025-05-07T20:33:19.2177614Z self = 2025-05-07T20:33:19.2178297Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.2178674Z 2025-05-07T20:33:19.2178780Z @given( 2025-05-07T20:33:19.2179195Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2179621Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2180017Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2180480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2180926Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2181296Z ) 2025-05-07T20:33:19.2181756Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2182332Z def test_silu_mul_quant( 2025-05-07T20:33:19.2182639Z self, 2025-05-07T20:33:19.2182902Z T: int, 2025-05-07T20:33:19.2183172Z D: int, 2025-05-07T20:33:19.2183469Z scale_ub: Optional[float], 2025-05-07T20:33:19.2183891Z contiguous: bool, 2025-05-07T20:33:19.2184211Z compiled: bool, 2025-05-07T20:33:19.2184499Z ) -> None: 2025-05-07T20:33:19.2184778Z torch.manual_seed(2025) 2025-05-07T20:33:19.2185103Z 2025-05-07T20:33:19.2185461Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2185916Z 2025-05-07T20:33:19.2186206Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2186586Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2186983Z x = x_sign * x_clamp 2025-05-07T20:33:19.2187293Z x0 = x[:, :D] 2025-05-07T20:33:19.2187641Z x1 = x[:, D:] 2025-05-07T20:33:19.2187907Z 2025-05-07T20:33:19.2188154Z if contiguous: 2025-05-07T20:33:19.2188459Z x0 = x0.contiguous() 2025-05-07T20:33:19.2188794Z x1 = x1.contiguous() 2025-05-07T20:33:19.2189117Z 2025-05-07T20:33:19.2189382Z if scale_ub is not None: 2025-05-07T20:33:19.2189759Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2190212Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2190635Z ) 2025-05-07T20:33:19.2190901Z else: 2025-05-07T20:33:19.2191184Z scale_ub_tensor = None 2025-05-07T20:33:19.2191524Z 2025-05-07T20:33:19.2191830Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2192255Z op = silu_mul_quant 2025-05-07T20:33:19.2192594Z if compiled: 2025-05-07T20:33:19.2192923Z op = torch.compile(op) 2025-05-07T20:33:19.2193277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2193544Z 2025-05-07T20:33:19.2193730Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2193892Z 2025-05-07T20:33:19.2193990Z moe/activation_test.py:117: 2025-05-07T20:33:19.2194281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2194603Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2194879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2195624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2196357Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2196910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2197576Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2198227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2198750Z kernel = self.compile( 2025-05-07T20:33:19.2199293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2199928Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2200062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2200070Z 2025-05-07T20:33:19.2200275Z self = 2025-05-07T20:33:19.2201110Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2201669Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38fee484a0>} 2025-05-07T20:33:19.2202398Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2202590Z context = 2025-05-07T20:33:19.2202634Z 2025-05-07T20:33:19.2202796Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2203064Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2203174Z module_map=module_map) 2025-05-07T20:33:19.2203334Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2203434Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2203509Z E ^ 2025-05-07T20:33:19.2203860Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2203872Z 2025-05-07T20:33:19.2204285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2204290Z 2025-05-07T20:33:19.2204389Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2204614Z self=, 2025-05-07T20:33:19.2204694Z T=1, 2025-05-07T20:33:19.2204769Z D=7168, 2025-05-07T20:33:19.2204854Z scale_ub=None, 2025-05-07T20:33:19.2204941Z contiguous=True, 2025-05-07T20:33:19.2205023Z compiled=True, 2025-05-07T20:33:19.2205103Z ) 2025-05-07T20:33:19.2205324Z self = 2025-05-07T20:33:19.2205488Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.2205493Z 2025-05-07T20:33:19.2205568Z @given( 2025-05-07T20:33:19.2205684Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2212405Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2212550Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2212668Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2212785Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2212859Z ) 2025-05-07T20:33:19.2213108Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2213210Z def test_silu_mul_quant( 2025-05-07T20:33:19.2213286Z self, 2025-05-07T20:33:19.2213436Z T: int, 2025-05-07T20:33:19.2213514Z D: int, 2025-05-07T20:33:19.2213611Z scale_ub: Optional[float], 2025-05-07T20:33:19.2213706Z contiguous: bool, 2025-05-07T20:33:19.2213787Z compiled: bool, 2025-05-07T20:33:19.2213863Z ) -> None: 2025-05-07T20:33:19.2213963Z torch.manual_seed(2025) 2025-05-07T20:33:19.2214034Z 2025-05-07T20:33:19.2214203Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2214281Z 2025-05-07T20:33:19.2214371Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2214493Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2214585Z x = x_sign * x_clamp 2025-05-07T20:33:19.2214665Z x0 = x[:, :D] 2025-05-07T20:33:19.2214743Z x1 = x[:, D:] 2025-05-07T20:33:19.2214824Z 2025-05-07T20:33:19.2214904Z if contiguous: 2025-05-07T20:33:19.2215001Z x0 = x0.contiguous() 2025-05-07T20:33:19.2215090Z x1 = x1.contiguous() 2025-05-07T20:33:19.2215209Z 2025-05-07T20:33:19.2215341Z if scale_ub is not None: 2025-05-07T20:33:19.2215445Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2215577Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2215658Z ) 2025-05-07T20:33:19.2215731Z else: 2025-05-07T20:33:19.2215824Z scale_ub_tensor = None 2025-05-07T20:33:19.2215910Z 2025-05-07T20:33:19.2216054Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2216161Z op = silu_mul_quant 2025-05-07T20:33:19.2216254Z if compiled: 2025-05-07T20:33:19.2216352Z op = torch.compile(op) 2025-05-07T20:33:19.2216460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2216571Z 2025-05-07T20:33:19.2216660Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.2216785Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.2216857Z 2025-05-07T20:33:19.2216991Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2217104Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.2217201Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.2217318Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.2217459Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2217530Z 2025-05-07T20:33:19.2217633Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.2217638Z 2025-05-07T20:33:19.2217739Z moe/activation_test.py:126: 2025-05-07T20:33:19.2217864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2217971Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.2218102Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2218693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.2218801Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.2219171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2219397Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2219763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.2220013Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.2220392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.2220560Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.2220914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.2221032Z fn() 2025-05-07T20:33:19.2221449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.2221536Z self.fn.run( 2025-05-07T20:33:19.2221869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2221959Z kernel = self.compile( 2025-05-07T20:33:19.2222343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2222512Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2222644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2222649Z 2025-05-07T20:33:19.2222851Z self = 2025-05-07T20:33:19.2223663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2224199Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38fef04ea0>} 2025-05-07T20:33:19.2224934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2225128Z context = 2025-05-07T20:33:19.2225133Z 2025-05-07T20:33:19.2225295Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2225558Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2225709Z module_map=module_map) 2025-05-07T20:33:19.2225876Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2226004Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.2226087Z E ^ 2025-05-07T20:33:19.2226473Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2226478Z 2025-05-07T20:33:19.2226905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2226910Z 2025-05-07T20:33:19.2227010Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2227234Z self=, 2025-05-07T20:33:19.2227309Z T=4096, 2025-05-07T20:33:19.2227385Z D=5120, 2025-05-07T20:33:19.2227550Z scale_ub=None, 2025-05-07T20:33:19.2227640Z contiguous=False, 2025-05-07T20:33:19.2227721Z compiled=False, 2025-05-07T20:33:19.2227799Z ) 2025-05-07T20:33:19.2228018Z self = 2025-05-07T20:33:19.2228197Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.2228202Z 2025-05-07T20:33:19.2228282Z @given( 2025-05-07T20:33:19.2228398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2228497Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2228617Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2228731Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2228851Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2228926Z ) 2025-05-07T20:33:19.2229169Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2229269Z def test_silu_mul_quant( 2025-05-07T20:33:19.2229345Z self, 2025-05-07T20:33:19.2229420Z T: int, 2025-05-07T20:33:19.2229503Z D: int, 2025-05-07T20:33:19.2229599Z scale_ub: Optional[float], 2025-05-07T20:33:19.2229736Z contiguous: bool, 2025-05-07T20:33:19.2229835Z compiled: bool, 2025-05-07T20:33:19.2229914Z ) -> None: 2025-05-07T20:33:19.2230007Z torch.manual_seed(2025) 2025-05-07T20:33:19.2230084Z 2025-05-07T20:33:19.2230249Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2230328Z 2025-05-07T20:33:19.2230418Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2230539Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2230633Z x = x_sign * x_clamp 2025-05-07T20:33:19.2230714Z x0 = x[:, :D] 2025-05-07T20:33:19.2230793Z x1 = x[:, D:] 2025-05-07T20:33:19.2230869Z 2025-05-07T20:33:19.2230949Z if contiguous: 2025-05-07T20:33:19.2231036Z x0 = x0.contiguous() 2025-05-07T20:33:19.2231129Z x1 = x1.contiguous() 2025-05-07T20:33:19.2231200Z 2025-05-07T20:33:19.2231285Z if scale_ub is not None: 2025-05-07T20:33:19.2231397Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2231610Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2231693Z ) 2025-05-07T20:33:19.2231772Z else: 2025-05-07T20:33:19.2231865Z scale_ub_tensor = None 2025-05-07T20:33:19.2231941Z 2025-05-07T20:33:19.2232068Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2232155Z op = silu_mul_quant 2025-05-07T20:33:19.2232243Z if compiled: 2025-05-07T20:33:19.2232340Z op = torch.compile(op) 2025-05-07T20:33:19.2232445Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2232523Z 2025-05-07T20:33:19.2232611Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2232615Z 2025-05-07T20:33:19.2232710Z moe/activation_test.py:117: 2025-05-07T20:33:19.2232881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2232979Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2233087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2233589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2233684Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2234058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2234282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2234629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2234724Z kernel = self.compile( 2025-05-07T20:33:19.2235120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2235302Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2235429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2235438Z 2025-05-07T20:33:19.2235638Z self = 2025-05-07T20:33:19.2236408Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2236917Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38fe31c2c0>} 2025-05-07T20:33:19.2237658Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2237848Z context = 2025-05-07T20:33:19.2237895Z 2025-05-07T20:33:19.2238066Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2238322Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2238427Z module_map=module_map) 2025-05-07T20:33:19.2238593Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2238690Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2238770Z E ^ 2025-05-07T20:33:19.2239128Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2239133Z 2025-05-07T20:33:19.2239559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2239567Z 2025-05-07T20:33:19.2239679Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2239902Z self=, 2025-05-07T20:33:19.2240046Z T=4096, 2025-05-07T20:33:19.2240477Z D=7168, 2025-05-07T20:33:19.2240566Z scale_ub=None, 2025-05-07T20:33:19.2240651Z contiguous=False, 2025-05-07T20:33:19.2240742Z compiled=False, 2025-05-07T20:33:19.2240814Z ) 2025-05-07T20:33:19.2241041Z self = 2025-05-07T20:33:19.2241220Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.2241225Z 2025-05-07T20:33:19.2241304Z @given( 2025-05-07T20:33:19.2241427Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2241532Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2241643Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2241838Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2241949Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2242023Z ) 2025-05-07T20:33:19.2242281Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2242376Z def test_silu_mul_quant( 2025-05-07T20:33:19.2242459Z self, 2025-05-07T20:33:19.2242534Z T: int, 2025-05-07T20:33:19.2242608Z D: int, 2025-05-07T20:33:19.2242709Z scale_ub: Optional[float], 2025-05-07T20:33:19.2242795Z contiguous: bool, 2025-05-07T20:33:19.2242877Z compiled: bool, 2025-05-07T20:33:19.2242960Z ) -> None: 2025-05-07T20:33:19.2243053Z torch.manual_seed(2025) 2025-05-07T20:33:19.2243125Z 2025-05-07T20:33:19.2243297Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2243370Z 2025-05-07T20:33:19.2243459Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2243586Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2243676Z x = x_sign * x_clamp 2025-05-07T20:33:19.2243757Z x0 = x[:, :D] 2025-05-07T20:33:19.2243836Z x1 = x[:, D:] 2025-05-07T20:33:19.2243905Z 2025-05-07T20:33:19.2243993Z if contiguous: 2025-05-07T20:33:19.2244087Z x0 = x0.contiguous() 2025-05-07T20:33:19.2244172Z x1 = x1.contiguous() 2025-05-07T20:33:19.2244251Z 2025-05-07T20:33:19.2244338Z if scale_ub is not None: 2025-05-07T20:33:19.2244440Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2244577Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2244657Z ) 2025-05-07T20:33:19.2244734Z else: 2025-05-07T20:33:19.2244828Z scale_ub_tensor = None 2025-05-07T20:33:19.2244902Z 2025-05-07T20:33:19.2245029Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2245121Z op = silu_mul_quant 2025-05-07T20:33:19.2245206Z if compiled: 2025-05-07T20:33:19.2245309Z op = torch.compile(op) 2025-05-07T20:33:19.2245413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2245560Z 2025-05-07T20:33:19.2245655Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2245664Z 2025-05-07T20:33:19.2245759Z moe/activation_test.py:117: 2025-05-07T20:33:19.2245883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2245987Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2246083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2246586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2246680Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2247037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2247268Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2247611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2247701Z kernel = self.compile( 2025-05-07T20:33:19.2248187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2248366Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2248494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2248499Z 2025-05-07T20:33:19.2248699Z self = 2025-05-07T20:33:19.2249461Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2249998Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38fe31d300>} 2025-05-07T20:33:19.2250735Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2250932Z context = 2025-05-07T20:33:19.2250937Z 2025-05-07T20:33:19.2251099Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2251365Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2251472Z module_map=module_map) 2025-05-07T20:33:19.2251631Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2251735Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2251814Z E ^ 2025-05-07T20:33:19.2252172Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2252180Z 2025-05-07T20:33:19.2252613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2252620Z 2025-05-07T20:33:19.2252718Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2252945Z self=, 2025-05-07T20:33:19.2253020Z T=128, 2025-05-07T20:33:19.2253095Z D=7168, 2025-05-07T20:33:19.2253179Z scale_ub=None, 2025-05-07T20:33:19.2253262Z contiguous=False, 2025-05-07T20:33:19.2253341Z compiled=True, 2025-05-07T20:33:19.2253420Z ) 2025-05-07T20:33:19.2253638Z self = 2025-05-07T20:33:19.2253805Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.2253813Z 2025-05-07T20:33:19.2253897Z @given( 2025-05-07T20:33:19.2254014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2254163Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2254281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2254396Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2254513Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2254586Z ) 2025-05-07T20:33:19.2254834Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2254932Z def test_silu_mul_quant( 2025-05-07T20:33:19.2255007Z self, 2025-05-07T20:33:19.2255080Z T: int, 2025-05-07T20:33:19.2255164Z D: int, 2025-05-07T20:33:19.2255261Z scale_ub: Optional[float], 2025-05-07T20:33:19.2255348Z contiguous: bool, 2025-05-07T20:33:19.2255438Z compiled: bool, 2025-05-07T20:33:19.2255513Z ) -> None: 2025-05-07T20:33:19.2255612Z torch.manual_seed(2025) 2025-05-07T20:33:19.2255686Z 2025-05-07T20:33:19.2255866Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2255962Z 2025-05-07T20:33:19.2256156Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2256281Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2256376Z x = x_sign * x_clamp 2025-05-07T20:33:19.2256459Z x0 = x[:, :D] 2025-05-07T20:33:19.2256536Z x1 = x[:, D:] 2025-05-07T20:33:19.2256623Z 2025-05-07T20:33:19.2256706Z if contiguous: 2025-05-07T20:33:19.2256797Z x0 = x0.contiguous() 2025-05-07T20:33:19.2256891Z x1 = x1.contiguous() 2025-05-07T20:33:19.2256960Z 2025-05-07T20:33:19.2257059Z if scale_ub is not None: 2025-05-07T20:33:19.2257163Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2257298Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2257420Z ) 2025-05-07T20:33:19.2257496Z else: 2025-05-07T20:33:19.2257590Z scale_ub_tensor = None 2025-05-07T20:33:19.2257669Z 2025-05-07T20:33:19.2257801Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2257894Z op = silu_mul_quant 2025-05-07T20:33:19.2257985Z if compiled: 2025-05-07T20:33:19.2258084Z op = torch.compile(op) 2025-05-07T20:33:19.2258189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2258269Z 2025-05-07T20:33:19.2258358Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.2258483Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.2258557Z 2025-05-07T20:33:19.2258690Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2258798Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.2258894Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.2259015Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.2259165Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2259237Z 2025-05-07T20:33:19.2259338Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.2259345Z 2025-05-07T20:33:19.2259448Z moe/activation_test.py:126: 2025-05-07T20:33:19.2259575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2259685Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.2259816Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2260403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.2260508Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.2260864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2261086Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2261461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.2261844Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.2262225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.2262394Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.2262746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.2262828Z fn() 2025-05-07T20:33:19.2263226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.2263311Z self.fn.run( 2025-05-07T20:33:19.2263646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2263740Z kernel = self.compile( 2025-05-07T20:33:19.2264133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2264384Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2264510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2264514Z 2025-05-07T20:33:19.2264724Z self = 2025-05-07T20:33:19.2265487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2266018Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38fe31fe20>} 2025-05-07T20:33:19.2266793Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2266993Z context = 2025-05-07T20:33:19.2266998Z 2025-05-07T20:33:19.2267159Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2267486Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2267599Z module_map=module_map) 2025-05-07T20:33:19.2267758Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2267857Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.2267940Z E ^ 2025-05-07T20:33:19.2268303Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2268311Z 2025-05-07T20:33:19.2268742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2268746Z 2025-05-07T20:33:19.2268852Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2269068Z self=, 2025-05-07T20:33:19.2269150Z T=128, 2025-05-07T20:33:19.2269224Z D=7168, 2025-05-07T20:33:19.2269305Z scale_ub=None, 2025-05-07T20:33:19.2269396Z contiguous=False, 2025-05-07T20:33:19.2269475Z compiled=False, 2025-05-07T20:33:19.2269552Z ) 2025-05-07T20:33:19.2269766Z self = 2025-05-07T20:33:19.2269933Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.2269937Z 2025-05-07T20:33:19.2270018Z @given( 2025-05-07T20:33:19.2270133Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2270230Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2270347Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2270536Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2270656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2270730Z ) 2025-05-07T20:33:19.2270977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2271072Z def test_silu_mul_quant( 2025-05-07T20:33:19.2271146Z self, 2025-05-07T20:33:19.2271219Z T: int, 2025-05-07T20:33:19.2271303Z D: int, 2025-05-07T20:33:19.2271401Z scale_ub: Optional[float], 2025-05-07T20:33:19.2271490Z contiguous: bool, 2025-05-07T20:33:19.2271583Z compiled: bool, 2025-05-07T20:33:19.2271660Z ) -> None: 2025-05-07T20:33:19.2271754Z torch.manual_seed(2025) 2025-05-07T20:33:19.2271830Z 2025-05-07T20:33:19.2271997Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2272081Z 2025-05-07T20:33:19.2272168Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2272298Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2272427Z x = x_sign * x_clamp 2025-05-07T20:33:19.2272541Z x0 = x[:, :D] 2025-05-07T20:33:19.2272627Z x1 = x[:, D:] 2025-05-07T20:33:19.2272699Z 2025-05-07T20:33:19.2272788Z if contiguous: 2025-05-07T20:33:19.2272878Z x0 = x0.contiguous() 2025-05-07T20:33:19.2272963Z x1 = x1.contiguous() 2025-05-07T20:33:19.2273039Z 2025-05-07T20:33:19.2273128Z if scale_ub is not None: 2025-05-07T20:33:19.2273230Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2273371Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2273447Z ) 2025-05-07T20:33:19.2273524Z else: 2025-05-07T20:33:19.2273621Z scale_ub_tensor = None 2025-05-07T20:33:19.2273737Z 2025-05-07T20:33:19.2273862Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2273958Z op = silu_mul_quant 2025-05-07T20:33:19.2274043Z if compiled: 2025-05-07T20:33:19.2274141Z op = torch.compile(op) 2025-05-07T20:33:19.2274254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2274327Z 2025-05-07T20:33:19.2274423Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2274427Z 2025-05-07T20:33:19.2274523Z moe/activation_test.py:117: 2025-05-07T20:33:19.2274646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2274747Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2274843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2275347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2275446Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2275822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2276054Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2276395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2276486Z kernel = self.compile( 2025-05-07T20:33:19.2276890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2277062Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2277193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2277197Z 2025-05-07T20:33:19.2277397Z self = 2025-05-07T20:33:19.2278158Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2278739Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9e787c0>} 2025-05-07T20:33:19.2279474Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2279665Z context = 2025-05-07T20:33:19.2279670Z 2025-05-07T20:33:19.2279830Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2280090Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2280202Z module_map=module_map) 2025-05-07T20:33:19.2280363Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2280462Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2280540Z E ^ 2025-05-07T20:33:19.2280973Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2280978Z 2025-05-07T20:33:19.2281415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2281419Z 2025-05-07T20:33:19.2281518Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2281744Z self=, 2025-05-07T20:33:19.2281819Z T=4096, 2025-05-07T20:33:19.2281895Z D=5120, 2025-05-07T20:33:19.2281981Z scale_ub=1200.0, 2025-05-07T20:33:19.2282063Z contiguous=True, 2025-05-07T20:33:19.2282141Z compiled=False, 2025-05-07T20:33:19.2282218Z ) 2025-05-07T20:33:19.2282475Z self = 2025-05-07T20:33:19.2282653Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.2282657Z 2025-05-07T20:33:19.2282741Z @given( 2025-05-07T20:33:19.2282858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2282955Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2283071Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2283183Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2283297Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2283370Z ) 2025-05-07T20:33:19.2283610Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2283708Z def test_silu_mul_quant( 2025-05-07T20:33:19.2283784Z self, 2025-05-07T20:33:19.2283857Z T: int, 2025-05-07T20:33:19.2283939Z D: int, 2025-05-07T20:33:19.2284040Z scale_ub: Optional[float], 2025-05-07T20:33:19.2284124Z contiguous: bool, 2025-05-07T20:33:19.2284214Z compiled: bool, 2025-05-07T20:33:19.2284290Z ) -> None: 2025-05-07T20:33:19.2284382Z torch.manual_seed(2025) 2025-05-07T20:33:19.2284460Z 2025-05-07T20:33:19.2284622Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2284697Z 2025-05-07T20:33:19.2284786Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2284906Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2284996Z x = x_sign * x_clamp 2025-05-07T20:33:19.2285071Z x0 = x[:, :D] 2025-05-07T20:33:19.2285144Z x1 = x[:, D:] 2025-05-07T20:33:19.2285214Z 2025-05-07T20:33:19.2285294Z if contiguous: 2025-05-07T20:33:19.2285383Z x0 = x0.contiguous() 2025-05-07T20:33:19.2285474Z x1 = x1.contiguous() 2025-05-07T20:33:19.2285544Z 2025-05-07T20:33:19.2285630Z if scale_ub is not None: 2025-05-07T20:33:19.2285742Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2285873Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2285997Z ) 2025-05-07T20:33:19.2286074Z else: 2025-05-07T20:33:19.2286166Z scale_ub_tensor = None 2025-05-07T20:33:19.2286243Z 2025-05-07T20:33:19.2286368Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2286458Z op = silu_mul_quant 2025-05-07T20:33:19.2286544Z if compiled: 2025-05-07T20:33:19.2286640Z op = torch.compile(op) 2025-05-07T20:33:19.2286743Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2286817Z 2025-05-07T20:33:19.2286903Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2286909Z 2025-05-07T20:33:19.2287002Z moe/activation_test.py:117: 2025-05-07T20:33:19.2287133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2287234Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2287338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2287845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2288019Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2288378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2288600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2288942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2289031Z kernel = self.compile( 2025-05-07T20:33:19.2289432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2289606Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2289769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2289774Z 2025-05-07T20:33:19.2289975Z self = 2025-05-07T20:33:19.2290745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2291269Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9e78c20>} 2025-05-07T20:33:19.2292007Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2292192Z context = 2025-05-07T20:33:19.2292201Z 2025-05-07T20:33:19.2292366Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2292632Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2292741Z module_map=module_map) 2025-05-07T20:33:19.2292907Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2293003Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2293080Z E ^ 2025-05-07T20:33:19.2293432Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2293437Z 2025-05-07T20:33:19.2293862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2293866Z 2025-05-07T20:33:19.2293971Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2294189Z self=, 2025-05-07T20:33:19.2294264Z T=1, 2025-05-07T20:33:19.2294342Z D=5120, 2025-05-07T20:33:19.2294465Z scale_ub=None, 2025-05-07T20:33:19.2294547Z contiguous=True, 2025-05-07T20:33:19.2294635Z compiled=True, 2025-05-07T20:33:19.2294708Z ) 2025-05-07T20:33:19.2294928Z self = 2025-05-07T20:33:19.2295085Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.2295089Z 2025-05-07T20:33:19.2295166Z @given( 2025-05-07T20:33:19.2295285Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2295382Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2295493Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2295616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2295725Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2295801Z ) 2025-05-07T20:33:19.2296046Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2296137Z def test_silu_mul_quant( 2025-05-07T20:33:19.2296221Z self, 2025-05-07T20:33:19.2296345Z T: int, 2025-05-07T20:33:19.2296456Z D: int, 2025-05-07T20:33:19.2296556Z scale_ub: Optional[float], 2025-05-07T20:33:19.2296641Z contiguous: bool, 2025-05-07T20:33:19.2296723Z compiled: bool, 2025-05-07T20:33:19.2296803Z ) -> None: 2025-05-07T20:33:19.2296894Z torch.manual_seed(2025) 2025-05-07T20:33:19.2296964Z 2025-05-07T20:33:19.2297135Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2297211Z 2025-05-07T20:33:19.2297299Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2297423Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2297508Z x = x_sign * x_clamp 2025-05-07T20:33:19.2297589Z x0 = x[:, :D] 2025-05-07T20:33:19.2297729Z x1 = x[:, D:] 2025-05-07T20:33:19.2297801Z 2025-05-07T20:33:19.2297887Z if contiguous: 2025-05-07T20:33:19.2297973Z x0 = x0.contiguous() 2025-05-07T20:33:19.2298060Z x1 = x1.contiguous() 2025-05-07T20:33:19.2298139Z 2025-05-07T20:33:19.2298229Z if scale_ub is not None: 2025-05-07T20:33:19.2298330Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2298467Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2298542Z ) 2025-05-07T20:33:19.2298616Z else: 2025-05-07T20:33:19.2298711Z scale_ub_tensor = None 2025-05-07T20:33:19.2298779Z 2025-05-07T20:33:19.2298903Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2298995Z op = silu_mul_quant 2025-05-07T20:33:19.2299076Z if compiled: 2025-05-07T20:33:19.2299178Z op = torch.compile(op) 2025-05-07T20:33:19.2299280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2299353Z 2025-05-07T20:33:19.2299444Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.2299565Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.2299639Z 2025-05-07T20:33:19.2299779Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2299878Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.2299976Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.2300096Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.2300231Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2300308Z 2025-05-07T20:33:19.2300407Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.2300411Z 2025-05-07T20:33:19.2300507Z moe/activation_test.py:126: 2025-05-07T20:33:19.2300636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2300739Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.2300871Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2301499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.2301600Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.2301961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2302183Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2302542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.2302799Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.2303189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.2303353Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.2303714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.2303790Z fn() 2025-05-07T20:33:19.2304289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.2304372Z self.fn.run( 2025-05-07T20:33:19.2304707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2304803Z kernel = self.compile( 2025-05-07T20:33:19.2305181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2305352Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2305480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2305485Z 2025-05-07T20:33:19.2305731Z self = 2025-05-07T20:33:19.2306503Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2307031Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9e7a980>} 2025-05-07T20:33:19.2307864Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2308054Z context = 2025-05-07T20:33:19.2308059Z 2025-05-07T20:33:19.2308219Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2308487Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2308594Z module_map=module_map) 2025-05-07T20:33:19.2308760Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2308862Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.2308936Z E ^ 2025-05-07T20:33:19.2309302Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2309306Z 2025-05-07T20:33:19.2309730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2309735Z 2025-05-07T20:33:19.2309837Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2310054Z self=, 2025-05-07T20:33:19.2310130Z T=2048, 2025-05-07T20:33:19.2310209Z D=5120, 2025-05-07T20:33:19.2310290Z scale_ub=None, 2025-05-07T20:33:19.2310372Z contiguous=True, 2025-05-07T20:33:19.2310454Z compiled=True, 2025-05-07T20:33:19.2310526Z ) 2025-05-07T20:33:19.2310793Z self = 2025-05-07T20:33:19.2310972Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.2310977Z 2025-05-07T20:33:19.2311052Z @given( 2025-05-07T20:33:19.2311165Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2311266Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2311380Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2311497Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2311606Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2311679Z ) 2025-05-07T20:33:19.2311925Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2312018Z def test_silu_mul_quant( 2025-05-07T20:33:19.2312093Z self, 2025-05-07T20:33:19.2312173Z T: int, 2025-05-07T20:33:19.2312246Z D: int, 2025-05-07T20:33:19.2312342Z scale_ub: Optional[float], 2025-05-07T20:33:19.2312478Z contiguous: bool, 2025-05-07T20:33:19.2312597Z compiled: bool, 2025-05-07T20:33:19.2312677Z ) -> None: 2025-05-07T20:33:19.2312771Z torch.manual_seed(2025) 2025-05-07T20:33:19.2312840Z 2025-05-07T20:33:19.2313011Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2313081Z 2025-05-07T20:33:19.2313167Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2313295Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2313378Z x = x_sign * x_clamp 2025-05-07T20:33:19.2313452Z x0 = x[:, :D] 2025-05-07T20:33:19.2313531Z x1 = x[:, D:] 2025-05-07T20:33:19.2313600Z 2025-05-07T20:33:19.2313678Z if contiguous: 2025-05-07T20:33:19.2313815Z x0 = x0.contiguous() 2025-05-07T20:33:19.2313898Z x1 = x1.contiguous() 2025-05-07T20:33:19.2313967Z 2025-05-07T20:33:19.2314057Z if scale_ub is not None: 2025-05-07T20:33:19.2314162Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2314302Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2314376Z ) 2025-05-07T20:33:19.2314453Z else: 2025-05-07T20:33:19.2314547Z scale_ub_tensor = None 2025-05-07T20:33:19.2314619Z 2025-05-07T20:33:19.2314745Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2314840Z op = silu_mul_quant 2025-05-07T20:33:19.2314923Z if compiled: 2025-05-07T20:33:19.2315017Z op = torch.compile(op) 2025-05-07T20:33:19.2315124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2315193Z 2025-05-07T20:33:19.2315282Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.2315406Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.2315479Z 2025-05-07T20:33:19.2315612Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2315712Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.2315810Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.2315934Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.2316068Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2316136Z 2025-05-07T20:33:19.2316236Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.2316241Z 2025-05-07T20:33:19.2316333Z moe/activation_test.py:126: 2025-05-07T20:33:19.2316457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2316562Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.2316691Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2317278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.2317379Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.2317787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2318019Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2318392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.2318649Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.2319020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.2319180Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.2319524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.2319600Z fn() 2025-05-07T20:33:19.2320013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.2320143Z self.fn.run( 2025-05-07T20:33:19.2320513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2320607Z kernel = self.compile( 2025-05-07T20:33:19.2320983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2321150Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2321279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2321284Z 2025-05-07T20:33:19.2321490Z self = 2025-05-07T20:33:19.2322261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2322830Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9ec8860>} 2025-05-07T20:33:19.2323561Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2323753Z context = 2025-05-07T20:33:19.2323758Z 2025-05-07T20:33:19.2323916Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2324182Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2324286Z module_map=module_map) 2025-05-07T20:33:19.2324445Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2324547Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.2324624Z E ^ 2025-05-07T20:33:19.2324974Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2324983Z 2025-05-07T20:33:19.2325402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2325407Z 2025-05-07T20:33:19.2325506Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2325727Z self=, 2025-05-07T20:33:19.2325806Z T=128, 2025-05-07T20:33:19.2325883Z D=5120, 2025-05-07T20:33:19.2325965Z scale_ub=None, 2025-05-07T20:33:19.2326065Z contiguous=True, 2025-05-07T20:33:19.2326151Z compiled=True, 2025-05-07T20:33:19.2326244Z ) 2025-05-07T20:33:19.2326467Z self = 2025-05-07T20:33:19.2326678Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.2326686Z 2025-05-07T20:33:19.2326764Z @given( 2025-05-07T20:33:19.2326879Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2326979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2327090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2327201Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2327315Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2327387Z ) 2025-05-07T20:33:19.2327632Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2327724Z def test_silu_mul_quant( 2025-05-07T20:33:19.2327798Z self, 2025-05-07T20:33:19.2327876Z T: int, 2025-05-07T20:33:19.2327951Z D: int, 2025-05-07T20:33:19.2328047Z scale_ub: Optional[float], 2025-05-07T20:33:19.2328141Z contiguous: bool, 2025-05-07T20:33:19.2328220Z compiled: bool, 2025-05-07T20:33:19.2328300Z ) -> None: 2025-05-07T20:33:19.2328443Z torch.manual_seed(2025) 2025-05-07T20:33:19.2328571Z 2025-05-07T20:33:19.2328735Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2328812Z 2025-05-07T20:33:19.2328898Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2329019Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2329107Z x = x_sign * x_clamp 2025-05-07T20:33:19.2329183Z x0 = x[:, :D] 2025-05-07T20:33:19.2329267Z x1 = x[:, D:] 2025-05-07T20:33:19.2329339Z 2025-05-07T20:33:19.2329419Z if contiguous: 2025-05-07T20:33:19.2329510Z x0 = x0.contiguous() 2025-05-07T20:33:19.2329599Z x1 = x1.contiguous() 2025-05-07T20:33:19.2329669Z 2025-05-07T20:33:19.2329805Z if scale_ub is not None: 2025-05-07T20:33:19.2329906Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2330036Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2330115Z ) 2025-05-07T20:33:19.2330188Z else: 2025-05-07T20:33:19.2330284Z scale_ub_tensor = None 2025-05-07T20:33:19.2330360Z 2025-05-07T20:33:19.2330484Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2330582Z op = silu_mul_quant 2025-05-07T20:33:19.2330663Z if compiled: 2025-05-07T20:33:19.2341645Z op = torch.compile(op) 2025-05-07T20:33:19.2341777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2341849Z 2025-05-07T20:33:19.2341946Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.2342071Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.2342141Z 2025-05-07T20:33:19.2342281Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2342391Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.2342491Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.2342612Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.2342755Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2342830Z 2025-05-07T20:33:19.2342928Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.2342934Z 2025-05-07T20:33:19.2343032Z moe/activation_test.py:126: 2025-05-07T20:33:19.2343165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2343270Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.2343406Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2343961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.2344061Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.2344427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2344764Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2345153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.2345406Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.2345781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.2345947Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.2346286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.2346363Z fn() 2025-05-07T20:33:19.2346761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.2346845Z self.fn.run( 2025-05-07T20:33:19.2347185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2347398Z kernel = self.compile( 2025-05-07T20:33:19.2347843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2348017Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2348142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2348147Z 2025-05-07T20:33:19.2348349Z self = 2025-05-07T20:33:19.2349121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2349688Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9836ca0>} 2025-05-07T20:33:19.2350434Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2350621Z context = 2025-05-07T20:33:19.2350626Z 2025-05-07T20:33:19.2350794Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2351050Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2351154Z module_map=module_map) 2025-05-07T20:33:19.2351317Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2351419Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.2351496Z E ^ 2025-05-07T20:33:19.2351853Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2351860Z 2025-05-07T20:33:19.2352274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2352279Z 2025-05-07T20:33:19.2352381Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2352598Z self=, 2025-05-07T20:33:19.2352674Z T=4096, 2025-05-07T20:33:19.2352751Z D=5120, 2025-05-07T20:33:19.2352830Z scale_ub=None, 2025-05-07T20:33:19.2352915Z contiguous=True, 2025-05-07T20:33:19.2352996Z compiled=True, 2025-05-07T20:33:19.2353067Z ) 2025-05-07T20:33:19.2353282Z self = 2025-05-07T20:33:19.2353453Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.2353460Z 2025-05-07T20:33:19.2353536Z @given( 2025-05-07T20:33:19.2353702Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2353805Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2353917Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2354037Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2354146Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2354220Z ) 2025-05-07T20:33:19.2354462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2354551Z def test_silu_mul_quant( 2025-05-07T20:33:19.2354633Z self, 2025-05-07T20:33:19.2354710Z T: int, 2025-05-07T20:33:19.2354782Z D: int, 2025-05-07T20:33:19.2354881Z scale_ub: Optional[float], 2025-05-07T20:33:19.2354968Z contiguous: bool, 2025-05-07T20:33:19.2355049Z compiled: bool, 2025-05-07T20:33:19.2355132Z ) -> None: 2025-05-07T20:33:19.2355225Z torch.manual_seed(2025) 2025-05-07T20:33:19.2355294Z 2025-05-07T20:33:19.2355465Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2355618Z 2025-05-07T20:33:19.2355707Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2355832Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2355919Z x = x_sign * x_clamp 2025-05-07T20:33:19.2356000Z x0 = x[:, :D] 2025-05-07T20:33:19.2356078Z x1 = x[:, D:] 2025-05-07T20:33:19.2356146Z 2025-05-07T20:33:19.2356230Z if contiguous: 2025-05-07T20:33:19.2356318Z x0 = x0.contiguous() 2025-05-07T20:33:19.2356402Z x1 = x1.contiguous() 2025-05-07T20:33:19.2356472Z 2025-05-07T20:33:19.2356555Z if scale_ub is not None: 2025-05-07T20:33:19.2356658Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2356789Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2356905Z ) 2025-05-07T20:33:19.2356979Z else: 2025-05-07T20:33:19.2357074Z scale_ub_tensor = None 2025-05-07T20:33:19.2357144Z 2025-05-07T20:33:19.2357272Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2357360Z op = silu_mul_quant 2025-05-07T20:33:19.2357438Z if compiled: 2025-05-07T20:33:19.2357536Z op = torch.compile(op) 2025-05-07T20:33:19.2357637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2357705Z 2025-05-07T20:33:19.2357792Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.2357905Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.2357971Z 2025-05-07T20:33:19.2358101Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2358198Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.2358291Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.2358414Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.2358547Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2358626Z 2025-05-07T20:33:19.2358720Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.2358729Z 2025-05-07T20:33:19.2358821Z moe/activation_test.py:126: 2025-05-07T20:33:19.2358949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2359048Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.2359176Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2359737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.2359832Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.2360193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2360413Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2360820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.2361076Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.2361445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.2361606Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.2361945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.2362019Z fn() 2025-05-07T20:33:19.2362416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.2362493Z self.fn.run( 2025-05-07T20:33:19.2362824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2362919Z kernel = self.compile( 2025-05-07T20:33:19.2363337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2363541Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2363665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2363670Z 2025-05-07T20:33:19.2363866Z self = 2025-05-07T20:33:19.2364630Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2365119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d974a200>} 2025-05-07T20:33:19.2365899Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2366112Z context = 2025-05-07T20:33:19.2366118Z 2025-05-07T20:33:19.2366291Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2366548Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2366650Z module_map=module_map) 2025-05-07T20:33:19.2366809Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2366904Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.2366975Z E ^ 2025-05-07T20:33:19.2367323Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2367331Z 2025-05-07T20:33:19.2367762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2367771Z 2025-05-07T20:33:19.2367872Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2368085Z self=, 2025-05-07T20:33:19.2368159Z T=16384, 2025-05-07T20:33:19.2368234Z D=5120, 2025-05-07T20:33:19.2368311Z scale_ub=None, 2025-05-07T20:33:19.2368389Z contiguous=True, 2025-05-07T20:33:19.2368470Z compiled=True, 2025-05-07T20:33:19.2368538Z ) 2025-05-07T20:33:19.2368751Z self = 2025-05-07T20:33:19.2368922Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.2368927Z 2025-05-07T20:33:19.2369000Z @given( 2025-05-07T20:33:19.2369117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2369213Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2369365Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2369485Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2369594Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2369664Z ) 2025-05-07T20:33:19.2369902Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2369994Z def test_silu_mul_quant( 2025-05-07T20:33:19.2370066Z self, 2025-05-07T20:33:19.2370142Z T: int, 2025-05-07T20:33:19.2370215Z D: int, 2025-05-07T20:33:19.2370305Z scale_ub: Optional[float], 2025-05-07T20:33:19.2370391Z contiguous: bool, 2025-05-07T20:33:19.2370470Z compiled: bool, 2025-05-07T20:33:19.2370544Z ) -> None: 2025-05-07T20:33:19.2370632Z torch.manual_seed(2025) 2025-05-07T20:33:19.2370700Z 2025-05-07T20:33:19.2370866Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2370933Z 2025-05-07T20:33:19.2371018Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2371142Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2371982Z x = x_sign * x_clamp 2025-05-07T20:33:19.2372060Z x0 = x[:, :D] 2025-05-07T20:33:19.2372138Z x1 = x[:, D:] 2025-05-07T20:33:19.2372207Z 2025-05-07T20:33:19.2372286Z if contiguous: 2025-05-07T20:33:19.2372379Z x0 = x0.contiguous() 2025-05-07T20:33:19.2372461Z x1 = x1.contiguous() 2025-05-07T20:33:19.2372529Z 2025-05-07T20:33:19.2372619Z if scale_ub is not None: 2025-05-07T20:33:19.2372719Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2372851Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2372923Z ) 2025-05-07T20:33:19.2372998Z else: 2025-05-07T20:33:19.2373134Z scale_ub_tensor = None 2025-05-07T20:33:19.2373202Z 2025-05-07T20:33:19.2373326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2373414Z op = silu_mul_quant 2025-05-07T20:33:19.2373498Z if compiled: 2025-05-07T20:33:19.2373597Z op = torch.compile(op) 2025-05-07T20:33:19.2373700Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2373770Z 2025-05-07T20:33:19.2373856Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.2373974Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.2374043Z 2025-05-07T20:33:19.2374178Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2374275Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.2374371Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.2374487Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.2374623Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2374699Z 2025-05-07T20:33:19.2374795Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.2374800Z 2025-05-07T20:33:19.2374895Z moe/activation_test.py:126: 2025-05-07T20:33:19.2375019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2375127Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.2375255Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2375809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.2375904Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.2376302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2376523Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2376887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.2377186Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.2377559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.2377720Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.2378060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.2378132Z fn() 2025-05-07T20:33:19.2378524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.2378601Z self.fn.run( 2025-05-07T20:33:19.2378930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2379020Z kernel = self.compile( 2025-05-07T20:33:19.2379397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2379568Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2379770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2379775Z 2025-05-07T20:33:19.2379972Z self = 2025-05-07T20:33:19.2380734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2381221Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8d38680>} 2025-05-07T20:33:19.2381947Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2382176Z context = 2025-05-07T20:33:19.2382185Z 2025-05-07T20:33:19.2382343Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2382602Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2382705Z module_map=module_map) 2025-05-07T20:33:19.2382858Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2382958Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.2383031Z E ^ 2025-05-07T20:33:19.2383376Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2383383Z 2025-05-07T20:33:19.2383816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2383824Z 2025-05-07T20:33:19.2383922Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2384141Z self=, 2025-05-07T20:33:19.2384217Z T=1, 2025-05-07T20:33:19.2384291Z D=5120, 2025-05-07T20:33:19.2384372Z scale_ub=1200.0, 2025-05-07T20:33:19.2384452Z contiguous=True, 2025-05-07T20:33:19.2384528Z compiled=True, 2025-05-07T20:33:19.2384599Z ) 2025-05-07T20:33:19.2384809Z self = 2025-05-07T20:33:19.2384971Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.2384976Z 2025-05-07T20:33:19.2385049Z @given( 2025-05-07T20:33:19.2385160Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2385256Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2385367Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2385480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2385633Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2385705Z ) 2025-05-07T20:33:19.2385947Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2386036Z def test_silu_mul_quant( 2025-05-07T20:33:19.2386108Z self, 2025-05-07T20:33:19.2386183Z T: int, 2025-05-07T20:33:19.2386255Z D: int, 2025-05-07T20:33:19.2386348Z scale_ub: Optional[float], 2025-05-07T20:33:19.2386433Z contiguous: bool, 2025-05-07T20:33:19.2386512Z compiled: bool, 2025-05-07T20:33:19.2386584Z ) -> None: 2025-05-07T20:33:19.2386677Z torch.manual_seed(2025) 2025-05-07T20:33:19.2386742Z 2025-05-07T20:33:19.2386903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2386976Z 2025-05-07T20:33:19.2387066Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2387183Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2387271Z x = x_sign * x_clamp 2025-05-07T20:33:19.2387350Z x0 = x[:, :D] 2025-05-07T20:33:19.2387520Z x1 = x[:, D:] 2025-05-07T20:33:19.2387631Z 2025-05-07T20:33:19.2387712Z if contiguous: 2025-05-07T20:33:19.2387800Z x0 = x0.contiguous() 2025-05-07T20:33:19.2387883Z x1 = x1.contiguous() 2025-05-07T20:33:19.2387952Z 2025-05-07T20:33:19.2388038Z if scale_ub is not None: 2025-05-07T20:33:19.2388138Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2388267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2388341Z ) 2025-05-07T20:33:19.2388413Z else: 2025-05-07T20:33:19.2388502Z scale_ub_tensor = None 2025-05-07T20:33:19.2388573Z 2025-05-07T20:33:19.2388698Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2388827Z op = silu_mul_quant 2025-05-07T20:33:19.2388907Z if compiled: 2025-05-07T20:33:19.2389001Z op = torch.compile(op) 2025-05-07T20:33:19.2389106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2389176Z 2025-05-07T20:33:19.2389263Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2389267Z 2025-05-07T20:33:19.2389361Z moe/activation_test.py:117: 2025-05-07T20:33:19.2389483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2389575Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2389670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2390029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2390122Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2390606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2390704Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2391058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2391278Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2391609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2391699Z kernel = self.compile( 2025-05-07T20:33:19.2392097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2392266Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2392386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2392390Z 2025-05-07T20:33:19.2392584Z self = 2025-05-07T20:33:19.2393394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2393886Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8934680>} 2025-05-07T20:33:19.2394622Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2394805Z context = 2025-05-07T20:33:19.2394809Z 2025-05-07T20:33:19.2394968Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2395221Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2395325Z module_map=module_map) 2025-05-07T20:33:19.2395486Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2395578Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2395731Z E ^ 2025-05-07T20:33:19.2396080Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2396085Z 2025-05-07T20:33:19.2396511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2396516Z 2025-05-07T20:33:19.2396616Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2396830Z self=, 2025-05-07T20:33:19.2396902Z T=1, 2025-05-07T20:33:19.2396980Z D=5120, 2025-05-07T20:33:19.2397057Z scale_ub=None, 2025-05-07T20:33:19.2397139Z contiguous=False, 2025-05-07T20:33:19.2397342Z compiled=True, 2025-05-07T20:33:19.2397410Z ) 2025-05-07T20:33:19.2397620Z self = 2025-05-07T20:33:19.2397782Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.2397791Z 2025-05-07T20:33:19.2397864Z @given( 2025-05-07T20:33:19.2397979Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2398072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2398180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2398293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2398400Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2398466Z ) 2025-05-07T20:33:19.2398702Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2398789Z def test_silu_mul_quant( 2025-05-07T20:33:19.2398865Z self, 2025-05-07T20:33:19.2398940Z T: int, 2025-05-07T20:33:19.2399011Z D: int, 2025-05-07T20:33:19.2399108Z scale_ub: Optional[float], 2025-05-07T20:33:19.2399191Z contiguous: bool, 2025-05-07T20:33:19.2399270Z compiled: bool, 2025-05-07T20:33:19.2399346Z ) -> None: 2025-05-07T20:33:19.2399437Z torch.manual_seed(2025) 2025-05-07T20:33:19.2399505Z 2025-05-07T20:33:19.2399669Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2399737Z 2025-05-07T20:33:19.2399826Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2399946Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2400029Z x = x_sign * x_clamp 2025-05-07T20:33:19.2400102Z x0 = x[:, :D] 2025-05-07T20:33:19.2400181Z x1 = x[:, D:] 2025-05-07T20:33:19.2400249Z 2025-05-07T20:33:19.2400333Z if contiguous: 2025-05-07T20:33:19.2400417Z x0 = x0.contiguous() 2025-05-07T20:33:19.2400498Z x1 = x1.contiguous() 2025-05-07T20:33:19.2400571Z 2025-05-07T20:33:19.2400659Z if scale_ub is not None: 2025-05-07T20:33:19.2400760Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2400973Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2401053Z ) 2025-05-07T20:33:19.2401126Z else: 2025-05-07T20:33:19.2401218Z scale_ub_tensor = None 2025-05-07T20:33:19.2401287Z 2025-05-07T20:33:19.2401410Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2401497Z op = silu_mul_quant 2025-05-07T20:33:19.2401579Z if compiled: 2025-05-07T20:33:19.2401676Z op = torch.compile(op) 2025-05-07T20:33:19.2401777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2401847Z 2025-05-07T20:33:19.2401939Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.2402057Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.2402126Z 2025-05-07T20:33:19.2402266Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2402364Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.2402466Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.2402634Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.2402807Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2402877Z 2025-05-07T20:33:19.2402978Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.2402983Z 2025-05-07T20:33:19.2403075Z moe/activation_test.py:126: 2025-05-07T20:33:19.2403198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2403298Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.2403429Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2403988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.2404125Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.2404488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2404707Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2405069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.2405324Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.2405702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.2405867Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.2406257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.2406332Z fn() 2025-05-07T20:33:19.2406738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.2406816Z self.fn.run( 2025-05-07T20:33:19.2407152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2407250Z kernel = self.compile( 2025-05-07T20:33:19.2407626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2407793Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2407921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2407926Z 2025-05-07T20:33:19.2408125Z self = 2025-05-07T20:33:19.2408891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2409427Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d892ad40>} 2025-05-07T20:33:19.2410168Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2410350Z context = 2025-05-07T20:33:19.2410355Z 2025-05-07T20:33:19.2410512Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2410773Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2410877Z module_map=module_map) 2025-05-07T20:33:19.2411039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2411140Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.2411213Z E ^ 2025-05-07T20:33:19.2411607Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2411648Z 2025-05-07T20:33:19.2412058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2412063Z 2025-05-07T20:33:19.2412161Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2412383Z self=, 2025-05-07T20:33:19.2412459Z T=1, 2025-05-07T20:33:19.2412538Z D=5120, 2025-05-07T20:33:19.2412618Z scale_ub=None, 2025-05-07T20:33:19.2412698Z contiguous=True, 2025-05-07T20:33:19.2412780Z compiled=False, 2025-05-07T20:33:19.2412851Z ) 2025-05-07T20:33:19.2413066Z self = 2025-05-07T20:33:19.2413274Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.2413279Z 2025-05-07T20:33:19.2413354Z @given( 2025-05-07T20:33:19.2413471Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2413575Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2413686Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2413804Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2413913Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2413986Z ) 2025-05-07T20:33:19.2414228Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2414318Z def test_silu_mul_quant( 2025-05-07T20:33:19.2414396Z self, 2025-05-07T20:33:19.2414474Z T: int, 2025-05-07T20:33:19.2414549Z D: int, 2025-05-07T20:33:19.2414642Z scale_ub: Optional[float], 2025-05-07T20:33:19.2414734Z contiguous: bool, 2025-05-07T20:33:19.2414817Z compiled: bool, 2025-05-07T20:33:19.2414892Z ) -> None: 2025-05-07T20:33:19.2414987Z torch.manual_seed(2025) 2025-05-07T20:33:19.2415061Z 2025-05-07T20:33:19.2415232Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2415305Z 2025-05-07T20:33:19.2415393Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2415518Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2415605Z x = x_sign * x_clamp 2025-05-07T20:33:19.2415680Z x0 = x[:, :D] 2025-05-07T20:33:19.2415761Z x1 = x[:, D:] 2025-05-07T20:33:19.2415828Z 2025-05-07T20:33:19.2415908Z if contiguous: 2025-05-07T20:33:19.2416016Z x0 = x0.contiguous() 2025-05-07T20:33:19.2416110Z x1 = x1.contiguous() 2025-05-07T20:33:19.2416192Z 2025-05-07T20:33:19.2416294Z if scale_ub is not None: 2025-05-07T20:33:19.2416396Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2416539Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2416613Z ) 2025-05-07T20:33:19.2416688Z else: 2025-05-07T20:33:19.2416832Z scale_ub_tensor = None 2025-05-07T20:33:19.2416904Z 2025-05-07T20:33:19.2417030Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2417123Z op = silu_mul_quant 2025-05-07T20:33:19.2417204Z if compiled: 2025-05-07T20:33:19.2417299Z op = torch.compile(op) 2025-05-07T20:33:19.2417407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2417479Z 2025-05-07T20:33:19.2417564Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2417569Z 2025-05-07T20:33:19.2417667Z moe/activation_test.py:117: 2025-05-07T20:33:19.2417793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2417896Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2417993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2418486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2418587Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2419027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2419246Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2419589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2419681Z kernel = self.compile( 2025-05-07T20:33:19.2420078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2420248Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2420370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2420414Z 2025-05-07T20:33:19.2420621Z self = 2025-05-07T20:33:19.2421389Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2421885Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9226660>} 2025-05-07T20:33:19.2422617Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2422808Z context = 2025-05-07T20:33:19.2422812Z 2025-05-07T20:33:19.2422974Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2423233Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2423348Z module_map=module_map) 2025-05-07T20:33:19.2423505Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2423599Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2423681Z E ^ 2025-05-07T20:33:19.2424029Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2424034Z 2025-05-07T20:33:19.2424469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2424474Z 2025-05-07T20:33:19.2424572Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2424787Z self=, 2025-05-07T20:33:19.2424871Z T=128, 2025-05-07T20:33:19.2424944Z D=5120, 2025-05-07T20:33:19.2425021Z scale_ub=None, 2025-05-07T20:33:19.2425109Z contiguous=False, 2025-05-07T20:33:19.2425234Z compiled=True, 2025-05-07T20:33:19.2425306Z ) 2025-05-07T20:33:19.2425524Z self = 2025-05-07T20:33:19.2425688Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.2425693Z 2025-05-07T20:33:19.2425771Z @given( 2025-05-07T20:33:19.2425881Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2425976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2426093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2426206Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2426315Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2426392Z ) 2025-05-07T20:33:19.2426629Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2426726Z def test_silu_mul_quant( 2025-05-07T20:33:19.2426801Z self, 2025-05-07T20:33:19.2426876Z T: int, 2025-05-07T20:33:19.2426955Z D: int, 2025-05-07T20:33:19.2427133Z scale_ub: Optional[float], 2025-05-07T20:33:19.2427221Z contiguous: bool, 2025-05-07T20:33:19.2427306Z compiled: bool, 2025-05-07T20:33:19.2427380Z ) -> None: 2025-05-07T20:33:19.2427523Z torch.manual_seed(2025) 2025-05-07T20:33:19.2427597Z 2025-05-07T20:33:19.2427762Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2427837Z 2025-05-07T20:33:19.2427931Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2428049Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2428136Z x = x_sign * x_clamp 2025-05-07T20:33:19.2428213Z x0 = x[:, :D] 2025-05-07T20:33:19.2428290Z x1 = x[:, D:] 2025-05-07T20:33:19.2428442Z 2025-05-07T20:33:19.2428520Z if contiguous: 2025-05-07T20:33:19.2428607Z x0 = x0.contiguous() 2025-05-07T20:33:19.2428694Z x1 = x1.contiguous() 2025-05-07T20:33:19.2428766Z 2025-05-07T20:33:19.2428855Z if scale_ub is not None: 2025-05-07T20:33:19.2428965Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2429094Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2429167Z ) 2025-05-07T20:33:19.2429241Z else: 2025-05-07T20:33:19.2429332Z scale_ub_tensor = None 2025-05-07T20:33:19.2429401Z 2025-05-07T20:33:19.2429529Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2429616Z op = silu_mul_quant 2025-05-07T20:33:19.2429703Z if compiled: 2025-05-07T20:33:19.2429801Z op = torch.compile(op) 2025-05-07T20:33:19.2429903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2429979Z 2025-05-07T20:33:19.2430070Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2430074Z 2025-05-07T20:33:19.2430169Z moe/activation_test.py:117: 2025-05-07T20:33:19.2430305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2430405Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2430500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2430869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2430959Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2431452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2431546Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2431901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2432126Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2432466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2432606Z kernel = self.compile( 2025-05-07T20:33:19.2433011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2433184Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2433312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2433317Z 2025-05-07T20:33:19.2433515Z self = 2025-05-07T20:33:19.2434274Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2434769Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d892bb00>} 2025-05-07T20:33:19.2435550Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2435779Z context = 2025-05-07T20:33:19.2435784Z 2025-05-07T20:33:19.2435942Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2436205Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2436309Z module_map=module_map) 2025-05-07T20:33:19.2436466Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2436571Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2436652Z E ^ 2025-05-07T20:33:19.2437041Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2437046Z 2025-05-07T20:33:19.2437471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2437477Z 2025-05-07T20:33:19.2437579Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2437804Z self=, 2025-05-07T20:33:19.2437881Z T=128, 2025-05-07T20:33:19.2437957Z D=7168, 2025-05-07T20:33:19.2438042Z scale_ub=1200.0, 2025-05-07T20:33:19.2438125Z contiguous=False, 2025-05-07T20:33:19.2438208Z compiled=False, 2025-05-07T20:33:19.2438284Z ) 2025-05-07T20:33:19.2438496Z self = 2025-05-07T20:33:19.2438671Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.2438678Z 2025-05-07T20:33:19.2438755Z @given( 2025-05-07T20:33:19.2438870Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2438978Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2439095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2439209Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2439321Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2439393Z ) 2025-05-07T20:33:19.2439629Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2439726Z def test_silu_mul_quant( 2025-05-07T20:33:19.2439801Z self, 2025-05-07T20:33:19.2439880Z T: int, 2025-05-07T20:33:19.2439958Z D: int, 2025-05-07T20:33:19.2440053Z scale_ub: Optional[float], 2025-05-07T20:33:19.2440357Z contiguous: bool, 2025-05-07T20:33:19.2440479Z compiled: bool, 2025-05-07T20:33:19.2440582Z ) -> None: 2025-05-07T20:33:19.2440685Z torch.manual_seed(2025) 2025-05-07T20:33:19.2440755Z 2025-05-07T20:33:19.2440923Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2441090Z 2025-05-07T20:33:19.2441185Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2441311Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2441404Z x = x_sign * x_clamp 2025-05-07T20:33:19.2441483Z x0 = x[:, :D] 2025-05-07T20:33:19.2441567Z x1 = x[:, D:] 2025-05-07T20:33:19.2441640Z 2025-05-07T20:33:19.2441720Z if contiguous: 2025-05-07T20:33:19.2441819Z x0 = x0.contiguous() 2025-05-07T20:33:19.2441908Z x1 = x1.contiguous() 2025-05-07T20:33:19.2441976Z 2025-05-07T20:33:19.2442069Z if scale_ub is not None: 2025-05-07T20:33:19.2442173Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2442304Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2442381Z ) 2025-05-07T20:33:19.2442459Z else: 2025-05-07T20:33:19.2442550Z scale_ub_tensor = None 2025-05-07T20:33:19.2442627Z 2025-05-07T20:33:19.2442757Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2442964Z op = silu_mul_quant 2025-05-07T20:33:19.2443051Z if compiled: 2025-05-07T20:33:19.2443147Z op = torch.compile(op) 2025-05-07T20:33:19.2443253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2443323Z 2025-05-07T20:33:19.2443413Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2443417Z 2025-05-07T20:33:19.2443515Z moe/activation_test.py:117: 2025-05-07T20:33:19.2443639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2443738Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2443839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2444326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2444488Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2444845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2445067Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2445408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2445503Z kernel = self.compile( 2025-05-07T20:33:19.2445905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2446084Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2446207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2446212Z 2025-05-07T20:33:19.2446420Z self = 2025-05-07T20:33:19.2447192Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2447685Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8d66200>} 2025-05-07T20:33:19.2448421Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2448607Z context = 2025-05-07T20:33:19.2448611Z 2025-05-07T20:33:19.2448773Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2449029Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2449139Z module_map=module_map) 2025-05-07T20:33:19.2449350Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2449456Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2449541Z E ^ 2025-05-07T20:33:19.2449890Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2449894Z 2025-05-07T20:33:19.2450308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2450312Z 2025-05-07T20:33:19.2450419Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2450634Z self=, 2025-05-07T20:33:19.2450716Z T=128, 2025-05-07T20:33:19.2450792Z D=5120, 2025-05-07T20:33:19.2450873Z scale_ub=None, 2025-05-07T20:33:19.2450962Z contiguous=False, 2025-05-07T20:33:19.2451045Z compiled=False, 2025-05-07T20:33:19.2451116Z ) 2025-05-07T20:33:19.2451340Z self = 2025-05-07T20:33:19.2451587Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.2451592Z 2025-05-07T20:33:19.2451670Z @given( 2025-05-07T20:33:19.2451792Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2451889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2452008Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2452121Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2452229Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2452304Z ) 2025-05-07T20:33:19.2452542Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2452631Z def test_silu_mul_quant( 2025-05-07T20:33:19.2452750Z self, 2025-05-07T20:33:19.2452826Z T: int, 2025-05-07T20:33:19.2452902Z D: int, 2025-05-07T20:33:19.2453001Z scale_ub: Optional[float], 2025-05-07T20:33:19.2453090Z contiguous: bool, 2025-05-07T20:33:19.2453174Z compiled: bool, 2025-05-07T20:33:19.2453256Z ) -> None: 2025-05-07T20:33:19.2453345Z torch.manual_seed(2025) 2025-05-07T20:33:19.2453420Z 2025-05-07T20:33:19.2453582Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2453654Z 2025-05-07T20:33:19.2453748Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2453869Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2453954Z x = x_sign * x_clamp 2025-05-07T20:33:19.2454034Z x0 = x[:, :D] 2025-05-07T20:33:19.2454111Z x1 = x[:, D:] 2025-05-07T20:33:19.2454183Z 2025-05-07T20:33:19.2454267Z if contiguous: 2025-05-07T20:33:19.2454359Z x0 = x0.contiguous() 2025-05-07T20:33:19.2454450Z x1 = x1.contiguous() 2025-05-07T20:33:19.2454525Z 2025-05-07T20:33:19.2454613Z if scale_ub is not None: 2025-05-07T20:33:19.2454718Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2454860Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2454932Z ) 2025-05-07T20:33:19.2455014Z else: 2025-05-07T20:33:19.2455105Z scale_ub_tensor = None 2025-05-07T20:33:19.2455178Z 2025-05-07T20:33:19.2455312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2455400Z op = silu_mul_quant 2025-05-07T20:33:19.2455482Z if compiled: 2025-05-07T20:33:19.2455591Z op = torch.compile(op) 2025-05-07T20:33:19.2455693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2455765Z 2025-05-07T20:33:19.2455860Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2455864Z 2025-05-07T20:33:19.2455958Z moe/activation_test.py:117: 2025-05-07T20:33:19.2456091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2456186Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2456330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2456830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2456922Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2457275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2457498Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2457832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2457929Z kernel = self.compile( 2025-05-07T20:33:19.2458308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2458480Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2458611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2458657Z 2025-05-07T20:33:19.2458918Z self = 2025-05-07T20:33:19.2459686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2460175Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8935940>} 2025-05-07T20:33:19.2460905Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2461136Z context = 2025-05-07T20:33:19.2461144Z 2025-05-07T20:33:19.2461309Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2461574Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2461678Z module_map=module_map) 2025-05-07T20:33:19.2461836Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2461936Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2462012Z E ^ 2025-05-07T20:33:19.2462359Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2462367Z 2025-05-07T20:33:19.2462775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2462783Z 2025-05-07T20:33:19.2462884Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2463110Z self=, 2025-05-07T20:33:19.2466787Z T=128, 2025-05-07T20:33:19.2466886Z D=5120, 2025-05-07T20:33:19.2466976Z scale_ub=1200.0, 2025-05-07T20:33:19.2467062Z contiguous=True, 2025-05-07T20:33:19.2467144Z compiled=False, 2025-05-07T20:33:19.2467220Z ) 2025-05-07T20:33:19.2467507Z self = 2025-05-07T20:33:19.2467689Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.2467694Z 2025-05-07T20:33:19.2467775Z @given( 2025-05-07T20:33:19.2467895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2467994Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2468107Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2468221Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2468338Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2468413Z ) 2025-05-07T20:33:19.2468725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2468824Z def test_silu_mul_quant( 2025-05-07T20:33:19.2468902Z self, 2025-05-07T20:33:19.2468985Z T: int, 2025-05-07T20:33:19.2469062Z D: int, 2025-05-07T20:33:19.2469159Z scale_ub: Optional[float], 2025-05-07T20:33:19.2469251Z contiguous: bool, 2025-05-07T20:33:19.2469336Z compiled: bool, 2025-05-07T20:33:19.2469417Z ) -> None: 2025-05-07T20:33:19.2469513Z torch.manual_seed(2025) 2025-05-07T20:33:19.2469587Z 2025-05-07T20:33:19.2469753Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2469835Z 2025-05-07T20:33:19.2469926Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2470050Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2470140Z x = x_sign * x_clamp 2025-05-07T20:33:19.2470218Z x0 = x[:, :D] 2025-05-07T20:33:19.2470299Z x1 = x[:, D:] 2025-05-07T20:33:19.2470371Z 2025-05-07T20:33:19.2470497Z if contiguous: 2025-05-07T20:33:19.2470627Z x0 = x0.contiguous() 2025-05-07T20:33:19.2470713Z x1 = x1.contiguous() 2025-05-07T20:33:19.2470784Z 2025-05-07T20:33:19.2470871Z if scale_ub is not None: 2025-05-07T20:33:19.2470972Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2471103Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2471181Z ) 2025-05-07T20:33:19.2471257Z else: 2025-05-07T20:33:19.2471348Z scale_ub_tensor = None 2025-05-07T20:33:19.2471422Z 2025-05-07T20:33:19.2471549Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2471646Z op = silu_mul_quant 2025-05-07T20:33:19.2471729Z if compiled: 2025-05-07T20:33:19.2471874Z op = torch.compile(op) 2025-05-07T20:33:19.2471981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2472054Z 2025-05-07T20:33:19.2472145Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2472153Z 2025-05-07T20:33:19.2472256Z moe/activation_test.py:117: 2025-05-07T20:33:19.2472381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2472478Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2472577Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2473070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2473166Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2473519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2473733Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2474076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2474169Z kernel = self.compile( 2025-05-07T20:33:19.2474555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2474727Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2474848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2474852Z 2025-05-07T20:33:19.2475054Z self = 2025-05-07T20:33:19.2475815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2476362Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d872cc20>} 2025-05-07T20:33:19.2477147Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2477337Z context = 2025-05-07T20:33:19.2477342Z 2025-05-07T20:33:19.2477506Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2477762Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2477869Z module_map=module_map) 2025-05-07T20:33:19.2478026Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2478120Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2478205Z E ^ 2025-05-07T20:33:19.2478552Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2478559Z 2025-05-07T20:33:19.2479031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2479076Z 2025-05-07T20:33:19.2479175Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2479388Z self=, 2025-05-07T20:33:19.2479464Z T=1, 2025-05-07T20:33:19.2479538Z D=7168, 2025-05-07T20:33:19.2479614Z scale_ub=1200.0, 2025-05-07T20:33:19.2479695Z contiguous=True, 2025-05-07T20:33:19.2479775Z compiled=True, 2025-05-07T20:33:19.2479844Z ) 2025-05-07T20:33:19.2480059Z self = 2025-05-07T20:33:19.2480217Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.2480263Z 2025-05-07T20:33:19.2480336Z @given( 2025-05-07T20:33:19.2480453Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2480550Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2480667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2480780Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2480888Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2480961Z ) 2025-05-07T20:33:19.2481196Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2481283Z def test_silu_mul_quant( 2025-05-07T20:33:19.2481354Z self, 2025-05-07T20:33:19.2481424Z T: int, 2025-05-07T20:33:19.2481494Z D: int, 2025-05-07T20:33:19.2481588Z scale_ub: Optional[float], 2025-05-07T20:33:19.2481673Z contiguous: bool, 2025-05-07T20:33:19.2481760Z compiled: bool, 2025-05-07T20:33:19.2481836Z ) -> None: 2025-05-07T20:33:19.2481930Z torch.manual_seed(2025) 2025-05-07T20:33:19.2482000Z 2025-05-07T20:33:19.2482163Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2482233Z 2025-05-07T20:33:19.2482326Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2482449Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2482535Z x = x_sign * x_clamp 2025-05-07T20:33:19.2482616Z x0 = x[:, :D] 2025-05-07T20:33:19.2482690Z x1 = x[:, D:] 2025-05-07T20:33:19.2482760Z 2025-05-07T20:33:19.2482845Z if contiguous: 2025-05-07T20:33:19.2482933Z x0 = x0.contiguous() 2025-05-07T20:33:19.2483016Z x1 = x1.contiguous() 2025-05-07T20:33:19.2483089Z 2025-05-07T20:33:19.2483174Z if scale_ub is not None: 2025-05-07T20:33:19.2483280Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2483410Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2483490Z ) 2025-05-07T20:33:19.2483568Z else: 2025-05-07T20:33:19.2483657Z scale_ub_tensor = None 2025-05-07T20:33:19.2483726Z 2025-05-07T20:33:19.2483920Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2484012Z op = silu_mul_quant 2025-05-07T20:33:19.2484093Z if compiled: 2025-05-07T20:33:19.2484191Z op = torch.compile(op) 2025-05-07T20:33:19.2484290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2484359Z 2025-05-07T20:33:19.2484446Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2484451Z 2025-05-07T20:33:19.2484543Z moe/activation_test.py:117: 2025-05-07T20:33:19.2484671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2484766Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2484862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2485223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2485314Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2485799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2485976Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2486329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2486547Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2486883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2486973Z kernel = self.compile( 2025-05-07T20:33:19.2487354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2487521Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2487687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2487692Z 2025-05-07T20:33:19.2487891Z self = 2025-05-07T20:33:19.2488655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2489151Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d872dee0>} 2025-05-07T20:33:19.2489882Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2490067Z context = 2025-05-07T20:33:19.2490074Z 2025-05-07T20:33:19.2490231Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2490491Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2490598Z module_map=module_map) 2025-05-07T20:33:19.2490754Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2490849Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2490923Z E ^ 2025-05-07T20:33:19.2491267Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2491272Z 2025-05-07T20:33:19.2491684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2491689Z 2025-05-07T20:33:19.2491786Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2492004Z self=, 2025-05-07T20:33:19.2492080Z T=1, 2025-05-07T20:33:19.2492152Z D=7168, 2025-05-07T20:33:19.2492275Z scale_ub=1200.0, 2025-05-07T20:33:19.2492357Z contiguous=False, 2025-05-07T20:33:19.2492440Z compiled=True, 2025-05-07T20:33:19.2492510Z ) 2025-05-07T20:33:19.2492721Z self = 2025-05-07T20:33:19.2492883Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.2492887Z 2025-05-07T20:33:19.2492963Z @given( 2025-05-07T20:33:19.2493077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2493172Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2493285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2493395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2493504Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2493579Z ) 2025-05-07T20:33:19.2493815Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2493904Z def test_silu_mul_quant( 2025-05-07T20:33:19.2493980Z self, 2025-05-07T20:33:19.2494125Z T: int, 2025-05-07T20:33:19.2494239Z D: int, 2025-05-07T20:33:19.2494334Z scale_ub: Optional[float], 2025-05-07T20:33:19.2494422Z contiguous: bool, 2025-05-07T20:33:19.2494510Z compiled: bool, 2025-05-07T20:33:19.2494587Z ) -> None: 2025-05-07T20:33:19.2494687Z torch.manual_seed(2025) 2025-05-07T20:33:19.2494756Z 2025-05-07T20:33:19.2494920Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2494996Z 2025-05-07T20:33:19.2495085Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2495205Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2495296Z x = x_sign * x_clamp 2025-05-07T20:33:19.2495374Z x0 = x[:, :D] 2025-05-07T20:33:19.2495491Z x1 = x[:, D:] 2025-05-07T20:33:19.2495566Z 2025-05-07T20:33:19.2495649Z if contiguous: 2025-05-07T20:33:19.2495737Z x0 = x0.contiguous() 2025-05-07T20:33:19.2495831Z x1 = x1.contiguous() 2025-05-07T20:33:19.2495906Z 2025-05-07T20:33:19.2495994Z if scale_ub is not None: 2025-05-07T20:33:19.2496098Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2496229Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2496306Z ) 2025-05-07T20:33:19.2496379Z else: 2025-05-07T20:33:19.2496469Z scale_ub_tensor = None 2025-05-07T20:33:19.2496543Z 2025-05-07T20:33:19.2496671Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2496759Z op = silu_mul_quant 2025-05-07T20:33:19.2496845Z if compiled: 2025-05-07T20:33:19.2496943Z op = torch.compile(op) 2025-05-07T20:33:19.2497045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2497118Z 2025-05-07T20:33:19.2497207Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2497211Z 2025-05-07T20:33:19.2497309Z moe/activation_test.py:117: 2025-05-07T20:33:19.2497436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2497532Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2497629Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2497993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2498082Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2498570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2498665Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2499023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2499242Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2499622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2499722Z kernel = self.compile( 2025-05-07T20:33:19.2500115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2500284Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2500408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2500413Z 2025-05-07T20:33:19.2500608Z self = 2025-05-07T20:33:19.2501373Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2501867Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d872ec00>} 2025-05-07T20:33:19.2502681Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2502865Z context = 2025-05-07T20:33:19.2502869Z 2025-05-07T20:33:19.2503028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2503286Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2503390Z module_map=module_map) 2025-05-07T20:33:19.2503551Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2503686Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2503759Z E ^ 2025-05-07T20:33:19.2504112Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2504119Z 2025-05-07T20:33:19.2504533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2504537Z 2025-05-07T20:33:19.2504640Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2504858Z self=, 2025-05-07T20:33:19.2504932Z T=1, 2025-05-07T20:33:19.2505010Z D=7168, 2025-05-07T20:33:19.2505088Z scale_ub=None, 2025-05-07T20:33:19.2505169Z contiguous=False, 2025-05-07T20:33:19.2505253Z compiled=True, 2025-05-07T20:33:19.2505321Z ) 2025-05-07T20:33:19.2505533Z self = 2025-05-07T20:33:19.2505694Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.2505701Z 2025-05-07T20:33:19.2505773Z @given( 2025-05-07T20:33:19.2505890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2505990Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2506104Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2506219Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2506328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2506396Z ) 2025-05-07T20:33:19.2506636Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2506724Z def test_silu_mul_quant( 2025-05-07T20:33:19.2506796Z self, 2025-05-07T20:33:19.2506872Z T: int, 2025-05-07T20:33:19.2506943Z D: int, 2025-05-07T20:33:19.2507036Z scale_ub: Optional[float], 2025-05-07T20:33:19.2507123Z contiguous: bool, 2025-05-07T20:33:19.2507202Z compiled: bool, 2025-05-07T20:33:19.2507280Z ) -> None: 2025-05-07T20:33:19.2507372Z torch.manual_seed(2025) 2025-05-07T20:33:19.2507495Z 2025-05-07T20:33:19.2507706Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2507785Z 2025-05-07T20:33:19.2507874Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2507996Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2508079Z x = x_sign * x_clamp 2025-05-07T20:33:19.2508155Z x0 = x[:, :D] 2025-05-07T20:33:19.2508233Z x1 = x[:, D:] 2025-05-07T20:33:19.2508303Z 2025-05-07T20:33:19.2508379Z if contiguous: 2025-05-07T20:33:19.2508468Z x0 = x0.contiguous() 2025-05-07T20:33:19.2508553Z x1 = x1.contiguous() 2025-05-07T20:33:19.2508619Z 2025-05-07T20:33:19.2508710Z if scale_ub is not None: 2025-05-07T20:33:19.2508810Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2508939Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2509020Z ) 2025-05-07T20:33:19.2509090Z else: 2025-05-07T20:33:19.2509183Z scale_ub_tensor = None 2025-05-07T20:33:19.2509257Z 2025-05-07T20:33:19.2509460Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2509549Z op = silu_mul_quant 2025-05-07T20:33:19.2509629Z if compiled: 2025-05-07T20:33:19.2509722Z op = torch.compile(op) 2025-05-07T20:33:19.2509826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2509895Z 2025-05-07T20:33:19.2509983Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.2510101Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.2510170Z 2025-05-07T20:33:19.2510300Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2510399Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.2510492Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.2510650Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.2510783Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2510854Z 2025-05-07T20:33:19.2510955Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.2510965Z 2025-05-07T20:33:19.2511057Z moe/activation_test.py:126: 2025-05-07T20:33:19.2511179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2511283Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.2511412Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.2511960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.2512056Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.2512407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2512629Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2513008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.2513262Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.2513633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.2513792Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.2514129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.2514204Z fn() 2025-05-07T20:33:19.2514598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.2514678Z self.fn.run( 2025-05-07T20:33:19.2515009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2515102Z kernel = self.compile( 2025-05-07T20:33:19.2515521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2515693Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2515817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2515821Z 2025-05-07T20:33:19.2516018Z self = 2025-05-07T20:33:19.2516777Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2517267Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8518180>} 2025-05-07T20:33:19.2518038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2518260Z context = 2025-05-07T20:33:19.2518265Z 2025-05-07T20:33:19.2518426Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2518682Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2518786Z module_map=module_map) 2025-05-07T20:33:19.2518940Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2519038Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.2519113Z E ^ 2025-05-07T20:33:19.2519456Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2519500Z 2025-05-07T20:33:19.2519918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2519925Z 2025-05-07T20:33:19.2520026Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2520241Z self=, 2025-05-07T20:33:19.2520314Z T=1, 2025-05-07T20:33:19.2520387Z D=5120, 2025-05-07T20:33:19.2520470Z scale_ub=1200.0, 2025-05-07T20:33:19.2520552Z contiguous=False, 2025-05-07T20:33:19.2520629Z compiled=True, 2025-05-07T20:33:19.2520701Z ) 2025-05-07T20:33:19.2520914Z self = 2025-05-07T20:33:19.2521076Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.2521081Z 2025-05-07T20:33:19.2521154Z @given( 2025-05-07T20:33:19.2521268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2521366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2521479Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2521591Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2521706Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2521778Z ) 2025-05-07T20:33:19.2522012Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2522101Z def test_silu_mul_quant( 2025-05-07T20:33:19.2522172Z self, 2025-05-07T20:33:19.2522244Z T: int, 2025-05-07T20:33:19.2522315Z D: int, 2025-05-07T20:33:19.2522406Z scale_ub: Optional[float], 2025-05-07T20:33:19.2522494Z contiguous: bool, 2025-05-07T20:33:19.2522573Z compiled: bool, 2025-05-07T20:33:19.2522646Z ) -> None: 2025-05-07T20:33:19.2522739Z torch.manual_seed(2025) 2025-05-07T20:33:19.2522808Z 2025-05-07T20:33:19.2522973Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2523045Z 2025-05-07T20:33:19.2523132Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2523333Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2523426Z x = x_sign * x_clamp 2025-05-07T20:33:19.2523501Z x0 = x[:, :D] 2025-05-07T20:33:19.2523584Z x1 = x[:, D:] 2025-05-07T20:33:19.2523656Z 2025-05-07T20:33:19.2523734Z if contiguous: 2025-05-07T20:33:19.2523822Z x0 = x0.contiguous() 2025-05-07T20:33:19.2523905Z x1 = x1.contiguous() 2025-05-07T20:33:19.2523976Z 2025-05-07T20:33:19.2524065Z if scale_ub is not None: 2025-05-07T20:33:19.2524168Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2524298Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2524378Z ) 2025-05-07T20:33:19.2524453Z else: 2025-05-07T20:33:19.2524543Z scale_ub_tensor = None 2025-05-07T20:33:19.2524620Z 2025-05-07T20:33:19.2524748Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2524838Z op = silu_mul_quant 2025-05-07T20:33:19.2524924Z if compiled: 2025-05-07T20:33:19.2525177Z op = torch.compile(op) 2025-05-07T20:33:19.2525284Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2525356Z 2025-05-07T20:33:19.2525448Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2525452Z 2025-05-07T20:33:19.2525552Z moe/activation_test.py:117: 2025-05-07T20:33:19.2525676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2525771Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2525872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2526237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2526328Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2526857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2526953Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2527314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2527532Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2527868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2527965Z kernel = self.compile( 2025-05-07T20:33:19.2528363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2528538Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2528661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2528668Z 2025-05-07T20:33:19.2528869Z self = 2025-05-07T20:33:19.2529643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2530133Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8519300>} 2025-05-07T20:33:19.2530869Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2531054Z context = 2025-05-07T20:33:19.2531059Z 2025-05-07T20:33:19.2531216Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2531481Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2531625Z module_map=module_map) 2025-05-07T20:33:19.2531792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2531889Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2531966Z E ^ 2025-05-07T20:33:19.2532319Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2532324Z 2025-05-07T20:33:19.2532735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2532739Z 2025-05-07T20:33:19.2532842Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2533061Z self=, 2025-05-07T20:33:19.2533136Z T=1, 2025-05-07T20:33:19.2533210Z D=5120, 2025-05-07T20:33:19.2533290Z scale_ub=1200.0, 2025-05-07T20:33:19.2533375Z contiguous=False, 2025-05-07T20:33:19.2533469Z compiled=False, 2025-05-07T20:33:19.2533539Z ) 2025-05-07T20:33:19.2533840Z self = 2025-05-07T20:33:19.2534010Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.2534015Z 2025-05-07T20:33:19.2534088Z @given( 2025-05-07T20:33:19.2534208Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2534303Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2534414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2534533Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2534641Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2534714Z ) 2025-05-07T20:33:19.2534955Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2535088Z def test_silu_mul_quant( 2025-05-07T20:33:19.2535164Z self, 2025-05-07T20:33:19.2535242Z T: int, 2025-05-07T20:33:19.2535319Z D: int, 2025-05-07T20:33:19.2535415Z scale_ub: Optional[float], 2025-05-07T20:33:19.2535511Z contiguous: bool, 2025-05-07T20:33:19.2535593Z compiled: bool, 2025-05-07T20:33:19.2535670Z ) -> None: 2025-05-07T20:33:19.2535758Z torch.manual_seed(2025) 2025-05-07T20:33:19.2535828Z 2025-05-07T20:33:19.2535991Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2536062Z 2025-05-07T20:33:19.2536148Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2536273Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2536357Z x = x_sign * x_clamp 2025-05-07T20:33:19.2536432Z x0 = x[:, :D] 2025-05-07T20:33:19.2536514Z x1 = x[:, D:] 2025-05-07T20:33:19.2536582Z 2025-05-07T20:33:19.2536663Z if contiguous: 2025-05-07T20:33:19.2536751Z x0 = x0.contiguous() 2025-05-07T20:33:19.2536834Z x1 = x1.contiguous() 2025-05-07T20:33:19.2536908Z 2025-05-07T20:33:19.2536995Z if scale_ub is not None: 2025-05-07T20:33:19.2537102Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2537237Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2537312Z ) 2025-05-07T20:33:19.2537387Z else: 2025-05-07T20:33:19.2537481Z scale_ub_tensor = None 2025-05-07T20:33:19.2537552Z 2025-05-07T20:33:19.2537675Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2537767Z op = silu_mul_quant 2025-05-07T20:33:19.2537847Z if compiled: 2025-05-07T20:33:19.2537942Z op = torch.compile(op) 2025-05-07T20:33:19.2538046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2538116Z 2025-05-07T20:33:19.2538207Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2538214Z 2025-05-07T20:33:19.2538306Z moe/activation_test.py:117: 2025-05-07T20:33:19.2538474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2538573Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2538675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2539166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2539264Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2539620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2539841Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2540408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2540541Z kernel = self.compile( 2025-05-07T20:33:19.2540928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2541102Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2541375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2541387Z 2025-05-07T20:33:19.2541585Z self = 2025-05-07T20:33:19.2542347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2542837Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d851a020>} 2025-05-07T20:33:19.2543566Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2543822Z context = 2025-05-07T20:33:19.2543832Z 2025-05-07T20:33:19.2543991Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2544247Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2544354Z module_map=module_map) 2025-05-07T20:33:19.2544511Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2544610Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2544687Z E ^ 2025-05-07T20:33:19.2545032Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2545037Z 2025-05-07T20:33:19.2545449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2545456Z 2025-05-07T20:33:19.2545555Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2545770Z self=, 2025-05-07T20:33:19.2545846Z T=16384, 2025-05-07T20:33:19.2545917Z D=5120, 2025-05-07T20:33:19.2545996Z scale_ub=1200.0, 2025-05-07T20:33:19.2546079Z contiguous=False, 2025-05-07T20:33:19.2546157Z compiled=True, 2025-05-07T20:33:19.2546228Z ) 2025-05-07T20:33:19.2546440Z self = 2025-05-07T20:33:19.2546611Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.2546616Z 2025-05-07T20:33:19.2546693Z @given( 2025-05-07T20:33:19.2546806Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2546902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2547017Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2547126Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2547304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2547383Z ) 2025-05-07T20:33:19.2547680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2547771Z def test_silu_mul_quant( 2025-05-07T20:33:19.2547842Z self, 2025-05-07T20:33:19.2547915Z T: int, 2025-05-07T20:33:19.2547988Z D: int, 2025-05-07T20:33:19.2548081Z scale_ub: Optional[float], 2025-05-07T20:33:19.2548163Z contiguous: bool, 2025-05-07T20:33:19.2548249Z compiled: bool, 2025-05-07T20:33:19.2548323Z ) -> None: 2025-05-07T20:33:19.2548411Z torch.manual_seed(2025) 2025-05-07T20:33:19.2548483Z 2025-05-07T20:33:19.2548644Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2548713Z 2025-05-07T20:33:19.2548805Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2548924Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2549016Z x = x_sign * x_clamp 2025-05-07T20:33:19.2549091Z x0 = x[:, :D] 2025-05-07T20:33:19.2549251Z x1 = x[:, D:] 2025-05-07T20:33:19.2549327Z 2025-05-07T20:33:19.2549404Z if contiguous: 2025-05-07T20:33:19.2549491Z x0 = x0.contiguous() 2025-05-07T20:33:19.2549578Z x1 = x1.contiguous() 2025-05-07T20:33:19.2549643Z 2025-05-07T20:33:19.2549727Z if scale_ub is not None: 2025-05-07T20:33:19.2549834Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2549962Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2550035Z ) 2025-05-07T20:33:19.2550114Z else: 2025-05-07T20:33:19.2550202Z scale_ub_tensor = None 2025-05-07T20:33:19.2550270Z 2025-05-07T20:33:19.2550393Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2550555Z op = silu_mul_quant 2025-05-07T20:33:19.2550638Z if compiled: 2025-05-07T20:33:19.2550735Z op = torch.compile(op) 2025-05-07T20:33:19.2550835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2550918Z 2025-05-07T20:33:19.2551387Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2551392Z 2025-05-07T20:33:19.2551485Z moe/activation_test.py:117: 2025-05-07T20:33:19.2551615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2551709Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2551807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2552172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2552263Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2552753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2552851Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2553205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2553440Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2553776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2553871Z kernel = self.compile( 2025-05-07T20:33:19.2554270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2554440Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2554566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2554570Z 2025-05-07T20:33:19.2554764Z self = 2025-05-07T20:33:19.2555581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2556074Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d851b600>} 2025-05-07T20:33:19.2556810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2556995Z context = 2025-05-07T20:33:19.2556999Z 2025-05-07T20:33:19.2557157Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2557418Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2557526Z module_map=module_map) 2025-05-07T20:33:19.2557689Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2557871Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2557946Z E ^ 2025-05-07T20:33:19.2558295Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2558299Z 2025-05-07T20:33:19.2558708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2558712Z 2025-05-07T20:33:19.2558808Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2559031Z self=, 2025-05-07T20:33:19.2559104Z T=2048, 2025-05-07T20:33:19.2559183Z D=7168, 2025-05-07T20:33:19.2559262Z scale_ub=1200.0, 2025-05-07T20:33:19.2559387Z contiguous=False, 2025-05-07T20:33:19.2559468Z compiled=True, 2025-05-07T20:33:19.2559534Z ) 2025-05-07T20:33:19.2559750Z self = 2025-05-07T20:33:19.2559926Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.2559933Z 2025-05-07T20:33:19.2560007Z @given( 2025-05-07T20:33:19.2560120Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2560222Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2560332Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2560447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2560558Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2560629Z ) 2025-05-07T20:33:19.2560872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2560961Z def test_silu_mul_quant( 2025-05-07T20:33:19.2561037Z self, 2025-05-07T20:33:19.2561116Z T: int, 2025-05-07T20:33:19.2561188Z D: int, 2025-05-07T20:33:19.2561280Z scale_ub: Optional[float], 2025-05-07T20:33:19.2561371Z contiguous: bool, 2025-05-07T20:33:19.2561452Z compiled: bool, 2025-05-07T20:33:19.2561535Z ) -> None: 2025-05-07T20:33:19.2561628Z torch.manual_seed(2025) 2025-05-07T20:33:19.2561696Z 2025-05-07T20:33:19.2561858Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2561935Z 2025-05-07T20:33:19.2562024Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2562148Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2562233Z x = x_sign * x_clamp 2025-05-07T20:33:19.2562308Z x0 = x[:, :D] 2025-05-07T20:33:19.2562389Z x1 = x[:, D:] 2025-05-07T20:33:19.2562460Z 2025-05-07T20:33:19.2562538Z if contiguous: 2025-05-07T20:33:19.2562630Z x0 = x0.contiguous() 2025-05-07T20:33:19.2562717Z x1 = x1.contiguous() 2025-05-07T20:33:19.2562785Z 2025-05-07T20:33:19.2562877Z if scale_ub is not None: 2025-05-07T20:33:19.2563026Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2563161Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2563239Z ) 2025-05-07T20:33:19.2563312Z else: 2025-05-07T20:33:19.2563404Z scale_ub_tensor = None 2025-05-07T20:33:19.2563473Z 2025-05-07T20:33:19.2563597Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2563684Z op = silu_mul_quant 2025-05-07T20:33:19.2563764Z if compiled: 2025-05-07T20:33:19.2563859Z op = torch.compile(op) 2025-05-07T20:33:19.2563966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2564038Z 2025-05-07T20:33:19.2564125Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2564130Z 2025-05-07T20:33:19.2564228Z moe/activation_test.py:117: 2025-05-07T20:33:19.2564354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2564456Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2564552Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2565004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2565099Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2565585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2565676Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2566036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2566291Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2566641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2566774Z kernel = self.compile( 2025-05-07T20:33:19.2567170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2567344Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2567464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2567468Z 2025-05-07T20:33:19.2567664Z self = 2025-05-07T20:33:19.2568425Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2568911Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8268720>} 2025-05-07T20:33:19.2569648Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2569835Z context = 2025-05-07T20:33:19.2569840Z 2025-05-07T20:33:19.2569998Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2570253Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2570356Z module_map=module_map) 2025-05-07T20:33:19.2570514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2570606Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2570676Z E ^ 2025-05-07T20:33:19.2571022Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2571030Z 2025-05-07T20:33:19.2571458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2571506Z 2025-05-07T20:33:19.2571609Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2571826Z self=, 2025-05-07T20:33:19.2571897Z T=1, 2025-05-07T20:33:19.2571971Z D=5120, 2025-05-07T20:33:19.2572047Z scale_ub=None, 2025-05-07T20:33:19.2572128Z contiguous=False, 2025-05-07T20:33:19.2572211Z compiled=False, 2025-05-07T20:33:19.2572278Z ) 2025-05-07T20:33:19.2572492Z self = 2025-05-07T20:33:19.2572652Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.2572657Z 2025-05-07T20:33:19.2572730Z @given( 2025-05-07T20:33:19.2572846Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2572943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2573052Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2573173Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2573361Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2573435Z ) 2025-05-07T20:33:19.2573671Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2573760Z def test_silu_mul_quant( 2025-05-07T20:33:19.2573831Z self, 2025-05-07T20:33:19.2573901Z T: int, 2025-05-07T20:33:19.2573972Z D: int, 2025-05-07T20:33:19.2574065Z scale_ub: Optional[float], 2025-05-07T20:33:19.2574148Z contiguous: bool, 2025-05-07T20:33:19.2574224Z compiled: bool, 2025-05-07T20:33:19.2574298Z ) -> None: 2025-05-07T20:33:19.2574388Z torch.manual_seed(2025) 2025-05-07T20:33:19.2574453Z 2025-05-07T20:33:19.2574617Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2574727Z 2025-05-07T20:33:19.2574811Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2574934Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2575019Z x = x_sign * x_clamp 2025-05-07T20:33:19.2575102Z x0 = x[:, :D] 2025-05-07T20:33:19.2575174Z x1 = x[:, D:] 2025-05-07T20:33:19.2575238Z 2025-05-07T20:33:19.2575319Z if contiguous: 2025-05-07T20:33:19.2575405Z x0 = x0.contiguous() 2025-05-07T20:33:19.2575489Z x1 = x1.contiguous() 2025-05-07T20:33:19.2575563Z 2025-05-07T20:33:19.2575646Z if scale_ub is not None: 2025-05-07T20:33:19.2575744Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2575877Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2575951Z ) 2025-05-07T20:33:19.2576024Z else: 2025-05-07T20:33:19.2576119Z scale_ub_tensor = None 2025-05-07T20:33:19.2576195Z 2025-05-07T20:33:19.2576345Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2576450Z op = silu_mul_quant 2025-05-07T20:33:19.2576530Z if compiled: 2025-05-07T20:33:19.2576629Z op = torch.compile(op) 2025-05-07T20:33:19.2576733Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2576808Z 2025-05-07T20:33:19.2576897Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2576901Z 2025-05-07T20:33:19.2576993Z moe/activation_test.py:117: 2025-05-07T20:33:19.2577117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2577215Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2577308Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2577798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2577889Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2578246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2578514Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2578855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2578945Z kernel = self.compile( 2025-05-07T20:33:19.2579327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2579494Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2579617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2579621Z 2025-05-07T20:33:19.2579817Z self = 2025-05-07T20:33:19.2580577Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2581111Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8269120>} 2025-05-07T20:33:19.2581899Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2582087Z context = 2025-05-07T20:33:19.2582091Z 2025-05-07T20:33:19.2582249Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2582509Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2586128Z module_map=module_map) 2025-05-07T20:33:19.2586380Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2586478Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2586558Z E ^ 2025-05-07T20:33:19.2586916Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2586924Z 2025-05-07T20:33:19.2587346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2587351Z 2025-05-07T20:33:19.2587520Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2587745Z self=, 2025-05-07T20:33:19.2587825Z T=4096, 2025-05-07T20:33:19.2587899Z D=7168, 2025-05-07T20:33:19.2587985Z scale_ub=1200.0, 2025-05-07T20:33:19.2588069Z contiguous=False, 2025-05-07T20:33:19.2588150Z compiled=False, 2025-05-07T20:33:19.2588225Z ) 2025-05-07T20:33:19.2588447Z self = 2025-05-07T20:33:19.2588623Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.2588630Z 2025-05-07T20:33:19.2588709Z @given( 2025-05-07T20:33:19.2588830Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2588932Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2589045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2589161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2589276Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2589349Z ) 2025-05-07T20:33:19.2589590Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2589683Z def test_silu_mul_quant( 2025-05-07T20:33:19.2589758Z self, 2025-05-07T20:33:19.2589833Z T: int, 2025-05-07T20:33:19.2589910Z D: int, 2025-05-07T20:33:19.2590004Z scale_ub: Optional[float], 2025-05-07T20:33:19.2590094Z contiguous: bool, 2025-05-07T20:33:19.2590176Z compiled: bool, 2025-05-07T20:33:19.2590254Z ) -> None: 2025-05-07T20:33:19.2590405Z torch.manual_seed(2025) 2025-05-07T20:33:19.2590481Z 2025-05-07T20:33:19.2590649Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2590722Z 2025-05-07T20:33:19.2590812Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2590933Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2591023Z x = x_sign * x_clamp 2025-05-07T20:33:19.2591107Z x0 = x[:, :D] 2025-05-07T20:33:19.2591182Z x1 = x[:, D:] 2025-05-07T20:33:19.2591255Z 2025-05-07T20:33:19.2591334Z if contiguous: 2025-05-07T20:33:19.2591419Z x0 = x0.contiguous() 2025-05-07T20:33:19.2591503Z x1 = x1.contiguous() 2025-05-07T20:33:19.2591576Z 2025-05-07T20:33:19.2591669Z if scale_ub is not None: 2025-05-07T20:33:19.2591773Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2591905Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2591983Z ) 2025-05-07T20:33:19.2592057Z else: 2025-05-07T20:33:19.2592229Z scale_ub_tensor = None 2025-05-07T20:33:19.2592302Z 2025-05-07T20:33:19.2592426Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2592511Z op = silu_mul_quant 2025-05-07T20:33:19.2592594Z if compiled: 2025-05-07T20:33:19.2592690Z op = torch.compile(op) 2025-05-07T20:33:19.2592789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2592862Z 2025-05-07T20:33:19.2592948Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2592953Z 2025-05-07T20:33:19.2593048Z moe/activation_test.py:117: 2025-05-07T20:33:19.2593172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2593270Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2593412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2593905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2594001Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2594358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2594574Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2594915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2595003Z kernel = self.compile( 2025-05-07T20:33:19.2595400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2595571Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2595695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2595699Z 2025-05-07T20:33:19.2595904Z self = 2025-05-07T20:33:19.2596669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2597157Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d826a480>} 2025-05-07T20:33:19.2597888Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2598072Z context = 2025-05-07T20:33:19.2598079Z 2025-05-07T20:33:19.2598239Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2598535Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2598646Z module_map=module_map) 2025-05-07T20:33:19.2598804Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2598898Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2598973Z E ^ 2025-05-07T20:33:19.2599319Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2599323Z 2025-05-07T20:33:19.2599734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2599738Z 2025-05-07T20:33:19.2599836Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2600049Z self=, 2025-05-07T20:33:19.2600127Z T=16384, 2025-05-07T20:33:19.2600203Z D=7168, 2025-05-07T20:33:19.2600282Z scale_ub=None, 2025-05-07T20:33:19.2600366Z contiguous=True, 2025-05-07T20:33:19.2600523Z compiled=True, 2025-05-07T20:33:19.2600595Z ) 2025-05-07T20:33:19.2600808Z self = 2025-05-07T20:33:19.2600974Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.2600979Z 2025-05-07T20:33:19.2601048Z @given( 2025-05-07T20:33:19.2601165Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2601260Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2601368Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2601481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2601588Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2601661Z ) 2025-05-07T20:33:19.2601939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2602025Z def test_silu_mul_quant( 2025-05-07T20:33:19.2602103Z self, 2025-05-07T20:33:19.2602179Z T: int, 2025-05-07T20:33:19.2602254Z D: int, 2025-05-07T20:33:19.2602349Z scale_ub: Optional[float], 2025-05-07T20:33:19.2602430Z contiguous: bool, 2025-05-07T20:33:19.2602507Z compiled: bool, 2025-05-07T20:33:19.2602584Z ) -> None: 2025-05-07T20:33:19.2602674Z torch.manual_seed(2025) 2025-05-07T20:33:19.2602743Z 2025-05-07T20:33:19.2602906Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2602977Z 2025-05-07T20:33:19.2603066Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2603183Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2603267Z x = x_sign * x_clamp 2025-05-07T20:33:19.2603343Z x0 = x[:, :D] 2025-05-07T20:33:19.2603419Z x1 = x[:, D:] 2025-05-07T20:33:19.2603490Z 2025-05-07T20:33:19.2603571Z if contiguous: 2025-05-07T20:33:19.2603658Z x0 = x0.contiguous() 2025-05-07T20:33:19.2603744Z x1 = x1.contiguous() 2025-05-07T20:33:19.2603818Z 2025-05-07T20:33:19.2603903Z if scale_ub is not None: 2025-05-07T20:33:19.2604001Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2604133Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2604205Z ) 2025-05-07T20:33:19.2604279Z else: 2025-05-07T20:33:19.2604370Z scale_ub_tensor = None 2025-05-07T20:33:19.2604440Z 2025-05-07T20:33:19.2604564Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2604648Z op = silu_mul_quant 2025-05-07T20:33:19.2604729Z if compiled: 2025-05-07T20:33:19.2604825Z op = torch.compile(op) 2025-05-07T20:33:19.2604925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2604996Z 2025-05-07T20:33:19.2605088Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2605092Z 2025-05-07T20:33:19.2605183Z moe/activation_test.py:117: 2025-05-07T20:33:19.2605354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2605456Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2605549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2605911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2606000Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2606483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2606578Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2606928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2607147Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2607485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2607654Z kernel = self.compile( 2025-05-07T20:33:19.2608052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2608220Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2608339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2608343Z 2025-05-07T20:33:19.2608541Z self = 2025-05-07T20:33:19.2609298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2609831Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d826b740>} 2025-05-07T20:33:19.2610565Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2610751Z context = 2025-05-07T20:33:19.2610755Z 2025-05-07T20:33:19.2610910Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2611163Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2611268Z module_map=module_map) 2025-05-07T20:33:19.2611426Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2611518Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2611598Z E ^ 2025-05-07T20:33:19.2611946Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2611953Z 2025-05-07T20:33:19.2612389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2612393Z 2025-05-07T20:33:19.2612489Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2612703Z self=, 2025-05-07T20:33:19.2612779Z T=4096, 2025-05-07T20:33:19.2612849Z D=5120, 2025-05-07T20:33:19.2612925Z scale_ub=None, 2025-05-07T20:33:19.2613011Z contiguous=False, 2025-05-07T20:33:19.2613086Z compiled=True, 2025-05-07T20:33:19.2613156Z ) 2025-05-07T20:33:19.2613365Z self = 2025-05-07T20:33:19.2613529Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.2613537Z 2025-05-07T20:33:19.2613608Z @given( 2025-05-07T20:33:19.2613767Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2613863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2613983Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2614095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2614202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2614274Z ) 2025-05-07T20:33:19.2614508Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2614598Z def test_silu_mul_quant( 2025-05-07T20:33:19.2614669Z self, 2025-05-07T20:33:19.2614741Z T: int, 2025-05-07T20:33:19.2614815Z D: int, 2025-05-07T20:33:19.2614906Z scale_ub: Optional[float], 2025-05-07T20:33:19.2614991Z contiguous: bool, 2025-05-07T20:33:19.2615073Z compiled: bool, 2025-05-07T20:33:19.2615147Z ) -> None: 2025-05-07T20:33:19.2615237Z torch.manual_seed(2025) 2025-05-07T20:33:19.2615310Z 2025-05-07T20:33:19.2615472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2615610Z 2025-05-07T20:33:19.2615736Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2615866Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2615968Z x = x_sign * x_clamp 2025-05-07T20:33:19.2616054Z x0 = x[:, :D] 2025-05-07T20:33:19.2616137Z x1 = x[:, D:] 2025-05-07T20:33:19.2616208Z 2025-05-07T20:33:19.2616288Z if contiguous: 2025-05-07T20:33:19.2616373Z x0 = x0.contiguous() 2025-05-07T20:33:19.2616460Z x1 = x1.contiguous() 2025-05-07T20:33:19.2616524Z 2025-05-07T20:33:19.2616608Z if scale_ub is not None: 2025-05-07T20:33:19.2616711Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2616838Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2616955Z ) 2025-05-07T20:33:19.2617029Z else: 2025-05-07T20:33:19.2617116Z scale_ub_tensor = None 2025-05-07T20:33:19.2617183Z 2025-05-07T20:33:19.2617312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2617396Z op = silu_mul_quant 2025-05-07T20:33:19.2617478Z if compiled: 2025-05-07T20:33:19.2617570Z op = torch.compile(op) 2025-05-07T20:33:19.2617669Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2617738Z 2025-05-07T20:33:19.2617824Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2617829Z 2025-05-07T20:33:19.2617920Z moe/activation_test.py:117: 2025-05-07T20:33:19.2618045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2618138Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2618231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2618596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2618686Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2619176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2619268Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2619618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2619836Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2620170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2620262Z kernel = self.compile( 2025-05-07T20:33:19.2620638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2620806Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2620931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2620936Z 2025-05-07T20:33:19.2621177Z self = 2025-05-07T20:33:19.2621943Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2622433Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9a74c20>} 2025-05-07T20:33:19.2623165Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2623354Z context = 2025-05-07T20:33:19.2623359Z 2025-05-07T20:33:19.2623518Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2623815Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2623954Z module_map=module_map) 2025-05-07T20:33:19.2624108Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2624205Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2624279Z E ^ 2025-05-07T20:33:19.2624623Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2624632Z 2025-05-07T20:33:19.2625039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2625043Z 2025-05-07T20:33:19.2625139Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2625394Z self=, 2025-05-07T20:33:19.2625465Z T=4096, 2025-05-07T20:33:19.2625539Z D=5120, 2025-05-07T20:33:19.2625619Z scale_ub=1200.0, 2025-05-07T20:33:19.2625707Z contiguous=False, 2025-05-07T20:33:19.2625787Z compiled=False, 2025-05-07T20:33:19.2625856Z ) 2025-05-07T20:33:19.2626088Z self = 2025-05-07T20:33:19.2626289Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.2626294Z 2025-05-07T20:33:19.2626369Z @given( 2025-05-07T20:33:19.2626479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2626573Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2626683Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2626794Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2626903Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2626973Z ) 2025-05-07T20:33:19.2627209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2627301Z def test_silu_mul_quant( 2025-05-07T20:33:19.2627377Z self, 2025-05-07T20:33:19.2627503Z T: int, 2025-05-07T20:33:19.2627575Z D: int, 2025-05-07T20:33:19.2627667Z scale_ub: Optional[float], 2025-05-07T20:33:19.2627754Z contiguous: bool, 2025-05-07T20:33:19.2627833Z compiled: bool, 2025-05-07T20:33:19.2627905Z ) -> None: 2025-05-07T20:33:19.2627995Z torch.manual_seed(2025) 2025-05-07T20:33:19.2628064Z 2025-05-07T20:33:19.2628224Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2628294Z 2025-05-07T20:33:19.2628379Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2628497Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2628583Z x = x_sign * x_clamp 2025-05-07T20:33:19.2628659Z x0 = x[:, :D] 2025-05-07T20:33:19.2628734Z x1 = x[:, D:] 2025-05-07T20:33:19.2628800Z 2025-05-07T20:33:19.2628877Z if contiguous: 2025-05-07T20:33:19.2629010Z x0 = x0.contiguous() 2025-05-07T20:33:19.2629098Z x1 = x1.contiguous() 2025-05-07T20:33:19.2629165Z 2025-05-07T20:33:19.2629250Z if scale_ub is not None: 2025-05-07T20:33:19.2629349Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2629478Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2629548Z ) 2025-05-07T20:33:19.2629618Z else: 2025-05-07T20:33:19.2629706Z scale_ub_tensor = None 2025-05-07T20:33:19.2629776Z 2025-05-07T20:33:19.2629899Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2629982Z op = silu_mul_quant 2025-05-07T20:33:19.2630064Z if compiled: 2025-05-07T20:33:19.2630156Z op = torch.compile(op) 2025-05-07T20:33:19.2630262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2630330Z 2025-05-07T20:33:19.2630415Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2630424Z 2025-05-07T20:33:19.2630516Z moe/activation_test.py:117: 2025-05-07T20:33:19.2630718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2630812Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2630906Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2631391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2631483Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2631835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2632050Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2632388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2632517Z kernel = self.compile( 2025-05-07T20:33:19.2632899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2633072Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2633192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2633197Z 2025-05-07T20:33:19.2633393Z self = 2025-05-07T20:33:19.2634151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2634637Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9a756c0>} 2025-05-07T20:33:19.2635375Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2635559Z context = 2025-05-07T20:33:19.2635564Z 2025-05-07T20:33:19.2635726Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2635978Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2636083Z module_map=module_map) 2025-05-07T20:33:19.2636235Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2636327Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2636398Z E ^ 2025-05-07T20:33:19.2636743Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2636751Z 2025-05-07T20:33:19.2637201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2637211Z 2025-05-07T20:33:19.2637311Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2637525Z self=, 2025-05-07T20:33:19.2637597Z T=4096, 2025-05-07T20:33:19.2637667Z D=5120, 2025-05-07T20:33:19.2637747Z scale_ub=1200.0, 2025-05-07T20:33:19.2637831Z contiguous=False, 2025-05-07T20:33:19.2637907Z compiled=True, 2025-05-07T20:33:19.2637972Z ) 2025-05-07T20:33:19.2638185Z self = 2025-05-07T20:33:19.2638353Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.2638357Z 2025-05-07T20:33:19.2638429Z @given( 2025-05-07T20:33:19.2638545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2638637Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2638749Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2638938Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2639047Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2639117Z ) 2025-05-07T20:33:19.2639364Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2639450Z def test_silu_mul_quant( 2025-05-07T20:33:19.2639523Z self, 2025-05-07T20:33:19.2639596Z T: int, 2025-05-07T20:33:19.2639667Z D: int, 2025-05-07T20:33:19.2639761Z scale_ub: Optional[float], 2025-05-07T20:33:19.2639842Z contiguous: bool, 2025-05-07T20:33:19.2639921Z compiled: bool, 2025-05-07T20:33:19.2639997Z ) -> None: 2025-05-07T20:33:19.2640296Z torch.manual_seed(2025) 2025-05-07T20:33:19.2640519Z 2025-05-07T20:33:19.2640698Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2640765Z 2025-05-07T20:33:19.2640859Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2640980Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2641066Z x = x_sign * x_clamp 2025-05-07T20:33:19.2641142Z x0 = x[:, :D] 2025-05-07T20:33:19.2641215Z x1 = x[:, D:] 2025-05-07T20:33:19.2641284Z 2025-05-07T20:33:19.2641363Z if contiguous: 2025-05-07T20:33:19.2641449Z x0 = x0.contiguous() 2025-05-07T20:33:19.2641530Z x1 = x1.contiguous() 2025-05-07T20:33:19.2641601Z 2025-05-07T20:33:19.2641685Z if scale_ub is not None: 2025-05-07T20:33:19.2641789Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2641916Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2641991Z ) 2025-05-07T20:33:19.2642068Z else: 2025-05-07T20:33:19.2642162Z scale_ub_tensor = None 2025-05-07T20:33:19.2642231Z 2025-05-07T20:33:19.2642358Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2642445Z op = silu_mul_quant 2025-05-07T20:33:19.2642528Z if compiled: 2025-05-07T20:33:19.2642631Z op = torch.compile(op) 2025-05-07T20:33:19.2642736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2642807Z 2025-05-07T20:33:19.2642897Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2642901Z 2025-05-07T20:33:19.2642991Z moe/activation_test.py:117: 2025-05-07T20:33:19.2643117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2643210Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2643303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2643662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2643750Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2644335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2644436Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2644793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2645015Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2645347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2645438Z kernel = self.compile( 2025-05-07T20:33:19.2645839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2646006Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2646128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2646138Z 2025-05-07T20:33:19.2646334Z self = 2025-05-07T20:33:19.2647157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2647701Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9a76fc0>} 2025-05-07T20:33:19.2648431Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2648620Z context = 2025-05-07T20:33:19.2648625Z 2025-05-07T20:33:19.2648832Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2649091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2649204Z module_map=module_map) 2025-05-07T20:33:19.2649360Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2649459Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2649532Z E ^ 2025-05-07T20:33:19.2649876Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2649881Z 2025-05-07T20:33:19.2650314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2650319Z 2025-05-07T20:33:19.2650417Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2650633Z self=, 2025-05-07T20:33:19.2650717Z T=2048, 2025-05-07T20:33:19.2650791Z D=7168, 2025-05-07T20:33:19.2650876Z scale_ub=1200.0, 2025-05-07T20:33:19.2650960Z contiguous=False, 2025-05-07T20:33:19.2651047Z compiled=False, 2025-05-07T20:33:19.2651125Z ) 2025-05-07T20:33:19.2651341Z self = 2025-05-07T20:33:19.2651512Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.2651517Z 2025-05-07T20:33:19.2651596Z @given( 2025-05-07T20:33:19.2651710Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2651806Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2651923Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2652035Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2652148Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2652221Z ) 2025-05-07T20:33:19.2652460Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2652558Z def test_silu_mul_quant( 2025-05-07T20:33:19.2652632Z self, 2025-05-07T20:33:19.2652752Z T: int, 2025-05-07T20:33:19.2652827Z D: int, 2025-05-07T20:33:19.2652925Z scale_ub: Optional[float], 2025-05-07T20:33:19.2653008Z contiguous: bool, 2025-05-07T20:33:19.2653090Z compiled: bool, 2025-05-07T20:33:19.2653167Z ) -> None: 2025-05-07T20:33:19.2653257Z torch.manual_seed(2025) 2025-05-07T20:33:19.2653329Z 2025-05-07T20:33:19.2653491Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2653565Z 2025-05-07T20:33:19.2653652Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2653771Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2653860Z x = x_sign * x_clamp 2025-05-07T20:33:19.2653936Z x0 = x[:, :D] 2025-05-07T20:33:19.2654010Z x1 = x[:, D:] 2025-05-07T20:33:19.2654087Z 2025-05-07T20:33:19.2654166Z if contiguous: 2025-05-07T20:33:19.2654252Z x0 = x0.contiguous() 2025-05-07T20:33:19.2654340Z x1 = x1.contiguous() 2025-05-07T20:33:19.2654413Z 2025-05-07T20:33:19.2654649Z if scale_ub is not None: 2025-05-07T20:33:19.2654756Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2654884Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2654962Z ) 2025-05-07T20:33:19.2655034Z else: 2025-05-07T20:33:19.2655124Z scale_ub_tensor = None 2025-05-07T20:33:19.2655196Z 2025-05-07T20:33:19.2655319Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2655403Z op = silu_mul_quant 2025-05-07T20:33:19.2655491Z if compiled: 2025-05-07T20:33:19.2655587Z op = torch.compile(op) 2025-05-07T20:33:19.2655689Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2655765Z 2025-05-07T20:33:19.2655892Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2655897Z 2025-05-07T20:33:19.2655990Z moe/activation_test.py:117: 2025-05-07T20:33:19.2656120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2656220Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2656317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2656801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2656893Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2657252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2657468Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2657809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2657903Z kernel = self.compile( 2025-05-07T20:33:19.2658299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2658476Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2658604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2658609Z 2025-05-07T20:33:19.2658807Z self = 2025-05-07T20:33:19.2659573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2660061Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9a77ec0>} 2025-05-07T20:33:19.2660846Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2661038Z context = 2025-05-07T20:33:19.2661043Z 2025-05-07T20:33:19.2661205Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2661461Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2661565Z module_map=module_map) 2025-05-07T20:33:19.2661731Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2661823Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2661895Z E ^ 2025-05-07T20:33:19.2662246Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2662250Z 2025-05-07T20:33:19.2662662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2662667Z 2025-05-07T20:33:19.2662773Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2663068Z self=, 2025-05-07T20:33:19.2663144Z T=1, 2025-05-07T20:33:19.2663223Z D=7168, 2025-05-07T20:33:19.2663300Z scale_ub=None, 2025-05-07T20:33:19.2663380Z contiguous=True, 2025-05-07T20:33:19.2663465Z compiled=False, 2025-05-07T20:33:19.2663535Z ) 2025-05-07T20:33:19.2663746Z self = 2025-05-07T20:33:19.2663910Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.2663915Z 2025-05-07T20:33:19.2663991Z @given( 2025-05-07T20:33:19.2664107Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2664203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2664356Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2664474Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2664585Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2664662Z ) 2025-05-07T20:33:19.2664901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2664987Z def test_silu_mul_quant( 2025-05-07T20:33:19.2665062Z self, 2025-05-07T20:33:19.2665134Z T: int, 2025-05-07T20:33:19.2665204Z D: int, 2025-05-07T20:33:19.2665303Z scale_ub: Optional[float], 2025-05-07T20:33:19.2665386Z contiguous: bool, 2025-05-07T20:33:19.2665464Z compiled: bool, 2025-05-07T20:33:19.2665539Z ) -> None: 2025-05-07T20:33:19.2665627Z torch.manual_seed(2025) 2025-05-07T20:33:19.2665695Z 2025-05-07T20:33:19.2665865Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2665939Z 2025-05-07T20:33:19.2666038Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2666174Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2666284Z x = x_sign * x_clamp 2025-05-07T20:33:19.2666372Z x0 = x[:, :D] 2025-05-07T20:33:19.2666454Z x1 = x[:, D:] 2025-05-07T20:33:19.2666525Z 2025-05-07T20:33:19.2666610Z if contiguous: 2025-05-07T20:33:19.2666697Z x0 = x0.contiguous() 2025-05-07T20:33:19.2666783Z x1 = x1.contiguous() 2025-05-07T20:33:19.2666856Z 2025-05-07T20:33:19.2666941Z if scale_ub is not None: 2025-05-07T20:33:19.2667041Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2667171Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2667243Z ) 2025-05-07T20:33:19.2667316Z else: 2025-05-07T20:33:19.2667461Z scale_ub_tensor = None 2025-05-07T20:33:19.2667531Z 2025-05-07T20:33:19.2667654Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2667743Z op = silu_mul_quant 2025-05-07T20:33:19.2667823Z if compiled: 2025-05-07T20:33:19.2667964Z op = torch.compile(op) 2025-05-07T20:33:19.2668070Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2668136Z 2025-05-07T20:33:19.2668224Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2668229Z 2025-05-07T20:33:19.2668320Z moe/activation_test.py:117: 2025-05-07T20:33:19.2668441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2668536Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2668630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2669120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2669212Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2669564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2669784Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2670164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2670289Z kernel = self.compile( 2025-05-07T20:33:19.2670692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2670858Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2670982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2670987Z 2025-05-07T20:33:19.2671184Z self = 2025-05-07T20:33:19.2671941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2672501Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d88d4cc0>} 2025-05-07T20:33:19.2673235Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2673425Z context = 2025-05-07T20:33:19.2673430Z 2025-05-07T20:33:19.2673585Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2673837Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2673944Z module_map=module_map) 2025-05-07T20:33:19.2674100Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2674197Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2674272Z E ^ 2025-05-07T20:33:19.2674622Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2674629Z 2025-05-07T20:33:19.2675043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2675048Z 2025-05-07T20:33:19.2675145Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2675362Z self=, 2025-05-07T20:33:19.2675436Z T=16384, 2025-05-07T20:33:19.2675509Z D=7168, 2025-05-07T20:33:19.2675586Z scale_ub=1200.0, 2025-05-07T20:33:19.2675667Z contiguous=False, 2025-05-07T20:33:19.2675747Z compiled=True, 2025-05-07T20:33:19.2675816Z ) 2025-05-07T20:33:19.2676052Z self = 2025-05-07T20:33:19.2676253Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.2676257Z 2025-05-07T20:33:19.2676381Z @given( 2025-05-07T20:33:19.2676498Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2676600Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2676710Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2676820Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2676928Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2677000Z ) 2025-05-07T20:33:19.2677236Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2677327Z def test_silu_mul_quant( 2025-05-07T20:33:19.2677397Z self, 2025-05-07T20:33:19.2677471Z T: int, 2025-05-07T20:33:19.2677547Z D: int, 2025-05-07T20:33:19.2677638Z scale_ub: Optional[float], 2025-05-07T20:33:19.2677728Z contiguous: bool, 2025-05-07T20:33:19.2677813Z compiled: bool, 2025-05-07T20:33:19.2677885Z ) -> None: 2025-05-07T20:33:19.2677979Z torch.manual_seed(2025) 2025-05-07T20:33:19.2678047Z 2025-05-07T20:33:19.2678289Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2678368Z 2025-05-07T20:33:19.2678454Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2678572Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2678657Z x = x_sign * x_clamp 2025-05-07T20:33:19.2678732Z x0 = x[:, :D] 2025-05-07T20:33:19.2678805Z x1 = x[:, D:] 2025-05-07T20:33:19.2678874Z 2025-05-07T20:33:19.2678953Z if contiguous: 2025-05-07T20:33:19.2679038Z x0 = x0.contiguous() 2025-05-07T20:33:19.2679127Z x1 = x1.contiguous() 2025-05-07T20:33:19.2679195Z 2025-05-07T20:33:19.2679281Z if scale_ub is not None: 2025-05-07T20:33:19.2679379Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2679551Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2679627Z ) 2025-05-07T20:33:19.2679703Z else: 2025-05-07T20:33:19.2679792Z scale_ub_tensor = None 2025-05-07T20:33:19.2679870Z 2025-05-07T20:33:19.2679993Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2680077Z op = silu_mul_quant 2025-05-07T20:33:19.2680161Z if compiled: 2025-05-07T20:33:19.2680254Z op = torch.compile(op) 2025-05-07T20:33:19.2680352Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2680423Z 2025-05-07T20:33:19.2680507Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2680512Z 2025-05-07T20:33:19.2680606Z moe/activation_test.py:117: 2025-05-07T20:33:19.2680728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2680823Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2680920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2681281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2681371Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2681864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2681954Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2682307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2682522Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2682855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2682945Z kernel = self.compile( 2025-05-07T20:33:19.2683338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2683506Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2683675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2683683Z 2025-05-07T20:33:19.2683881Z self = 2025-05-07T20:33:19.2684642Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2685127Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d88d60c0>} 2025-05-07T20:33:19.2685858Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2686041Z context = 2025-05-07T20:33:19.2686048Z 2025-05-07T20:33:19.2686245Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2686538Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2686639Z module_map=module_map) 2025-05-07T20:33:19.2686800Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2686892Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2686964Z E ^ 2025-05-07T20:33:19.2687315Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2687319Z 2025-05-07T20:33:19.2687748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2687792Z 2025-05-07T20:33:19.2687890Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2688111Z self=, 2025-05-07T20:33:19.2688185Z T=1, 2025-05-07T20:33:19.2688264Z D=7168, 2025-05-07T20:33:19.2688341Z scale_ub=None, 2025-05-07T20:33:19.2688420Z contiguous=False, 2025-05-07T20:33:19.2688504Z compiled=False, 2025-05-07T20:33:19.2688570Z ) 2025-05-07T20:33:19.2688782Z self = 2025-05-07T20:33:19.2688947Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.2688952Z 2025-05-07T20:33:19.2689026Z @given( 2025-05-07T20:33:19.2689139Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2689237Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2689346Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2689458Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2689568Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2689636Z ) 2025-05-07T20:33:19.2689877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2689968Z def test_silu_mul_quant( 2025-05-07T20:33:19.2690040Z self, 2025-05-07T20:33:19.2690116Z T: int, 2025-05-07T20:33:19.2690188Z D: int, 2025-05-07T20:33:19.2690279Z scale_ub: Optional[float], 2025-05-07T20:33:19.2690365Z contiguous: bool, 2025-05-07T20:33:19.2690444Z compiled: bool, 2025-05-07T20:33:19.2690516Z ) -> None: 2025-05-07T20:33:19.2690607Z torch.manual_seed(2025) 2025-05-07T20:33:19.2690676Z 2025-05-07T20:33:19.2690840Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2690909Z 2025-05-07T20:33:19.2690994Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2691114Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2691198Z x = x_sign * x_clamp 2025-05-07T20:33:19.2691271Z x0 = x[:, :D] 2025-05-07T20:33:19.2691348Z x1 = x[:, D:] 2025-05-07T20:33:19.2691459Z 2025-05-07T20:33:19.2691539Z if contiguous: 2025-05-07T20:33:19.2691628Z x0 = x0.contiguous() 2025-05-07T20:33:19.2691712Z x1 = x1.contiguous() 2025-05-07T20:33:19.2691781Z 2025-05-07T20:33:19.2691868Z if scale_ub is not None: 2025-05-07T20:33:19.2691967Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2692099Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2692170Z ) 2025-05-07T20:33:19.2692240Z else: 2025-05-07T20:33:19.2692332Z scale_ub_tensor = None 2025-05-07T20:33:19.2692399Z 2025-05-07T20:33:19.2692525Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2692611Z op = silu_mul_quant 2025-05-07T20:33:19.2692692Z if compiled: 2025-05-07T20:33:19.2692787Z op = torch.compile(op) 2025-05-07T20:33:19.2692888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2692956Z 2025-05-07T20:33:19.2693044Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2693091Z 2025-05-07T20:33:19.2693223Z moe/activation_test.py:117: 2025-05-07T20:33:19.2693348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2693448Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2693543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2694035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2694134Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2694489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2694707Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2695086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2695179Z kernel = self.compile( 2025-05-07T20:33:19.2695568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2695735Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2695856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2695860Z 2025-05-07T20:33:19.2696057Z self = 2025-05-07T20:33:19.2696816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2697311Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d88d6c00>} 2025-05-07T20:33:19.2698043Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2698227Z context = 2025-05-07T20:33:19.2698234Z 2025-05-07T20:33:19.2698390Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2698643Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2698746Z module_map=module_map) 2025-05-07T20:33:19.2698903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2698995Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2699079Z E ^ 2025-05-07T20:33:19.2699426Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2699471Z 2025-05-07T20:33:19.2699911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2699918Z 2025-05-07T20:33:19.2700015Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2700231Z self=, 2025-05-07T20:33:19.2700307Z T=2048, 2025-05-07T20:33:19.2700379Z D=7168, 2025-05-07T20:33:19.2700454Z scale_ub=None, 2025-05-07T20:33:19.2700543Z contiguous=False, 2025-05-07T20:33:19.2700622Z compiled=True, 2025-05-07T20:33:19.2700693Z ) 2025-05-07T20:33:19.2700907Z self = 2025-05-07T20:33:19.2701073Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.2701083Z 2025-05-07T20:33:19.2701159Z @given( 2025-05-07T20:33:19.2704583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2704699Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2704979Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2705092Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2705200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2705273Z ) 2025-05-07T20:33:19.2705511Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2705604Z def test_silu_mul_quant( 2025-05-07T20:33:19.2705679Z self, 2025-05-07T20:33:19.2705754Z T: int, 2025-05-07T20:33:19.2705828Z D: int, 2025-05-07T20:33:19.2705921Z scale_ub: Optional[float], 2025-05-07T20:33:19.2706004Z contiguous: bool, 2025-05-07T20:33:19.2706087Z compiled: bool, 2025-05-07T20:33:19.2706162Z ) -> None: 2025-05-07T20:33:19.2706293Z torch.manual_seed(2025) 2025-05-07T20:33:19.2706365Z 2025-05-07T20:33:19.2706529Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2706603Z 2025-05-07T20:33:19.2706694Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2706817Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2706899Z x = x_sign * x_clamp 2025-05-07T20:33:19.2706978Z x0 = x[:, :D] 2025-05-07T20:33:19.2707052Z x1 = x[:, D:] 2025-05-07T20:33:19.2707126Z 2025-05-07T20:33:19.2707205Z if contiguous: 2025-05-07T20:33:19.2707289Z x0 = x0.contiguous() 2025-05-07T20:33:19.2707376Z x1 = x1.contiguous() 2025-05-07T20:33:19.2707504Z 2025-05-07T20:33:19.2707590Z if scale_ub is not None: 2025-05-07T20:33:19.2707695Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2707824Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2707900Z ) 2025-05-07T20:33:19.2707975Z else: 2025-05-07T20:33:19.2708063Z scale_ub_tensor = None 2025-05-07T20:33:19.2708128Z 2025-05-07T20:33:19.2708257Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2708346Z op = silu_mul_quant 2025-05-07T20:33:19.2708430Z if compiled: 2025-05-07T20:33:19.2708525Z op = torch.compile(op) 2025-05-07T20:33:19.2708625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2708693Z 2025-05-07T20:33:19.2708780Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2708785Z 2025-05-07T20:33:19.2708877Z moe/activation_test.py:117: 2025-05-07T20:33:19.2709003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2709100Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2709194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2709567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2709661Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2710202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2710302Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2710655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2710875Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2711209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2711298Z kernel = self.compile( 2025-05-07T20:33:19.2711698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2711867Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2711994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2711999Z 2025-05-07T20:33:19.2712197Z self = 2025-05-07T20:33:19.2713041Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2713532Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d93802c0>} 2025-05-07T20:33:19.2714260Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2714447Z context = 2025-05-07T20:33:19.2714491Z 2025-05-07T20:33:19.2714648Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2714911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2715016Z module_map=module_map) 2025-05-07T20:33:19.2715170Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2715269Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2715344Z E ^ 2025-05-07T20:33:19.2715691Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2715696Z 2025-05-07T20:33:19.2716144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2716149Z 2025-05-07T20:33:19.2716266Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2716492Z self=, 2025-05-07T20:33:19.2716567Z T=4096, 2025-05-07T20:33:19.2716641Z D=7168, 2025-05-07T20:33:19.2716725Z scale_ub=None, 2025-05-07T20:33:19.2716810Z contiguous=False, 2025-05-07T20:33:19.2716893Z compiled=True, 2025-05-07T20:33:19.2716965Z ) 2025-05-07T20:33:19.2717174Z self = 2025-05-07T20:33:19.2717342Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.2717350Z 2025-05-07T20:33:19.2717421Z @given( 2025-05-07T20:33:19.2717533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2717628Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2717734Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2717844Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2717955Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2718028Z ) 2025-05-07T20:33:19.2718264Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2718353Z def test_silu_mul_quant( 2025-05-07T20:33:19.2718469Z self, 2025-05-07T20:33:19.2718549Z T: int, 2025-05-07T20:33:19.2718621Z D: int, 2025-05-07T20:33:19.2718713Z scale_ub: Optional[float], 2025-05-07T20:33:19.2718798Z contiguous: bool, 2025-05-07T20:33:19.2718879Z compiled: bool, 2025-05-07T20:33:19.2718950Z ) -> None: 2025-05-07T20:33:19.2719040Z torch.manual_seed(2025) 2025-05-07T20:33:19.2719108Z 2025-05-07T20:33:19.2719268Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2719338Z 2025-05-07T20:33:19.2719427Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2719547Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2719639Z x = x_sign * x_clamp 2025-05-07T20:33:19.2719714Z x0 = x[:, :D] 2025-05-07T20:33:19.2719794Z x1 = x[:, D:] 2025-05-07T20:33:19.2719868Z 2025-05-07T20:33:19.2719948Z if contiguous: 2025-05-07T20:33:19.2720042Z x0 = x0.contiguous() 2025-05-07T20:33:19.2720130Z x1 = x1.contiguous() 2025-05-07T20:33:19.2720275Z 2025-05-07T20:33:19.2720365Z if scale_ub is not None: 2025-05-07T20:33:19.2720464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2720594Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2720671Z ) 2025-05-07T20:33:19.2720743Z else: 2025-05-07T20:33:19.2720830Z scale_ub_tensor = None 2025-05-07T20:33:19.2720902Z 2025-05-07T20:33:19.2721025Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2721109Z op = silu_mul_quant 2025-05-07T20:33:19.2721194Z if compiled: 2025-05-07T20:33:19.2721289Z op = torch.compile(op) 2025-05-07T20:33:19.2721397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2721504Z 2025-05-07T20:33:19.2721589Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2721594Z 2025-05-07T20:33:19.2721692Z moe/activation_test.py:117: 2025-05-07T20:33:19.2721818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2721916Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2722011Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2722374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2722464Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2722951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2723043Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2723397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2723616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2723952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2724048Z kernel = self.compile( 2025-05-07T20:33:19.2724424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2724594Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2724714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2724718Z 2025-05-07T20:33:19.2724914Z self = 2025-05-07T20:33:19.2725678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2726210Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9380d60>} 2025-05-07T20:33:19.2726949Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2727131Z context = 2025-05-07T20:33:19.2727136Z 2025-05-07T20:33:19.2727291Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2727556Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2727657Z module_map=module_map) 2025-05-07T20:33:19.2727817Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2727913Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2727984Z E ^ 2025-05-07T20:33:19.2728335Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2728382Z 2025-05-07T20:33:19.2728828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2728833Z 2025-05-07T20:33:19.2728933Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2729145Z self=, 2025-05-07T20:33:19.2729219Z T=16384, 2025-05-07T20:33:19.2729295Z D=5120, 2025-05-07T20:33:19.2729374Z scale_ub=1200.0, 2025-05-07T20:33:19.2729458Z contiguous=False, 2025-05-07T20:33:19.2729536Z compiled=False, 2025-05-07T20:33:19.2729604Z ) 2025-05-07T20:33:19.2729815Z self = 2025-05-07T20:33:19.2729990Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.2730035Z 2025-05-07T20:33:19.2730108Z @given( 2025-05-07T20:33:19.2730225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2730325Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2730434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2730546Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2730652Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2730722Z ) 2025-05-07T20:33:19.2730959Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2731046Z def test_silu_mul_quant( 2025-05-07T20:33:19.2731118Z self, 2025-05-07T20:33:19.2731193Z T: int, 2025-05-07T20:33:19.2731263Z D: int, 2025-05-07T20:33:19.2731354Z scale_ub: Optional[float], 2025-05-07T20:33:19.2731441Z contiguous: bool, 2025-05-07T20:33:19.2731524Z compiled: bool, 2025-05-07T20:33:19.2731597Z ) -> None: 2025-05-07T20:33:19.2731683Z torch.manual_seed(2025) 2025-05-07T20:33:19.2731751Z 2025-05-07T20:33:19.2731918Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2731994Z 2025-05-07T20:33:19.2732082Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2732203Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2732288Z x = x_sign * x_clamp 2025-05-07T20:33:19.2732360Z x0 = x[:, :D] 2025-05-07T20:33:19.2732439Z x1 = x[:, D:] 2025-05-07T20:33:19.2732506Z 2025-05-07T20:33:19.2732585Z if contiguous: 2025-05-07T20:33:19.2732676Z x0 = x0.contiguous() 2025-05-07T20:33:19.2732760Z x1 = x1.contiguous() 2025-05-07T20:33:19.2732827Z 2025-05-07T20:33:19.2732911Z if scale_ub is not None: 2025-05-07T20:33:19.2733011Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2733143Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2733216Z ) 2025-05-07T20:33:19.2733290Z else: 2025-05-07T20:33:19.2733428Z scale_ub_tensor = None 2025-05-07T20:33:19.2733500Z 2025-05-07T20:33:19.2733631Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2733720Z op = silu_mul_quant 2025-05-07T20:33:19.2733801Z if compiled: 2025-05-07T20:33:19.2733894Z op = torch.compile(op) 2025-05-07T20:33:19.2733996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2734060Z 2025-05-07T20:33:19.2734150Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2734155Z 2025-05-07T20:33:19.2734250Z moe/activation_test.py:117: 2025-05-07T20:33:19.2734373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2734470Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2734565Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2735052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2735149Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2735565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2735820Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2736153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2736241Z kernel = self.compile( 2025-05-07T20:33:19.2736623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2736790Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2736909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2736958Z 2025-05-07T20:33:19.2737154Z self = 2025-05-07T20:33:19.2737921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2738414Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9381c60>} 2025-05-07T20:33:19.2739145Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2739333Z context = 2025-05-07T20:33:19.2739337Z 2025-05-07T20:33:19.2739494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2739750Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2739858Z module_map=module_map) 2025-05-07T20:33:19.2740017Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2740350Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2740470Z E ^ 2025-05-07T20:33:19.2740856Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2740861Z 2025-05-07T20:33:19.2741278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2741284Z 2025-05-07T20:33:19.2741382Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2741598Z self=, 2025-05-07T20:33:19.2741675Z T=16384, 2025-05-07T20:33:19.2741751Z D=5120, 2025-05-07T20:33:19.2741830Z scale_ub=1200.0, 2025-05-07T20:33:19.2741909Z contiguous=True, 2025-05-07T20:33:19.2741990Z compiled=True, 2025-05-07T20:33:19.2742157Z ) 2025-05-07T20:33:19.2742375Z self = 2025-05-07T20:33:19.2742549Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.2742554Z 2025-05-07T20:33:19.2742630Z @given( 2025-05-07T20:33:19.2742742Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2742837Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2742950Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2743061Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2743172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2743242Z ) 2025-05-07T20:33:19.2743479Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2743576Z def test_silu_mul_quant( 2025-05-07T20:33:19.2743646Z self, 2025-05-07T20:33:19.2743719Z T: int, 2025-05-07T20:33:19.2743805Z D: int, 2025-05-07T20:33:19.2743900Z scale_ub: Optional[float], 2025-05-07T20:33:19.2744107Z contiguous: bool, 2025-05-07T20:33:19.2744197Z compiled: bool, 2025-05-07T20:33:19.2744278Z ) -> None: 2025-05-07T20:33:19.2744370Z torch.manual_seed(2025) 2025-05-07T20:33:19.2744443Z 2025-05-07T20:33:19.2744606Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2744680Z 2025-05-07T20:33:19.2744768Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2744887Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2744975Z x = x_sign * x_clamp 2025-05-07T20:33:19.2745051Z x0 = x[:, :D] 2025-05-07T20:33:19.2745128Z x1 = x[:, D:] 2025-05-07T20:33:19.2745197Z 2025-05-07T20:33:19.2745275Z if contiguous: 2025-05-07T20:33:19.2745429Z x0 = x0.contiguous() 2025-05-07T20:33:19.2745517Z x1 = x1.contiguous() 2025-05-07T20:33:19.2745586Z 2025-05-07T20:33:19.2745677Z if scale_ub is not None: 2025-05-07T20:33:19.2745787Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2745920Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2745999Z ) 2025-05-07T20:33:19.2746074Z else: 2025-05-07T20:33:19.2746166Z scale_ub_tensor = None 2025-05-07T20:33:19.2746245Z 2025-05-07T20:33:19.2746373Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2746461Z op = silu_mul_quant 2025-05-07T20:33:19.2746547Z if compiled: 2025-05-07T20:33:19.2746642Z op = torch.compile(op) 2025-05-07T20:33:19.2746745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2746821Z 2025-05-07T20:33:19.2746909Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2746917Z 2025-05-07T20:33:19.2747009Z moe/activation_test.py:117: 2025-05-07T20:33:19.2747135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2747234Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2747339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2747792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2747883Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2748371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2748464Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2748817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2749034Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2749369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2749460Z kernel = self.compile( 2025-05-07T20:33:19.2749906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2750077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2750205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2750210Z 2025-05-07T20:33:19.2750406Z self = 2025-05-07T20:33:19.2751168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2751655Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d9383380>} 2025-05-07T20:33:19.2752431Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2752654Z context = 2025-05-07T20:33:19.2752659Z 2025-05-07T20:33:19.2752816Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2753073Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2753176Z module_map=module_map) 2025-05-07T20:33:19.2753329Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2753425Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2753498Z E ^ 2025-05-07T20:33:19.2753847Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2753894Z 2025-05-07T20:33:19.2754309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2754318Z 2025-05-07T20:33:19.2754415Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2754633Z self=, 2025-05-07T20:33:19.2754707Z T=16384, 2025-05-07T20:33:19.2754780Z D=5120, 2025-05-07T20:33:19.2754860Z scale_ub=None, 2025-05-07T20:33:19.2754943Z contiguous=False, 2025-05-07T20:33:19.2755021Z compiled=True, 2025-05-07T20:33:19.2755090Z ) 2025-05-07T20:33:19.2755302Z self = 2025-05-07T20:33:19.2755475Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.2755480Z 2025-05-07T20:33:19.2755556Z @given( 2025-05-07T20:33:19.2755669Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2755767Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2755895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2756026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2756157Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2756228Z ) 2025-05-07T20:33:19.2756466Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2756556Z def test_silu_mul_quant( 2025-05-07T20:33:19.2756628Z self, 2025-05-07T20:33:19.2756702Z T: int, 2025-05-07T20:33:19.2756775Z D: int, 2025-05-07T20:33:19.2756866Z scale_ub: Optional[float], 2025-05-07T20:33:19.2756953Z contiguous: bool, 2025-05-07T20:33:19.2757033Z compiled: bool, 2025-05-07T20:33:19.2757108Z ) -> None: 2025-05-07T20:33:19.2757200Z torch.manual_seed(2025) 2025-05-07T20:33:19.2757272Z 2025-05-07T20:33:19.2757434Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2757510Z 2025-05-07T20:33:19.2757643Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2757770Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2757855Z x = x_sign * x_clamp 2025-05-07T20:33:19.2757930Z x0 = x[:, :D] 2025-05-07T20:33:19.2758005Z x1 = x[:, D:] 2025-05-07T20:33:19.2758074Z 2025-05-07T20:33:19.2758152Z if contiguous: 2025-05-07T20:33:19.2758239Z x0 = x0.contiguous() 2025-05-07T20:33:19.2758322Z x1 = x1.contiguous() 2025-05-07T20:33:19.2758390Z 2025-05-07T20:33:19.2758477Z if scale_ub is not None: 2025-05-07T20:33:19.2758578Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2758706Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2758782Z ) 2025-05-07T20:33:19.2758855Z else: 2025-05-07T20:33:19.2758947Z scale_ub_tensor = None 2025-05-07T20:33:19.2759017Z 2025-05-07T20:33:19.2759147Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2759240Z op = silu_mul_quant 2025-05-07T20:33:19.2759399Z if compiled: 2025-05-07T20:33:19.2759493Z op = torch.compile(op) 2025-05-07T20:33:19.2759594Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2759664Z 2025-05-07T20:33:19.2759749Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2759753Z 2025-05-07T20:33:19.2759848Z moe/activation_test.py:117: 2025-05-07T20:33:19.2759970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2760066Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2760161Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2760530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2760663Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2761148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2761243Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2761600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2761817Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2762155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2762245Z kernel = self.compile( 2025-05-07T20:33:19.2762626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2762798Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2762919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2762927Z 2025-05-07T20:33:19.2763125Z self = 2025-05-07T20:33:19.2763893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2764389Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d83485e0>} 2025-05-07T20:33:19.2765126Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2765310Z context = 2025-05-07T20:33:19.2765317Z 2025-05-07T20:33:19.2765484Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2765809Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2765921Z module_map=module_map) 2025-05-07T20:33:19.2766084Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2766180Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2766256Z E ^ 2025-05-07T20:33:19.2766607Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2766611Z 2025-05-07T20:33:19.2767024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2767029Z 2025-05-07T20:33:19.2767130Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2767346Z self=, 2025-05-07T20:33:19.2767420Z T=2048, 2025-05-07T20:33:19.2767494Z D=5120, 2025-05-07T20:33:19.2767574Z scale_ub=None, 2025-05-07T20:33:19.2767658Z contiguous=False, 2025-05-07T20:33:19.2767742Z compiled=True, 2025-05-07T20:33:19.2767857Z ) 2025-05-07T20:33:19.2768107Z self = 2025-05-07T20:33:19.2768287Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.2768292Z 2025-05-07T20:33:19.2768365Z @given( 2025-05-07T20:33:19.2768490Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2768586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2768715Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2768879Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2769029Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2769127Z ) 2025-05-07T20:33:19.2769451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2769601Z def test_silu_mul_quant( 2025-05-07T20:33:19.2769679Z self, 2025-05-07T20:33:19.2769758Z T: int, 2025-05-07T20:33:19.2769834Z D: int, 2025-05-07T20:33:19.2769937Z scale_ub: Optional[float], 2025-05-07T20:33:19.2770023Z contiguous: bool, 2025-05-07T20:33:19.2770103Z compiled: bool, 2025-05-07T20:33:19.2770185Z ) -> None: 2025-05-07T20:33:19.2770276Z torch.manual_seed(2025) 2025-05-07T20:33:19.2770347Z 2025-05-07T20:33:19.2770516Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2770585Z 2025-05-07T20:33:19.2770673Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2770798Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2770881Z x = x_sign * x_clamp 2025-05-07T20:33:19.2770963Z x0 = x[:, :D] 2025-05-07T20:33:19.2771041Z x1 = x[:, D:] 2025-05-07T20:33:19.2771113Z 2025-05-07T20:33:19.2771199Z if contiguous: 2025-05-07T20:33:19.2771289Z x0 = x0.contiguous() 2025-05-07T20:33:19.2771377Z x1 = x1.contiguous() 2025-05-07T20:33:19.2771453Z 2025-05-07T20:33:19.2771544Z if scale_ub is not None: 2025-05-07T20:33:19.2771645Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2771781Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2771858Z ) 2025-05-07T20:33:19.2771933Z else: 2025-05-07T20:33:19.2772029Z scale_ub_tensor = None 2025-05-07T20:33:19.2772099Z 2025-05-07T20:33:19.2772225Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2772316Z op = silu_mul_quant 2025-05-07T20:33:19.2772399Z if compiled: 2025-05-07T20:33:19.2772498Z op = torch.compile(op) 2025-05-07T20:33:19.2772600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2772671Z 2025-05-07T20:33:19.2772761Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2772766Z 2025-05-07T20:33:19.2772859Z moe/activation_test.py:117: 2025-05-07T20:33:19.2773030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2773138Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2773234Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2773597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2773687Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2774169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2774265Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2774615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2774832Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2775174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2775269Z kernel = self.compile( 2025-05-07T20:33:19.2775753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2775924Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2776047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2776052Z 2025-05-07T20:33:19.2776254Z self = 2025-05-07T20:33:19.2777013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2777542Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d8349440>} 2025-05-07T20:33:19.2778276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2778462Z context = 2025-05-07T20:33:19.2778467Z 2025-05-07T20:33:19.2778629Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2778885Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2778993Z module_map=module_map) 2025-05-07T20:33:19.2779149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2779241Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2779316Z E ^ 2025-05-07T20:33:19.2779662Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2779670Z 2025-05-07T20:33:19.2780111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2780122Z 2025-05-07T20:33:19.2780222Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2780440Z self=, 2025-05-07T20:33:19.2780519Z T=2048, 2025-05-07T20:33:19.2780590Z D=5120, 2025-05-07T20:33:19.2780668Z scale_ub=1200.0, 2025-05-07T20:33:19.2780756Z contiguous=False, 2025-05-07T20:33:19.2780838Z compiled=True, 2025-05-07T20:33:19.2780909Z ) 2025-05-07T20:33:19.2781125Z self = 2025-05-07T20:33:19.2781294Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.2781301Z 2025-05-07T20:33:19.2781376Z @given( 2025-05-07T20:33:19.2781489Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2781710Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2781828Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2781941Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2782051Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2782126Z ) 2025-05-07T20:33:19.2782363Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2782451Z def test_silu_mul_quant( 2025-05-07T20:33:19.2782524Z self, 2025-05-07T20:33:19.2782597Z T: int, 2025-05-07T20:33:19.2782673Z D: int, 2025-05-07T20:33:19.2782765Z scale_ub: Optional[float], 2025-05-07T20:33:19.2782849Z contiguous: bool, 2025-05-07T20:33:19.2782935Z compiled: bool, 2025-05-07T20:33:19.2783010Z ) -> None: 2025-05-07T20:33:19.2783103Z torch.manual_seed(2025) 2025-05-07T20:33:19.2783177Z 2025-05-07T20:33:19.2783343Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2783414Z 2025-05-07T20:33:19.2783586Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2783705Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2783791Z x = x_sign * x_clamp 2025-05-07T20:33:19.2783873Z x0 = x[:, :D] 2025-05-07T20:33:19.2783949Z x1 = x[:, D:] 2025-05-07T20:33:19.2784018Z 2025-05-07T20:33:19.2784098Z if contiguous: 2025-05-07T20:33:19.2784183Z x0 = x0.contiguous() 2025-05-07T20:33:19.2784274Z x1 = x1.contiguous() 2025-05-07T20:33:19.2784344Z 2025-05-07T20:33:19.2784429Z if scale_ub is not None: 2025-05-07T20:33:19.2784534Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2784665Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2784781Z ) 2025-05-07T20:33:19.2784859Z else: 2025-05-07T20:33:19.2784949Z scale_ub_tensor = None 2025-05-07T20:33:19.2785018Z 2025-05-07T20:33:19.2785156Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2785248Z op = silu_mul_quant 2025-05-07T20:33:19.2785328Z if compiled: 2025-05-07T20:33:19.2785428Z op = torch.compile(op) 2025-05-07T20:33:19.2785528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2785601Z 2025-05-07T20:33:19.2785688Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2785692Z 2025-05-07T20:33:19.2785784Z moe/activation_test.py:117: 2025-05-07T20:33:19.2785912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2786008Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2786104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2786471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2786565Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2787058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2787154Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2787578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2787800Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2788134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2788224Z kernel = self.compile( 2025-05-07T20:33:19.2788605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2788778Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2788907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2788913Z 2025-05-07T20:33:19.2789154Z self = 2025-05-07T20:33:19.2789918Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2790409Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d834a660>} 2025-05-07T20:33:19.2791139Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2791328Z context = 2025-05-07T20:33:19.2791335Z 2025-05-07T20:33:19.2791494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2791797Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2791938Z module_map=module_map) 2025-05-07T20:33:19.2792096Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2792194Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2792271Z E ^ 2025-05-07T20:33:19.2792616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2792620Z 2025-05-07T20:33:19.2793031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2793036Z 2025-05-07T20:33:19.2793134Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2793416Z self=, 2025-05-07T20:33:19.2793491Z T=4096, 2025-05-07T20:33:19.2793564Z D=5120, 2025-05-07T20:33:19.2793652Z scale_ub=1200.0, 2025-05-07T20:33:19.2793735Z contiguous=True, 2025-05-07T20:33:19.2793817Z compiled=True, 2025-05-07T20:33:19.2793890Z ) 2025-05-07T20:33:19.2794105Z self = 2025-05-07T20:33:19.2794270Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.2794280Z 2025-05-07T20:33:19.2794354Z @given( 2025-05-07T20:33:19.2794470Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2794572Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2794682Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2794794Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2794905Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2794984Z ) 2025-05-07T20:33:19.2795221Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2795319Z def test_silu_mul_quant( 2025-05-07T20:33:19.2795395Z self, 2025-05-07T20:33:19.2795474Z T: int, 2025-05-07T20:33:19.2795553Z D: int, 2025-05-07T20:33:19.2795648Z scale_ub: Optional[float], 2025-05-07T20:33:19.2795739Z contiguous: bool, 2025-05-07T20:33:19.2795820Z compiled: bool, 2025-05-07T20:33:19.2795894Z ) -> None: 2025-05-07T20:33:19.2795987Z torch.manual_seed(2025) 2025-05-07T20:33:19.2796057Z 2025-05-07T20:33:19.2796220Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2796293Z 2025-05-07T20:33:19.2796381Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2796504Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2796596Z x = x_sign * x_clamp 2025-05-07T20:33:19.2796674Z x0 = x[:, :D] 2025-05-07T20:33:19.2796752Z x1 = x[:, D:] 2025-05-07T20:33:19.2796827Z 2025-05-07T20:33:19.2796906Z if contiguous: 2025-05-07T20:33:19.2797040Z x0 = x0.contiguous() 2025-05-07T20:33:19.2797133Z x1 = x1.contiguous() 2025-05-07T20:33:19.2797204Z 2025-05-07T20:33:19.2797299Z if scale_ub is not None: 2025-05-07T20:33:19.2797399Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2797531Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2797610Z ) 2025-05-07T20:33:19.2797682Z else: 2025-05-07T20:33:19.2797771Z scale_ub_tensor = None 2025-05-07T20:33:19.2797843Z 2025-05-07T20:33:19.2797972Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2798059Z op = silu_mul_quant 2025-05-07T20:33:19.2798148Z if compiled: 2025-05-07T20:33:19.2798243Z op = torch.compile(op) 2025-05-07T20:33:19.2798344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2798418Z 2025-05-07T20:33:19.2798506Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2798511Z 2025-05-07T20:33:19.2798608Z moe/activation_test.py:117: 2025-05-07T20:33:19.2798817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2798915Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2799012Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2799380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2799471Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2799960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2800052Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2800408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2800668Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2801005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2801106Z kernel = self.compile( 2025-05-07T20:33:19.2801485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2801653Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2801786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2801791Z 2025-05-07T20:33:19.2801987Z self = 2025-05-07T20:33:19.2802750Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2803244Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d834b9c0>} 2025-05-07T20:33:19.2803980Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2804162Z context = 2025-05-07T20:33:19.2804167Z 2025-05-07T20:33:19.2804324Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2804582Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2804684Z module_map=module_map) 2025-05-07T20:33:19.2804843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2804942Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2805016Z E ^ 2025-05-07T20:33:19.2805413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2805420Z 2025-05-07T20:33:19.2805832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2805837Z 2025-05-07T20:33:19.2805938Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2806153Z self=, 2025-05-07T20:33:19.2806229Z T=128, 2025-05-07T20:33:19.2806309Z D=5120, 2025-05-07T20:33:19.2806390Z scale_ub=1200.0, 2025-05-07T20:33:19.2806474Z contiguous=False, 2025-05-07T20:33:19.2806559Z compiled=True, 2025-05-07T20:33:19.2806630Z ) 2025-05-07T20:33:19.2806844Z self = 2025-05-07T20:33:19.2807017Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.2807021Z 2025-05-07T20:33:19.2807095Z @given( 2025-05-07T20:33:19.2807212Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2807391Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2807501Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2807616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2807724Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2807797Z ) 2025-05-07T20:33:19.2808042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2808131Z def test_silu_mul_quant( 2025-05-07T20:33:19.2808203Z self, 2025-05-07T20:33:19.2808279Z T: int, 2025-05-07T20:33:19.2808353Z D: int, 2025-05-07T20:33:19.2808449Z scale_ub: Optional[float], 2025-05-07T20:33:19.2808535Z contiguous: bool, 2025-05-07T20:33:19.2808656Z compiled: bool, 2025-05-07T20:33:19.2808733Z ) -> None: 2025-05-07T20:33:19.2808824Z torch.manual_seed(2025) 2025-05-07T20:33:19.2808891Z 2025-05-07T20:33:19.2809056Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2809133Z 2025-05-07T20:33:19.2809225Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2809348Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2809433Z x = x_sign * x_clamp 2025-05-07T20:33:19.2809509Z x0 = x[:, :D] 2025-05-07T20:33:19.2809587Z x1 = x[:, D:] 2025-05-07T20:33:19.2809657Z 2025-05-07T20:33:19.2809734Z if contiguous: 2025-05-07T20:33:19.2809822Z x0 = x0.contiguous() 2025-05-07T20:33:19.2809907Z x1 = x1.contiguous() 2025-05-07T20:33:19.2809976Z 2025-05-07T20:33:19.2810061Z if scale_ub is not None: 2025-05-07T20:33:19.2810163Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2810306Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2810379Z ) 2025-05-07T20:33:19.2810453Z else: 2025-05-07T20:33:19.2810546Z scale_ub_tensor = None 2025-05-07T20:33:19.2810617Z 2025-05-07T20:33:19.2810746Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2810831Z op = silu_mul_quant 2025-05-07T20:33:19.2810912Z if compiled: 2025-05-07T20:33:19.2811006Z op = torch.compile(op) 2025-05-07T20:33:19.2811108Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2811177Z 2025-05-07T20:33:19.2811261Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2811269Z 2025-05-07T20:33:19.2811363Z moe/activation_test.py:117: 2025-05-07T20:33:19.2811485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2811582Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2811676Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2812042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2812134Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2812665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2812762Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2813115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2813330Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2813670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2813761Z kernel = self.compile( 2025-05-07T20:33:19.2814157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2814334Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2814457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2814461Z 2025-05-07T20:33:19.2814739Z self = 2025-05-07T20:33:19.2815503Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2816026Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359bf04fe0>} 2025-05-07T20:33:19.2816776Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2817002Z context = 2025-05-07T20:33:19.2817007Z 2025-05-07T20:33:19.2817170Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2817431Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2817535Z module_map=module_map) 2025-05-07T20:33:19.2817695Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2817787Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2817867Z E ^ 2025-05-07T20:33:19.2818212Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2818216Z 2025-05-07T20:33:19.2818629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2818633Z 2025-05-07T20:33:19.2818738Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2818954Z self=, 2025-05-07T20:33:19.2819035Z T=16384, 2025-05-07T20:33:19.2819110Z D=7168, 2025-05-07T20:33:19.2819195Z scale_ub=1200.0, 2025-05-07T20:33:19.2819278Z contiguous=True, 2025-05-07T20:33:19.2819360Z compiled=True, 2025-05-07T20:33:19.2819432Z ) 2025-05-07T20:33:19.2819650Z self = 2025-05-07T20:33:19.2819819Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.2819823Z 2025-05-07T20:33:19.2819898Z @given( 2025-05-07T20:33:19.2820016Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2820111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2820222Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2820333Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2820445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2820518Z ) 2025-05-07T20:33:19.2820801Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2820894Z def test_silu_mul_quant( 2025-05-07T20:33:19.2820972Z self, 2025-05-07T20:33:19.2824430Z T: int, 2025-05-07T20:33:19.2824515Z D: int, 2025-05-07T20:33:19.2824618Z scale_ub: Optional[float], 2025-05-07T20:33:19.2824707Z contiguous: bool, 2025-05-07T20:33:19.2824791Z compiled: bool, 2025-05-07T20:33:19.2824872Z ) -> None: 2025-05-07T20:33:19.2824966Z torch.manual_seed(2025) 2025-05-07T20:33:19.2825040Z 2025-05-07T20:33:19.2825219Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2825292Z 2025-05-07T20:33:19.2825388Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2825511Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2825607Z x = x_sign * x_clamp 2025-05-07T20:33:19.2825693Z x0 = x[:, :D] 2025-05-07T20:33:19.2825772Z x1 = x[:, D:] 2025-05-07T20:33:19.2825846Z 2025-05-07T20:33:19.2825935Z if contiguous: 2025-05-07T20:33:19.2826150Z x0 = x0.contiguous() 2025-05-07T20:33:19.2826237Z x1 = x1.contiguous() 2025-05-07T20:33:19.2826314Z 2025-05-07T20:33:19.2826402Z if scale_ub is not None: 2025-05-07T20:33:19.2826506Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2826639Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2826710Z ) 2025-05-07T20:33:19.2826790Z else: 2025-05-07T20:33:19.2826882Z scale_ub_tensor = None 2025-05-07T20:33:19.2826949Z 2025-05-07T20:33:19.2827076Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2827163Z op = silu_mul_quant 2025-05-07T20:33:19.2827242Z if compiled: 2025-05-07T20:33:19.2827387Z op = torch.compile(op) 2025-05-07T20:33:19.2827576Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2827648Z 2025-05-07T20:33:19.2827742Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2827747Z 2025-05-07T20:33:19.2827845Z moe/activation_test.py:117: 2025-05-07T20:33:19.2827975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2828072Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2828168Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2828545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2828633Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2829122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2829221Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2829575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2829799Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2830139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2830231Z kernel = self.compile( 2025-05-07T20:33:19.2830613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2830784Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2830912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2830916Z 2025-05-07T20:33:19.2831119Z self = 2025-05-07T20:33:19.2831884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2832429Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359bf05e40>} 2025-05-07T20:33:19.2833162Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2833354Z context = 2025-05-07T20:33:19.2833359Z 2025-05-07T20:33:19.2833520Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2833776Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2833885Z module_map=module_map) 2025-05-07T20:33:19.2834047Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2834141Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2834219Z E ^ 2025-05-07T20:33:19.2834610Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2834652Z 2025-05-07T20:33:19.2835071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2835076Z 2025-05-07T20:33:19.2835177Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2835393Z self=, 2025-05-07T20:33:19.2835473Z T=16384, 2025-05-07T20:33:19.2835546Z D=5120, 2025-05-07T20:33:19.2835625Z scale_ub=1200.0, 2025-05-07T20:33:19.2835712Z contiguous=True, 2025-05-07T20:33:19.2835793Z compiled=False, 2025-05-07T20:33:19.2835864Z ) 2025-05-07T20:33:19.2836078Z self = 2025-05-07T20:33:19.2836292Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.2836297Z 2025-05-07T20:33:19.2836378Z @given( 2025-05-07T20:33:19.2836498Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2836595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2836708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2836819Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2836930Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2837000Z ) 2025-05-07T20:33:19.2837237Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2837329Z def test_silu_mul_quant( 2025-05-07T20:33:19.2837401Z self, 2025-05-07T20:33:19.2837475Z T: int, 2025-05-07T20:33:19.2837549Z D: int, 2025-05-07T20:33:19.2837640Z scale_ub: Optional[float], 2025-05-07T20:33:19.2837729Z contiguous: bool, 2025-05-07T20:33:19.2837812Z compiled: bool, 2025-05-07T20:33:19.2837885Z ) -> None: 2025-05-07T20:33:19.2837979Z torch.manual_seed(2025) 2025-05-07T20:33:19.2838051Z 2025-05-07T20:33:19.2838215Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2838286Z 2025-05-07T20:33:19.2838377Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2838497Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2838585Z x = x_sign * x_clamp 2025-05-07T20:33:19.2838658Z x0 = x[:, :D] 2025-05-07T20:33:19.2838730Z x1 = x[:, D:] 2025-05-07T20:33:19.2838798Z 2025-05-07T20:33:19.2838877Z if contiguous: 2025-05-07T20:33:19.2838965Z x0 = x0.contiguous() 2025-05-07T20:33:19.2839052Z x1 = x1.contiguous() 2025-05-07T20:33:19.2839119Z 2025-05-07T20:33:19.2839203Z if scale_ub is not None: 2025-05-07T20:33:19.2839311Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2839438Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2839512Z ) 2025-05-07T20:33:19.2839641Z else: 2025-05-07T20:33:19.2839736Z scale_ub_tensor = None 2025-05-07T20:33:19.2839808Z 2025-05-07T20:33:19.2839933Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2840017Z op = silu_mul_quant 2025-05-07T20:33:19.2840481Z if compiled: 2025-05-07T20:33:19.2840625Z op = torch.compile(op) 2025-05-07T20:33:19.2840732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2840807Z 2025-05-07T20:33:19.2840897Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2840901Z 2025-05-07T20:33:19.2840996Z moe/activation_test.py:117: 2025-05-07T20:33:19.2841125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2841221Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2841321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2841813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2842061Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2842421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2842639Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2842974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2843065Z kernel = self.compile( 2025-05-07T20:33:19.2843459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2843629Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2843750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2843818Z 2025-05-07T20:33:19.2844018Z self = 2025-05-07T20:33:19.2844785Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2845276Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359bf06ca0>} 2025-05-07T20:33:19.2846037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2846245Z context = 2025-05-07T20:33:19.2846253Z 2025-05-07T20:33:19.2846416Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2846674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2846781Z module_map=module_map) 2025-05-07T20:33:19.2846939Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2847035Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2847109Z E ^ 2025-05-07T20:33:19.2847459Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2847463Z 2025-05-07T20:33:19.2847873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2847877Z 2025-05-07T20:33:19.2847976Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2848191Z self=, 2025-05-07T20:33:19.2848267Z T=1, 2025-05-07T20:33:19.2848344Z D=7168, 2025-05-07T20:33:19.2848423Z scale_ub=1200.0, 2025-05-07T20:33:19.2848568Z contiguous=False, 2025-05-07T20:33:19.2848654Z compiled=False, 2025-05-07T20:33:19.2848722Z ) 2025-05-07T20:33:19.2848933Z self = 2025-05-07T20:33:19.2849099Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.2849104Z 2025-05-07T20:33:19.2849177Z @given( 2025-05-07T20:33:19.2849292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2849387Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2849497Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2849613Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2849721Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2849792Z ) 2025-05-07T20:33:19.2850036Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2850122Z def test_silu_mul_quant( 2025-05-07T20:33:19.2850195Z self, 2025-05-07T20:33:19.2850275Z T: int, 2025-05-07T20:33:19.2850394Z D: int, 2025-05-07T20:33:19.2850525Z scale_ub: Optional[float], 2025-05-07T20:33:19.2850611Z contiguous: bool, 2025-05-07T20:33:19.2850692Z compiled: bool, 2025-05-07T20:33:19.2850771Z ) -> None: 2025-05-07T20:33:19.2850861Z torch.manual_seed(2025) 2025-05-07T20:33:19.2850926Z 2025-05-07T20:33:19.2851090Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2851160Z 2025-05-07T20:33:19.2851246Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2851370Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2851452Z x = x_sign * x_clamp 2025-05-07T20:33:19.2851529Z x0 = x[:, :D] 2025-05-07T20:33:19.2851605Z x1 = x[:, D:] 2025-05-07T20:33:19.2851715Z 2025-05-07T20:33:19.2851793Z if contiguous: 2025-05-07T20:33:19.2851883Z x0 = x0.contiguous() 2025-05-07T20:33:19.2851969Z x1 = x1.contiguous() 2025-05-07T20:33:19.2852043Z 2025-05-07T20:33:19.2852133Z if scale_ub is not None: 2025-05-07T20:33:19.2852233Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2852365Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2852439Z ) 2025-05-07T20:33:19.2852513Z else: 2025-05-07T20:33:19.2852604Z scale_ub_tensor = None 2025-05-07T20:33:19.2852669Z 2025-05-07T20:33:19.2852794Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2852882Z op = silu_mul_quant 2025-05-07T20:33:19.2852962Z if compiled: 2025-05-07T20:33:19.2853058Z op = torch.compile(op) 2025-05-07T20:33:19.2853161Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2853230Z 2025-05-07T20:33:19.2853321Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2853325Z 2025-05-07T20:33:19.2853416Z moe/activation_test.py:117: 2025-05-07T20:33:19.2853546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2853649Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2853743Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2854228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2854326Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2854679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2854898Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2855228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2855320Z kernel = self.compile( 2025-05-07T20:33:19.2855749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2855924Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2856048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2856057Z 2025-05-07T20:33:19.2856254Z self = 2025-05-07T20:33:19.2857014Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2857504Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d80c40e0>} 2025-05-07T20:33:19.2858243Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2858527Z context = 2025-05-07T20:33:19.2858532Z 2025-05-07T20:33:19.2858690Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2858946Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2859053Z module_map=module_map) 2025-05-07T20:33:19.2859208Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2859305Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2859378Z E ^ 2025-05-07T20:33:19.2859726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2859771Z 2025-05-07T20:33:19.2860205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2860212Z 2025-05-07T20:33:19.2860312Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2860531Z self=, 2025-05-07T20:33:19.2860608Z T=4096, 2025-05-07T20:33:19.2860679Z D=7168, 2025-05-07T20:33:19.2860761Z scale_ub=1200.0, 2025-05-07T20:33:19.2860843Z contiguous=False, 2025-05-07T20:33:19.2860923Z compiled=True, 2025-05-07T20:33:19.2860991Z ) 2025-05-07T20:33:19.2861203Z self = 2025-05-07T20:33:19.2861372Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.2861377Z 2025-05-07T20:33:19.2861453Z @given( 2025-05-07T20:33:19.2861566Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2861664Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2861778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2861896Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2862008Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2862082Z ) 2025-05-07T20:33:19.2862317Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2862407Z def test_silu_mul_quant( 2025-05-07T20:33:19.2862479Z self, 2025-05-07T20:33:19.2862552Z T: int, 2025-05-07T20:33:19.2862626Z D: int, 2025-05-07T20:33:19.2862719Z scale_ub: Optional[float], 2025-05-07T20:33:19.2862805Z contiguous: bool, 2025-05-07T20:33:19.2862891Z compiled: bool, 2025-05-07T20:33:19.2862964Z ) -> None: 2025-05-07T20:33:19.2863052Z torch.manual_seed(2025) 2025-05-07T20:33:19.2863127Z 2025-05-07T20:33:19.2863289Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2863368Z 2025-05-07T20:33:19.2863454Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2863618Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2863707Z x = x_sign * x_clamp 2025-05-07T20:33:19.2863785Z x0 = x[:, :D] 2025-05-07T20:33:19.2863863Z x1 = x[:, D:] 2025-05-07T20:33:19.2863934Z 2025-05-07T20:33:19.2864011Z if contiguous: 2025-05-07T20:33:19.2864096Z x0 = x0.contiguous() 2025-05-07T20:33:19.2864180Z x1 = x1.contiguous() 2025-05-07T20:33:19.2864245Z 2025-05-07T20:33:19.2864328Z if scale_ub is not None: 2025-05-07T20:33:19.2864429Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2864558Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2864630Z ) 2025-05-07T20:33:19.2864703Z else: 2025-05-07T20:33:19.2864790Z scale_ub_tensor = None 2025-05-07T20:33:19.2864860Z 2025-05-07T20:33:19.2864988Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2865073Z op = silu_mul_quant 2025-05-07T20:33:19.2865157Z if compiled: 2025-05-07T20:33:19.2865255Z op = torch.compile(op) 2025-05-07T20:33:19.2865437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2865511Z 2025-05-07T20:33:19.2865597Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2865602Z 2025-05-07T20:33:19.2865692Z moe/activation_test.py:117: 2025-05-07T20:33:19.2865819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2865913Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2866008Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2866367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2866455Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2866939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2867073Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2867478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2867702Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2868035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2868127Z kernel = self.compile( 2025-05-07T20:33:19.2868519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2868687Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2868810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2868814Z 2025-05-07T20:33:19.2869012Z self = 2025-05-07T20:33:19.2869779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2870270Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d80c5300>} 2025-05-07T20:33:19.2871001Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2871187Z context = 2025-05-07T20:33:19.2871192Z 2025-05-07T20:33:19.2871349Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2871608Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2871755Z module_map=module_map) 2025-05-07T20:33:19.2871916Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2872015Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2872089Z E ^ 2025-05-07T20:33:19.2872436Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2872440Z 2025-05-07T20:33:19.2872842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2872846Z 2025-05-07T20:33:19.2872943Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2873160Z self=, 2025-05-07T20:33:19.2873234Z T=128, 2025-05-07T20:33:19.2873306Z D=7168, 2025-05-07T20:33:19.2873388Z scale_ub=1200.0, 2025-05-07T20:33:19.2873469Z contiguous=False, 2025-05-07T20:33:19.2873548Z compiled=True, 2025-05-07T20:33:19.2873616Z ) 2025-05-07T20:33:19.2873830Z self = 2025-05-07T20:33:19.2874079Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.2874084Z 2025-05-07T20:33:19.2874159Z @given( 2025-05-07T20:33:19.2874272Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2874368Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2874477Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2874586Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2874696Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2874767Z ) 2025-05-07T20:33:19.2875005Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2875092Z def test_silu_mul_quant( 2025-05-07T20:33:19.2875206Z self, 2025-05-07T20:33:19.2875280Z T: int, 2025-05-07T20:33:19.2875351Z D: int, 2025-05-07T20:33:19.2875445Z scale_ub: Optional[float], 2025-05-07T20:33:19.2875532Z contiguous: bool, 2025-05-07T20:33:19.2875614Z compiled: bool, 2025-05-07T20:33:19.2875687Z ) -> None: 2025-05-07T20:33:19.2875778Z torch.manual_seed(2025) 2025-05-07T20:33:19.2875845Z 2025-05-07T20:33:19.2876006Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2876083Z 2025-05-07T20:33:19.2876183Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2876324Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2876412Z x = x_sign * x_clamp 2025-05-07T20:33:19.2876484Z x0 = x[:, :D] 2025-05-07T20:33:19.2876561Z x1 = x[:, D:] 2025-05-07T20:33:19.2876628Z 2025-05-07T20:33:19.2876706Z if contiguous: 2025-05-07T20:33:19.2876793Z x0 = x0.contiguous() 2025-05-07T20:33:19.2876879Z x1 = x1.contiguous() 2025-05-07T20:33:19.2876944Z 2025-05-07T20:33:19.2877030Z if scale_ub is not None: 2025-05-07T20:33:19.2877133Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2877269Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2877342Z ) 2025-05-07T20:33:19.2877412Z else: 2025-05-07T20:33:19.2877502Z scale_ub_tensor = None 2025-05-07T20:33:19.2877571Z 2025-05-07T20:33:19.2877694Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2877783Z op = silu_mul_quant 2025-05-07T20:33:19.2877863Z if compiled: 2025-05-07T20:33:19.2877959Z op = torch.compile(op) 2025-05-07T20:33:19.2878060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2878125Z 2025-05-07T20:33:19.2878210Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2878214Z 2025-05-07T20:33:19.2878309Z moe/activation_test.py:117: 2025-05-07T20:33:19.2878433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2878576Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2878669Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2879031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2879119Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2879602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2879695Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2880048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2880263Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2880594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2880684Z kernel = self.compile( 2025-05-07T20:33:19.2881081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2881335Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2881456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2881461Z 2025-05-07T20:33:19.2881662Z self = 2025-05-07T20:33:19.2882430Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2882917Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d80c6160>} 2025-05-07T20:33:19.2883696Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2883882Z context = 2025-05-07T20:33:19.2883886Z 2025-05-07T20:33:19.2884048Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2884304Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2884409Z module_map=module_map) 2025-05-07T20:33:19.2884573Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2884667Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2884740Z E ^ 2025-05-07T20:33:19.2885088Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2885095Z 2025-05-07T20:33:19.2885506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2885514Z 2025-05-07T20:33:19.2885617Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2885830Z self=, 2025-05-07T20:33:19.2885899Z T=2048, 2025-05-07T20:33:19.2885979Z D=7168, 2025-05-07T20:33:19.2886057Z scale_ub=None, 2025-05-07T20:33:19.2886140Z contiguous=True, 2025-05-07T20:33:19.2886226Z compiled=True, 2025-05-07T20:33:19.2886295Z ) 2025-05-07T20:33:19.2886514Z self = 2025-05-07T20:33:19.2886681Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.2886686Z 2025-05-07T20:33:19.2886761Z @given( 2025-05-07T20:33:19.2886881Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2886978Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2887156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2887276Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2887390Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2887460Z ) 2025-05-07T20:33:19.2887703Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2887791Z def test_silu_mul_quant( 2025-05-07T20:33:19.2887870Z self, 2025-05-07T20:33:19.2887943Z T: int, 2025-05-07T20:33:19.2888015Z D: int, 2025-05-07T20:33:19.2888115Z scale_ub: Optional[float], 2025-05-07T20:33:19.2888203Z contiguous: bool, 2025-05-07T20:33:19.2888283Z compiled: bool, 2025-05-07T20:33:19.2888361Z ) -> None: 2025-05-07T20:33:19.2888450Z torch.manual_seed(2025) 2025-05-07T20:33:19.2888520Z 2025-05-07T20:33:19.2888689Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2888759Z 2025-05-07T20:33:19.2888846Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2888974Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2889140Z x = x_sign * x_clamp 2025-05-07T20:33:19.2889224Z x0 = x[:, :D] 2025-05-07T20:33:19.2889299Z x1 = x[:, D:] 2025-05-07T20:33:19.2889366Z 2025-05-07T20:33:19.2889452Z if contiguous: 2025-05-07T20:33:19.2889540Z x0 = x0.contiguous() 2025-05-07T20:33:19.2889626Z x1 = x1.contiguous() 2025-05-07T20:33:19.2889699Z 2025-05-07T20:33:19.2889785Z if scale_ub is not None: 2025-05-07T20:33:19.2889887Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2890022Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2890095Z ) 2025-05-07T20:33:19.2890170Z else: 2025-05-07T20:33:19.2890266Z scale_ub_tensor = None 2025-05-07T20:33:19.2890376Z 2025-05-07T20:33:19.2890503Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2890591Z op = silu_mul_quant 2025-05-07T20:33:19.2890675Z if compiled: 2025-05-07T20:33:19.2890780Z op = torch.compile(op) 2025-05-07T20:33:19.2890882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2890953Z 2025-05-07T20:33:19.2891046Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2891051Z 2025-05-07T20:33:19.2891146Z moe/activation_test.py:117: 2025-05-07T20:33:19.2891271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2891372Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2891467Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2891837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.2891927Z return fn(*args, **kwargs) 2025-05-07T20:33:19.2892415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2892516Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2892876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2893094Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2893434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2893525Z kernel = self.compile( 2025-05-07T20:33:19.2893929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2894098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2894221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2894229Z 2025-05-07T20:33:19.2894432Z self = 2025-05-07T20:33:19.2895237Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2895731Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f38d80c7420>} 2025-05-07T20:33:19.2896463Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2896649Z context = 2025-05-07T20:33:19.2896658Z 2025-05-07T20:33:19.2896818Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2897077Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2897188Z module_map=module_map) 2025-05-07T20:33:19.2897431Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2897526Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2897605Z E ^ 2025-05-07T20:33:19.2897952Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2897956Z 2025-05-07T20:33:19.2898370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2898375Z 2025-05-07T20:33:19.2898473Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2898690Z self=, 2025-05-07T20:33:19.2898812Z T=16384, 2025-05-07T20:33:19.2898887Z D=5120, 2025-05-07T20:33:19.2898965Z scale_ub=None, 2025-05-07T20:33:19.2899053Z contiguous=False, 2025-05-07T20:33:19.2899141Z compiled=False, 2025-05-07T20:33:19.2899208Z ) 2025-05-07T20:33:19.2899433Z self = 2025-05-07T20:33:19.2899608Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.2899612Z 2025-05-07T20:33:19.2899694Z @given( 2025-05-07T20:33:19.2899809Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2899909Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2900027Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2900141Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2900251Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2900330Z ) 2025-05-07T20:33:19.2900570Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2900669Z def test_silu_mul_quant( 2025-05-07T20:33:19.2900744Z self, 2025-05-07T20:33:19.2900817Z T: int, 2025-05-07T20:33:19.2900901Z D: int, 2025-05-07T20:33:19.2901001Z scale_ub: Optional[float], 2025-05-07T20:33:19.2901087Z contiguous: bool, 2025-05-07T20:33:19.2901179Z compiled: bool, 2025-05-07T20:33:19.2901258Z ) -> None: 2025-05-07T20:33:19.2901355Z torch.manual_seed(2025) 2025-05-07T20:33:19.2901434Z 2025-05-07T20:33:19.2901596Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2901666Z 2025-05-07T20:33:19.2901763Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2901885Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2903721Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.2903734Z 2025-05-07T20:33:19.2903849Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:19.2903854Z 2025-05-07T20:33:19.2903957Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2904175Z self=, 2025-05-07T20:33:19.2904248Z T=4096, 2025-05-07T20:33:19.2904327Z D=7168, 2025-05-07T20:33:19.2904405Z scale_ub=1200.0, 2025-05-07T20:33:19.2904483Z contiguous=True, 2025-05-07T20:33:19.2904564Z compiled=True, 2025-05-07T20:33:19.2904631Z ) 2025-05-07T20:33:19.2904842Z self = 2025-05-07T20:33:19.2905010Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.2905014Z 2025-05-07T20:33:19.2905088Z @given( 2025-05-07T20:33:19.2905287Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2905383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2905492Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2905606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2905714Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2905785Z ) 2025-05-07T20:33:19.2906028Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2906118Z def test_silu_mul_quant( 2025-05-07T20:33:19.2906190Z self, 2025-05-07T20:33:19.2906272Z T: int, 2025-05-07T20:33:19.2906344Z D: int, 2025-05-07T20:33:19.2906436Z scale_ub: Optional[float], 2025-05-07T20:33:19.2906642Z contiguous: bool, 2025-05-07T20:33:19.2906725Z compiled: bool, 2025-05-07T20:33:19.2906806Z ) -> None: 2025-05-07T20:33:19.2906899Z torch.manual_seed(2025) 2025-05-07T20:33:19.2906971Z 2025-05-07T20:33:19.2907145Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2907214Z 2025-05-07T20:33:19.2907301Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2907473Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2909233Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.2909242Z 2025-05-07T20:33:19.2909360Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:19.2909365Z 2025-05-07T20:33:19.2909466Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2909680Z self=, 2025-05-07T20:33:19.2909756Z T=16384, 2025-05-07T20:33:19.2909823Z D=7168, 2025-05-07T20:33:19.2909899Z scale_ub=None, 2025-05-07T20:33:19.2909980Z contiguous=False, 2025-05-07T20:33:19.2910059Z compiled=False, 2025-05-07T20:33:19.2910126Z ) 2025-05-07T20:33:19.2910335Z self = 2025-05-07T20:33:19.2910503Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.2910508Z 2025-05-07T20:33:19.2910582Z @given( 2025-05-07T20:33:19.2910694Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2910792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2910907Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2911065Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2911182Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2911254Z ) 2025-05-07T20:33:19.2911491Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2911584Z def test_silu_mul_quant( 2025-05-07T20:33:19.2911656Z self, 2025-05-07T20:33:19.2911730Z T: int, 2025-05-07T20:33:19.2911807Z D: int, 2025-05-07T20:33:19.2911899Z scale_ub: Optional[float], 2025-05-07T20:33:19.2911986Z contiguous: bool, 2025-05-07T20:33:19.2912075Z compiled: bool, 2025-05-07T20:33:19.2912149Z ) -> None: 2025-05-07T20:33:19.2912241Z torch.manual_seed(2025) 2025-05-07T20:33:19.2912315Z 2025-05-07T20:33:19.2912475Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2914287Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.2914331Z 2025-05-07T20:33:19.2914448Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.2914452Z 2025-05-07T20:33:19.2914551Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2914768Z self=, 2025-05-07T20:33:19.2914844Z T=2048, 2025-05-07T20:33:19.2914983Z D=7168, 2025-05-07T20:33:19.2915065Z scale_ub=1200.0, 2025-05-07T20:33:19.2915147Z contiguous=True, 2025-05-07T20:33:19.2915234Z compiled=True, 2025-05-07T20:33:19.2915312Z ) 2025-05-07T20:33:19.2915526Z self = 2025-05-07T20:33:19.2915696Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.2915700Z 2025-05-07T20:33:19.2915772Z @given( 2025-05-07T20:33:19.2915890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2915999Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2916120Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2916262Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2916369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2916443Z ) 2025-05-07T20:33:19.2916682Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2916773Z def test_silu_mul_quant( 2025-05-07T20:33:19.2916846Z self, 2025-05-07T20:33:19.2916919Z T: int, 2025-05-07T20:33:19.2916990Z D: int, 2025-05-07T20:33:19.2917089Z scale_ub: Optional[float], 2025-05-07T20:33:19.2917179Z contiguous: bool, 2025-05-07T20:33:19.2917259Z compiled: bool, 2025-05-07T20:33:19.2917335Z ) -> None: 2025-05-07T20:33:19.2917428Z torch.manual_seed(2025) 2025-05-07T20:33:19.2917496Z 2025-05-07T20:33:19.2917665Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2917735Z 2025-05-07T20:33:19.2917823Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2917945Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2919733Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.2919744Z 2025-05-07T20:33:19.2919864Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:19.2919869Z 2025-05-07T20:33:19.2919965Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2920182Z self=, 2025-05-07T20:33:19.2920254Z T=2048, 2025-05-07T20:33:19.2920328Z D=7168, 2025-05-07T20:33:19.2920410Z scale_ub=None, 2025-05-07T20:33:19.2920489Z contiguous=True, 2025-05-07T20:33:19.2920570Z compiled=False, 2025-05-07T20:33:19.2920639Z ) 2025-05-07T20:33:19.2920848Z self = 2025-05-07T20:33:19.2921014Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.2921018Z 2025-05-07T20:33:19.2921096Z @given( 2025-05-07T20:33:19.2921210Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2921385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2921493Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2921605Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2921716Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2921788Z ) 2025-05-07T20:33:19.2922028Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2922121Z def test_silu_mul_quant( 2025-05-07T20:33:19.2922192Z self, 2025-05-07T20:33:19.2922266Z T: int, 2025-05-07T20:33:19.2922338Z D: int, 2025-05-07T20:33:19.2922429Z scale_ub: Optional[float], 2025-05-07T20:33:19.2922513Z contiguous: bool, 2025-05-07T20:33:19.2922641Z compiled: bool, 2025-05-07T20:33:19.2922717Z ) -> None: 2025-05-07T20:33:19.2922809Z torch.manual_seed(2025) 2025-05-07T20:33:19.2922879Z 2025-05-07T20:33:19.2923043Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2923119Z 2025-05-07T20:33:19.2923204Z > x_sign = torch.sign(x) 2025-05-07T20:33:19.2924942Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.2924953Z 2025-05-07T20:33:19.2925063Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:19.2925068Z 2025-05-07T20:33:19.2925164Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2925385Z self=, 2025-05-07T20:33:19.2925460Z T=1, 2025-05-07T20:33:19.2925533Z D=7168, 2025-05-07T20:33:19.2925613Z scale_ub=1200.0, 2025-05-07T20:33:19.2925693Z contiguous=True, 2025-05-07T20:33:19.2925777Z compiled=False, 2025-05-07T20:33:19.2925848Z ) 2025-05-07T20:33:19.2926058Z self = 2025-05-07T20:33:19.2926219Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.2926223Z 2025-05-07T20:33:19.2926294Z @given( 2025-05-07T20:33:19.2926404Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2926501Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2926608Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2926723Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2926882Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2926956Z ) 2025-05-07T20:33:19.2927205Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2927294Z def test_silu_mul_quant( 2025-05-07T20:33:19.2927368Z self, 2025-05-07T20:33:19.2927448Z T: int, 2025-05-07T20:33:19.2927521Z D: int, 2025-05-07T20:33:19.2927613Z scale_ub: Optional[float], 2025-05-07T20:33:19.2927702Z contiguous: bool, 2025-05-07T20:33:19.2927781Z compiled: bool, 2025-05-07T20:33:19.2927855Z ) -> None: 2025-05-07T20:33:19.2927950Z torch.manual_seed(2025) 2025-05-07T20:33:19.2928020Z 2025-05-07T20:33:19.2928182Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2928260Z 2025-05-07T20:33:19.2928347Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2928475Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2928560Z x = x_sign * x_clamp 2025-05-07T20:33:19.2928638Z x0 = x[:, :D] 2025-05-07T20:33:19.2928716Z x1 = x[:, D:] 2025-05-07T20:33:19.2928866Z 2025-05-07T20:33:19.2928946Z if contiguous: 2025-05-07T20:33:19.2929035Z x0 = x0.contiguous() 2025-05-07T20:33:19.2929119Z x1 = x1.contiguous() 2025-05-07T20:33:19.2929191Z 2025-05-07T20:33:19.2929281Z if scale_ub is not None: 2025-05-07T20:33:19.2929379Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2929508Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2929585Z ) 2025-05-07T20:33:19.2929657Z else: 2025-05-07T20:33:19.2929747Z scale_ub_tensor = None 2025-05-07T20:33:19.2929818Z 2025-05-07T20:33:19.2929941Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2930072Z op = silu_mul_quant 2025-05-07T20:33:19.2930153Z if compiled: 2025-05-07T20:33:19.2930249Z op = torch.compile(op) 2025-05-07T20:33:19.2930354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2930424Z 2025-05-07T20:33:19.2930512Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2930516Z 2025-05-07T20:33:19.2930612Z moe/activation_test.py:117: 2025-05-07T20:33:19.2930735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2930828Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2930922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2931415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2931510Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2931865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2932084Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2932429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2932520Z kernel = self.compile( 2025-05-07T20:33:19.2932905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2933074Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2933198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2933202Z 2025-05-07T20:33:19.2933405Z self = 2025-05-07T20:33:19.2934169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2934714Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359bc162a0>} 2025-05-07T20:33:19.2935451Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2935641Z context = 2025-05-07T20:33:19.2935645Z 2025-05-07T20:33:19.2935809Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2936067Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2936178Z module_map=module_map) 2025-05-07T20:33:19.2936334Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2936433Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2936516Z E ^ 2025-05-07T20:33:19.2936865Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2936911Z 2025-05-07T20:33:19.2937366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2937377Z 2025-05-07T20:33:19.2937478Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2937694Z self=, 2025-05-07T20:33:19.2937776Z T=128, 2025-05-07T20:33:19.2937852Z D=5120, 2025-05-07T20:33:19.2937930Z scale_ub=None, 2025-05-07T20:33:19.2938012Z contiguous=True, 2025-05-07T20:33:19.2938091Z compiled=False, 2025-05-07T20:33:19.2938157Z ) 2025-05-07T20:33:19.2938374Z self = 2025-05-07T20:33:19.2938583Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.2938588Z 2025-05-07T20:33:19.2938664Z @given( 2025-05-07T20:33:19.2938779Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2938878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2938990Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2939102Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2939213Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2939288Z ) 2025-05-07T20:33:19.2939526Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2939615Z def test_silu_mul_quant( 2025-05-07T20:33:19.2939696Z self, 2025-05-07T20:33:19.2939771Z T: int, 2025-05-07T20:33:19.2939847Z D: int, 2025-05-07T20:33:19.2939943Z scale_ub: Optional[float], 2025-05-07T20:33:19.2940026Z contiguous: bool, 2025-05-07T20:33:19.2940489Z compiled: bool, 2025-05-07T20:33:19.2940601Z ) -> None: 2025-05-07T20:33:19.2940724Z torch.manual_seed(2025) 2025-05-07T20:33:19.2940825Z 2025-05-07T20:33:19.2941027Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2941106Z 2025-05-07T20:33:19.2941200Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2944698Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2944801Z x = x_sign * x_clamp 2025-05-07T20:33:19.2944882Z x0 = x[:, :D] 2025-05-07T20:33:19.2944961Z x1 = x[:, D:] 2025-05-07T20:33:19.2945030Z 2025-05-07T20:33:19.2945113Z if contiguous: 2025-05-07T20:33:19.2945202Z x0 = x0.contiguous() 2025-05-07T20:33:19.2945290Z x1 = x1.contiguous() 2025-05-07T20:33:19.2945363Z 2025-05-07T20:33:19.2945452Z if scale_ub is not None: 2025-05-07T20:33:19.2945559Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2945691Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2945773Z ) 2025-05-07T20:33:19.2945850Z else: 2025-05-07T20:33:19.2946077Z scale_ub_tensor = None 2025-05-07T20:33:19.2946157Z 2025-05-07T20:33:19.2946307Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2946396Z op = silu_mul_quant 2025-05-07T20:33:19.2946478Z if compiled: 2025-05-07T20:33:19.2946581Z op = torch.compile(op) 2025-05-07T20:33:19.2946687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2946761Z 2025-05-07T20:33:19.2946849Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2946854Z 2025-05-07T20:33:19.2946948Z moe/activation_test.py:117: 2025-05-07T20:33:19.2947076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2947174Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2947269Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2947855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2947954Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2948381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2948678Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2949015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2949111Z kernel = self.compile( 2025-05-07T20:33:19.2949490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2949660Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2949787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2949853Z 2025-05-07T20:33:19.2950057Z self = 2025-05-07T20:33:19.2950825Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2951319Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359bc171a0>} 2025-05-07T20:33:19.2952053Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2952237Z context = 2025-05-07T20:33:19.2952242Z 2025-05-07T20:33:19.2952401Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2952666Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2952774Z module_map=module_map) 2025-05-07T20:33:19.2952936Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2953036Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2953110Z E ^ 2025-05-07T20:33:19.2953462Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2953467Z 2025-05-07T20:33:19.2953883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2953887Z 2025-05-07T20:33:19.2953986Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2954207Z self=, 2025-05-07T20:33:19.2954285Z T=128, 2025-05-07T20:33:19.2954364Z D=7168, 2025-05-07T20:33:19.2954444Z scale_ub=None, 2025-05-07T20:33:19.2954525Z contiguous=True, 2025-05-07T20:33:19.2954608Z compiled=False, 2025-05-07T20:33:19.2954724Z ) 2025-05-07T20:33:19.2954941Z self = 2025-05-07T20:33:19.2955112Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.2955117Z 2025-05-07T20:33:19.2955191Z @given( 2025-05-07T20:33:19.2955304Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2955405Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2955514Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2955630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2955740Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2955813Z ) 2025-05-07T20:33:19.2956055Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2956148Z def test_silu_mul_quant( 2025-05-07T20:33:19.2956223Z self, 2025-05-07T20:33:19.2956306Z T: int, 2025-05-07T20:33:19.2956383Z D: int, 2025-05-07T20:33:19.2956476Z scale_ub: Optional[float], 2025-05-07T20:33:19.2956646Z contiguous: bool, 2025-05-07T20:33:19.2956729Z compiled: bool, 2025-05-07T20:33:19.2956807Z ) -> None: 2025-05-07T20:33:19.2956901Z torch.manual_seed(2025) 2025-05-07T20:33:19.2956973Z 2025-05-07T20:33:19.2957134Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2957212Z 2025-05-07T20:33:19.2957300Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2957422Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2957507Z x = x_sign * x_clamp 2025-05-07T20:33:19.2957581Z x0 = x[:, :D] 2025-05-07T20:33:19.2957657Z x1 = x[:, D:] 2025-05-07T20:33:19.2957726Z 2025-05-07T20:33:19.2957806Z if contiguous: 2025-05-07T20:33:19.2957938Z x0 = x0.contiguous() 2025-05-07T20:33:19.2958022Z x1 = x1.contiguous() 2025-05-07T20:33:19.2958087Z 2025-05-07T20:33:19.2958179Z if scale_ub is not None: 2025-05-07T20:33:19.2958287Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2958419Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2958495Z ) 2025-05-07T20:33:19.2958569Z else: 2025-05-07T20:33:19.2958660Z scale_ub_tensor = None 2025-05-07T20:33:19.2958729Z 2025-05-07T20:33:19.2958852Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2958937Z op = silu_mul_quant 2025-05-07T20:33:19.2959017Z if compiled: 2025-05-07T20:33:19.2959112Z op = torch.compile(op) 2025-05-07T20:33:19.2959215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2959284Z 2025-05-07T20:33:19.2959373Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2959381Z 2025-05-07T20:33:19.2959479Z moe/activation_test.py:117: 2025-05-07T20:33:19.2959604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2959708Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2959808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2960295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2960397Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2960750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2960967Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2961308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2961400Z kernel = self.compile( 2025-05-07T20:33:19.2961787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2962001Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2962129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2962133Z 2025-05-07T20:33:19.2962337Z self = 2025-05-07T20:33:19.2963102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2963593Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359be10040>} 2025-05-07T20:33:19.2964324Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2964515Z context = 2025-05-07T20:33:19.2964604Z 2025-05-07T20:33:19.2964763Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2965020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2965128Z module_map=module_map) 2025-05-07T20:33:19.2965284Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2965379Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2965459Z E ^ 2025-05-07T20:33:19.2965806Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2965811Z 2025-05-07T20:33:19.2966228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2966271Z 2025-05-07T20:33:19.2966369Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2966590Z self=, 2025-05-07T20:33:19.2966671Z T=2048, 2025-05-07T20:33:19.2966743Z D=7168, 2025-05-07T20:33:19.2966822Z scale_ub=1200.0, 2025-05-07T20:33:19.2966906Z contiguous=True, 2025-05-07T20:33:19.2966984Z compiled=False, 2025-05-07T20:33:19.2967053Z ) 2025-05-07T20:33:19.2967267Z self = 2025-05-07T20:33:19.2967435Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.2967440Z 2025-05-07T20:33:19.2967515Z @given( 2025-05-07T20:33:19.2967626Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2967722Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2967837Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2967952Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2968064Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2968138Z ) 2025-05-07T20:33:19.2968379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2968468Z def test_silu_mul_quant( 2025-05-07T20:33:19.2968545Z self, 2025-05-07T20:33:19.2968619Z T: int, 2025-05-07T20:33:19.2968699Z D: int, 2025-05-07T20:33:19.2968793Z scale_ub: Optional[float], 2025-05-07T20:33:19.2968876Z contiguous: bool, 2025-05-07T20:33:19.2968959Z compiled: bool, 2025-05-07T20:33:19.2969031Z ) -> None: 2025-05-07T20:33:19.2969122Z torch.manual_seed(2025) 2025-05-07T20:33:19.2969195Z 2025-05-07T20:33:19.2969356Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2971167Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.2971178Z 2025-05-07T20:33:19.2971290Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.2971295Z 2025-05-07T20:33:19.2971389Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2971606Z self=, 2025-05-07T20:33:19.2971679Z T=1, 2025-05-07T20:33:19.2971753Z D=5120, 2025-05-07T20:33:19.2971829Z scale_ub=1200.0, 2025-05-07T20:33:19.2971905Z contiguous=True, 2025-05-07T20:33:19.2971987Z compiled=False, 2025-05-07T20:33:19.2972056Z ) 2025-05-07T20:33:19.2972267Z self = 2025-05-07T20:33:19.2972469Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.2972509Z 2025-05-07T20:33:19.2972583Z @given( 2025-05-07T20:33:19.2972693Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2972788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2972895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2973009Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2973122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2973194Z ) 2025-05-07T20:33:19.2973434Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2973521Z def test_silu_mul_quant( 2025-05-07T20:33:19.2973592Z self, 2025-05-07T20:33:19.2973709Z T: int, 2025-05-07T20:33:19.2973781Z D: int, 2025-05-07T20:33:19.2973874Z scale_ub: Optional[float], 2025-05-07T20:33:19.2973961Z contiguous: bool, 2025-05-07T20:33:19.2974042Z compiled: bool, 2025-05-07T20:33:19.2974119Z ) -> None: 2025-05-07T20:33:19.2974218Z torch.manual_seed(2025) 2025-05-07T20:33:19.2974288Z 2025-05-07T20:33:19.2974457Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2974528Z 2025-05-07T20:33:19.2974619Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2974742Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2974829Z x = x_sign * x_clamp 2025-05-07T20:33:19.2974908Z x0 = x[:, :D] 2025-05-07T20:33:19.2974990Z x1 = x[:, D:] 2025-05-07T20:33:19.2975057Z 2025-05-07T20:33:19.2975136Z if contiguous: 2025-05-07T20:33:19.2975230Z x0 = x0.contiguous() 2025-05-07T20:33:19.2975317Z x1 = x1.contiguous() 2025-05-07T20:33:19.2975389Z 2025-05-07T20:33:19.2975478Z if scale_ub is not None: 2025-05-07T20:33:19.2975580Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2975712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2975790Z ) 2025-05-07T20:33:19.2975864Z else: 2025-05-07T20:33:19.2975957Z scale_ub_tensor = None 2025-05-07T20:33:19.2976026Z 2025-05-07T20:33:19.2976172Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2976267Z op = silu_mul_quant 2025-05-07T20:33:19.2976369Z if compiled: 2025-05-07T20:33:19.2976466Z op = torch.compile(op) 2025-05-07T20:33:19.2976570Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2976636Z 2025-05-07T20:33:19.2976724Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2976729Z 2025-05-07T20:33:19.2976824Z moe/activation_test.py:117: 2025-05-07T20:33:19.2976947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2977047Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2977140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2977675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2977774Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2978127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2978343Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2978683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2978773Z kernel = self.compile( 2025-05-07T20:33:19.2979175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2979346Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2979469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2979474Z 2025-05-07T20:33:19.2979773Z self = 2025-05-07T20:33:19.2980536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2981030Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359be11580>} 2025-05-07T20:33:19.2981756Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2981981Z context = 2025-05-07T20:33:19.2981990Z 2025-05-07T20:33:19.2982149Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2982409Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2982515Z module_map=module_map) 2025-05-07T20:33:19.2982671Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2982764Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2982840Z E ^ 2025-05-07T20:33:19.2983186Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2983191Z 2025-05-07T20:33:19.2983602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2983606Z 2025-05-07T20:33:19.2983707Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2983921Z self=, 2025-05-07T20:33:19.2984005Z T=2048, 2025-05-07T20:33:19.2984078Z D=5120, 2025-05-07T20:33:19.2984161Z scale_ub=None, 2025-05-07T20:33:19.2984245Z contiguous=True, 2025-05-07T20:33:19.2984326Z compiled=False, 2025-05-07T20:33:19.2984395Z ) 2025-05-07T20:33:19.2984612Z self = 2025-05-07T20:33:19.2984779Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.2984783Z 2025-05-07T20:33:19.2984861Z @given( 2025-05-07T20:33:19.2984974Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2985068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2985178Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2985289Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2985399Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2985474Z ) 2025-05-07T20:33:19.2985755Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2985850Z def test_silu_mul_quant( 2025-05-07T20:33:19.2985929Z self, 2025-05-07T20:33:19.2986003Z T: int, 2025-05-07T20:33:19.2986079Z D: int, 2025-05-07T20:33:19.2986171Z scale_ub: Optional[float], 2025-05-07T20:33:19.2986254Z contiguous: bool, 2025-05-07T20:33:19.2986335Z compiled: bool, 2025-05-07T20:33:19.2986409Z ) -> None: 2025-05-07T20:33:19.2986495Z torch.manual_seed(2025) 2025-05-07T20:33:19.2986567Z 2025-05-07T20:33:19.2986727Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2986796Z 2025-05-07T20:33:19.2986883Z > x_sign = torch.sign(x) 2025-05-07T20:33:19.2988730Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.2988773Z 2025-05-07T20:33:19.2988890Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:19.2988895Z 2025-05-07T20:33:19.2988995Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2989215Z self=, 2025-05-07T20:33:19.2989288Z T=16384, 2025-05-07T20:33:19.2989362Z D=5120, 2025-05-07T20:33:19.2989441Z scale_ub=None, 2025-05-07T20:33:19.2989522Z contiguous=True, 2025-05-07T20:33:19.2989602Z compiled=False, 2025-05-07T20:33:19.2989719Z ) 2025-05-07T20:33:19.2989932Z self = 2025-05-07T20:33:19.2990108Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.2990114Z 2025-05-07T20:33:19.2990193Z @given( 2025-05-07T20:33:19.2990306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2990400Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2990509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2990618Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2990730Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2990801Z ) 2025-05-07T20:33:19.2991041Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2991133Z def test_silu_mul_quant( 2025-05-07T20:33:19.2991206Z self, 2025-05-07T20:33:19.2991281Z T: int, 2025-05-07T20:33:19.2991360Z D: int, 2025-05-07T20:33:19.2991452Z scale_ub: Optional[float], 2025-05-07T20:33:19.2991535Z contiguous: bool, 2025-05-07T20:33:19.2991623Z compiled: bool, 2025-05-07T20:33:19.2991697Z ) -> None: 2025-05-07T20:33:19.2991793Z torch.manual_seed(2025) 2025-05-07T20:33:19.2991860Z 2025-05-07T20:33:19.2992020Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2993771Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.2993779Z 2025-05-07T20:33:19.2993890Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.2993895Z 2025-05-07T20:33:19.2994038Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2994259Z self=, 2025-05-07T20:33:19.2994332Z T=4096, 2025-05-07T20:33:19.2994408Z D=5120, 2025-05-07T20:33:19.2994484Z scale_ub=None, 2025-05-07T20:33:19.2994564Z contiguous=True, 2025-05-07T20:33:19.2994644Z compiled=False, 2025-05-07T20:33:19.2994713Z ) 2025-05-07T20:33:19.2994925Z self = 2025-05-07T20:33:19.2995087Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.2995092Z 2025-05-07T20:33:19.2995161Z @given( 2025-05-07T20:33:19.2995276Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2995370Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2995480Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2995593Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2995701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2995813Z ) 2025-05-07T20:33:19.2996091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2996180Z def test_silu_mul_quant( 2025-05-07T20:33:19.2996254Z self, 2025-05-07T20:33:19.2996325Z T: int, 2025-05-07T20:33:19.2996395Z D: int, 2025-05-07T20:33:19.2996490Z scale_ub: Optional[float], 2025-05-07T20:33:19.2996574Z contiguous: bool, 2025-05-07T20:33:19.2996654Z compiled: bool, 2025-05-07T20:33:19.2996730Z ) -> None: 2025-05-07T20:33:19.2996822Z torch.manual_seed(2025) 2025-05-07T20:33:19.2996891Z 2025-05-07T20:33:19.2997054Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2998840Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.2998848Z 2025-05-07T20:33:19.2998961Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.2998966Z 2025-05-07T20:33:19.2999063Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2999280Z self=, 2025-05-07T20:33:19.2999355Z T=2048, 2025-05-07T20:33:19.2999429Z D=5120, 2025-05-07T20:33:19.2999509Z scale_ub=None, 2025-05-07T20:33:19.2999591Z contiguous=False, 2025-05-07T20:33:19.2999669Z compiled=False, 2025-05-07T20:33:19.2999742Z ) 2025-05-07T20:33:19.2999954Z self = 2025-05-07T20:33:19.3000123Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.3000128Z 2025-05-07T20:33:19.3000201Z @given( 2025-05-07T20:33:19.3000314Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3000409Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3000517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3000627Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3000737Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3000808Z ) 2025-05-07T20:33:19.3001044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3001134Z def test_silu_mul_quant( 2025-05-07T20:33:19.3001210Z self, 2025-05-07T20:33:19.3001281Z T: int, 2025-05-07T20:33:19.3001355Z D: int, 2025-05-07T20:33:19.3001447Z scale_ub: Optional[float], 2025-05-07T20:33:19.3001995Z contiguous: bool, 2025-05-07T20:33:19.3002083Z compiled: bool, 2025-05-07T20:33:19.3002161Z ) -> None: 2025-05-07T20:33:19.3002251Z torch.manual_seed(2025) 2025-05-07T20:33:19.3002318Z 2025-05-07T20:33:19.3002478Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3004213Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3004222Z 2025-05-07T20:33:19.3004336Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3004341Z 2025-05-07T20:33:19.3004523Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3004737Z self=, 2025-05-07T20:33:19.3004807Z T=4096, 2025-05-07T20:33:19.3004882Z D=7168, 2025-05-07T20:33:19.3004959Z scale_ub=None, 2025-05-07T20:33:19.3005038Z contiguous=True, 2025-05-07T20:33:19.3005124Z compiled=True, 2025-05-07T20:33:19.3005195Z ) 2025-05-07T20:33:19.3005410Z self = 2025-05-07T20:33:19.3005570Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.3005575Z 2025-05-07T20:33:19.3005645Z @given( 2025-05-07T20:33:19.3005760Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3005898Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3006008Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3006131Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3006244Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3006315Z ) 2025-05-07T20:33:19.3006556Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3006646Z def test_silu_mul_quant( 2025-05-07T20:33:19.3006728Z self, 2025-05-07T20:33:19.3006802Z T: int, 2025-05-07T20:33:19.3006875Z D: int, 2025-05-07T20:33:19.3006975Z scale_ub: Optional[float], 2025-05-07T20:33:19.3007061Z contiguous: bool, 2025-05-07T20:33:19.3007142Z compiled: bool, 2025-05-07T20:33:19.3007224Z ) -> None: 2025-05-07T20:33:19.3007317Z torch.manual_seed(2025) 2025-05-07T20:33:19.3007389Z 2025-05-07T20:33:19.3007551Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3009302Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3009311Z 2025-05-07T20:33:19.3009429Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3009433Z 2025-05-07T20:33:19.3009531Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3009751Z self=, 2025-05-07T20:33:19.3009825Z T=2048, 2025-05-07T20:33:19.3009905Z D=5120, 2025-05-07T20:33:19.3009987Z scale_ub=1200.0, 2025-05-07T20:33:19.3010069Z contiguous=False, 2025-05-07T20:33:19.3010149Z compiled=False, 2025-05-07T20:33:19.3010291Z ) 2025-05-07T20:33:19.3010507Z self = 2025-05-07T20:33:19.3010679Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.3010684Z 2025-05-07T20:33:19.3010763Z @given( 2025-05-07T20:33:19.3010878Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3010977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3011088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3011200Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3011314Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3011387Z ) 2025-05-07T20:33:19.3011623Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3011717Z def test_silu_mul_quant( 2025-05-07T20:33:19.3011791Z self, 2025-05-07T20:33:19.3011866Z T: int, 2025-05-07T20:33:19.3011942Z D: int, 2025-05-07T20:33:19.3012040Z scale_ub: Optional[float], 2025-05-07T20:33:19.3012204Z contiguous: bool, 2025-05-07T20:33:19.3012292Z compiled: bool, 2025-05-07T20:33:19.3012367Z ) -> None: 2025-05-07T20:33:19.3012459Z torch.manual_seed(2025) 2025-05-07T20:33:19.3012528Z 2025-05-07T20:33:19.3012691Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3014436Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3014483Z 2025-05-07T20:33:19.3014598Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3014605Z 2025-05-07T20:33:19.3014705Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3014919Z self=, 2025-05-07T20:33:19.3014993Z T=4096, 2025-05-07T20:33:19.3015071Z D=7168, 2025-05-07T20:33:19.3015149Z scale_ub=1200.0, 2025-05-07T20:33:19.3015229Z contiguous=True, 2025-05-07T20:33:19.3015314Z compiled=False, 2025-05-07T20:33:19.3015384Z ) 2025-05-07T20:33:19.3015599Z self = 2025-05-07T20:33:19.3015764Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.3015768Z 2025-05-07T20:33:19.3015841Z @given( 2025-05-07T20:33:19.3015961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3016058Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3016169Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3016293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3016403Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3016475Z ) 2025-05-07T20:33:19.3016714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3016804Z def test_silu_mul_quant( 2025-05-07T20:33:19.3016882Z self, 2025-05-07T20:33:19.3016954Z T: int, 2025-05-07T20:33:19.3017025Z D: int, 2025-05-07T20:33:19.3017121Z scale_ub: Optional[float], 2025-05-07T20:33:19.3017205Z contiguous: bool, 2025-05-07T20:33:19.3017288Z compiled: bool, 2025-05-07T20:33:19.3017365Z ) -> None: 2025-05-07T20:33:19.3017455Z torch.manual_seed(2025) 2025-05-07T20:33:19.3017526Z 2025-05-07T20:33:19.3017687Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3019472Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3019481Z 2025-05-07T20:33:19.3019595Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3019599Z 2025-05-07T20:33:19.3019696Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3019916Z self=, 2025-05-07T20:33:19.3019994Z T=16384, 2025-05-07T20:33:19.3020070Z D=7168, 2025-05-07T20:33:19.3020150Z scale_ub=None, 2025-05-07T20:33:19.3020235Z contiguous=False, 2025-05-07T20:33:19.3020318Z compiled=True, 2025-05-07T20:33:19.3020428Z ) 2025-05-07T20:33:19.3020674Z self = 2025-05-07T20:33:19.3020845Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.3020850Z 2025-05-07T20:33:19.3020926Z @given( 2025-05-07T20:33:19.3021037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3021133Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3021242Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3021352Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3021463Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3021533Z ) 2025-05-07T20:33:19.3021769Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3021905Z def test_silu_mul_quant( 2025-05-07T20:33:19.3021980Z self, 2025-05-07T20:33:19.3022054Z T: int, 2025-05-07T20:33:19.3022133Z D: int, 2025-05-07T20:33:19.3022226Z scale_ub: Optional[float], 2025-05-07T20:33:19.3022310Z contiguous: bool, 2025-05-07T20:33:19.3022397Z compiled: bool, 2025-05-07T20:33:19.3022471Z ) -> None: 2025-05-07T20:33:19.3022566Z torch.manual_seed(2025) 2025-05-07T20:33:19.3022633Z 2025-05-07T20:33:19.3022794Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3024537Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3024550Z 2025-05-07T20:33:19.3024661Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3024665Z 2025-05-07T20:33:19.3024768Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3024982Z self=, 2025-05-07T20:33:19.3025056Z T=4096, 2025-05-07T20:33:19.3025135Z D=7168, 2025-05-07T20:33:19.3025213Z scale_ub=None, 2025-05-07T20:33:19.3025294Z contiguous=True, 2025-05-07T20:33:19.3025379Z compiled=False, 2025-05-07T20:33:19.3025448Z ) 2025-05-07T20:33:19.3025659Z self = 2025-05-07T20:33:19.3025826Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.3025833Z 2025-05-07T20:33:19.3025905Z @given( 2025-05-07T20:33:19.3026021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3026158Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3026272Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3026387Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3026494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3026566Z ) 2025-05-07T20:33:19.3026807Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3026897Z def test_silu_mul_quant( 2025-05-07T20:33:19.3026973Z self, 2025-05-07T20:33:19.3027049Z T: int, 2025-05-07T20:33:19.3027124Z D: int, 2025-05-07T20:33:19.3027222Z scale_ub: Optional[float], 2025-05-07T20:33:19.3027309Z contiguous: bool, 2025-05-07T20:33:19.3027389Z compiled: bool, 2025-05-07T20:33:19.3027512Z ) -> None: 2025-05-07T20:33:19.3027604Z torch.manual_seed(2025) 2025-05-07T20:33:19.3027675Z 2025-05-07T20:33:19.3027843Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3029623Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3029666Z 2025-05-07T20:33:19.3029782Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3029786Z 2025-05-07T20:33:19.3029881Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3030139Z self=, 2025-05-07T20:33:19.3030214Z T=16384, 2025-05-07T20:33:19.3030285Z D=7168, 2025-05-07T20:33:19.3030371Z scale_ub=None, 2025-05-07T20:33:19.3030456Z contiguous=True, 2025-05-07T20:33:19.3030538Z compiled=False, 2025-05-07T20:33:19.3030612Z ) 2025-05-07T20:33:19.3030820Z self = 2025-05-07T20:33:19.3030988Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.3030992Z 2025-05-07T20:33:19.3031066Z @given( 2025-05-07T20:33:19.3031178Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3031275Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3031384Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3031496Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3031608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3031680Z ) 2025-05-07T20:33:19.3031918Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3032016Z def test_silu_mul_quant( 2025-05-07T20:33:19.3032091Z self, 2025-05-07T20:33:19.3032169Z T: int, 2025-05-07T20:33:19.3032247Z D: int, 2025-05-07T20:33:19.3032340Z scale_ub: Optional[float], 2025-05-07T20:33:19.3032424Z contiguous: bool, 2025-05-07T20:33:19.3032510Z compiled: bool, 2025-05-07T20:33:19.3032583Z ) -> None: 2025-05-07T20:33:19.3032673Z torch.manual_seed(2025) 2025-05-07T20:33:19.3032742Z 2025-05-07T20:33:19.3032901Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3034690Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3034701Z 2025-05-07T20:33:19.3034813Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3034818Z 2025-05-07T20:33:19.3034924Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3035141Z self=, 2025-05-07T20:33:19.3035213Z T=16384, 2025-05-07T20:33:19.3035293Z D=7168, 2025-05-07T20:33:19.3035370Z scale_ub=1200.0, 2025-05-07T20:33:19.3035453Z contiguous=True, 2025-05-07T20:33:19.3035537Z compiled=False, 2025-05-07T20:33:19.3035606Z ) 2025-05-07T20:33:19.3035819Z self = 2025-05-07T20:33:19.3036007Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.3036013Z 2025-05-07T20:33:19.3036089Z @given( 2025-05-07T20:33:19.3036233Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3036482Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3036592Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3036710Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3036819Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3036890Z ) 2025-05-07T20:33:19.3037132Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3037222Z def test_silu_mul_quant( 2025-05-07T20:33:19.3037299Z self, 2025-05-07T20:33:19.3037373Z T: int, 2025-05-07T20:33:19.3037449Z D: int, 2025-05-07T20:33:19.3037547Z scale_ub: Optional[float], 2025-05-07T20:33:19.3037630Z contiguous: bool, 2025-05-07T20:33:19.3037778Z compiled: bool, 2025-05-07T20:33:19.3037858Z ) -> None: 2025-05-07T20:33:19.3037948Z torch.manual_seed(2025) 2025-05-07T20:33:19.3038020Z 2025-05-07T20:33:19.3038193Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3039933Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3039939Z 2025-05-07T20:33:19.3040385Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3040401Z 2025-05-07T20:33:19.3040538Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3040763Z self=, 2025-05-07T20:33:19.3040843Z T=128, 2025-05-07T20:33:19.3040920Z D=5120, 2025-05-07T20:33:19.3041012Z scale_ub=1200.0, 2025-05-07T20:33:19.3041097Z contiguous=False, 2025-05-07T20:33:19.3041180Z compiled=False, 2025-05-07T20:33:19.3041257Z ) 2025-05-07T20:33:19.3041470Z self = 2025-05-07T20:33:19.3041637Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.3041641Z 2025-05-07T20:33:19.3041723Z @given( 2025-05-07T20:33:19.3041838Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3041938Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3042048Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3042162Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3042278Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3042351Z ) 2025-05-07T20:33:19.3042680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3042781Z def test_silu_mul_quant( 2025-05-07T20:33:19.3042856Z self, 2025-05-07T20:33:19.3042931Z T: int, 2025-05-07T20:33:19.3043010Z D: int, 2025-05-07T20:33:19.3043105Z scale_ub: Optional[float], 2025-05-07T20:33:19.3043192Z contiguous: bool, 2025-05-07T20:33:19.3043281Z compiled: bool, 2025-05-07T20:33:19.3043357Z ) -> None: 2025-05-07T20:33:19.3043454Z torch.manual_seed(2025) 2025-05-07T20:33:19.3043526Z 2025-05-07T20:33:19.3043688Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3043763Z 2025-05-07T20:33:19.3043854Z x_sign = torch.sign(x) 2025-05-07T20:33:19.3043976Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.3044067Z x = x_sign * x_clamp 2025-05-07T20:33:19.3044145Z x0 = x[:, :D] 2025-05-07T20:33:19.3044223Z x1 = x[:, D:] 2025-05-07T20:33:19.3044304Z 2025-05-07T20:33:19.3044388Z if contiguous: 2025-05-07T20:33:19.3044593Z x0 = x0.contiguous() 2025-05-07T20:33:19.3044684Z x1 = x1.contiguous() 2025-05-07T20:33:19.3044755Z 2025-05-07T20:33:19.3044851Z if scale_ub is not None: 2025-05-07T20:33:19.3044952Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.3045085Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.3045164Z ) 2025-05-07T20:33:19.3045238Z else: 2025-05-07T20:33:19.3045331Z scale_ub_tensor = None 2025-05-07T20:33:19.3045408Z 2025-05-07T20:33:19.3045536Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.3045620Z op = silu_mul_quant 2025-05-07T20:33:19.3045705Z if compiled: 2025-05-07T20:33:19.3045866Z op = torch.compile(op) 2025-05-07T20:33:19.3045967Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.3046041Z 2025-05-07T20:33:19.3046133Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.3046140Z 2025-05-07T20:33:19.3046238Z moe/activation_test.py:117: 2025-05-07T20:33:19.3046364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.3046464Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.3046565Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.3047061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.3047155Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.3047517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.3047738Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.3048086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.3048181Z kernel = self.compile( 2025-05-07T20:33:19.3048585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.3048764Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.3048889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.3048894Z 2025-05-07T20:33:19.3049098Z self = 2025-05-07T20:33:19.3049871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.3050365Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359bbf11c0>} 2025-05-07T20:33:19.3051195Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.3051387Z context = 2025-05-07T20:33:19.3051392Z 2025-05-07T20:33:19.3051552Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.3051809Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.3051912Z module_map=module_map) 2025-05-07T20:33:19.3052074Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.3052167Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.3052242Z E ^ 2025-05-07T20:33:19.3052596Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.3052600Z 2025-05-07T20:33:19.3053089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.3053132Z 2025-05-07T20:33:19.3053239Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3053457Z self=, 2025-05-07T20:33:19.3053531Z T=2048, 2025-05-07T20:33:19.3053607Z D=7168, 2025-05-07T20:33:19.3053685Z scale_ub=None, 2025-05-07T20:33:19.3053774Z contiguous=False, 2025-05-07T20:33:19.3053854Z compiled=False, 2025-05-07T20:33:19.3053923Z ) 2025-05-07T20:33:19.3054142Z self = 2025-05-07T20:33:19.3054313Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.3054359Z 2025-05-07T20:33:19.3054433Z @given( 2025-05-07T20:33:19.3054551Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3054652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3054763Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3054885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3054994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3055071Z ) 2025-05-07T20:33:19.3055309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3055400Z def test_silu_mul_quant( 2025-05-07T20:33:19.3055479Z self, 2025-05-07T20:33:19.3055560Z T: int, 2025-05-07T20:33:19.3055636Z D: int, 2025-05-07T20:33:19.3055738Z scale_ub: Optional[float], 2025-05-07T20:33:19.3055825Z contiguous: bool, 2025-05-07T20:33:19.3055905Z compiled: bool, 2025-05-07T20:33:19.3055985Z ) -> None: 2025-05-07T20:33:19.3056080Z torch.manual_seed(2025) 2025-05-07T20:33:19.3056150Z 2025-05-07T20:33:19.3056317Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3058079Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3058087Z 2025-05-07T20:33:19.3058201Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3058206Z 2025-05-07T20:33:19.3058303Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3058526Z self=, 2025-05-07T20:33:19.3058603Z T=128, 2025-05-07T20:33:19.3058676Z D=7168, 2025-05-07T20:33:19.3058759Z scale_ub=1200.0, 2025-05-07T20:33:19.3058883Z contiguous=True, 2025-05-07T20:33:19.3058970Z compiled=True, 2025-05-07T20:33:19.3059046Z ) 2025-05-07T20:33:19.3059258Z self = 2025-05-07T20:33:19.3059420Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.3059427Z 2025-05-07T20:33:19.3059502Z @given( 2025-05-07T20:33:19.3059616Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3059713Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3059824Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3059936Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3060050Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3060122Z ) 2025-05-07T20:33:19.3060362Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3060458Z def test_silu_mul_quant( 2025-05-07T20:33:19.3060538Z self, 2025-05-07T20:33:19.3060614Z T: int, 2025-05-07T20:33:19.3060777Z D: int, 2025-05-07T20:33:19.3060873Z scale_ub: Optional[float], 2025-05-07T20:33:19.3060965Z contiguous: bool, 2025-05-07T20:33:19.3061052Z compiled: bool, 2025-05-07T20:33:19.3061127Z ) -> None: 2025-05-07T20:33:19.3061220Z torch.manual_seed(2025) 2025-05-07T20:33:19.3061292Z 2025-05-07T20:33:19.3061454Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3061530Z 2025-05-07T20:33:19.3061619Z x_sign = torch.sign(x) 2025-05-07T20:33:19.3061741Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.3061831Z x = x_sign * x_clamp 2025-05-07T20:33:19.3061909Z x0 = x[:, :D] 2025-05-07T20:33:19.3062031Z x1 = x[:, D:] 2025-05-07T20:33:19.3062105Z 2025-05-07T20:33:19.3062188Z if contiguous: 2025-05-07T20:33:19.3062280Z x0 = x0.contiguous() 2025-05-07T20:33:19.3062374Z x1 = x1.contiguous() 2025-05-07T20:33:19.3062446Z 2025-05-07T20:33:19.3062539Z if scale_ub is not None: 2025-05-07T20:33:19.3062642Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.3062774Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.3062850Z ) 2025-05-07T20:33:19.3062925Z else: 2025-05-07T20:33:19.3063016Z scale_ub_tensor = None 2025-05-07T20:33:19.3063089Z 2025-05-07T20:33:19.3063216Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.3063303Z op = silu_mul_quant 2025-05-07T20:33:19.3063390Z if compiled: 2025-05-07T20:33:19.3067018Z op = torch.compile(op) 2025-05-07T20:33:19.3067136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.3067217Z 2025-05-07T20:33:19.3067308Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.3067313Z 2025-05-07T20:33:19.3067473Z moe/activation_test.py:117: 2025-05-07T20:33:19.3067606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.3067709Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.3067808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.3068174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.3068266Z return fn(*args, **kwargs) 2025-05-07T20:33:19.3068757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.3068852Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.3069213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.3069436Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.3069841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.3069942Z kernel = self.compile( 2025-05-07T20:33:19.3070328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.3070499Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.3070631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.3070636Z 2025-05-07T20:33:19.3070840Z self = 2025-05-07T20:33:19.3071613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.3072113Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f359b85fb00>} 2025-05-07T20:33:19.3072951Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.3073142Z context = 2025-05-07T20:33:19.3073147Z 2025-05-07T20:33:19.3073307Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.3073569Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.3073673Z module_map=module_map) 2025-05-07T20:33:19.3073833Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.3073928Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.3074043Z E ^ 2025-05-07T20:33:19.3074397Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.3074404Z 2025-05-07T20:33:19.3074838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.3074842Z 2025-05-07T20:33:19.3074949Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3075164Z self=, 2025-05-07T20:33:19.3075241Z T=128, 2025-05-07T20:33:19.3075324Z D=7168, 2025-05-07T20:33:19.3075408Z scale_ub=1200.0, 2025-05-07T20:33:19.3075494Z contiguous=True, 2025-05-07T20:33:19.3075579Z compiled=False, 2025-05-07T20:33:19.3075652Z ) 2025-05-07T20:33:19.3075866Z self = 2025-05-07T20:33:19.3076039Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.3076047Z 2025-05-07T20:33:19.3076118Z @given( 2025-05-07T20:33:19.3076236Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3076339Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3076457Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3076575Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3076686Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3076758Z ) 2025-05-07T20:33:19.3077003Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3077093Z def test_silu_mul_quant( 2025-05-07T20:33:19.3077167Z self, 2025-05-07T20:33:19.3077247Z T: int, 2025-05-07T20:33:19.3077321Z D: int, 2025-05-07T20:33:19.3077415Z scale_ub: Optional[float], 2025-05-07T20:33:19.3077506Z contiguous: bool, 2025-05-07T20:33:19.3077588Z compiled: bool, 2025-05-07T20:33:19.3077669Z ) -> None: 2025-05-07T20:33:19.3077760Z torch.manual_seed(2025) 2025-05-07T20:33:19.3077827Z 2025-05-07T20:33:19.3078043Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3078120Z 2025-05-07T20:33:19.3078211Z x_sign = torch.sign(x) 2025-05-07T20:33:19.3078337Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.3080083Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3080092Z 2025-05-07T20:33:19.3080210Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:19.3080215Z 2025-05-07T20:33:19.3080316Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3080575Z self=, 2025-05-07T20:33:19.3080691Z T=128, 2025-05-07T20:33:19.3080767Z D=5120, 2025-05-07T20:33:19.3080853Z scale_ub=1200.0, 2025-05-07T20:33:19.3080934Z contiguous=True, 2025-05-07T20:33:19.3081016Z compiled=True, 2025-05-07T20:33:19.3081091Z ) 2025-05-07T20:33:19.3081303Z self = 2025-05-07T20:33:19.3081466Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.3081471Z 2025-05-07T20:33:19.3081547Z @given( 2025-05-07T20:33:19.3081663Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3081758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3081917Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3082028Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3082143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3082215Z ) 2025-05-07T20:33:19.3082458Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3082551Z def test_silu_mul_quant( 2025-05-07T20:33:19.3082622Z self, 2025-05-07T20:33:19.3082696Z T: int, 2025-05-07T20:33:19.3082773Z D: int, 2025-05-07T20:33:19.3082867Z scale_ub: Optional[float], 2025-05-07T20:33:19.3082953Z contiguous: bool, 2025-05-07T20:33:19.3083037Z compiled: bool, 2025-05-07T20:33:19.3083112Z ) -> None: 2025-05-07T20:33:19.3083205Z torch.manual_seed(2025) 2025-05-07T20:33:19.3083280Z 2025-05-07T20:33:19.3083440Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3083511Z 2025-05-07T20:33:19.3083605Z x_sign = torch.sign(x) 2025-05-07T20:33:19.3083724Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.3085470Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3085478Z 2025-05-07T20:33:19.3085592Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:19.3085597Z 2025-05-07T20:33:19.3085699Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3085916Z self=, 2025-05-07T20:33:19.3085991Z T=128, 2025-05-07T20:33:19.3086068Z D=7168, 2025-05-07T20:33:19.3086147Z scale_ub=None, 2025-05-07T20:33:19.3086228Z contiguous=True, 2025-05-07T20:33:19.3086363Z compiled=True, 2025-05-07T20:33:19.3086437Z ) 2025-05-07T20:33:19.3086654Z self = 2025-05-07T20:33:19.3086818Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.3086822Z 2025-05-07T20:33:19.3086897Z @given( 2025-05-07T20:33:19.3087015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3087112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3087223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3087342Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3087451Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3087525Z ) 2025-05-07T20:33:19.3087765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3087858Z def test_silu_mul_quant( 2025-05-07T20:33:19.3087934Z self, 2025-05-07T20:33:19.3088013Z T: int, 2025-05-07T20:33:19.3088089Z D: int, 2025-05-07T20:33:19.3088263Z scale_ub: Optional[float], 2025-05-07T20:33:19.3088352Z contiguous: bool, 2025-05-07T20:33:19.3088435Z compiled: bool, 2025-05-07T20:33:19.3088512Z ) -> None: 2025-05-07T20:33:19.3088602Z torch.manual_seed(2025) 2025-05-07T20:33:19.3088673Z 2025-05-07T20:33:19.3088836Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3090572Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3090619Z 2025-05-07T20:33:19.3090737Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3090867Z =============================== warnings summary =============================== 2025-05-07T20:33:19.3091170Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:19.3091472Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:19.3091763Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:19.3092631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:19.3092861Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:19.3092868Z 2025-05-07T20:33:19.3093080Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:19.3093242Z ================= 1 failed, 1 deselected, 3 warnings in 12.16s ================= 2025-05-07T20:33:20.9496899Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:21.0130949Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:33:21.0131584Z 2025-05-07T20:33:23.0149338Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:33:25.1752615Z ============================= test session starts ============================== 2025-05-07T20:33:25.1753553Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:33:25.1754112Z cachedir: .pytest_cache 2025-05-07T20:33:25.1754683Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:33:25.1755418Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:33:25.1755829Z plugins: hypothesis-6.131.14 2025-05-07T20:33:26.7362138Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:33:26.8323177Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:33:26.8323578Z run-last-failure: rerun previous 1 failure 2025-05-07T20:33:26.8323789Z 2025-05-07T20:33:28.9544690Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.9545686Z self=, 2025-05-07T20:33:28.9546121Z T=1, 2025-05-07T20:33:28.9546603Z D=5120, 2025-05-07T20:33:28.9546873Z scale_ub=None, 2025-05-07T20:33:28.9547090Z contiguous=True, 2025-05-07T20:33:28.9547347Z compiled=True, 2025-05-07T20:33:28.9547622Z ) 2025-05-07T20:33:28.9547941Z self = 2025-05-07T20:33:28.9548423Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.9548690Z 2025-05-07T20:33:28.9548769Z @given( 2025-05-07T20:33:28.9549006Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.9549314Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.9549619Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.9549949Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.9550369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.9550657Z ) 2025-05-07T20:33:28.9551006Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.9551455Z def test_silu_mul_quant( 2025-05-07T20:33:28.9551702Z self, 2025-05-07T20:33:28.9551894Z T: int, 2025-05-07T20:33:28.9552084Z D: int, 2025-05-07T20:33:28.9552299Z scale_ub: Optional[float], 2025-05-07T20:33:28.9552565Z contiguous: bool, 2025-05-07T20:33:28.9552795Z compiled: bool, 2025-05-07T20:33:28.9553027Z ) -> None: 2025-05-07T20:33:28.9553240Z torch.manual_seed(2025) 2025-05-07T20:33:28.9553478Z 2025-05-07T20:33:28.9553740Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.9554085Z 2025-05-07T20:33:28.9554277Z x_sign = torch.sign(x) 2025-05-07T20:33:28.9554560Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.9554870Z x = x_sign * x_clamp 2025-05-07T20:33:28.9555112Z x0 = x[:, :D] 2025-05-07T20:33:28.9555318Z x1 = x[:, D:] 2025-05-07T20:33:28.9555527Z 2025-05-07T20:33:28.9555714Z if contiguous: 2025-05-07T20:33:28.9555941Z x0 = x0.contiguous() 2025-05-07T20:33:28.9556200Z x1 = x1.contiguous() 2025-05-07T20:33:28.9556435Z 2025-05-07T20:33:28.9556617Z if scale_ub is not None: 2025-05-07T20:33:28.9556888Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.9557218Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.9557516Z ) 2025-05-07T20:33:28.9557708Z else: 2025-05-07T20:33:28.9557916Z scale_ub_tensor = None 2025-05-07T20:33:28.9558168Z 2025-05-07T20:33:28.9558390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.9558698Z op = silu_mul_quant 2025-05-07T20:33:28.9558945Z if compiled: 2025-05-07T20:33:28.9559190Z op = torch.compile(op) 2025-05-07T20:33:28.9559486Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.9559757Z 2025-05-07T20:33:28.9560034Z y_fp8, y_scale = fn() 2025-05-07T20:33:28.9560325Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:28.9560611Z 2025-05-07T20:33:28.9560840Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.9561174Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:28.9561460Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:28.9561765Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:28.9562126Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.9562431Z 2025-05-07T20:33:28.9562635Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:28.9562827Z 2025-05-07T20:33:28.9562927Z moe/activation_test.py:126: 2025-05-07T20:33:28.9563225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.9563557Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:28.9563880Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.9564711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:28.9565491Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:28.9566035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.9566703Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.9567389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:28.9568102Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.9568879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:28.9569586Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:28.9570199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:28.9570706Z fn() 2025-05-07T20:33:28.9571230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:28.9571832Z self.fn.run( 2025-05-07T20:33:28.9572315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.9572846Z kernel = self.compile( 2025-05-07T20:33:28.9573388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.9574055Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.9574439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.9574677Z 2025-05-07T20:33:28.9574881Z self = 2025-05-07T20:33:28.9575962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.9577337Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f5d3a700>} 2025-05-07T20:33:28.9578673Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.9579672Z context = 2025-05-07T20:33:28.9579969Z 2025-05-07T20:33:28.9580138Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.9580703Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.9581166Z module_map=module_map) 2025-05-07T20:33:28.9581519Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.9581878Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:28.9582144Z E ^ 2025-05-07T20:33:28.9582592Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.9583050Z 2025-05-07T20:33:28.9583473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.9583978Z 2025-05-07T20:33:28.9584079Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.9584490Z self=, 2025-05-07T20:33:28.9584899Z T=2048, 2025-05-07T20:33:28.9585084Z D=5120, 2025-05-07T20:33:28.9585274Z scale_ub=1200.0, 2025-05-07T20:33:28.9585495Z contiguous=True, 2025-05-07T20:33:28.9585719Z compiled=False, 2025-05-07T20:33:28.9586008Z ) 2025-05-07T20:33:28.9586323Z self = 2025-05-07T20:33:28.9586811Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:28.9587075Z 2025-05-07T20:33:28.9587159Z @given( 2025-05-07T20:33:28.9587392Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.9587751Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.9588051Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.9588374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.9588688Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.9588974Z ) 2025-05-07T20:33:28.9589363Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.9589806Z def test_silu_mul_quant( 2025-05-07T20:33:28.9590046Z self, 2025-05-07T20:33:28.9590241Z T: int, 2025-05-07T20:33:28.9590431Z D: int, 2025-05-07T20:33:28.9590652Z scale_ub: Optional[float], 2025-05-07T20:33:28.9590919Z contiguous: bool, 2025-05-07T20:33:28.9591152Z compiled: bool, 2025-05-07T20:33:28.9591370Z ) -> None: 2025-05-07T20:33:28.9591588Z torch.manual_seed(2025) 2025-05-07T20:33:28.9591831Z 2025-05-07T20:33:28.9592096Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.9592443Z 2025-05-07T20:33:28.9592638Z x_sign = torch.sign(x) 2025-05-07T20:33:28.9592923Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.9593236Z x = x_sign * x_clamp 2025-05-07T20:33:28.9593478Z x0 = x[:, :D] 2025-05-07T20:33:28.9593688Z x1 = x[:, D:] 2025-05-07T20:33:28.9593904Z 2025-05-07T20:33:28.9594087Z if contiguous: 2025-05-07T20:33:28.9594309Z x0 = x0.contiguous() 2025-05-07T20:33:28.9594568Z x1 = x1.contiguous() 2025-05-07T20:33:28.9594810Z 2025-05-07T20:33:28.9594996Z if scale_ub is not None: 2025-05-07T20:33:28.9595270Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.9595605Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.9595911Z ) 2025-05-07T20:33:28.9596107Z else: 2025-05-07T20:33:28.9596311Z scale_ub_tensor = None 2025-05-07T20:33:28.9596554Z 2025-05-07T20:33:28.9596781Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.9597085Z op = silu_mul_quant 2025-05-07T20:33:28.9597340Z if compiled: 2025-05-07T20:33:28.9597578Z op = torch.compile(op) 2025-05-07T20:33:28.9597870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.9598148Z 2025-05-07T20:33:28.9598331Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.9598497Z 2025-05-07T20:33:28.9598596Z moe/activation_test.py:117: 2025-05-07T20:33:28.9598940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.9599265Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.9599547Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.9600254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.9600931Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.9601459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.9602132Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.9602786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.9603304Z kernel = self.compile( 2025-05-07T20:33:28.9603853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.9604539Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.9604967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.9605188Z 2025-05-07T20:33:28.9605389Z self = 2025-05-07T20:33:28.9606447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.9607798Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f5bf2020>} 2025-05-07T20:33:28.9609202Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.9610214Z context = 2025-05-07T20:33:28.9610496Z 2025-05-07T20:33:28.9610660Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.9611174Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.9611634Z module_map=module_map) 2025-05-07T20:33:28.9611992Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.9612343Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.9612605Z E ^ 2025-05-07T20:33:28.9613068Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.9613516Z 2025-05-07T20:33:28.9613942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.6170608Z 2025-05-07T20:33:29.6171020Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.6171759Z self=, 2025-05-07T20:33:29.6172405Z T=2048, 2025-05-07T20:33:29.6172696Z D=5120, 2025-05-07T20:33:29.6172987Z scale_ub=1200.0, 2025-05-07T20:33:29.6173317Z contiguous=True, 2025-05-07T20:33:29.6173659Z compiled=True, 2025-05-07T20:33:29.6173975Z ) 2025-05-07T20:33:29.6174462Z self = 2025-05-07T20:33:29.6175301Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:29.6175710Z 2025-05-07T20:33:29.6175820Z @given( 2025-05-07T20:33:29.6176186Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.6176702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.6177192Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.6178042Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.6178569Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.6179078Z ) 2025-05-07T20:33:29.6179627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.6180367Z def test_silu_mul_quant( 2025-05-07T20:33:29.6180753Z self, 2025-05-07T20:33:29.6181050Z T: int, 2025-05-07T20:33:29.6181363Z D: int, 2025-05-07T20:33:29.6181694Z scale_ub: Optional[float], 2025-05-07T20:33:29.6182114Z contiguous: bool, 2025-05-07T20:33:29.6182499Z compiled: bool, 2025-05-07T20:33:29.6182862Z ) -> None: 2025-05-07T20:33:29.6183192Z torch.manual_seed(2025) 2025-05-07T20:33:29.6183570Z 2025-05-07T20:33:29.6183997Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.6184549Z 2025-05-07T20:33:29.6184842Z x_sign = torch.sign(x) 2025-05-07T20:33:29.6185299Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.6185798Z x = x_sign * x_clamp 2025-05-07T20:33:29.6186392Z x0 = x[:, :D] 2025-05-07T20:33:29.6186723Z x1 = x[:, D:] 2025-05-07T20:33:29.6187050Z 2025-05-07T20:33:29.6187327Z if contiguous: 2025-05-07T20:33:29.6187792Z x0 = x0.contiguous() 2025-05-07T20:33:29.6188197Z x1 = x1.contiguous() 2025-05-07T20:33:29.6188581Z 2025-05-07T20:33:29.6188886Z if scale_ub is not None: 2025-05-07T20:33:29.6189324Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.6189848Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.6190347Z ) 2025-05-07T20:33:29.6190652Z else: 2025-05-07T20:33:29.6190968Z scale_ub_tensor = None 2025-05-07T20:33:29.6191325Z 2025-05-07T20:33:29.6191871Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.6192334Z op = silu_mul_quant 2025-05-07T20:33:29.6192711Z if compiled: 2025-05-07T20:33:29.6193093Z op = torch.compile(op) 2025-05-07T20:33:29.6193543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.6193966Z 2025-05-07T20:33:29.6194273Z y_fp8, y_scale = fn() 2025-05-07T20:33:29.6194712Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:29.6195188Z 2025-05-07T20:33:29.6195578Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.6196145Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:29.6196633Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:29.6197127Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:29.6197681Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:29.6198183Z 2025-05-07T20:33:29.6198492Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:29.6198801Z 2025-05-07T20:33:29.6198961Z moe/activation_test.py:126: 2025-05-07T20:33:29.6199436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.6200005Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:29.6200527Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:29.6201925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:29.6203243Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:29.6204180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.6205281Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.6206430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:29.6207615Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:29.6208927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:29.6209993Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:29.6210998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:29.6211857Z fn() 2025-05-07T20:33:29.6212691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:29.6222720Z self.fn.run( 2025-05-07T20:33:29.6223597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.6224533Z kernel = self.compile( 2025-05-07T20:33:29.6225469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.6226631Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.6227313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.6227891Z 2025-05-07T20:33:29.6228297Z self = 2025-05-07T20:33:29.6230049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.6232487Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f4acf560>} 2025-05-07T20:33:29.6234930Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.6236840Z context = 2025-05-07T20:33:29.6237354Z 2025-05-07T20:33:29.6237624Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.6238497Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.6239303Z module_map=module_map) 2025-05-07T20:33:29.6239898Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.6240702Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:29.6241145Z E ^ 2025-05-07T20:33:29.6241940Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.6242746Z 2025-05-07T20:33:29.6243482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.6244414Z 2025-05-07T20:33:29.6244573Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.6245284Z self=, 2025-05-07T20:33:29.6245978Z T=16384, 2025-05-07T20:33:29.6246279Z D=7168, 2025-05-07T20:33:29.6246590Z scale_ub=1200.0, 2025-05-07T20:33:29.6246945Z contiguous=False, 2025-05-07T20:33:29.6247305Z compiled=False, 2025-05-07T20:33:29.6247633Z ) 2025-05-07T20:33:29.6248159Z self = 2025-05-07T20:33:29.6249010Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:29.6249501Z 2025-05-07T20:33:29.6249631Z @given( 2025-05-07T20:33:29.6249996Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.6250512Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.6251023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.6251577Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.6252134Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.6252604Z ) 2025-05-07T20:33:29.6253338Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.6254113Z def test_silu_mul_quant( 2025-05-07T20:33:29.6254498Z self, 2025-05-07T20:33:29.6254813Z T: int, 2025-05-07T20:33:29.6255127Z D: int, 2025-05-07T20:33:29.6255468Z scale_ub: Optional[float], 2025-05-07T20:33:29.6255914Z contiguous: bool, 2025-05-07T20:33:29.6256307Z compiled: bool, 2025-05-07T20:33:29.6256658Z ) -> None: 2025-05-07T20:33:29.6257006Z torch.manual_seed(2025) 2025-05-07T20:33:29.6257406Z 2025-05-07T20:33:29.6257838Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.6258382Z 2025-05-07T20:33:29.6258682Z x_sign = torch.sign(x) 2025-05-07T20:33:29.6259054Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.6259461Z x = x_sign * x_clamp 2025-05-07T20:33:29.6259772Z x0 = x[:, :D] 2025-05-07T20:33:29.6260081Z x1 = x[:, D:] 2025-05-07T20:33:29.6260364Z 2025-05-07T20:33:29.6260863Z if contiguous: 2025-05-07T20:33:29.6261201Z x0 = x0.contiguous() 2025-05-07T20:33:29.6261562Z x1 = x1.contiguous() 2025-05-07T20:33:29.6261910Z 2025-05-07T20:33:29.6262178Z if scale_ub is not None: 2025-05-07T20:33:29.6262564Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.6263085Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.6263562Z ) 2025-05-07T20:33:29.6263835Z else: 2025-05-07T20:33:29.6264141Z scale_ub_tensor = None 2025-05-07T20:33:29.6264532Z 2025-05-07T20:33:29.6264859Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.6265331Z op = silu_mul_quant 2025-05-07T20:33:29.6265889Z if compiled: 2025-05-07T20:33:29.6266286Z op = torch.compile(op) 2025-05-07T20:33:29.6266774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.6267235Z 2025-05-07T20:33:29.6267670Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.6267963Z 2025-05-07T20:33:29.6268124Z moe/activation_test.py:117: 2025-05-07T20:33:29.6268611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.6269161Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.6269614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.6270827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.6272054Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.6272977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.6274180Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.6275359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.6276288Z kernel = self.compile( 2025-05-07T20:33:29.6277226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.6278300Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.6278894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.6279233Z 2025-05-07T20:33:29.6279548Z self = 2025-05-07T20:33:29.6281392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.6283910Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f4d787c0>} 2025-05-07T20:33:29.6286433Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.6288278Z context = 2025-05-07T20:33:29.6288828Z 2025-05-07T20:33:29.6289107Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.6290007Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.6290814Z module_map=module_map) 2025-05-07T20:33:29.6291416Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.6291994Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.6292425Z E ^ 2025-05-07T20:33:29.6293221Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.6294038Z 2025-05-07T20:33:29.6294915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.3244265Z 2025-05-07T20:33:30.3244749Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.3245194Z self=, 2025-05-07T20:33:30.3245606Z T=1, 2025-05-07T20:33:30.3245789Z D=7168, 2025-05-07T20:33:30.3245979Z scale_ub=None, 2025-05-07T20:33:30.3246185Z contiguous=True, 2025-05-07T20:33:30.3246407Z compiled=True, 2025-05-07T20:33:30.3246619Z ) 2025-05-07T20:33:30.3246936Z self = 2025-05-07T20:33:30.3247417Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:30.3247996Z 2025-05-07T20:33:30.3248073Z @given( 2025-05-07T20:33:30.3248308Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.3248625Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.3248944Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.3249277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.3249594Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.3249879Z ) 2025-05-07T20:33:30.3250231Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.3250673Z def test_silu_mul_quant( 2025-05-07T20:33:30.3250910Z self, 2025-05-07T20:33:30.3251106Z T: int, 2025-05-07T20:33:30.3251291Z D: int, 2025-05-07T20:33:30.3251508Z scale_ub: Optional[float], 2025-05-07T20:33:30.3251775Z contiguous: bool, 2025-05-07T20:33:30.3252009Z compiled: bool, 2025-05-07T20:33:30.3252229Z ) -> None: 2025-05-07T20:33:30.3252439Z torch.manual_seed(2025) 2025-05-07T20:33:30.3252675Z 2025-05-07T20:33:30.3252937Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.3253272Z 2025-05-07T20:33:30.3253463Z x_sign = torch.sign(x) 2025-05-07T20:33:30.3253741Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.3254043Z x = x_sign * x_clamp 2025-05-07T20:33:30.3254282Z x0 = x[:, :D] 2025-05-07T20:33:30.3254485Z x1 = x[:, D:] 2025-05-07T20:33:30.3254687Z 2025-05-07T20:33:30.3254863Z if contiguous: 2025-05-07T20:33:30.3255085Z x0 = x0.contiguous() 2025-05-07T20:33:30.3255336Z x1 = x1.contiguous() 2025-05-07T20:33:30.3255569Z 2025-05-07T20:33:30.3255745Z if scale_ub is not None: 2025-05-07T20:33:30.3256007Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.3256331Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.3256644Z ) 2025-05-07T20:33:30.3256830Z else: 2025-05-07T20:33:30.3257036Z scale_ub_tensor = None 2025-05-07T20:33:30.3257280Z 2025-05-07T20:33:30.3257605Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.3257922Z op = silu_mul_quant 2025-05-07T20:33:30.3258165Z if compiled: 2025-05-07T20:33:30.3258436Z op = torch.compile(op) 2025-05-07T20:33:30.3258729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.3258991Z 2025-05-07T20:33:30.3259175Z y_fp8, y_scale = fn() 2025-05-07T20:33:30.3259456Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:30.3259731Z 2025-05-07T20:33:30.3259959Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.3260283Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:30.3260563Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:30.3260866Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:30.3261226Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:30.3261523Z 2025-05-07T20:33:30.3261726Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:30.3262014Z 2025-05-07T20:33:30.3262184Z moe/activation_test.py:126: 2025-05-07T20:33:30.3262476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.3262797Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:30.3263116Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:30.3263916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:30.3264646Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:30.3265195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.3265884Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.3266613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:30.3267319Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:30.3268127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:30.3268774Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:30.3269365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:30.3269862Z fn() 2025-05-07T20:33:30.3270371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:30.3270958Z self.fn.run( 2025-05-07T20:33:30.3271412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.3271941Z kernel = self.compile( 2025-05-07T20:33:30.3272490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.3273135Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.3273525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.3273747Z 2025-05-07T20:33:30.3273952Z self = 2025-05-07T20:33:30.3275033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.3276407Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f4baa340>} 2025-05-07T20:33:30.3277873Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.3278935Z context = 2025-05-07T20:33:30.3279215Z 2025-05-07T20:33:30.3279377Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.3279889Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.3280357Z module_map=module_map) 2025-05-07T20:33:30.3280712Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.3281071Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:30.3281338Z E ^ 2025-05-07T20:33:30.3281806Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.3282273Z 2025-05-07T20:33:30.3282713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.3283221Z 2025-05-07T20:33:30.3283409Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.3283819Z self=, 2025-05-07T20:33:30.3284204Z T=4096, 2025-05-07T20:33:30.3284387Z D=5120, 2025-05-07T20:33:30.3284573Z scale_ub=None, 2025-05-07T20:33:30.3284775Z contiguous=False, 2025-05-07T20:33:30.3284993Z compiled=False, 2025-05-07T20:33:30.3285191Z ) 2025-05-07T20:33:30.3285510Z self = 2025-05-07T20:33:30.3285992Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:30.3286267Z 2025-05-07T20:33:30.3286344Z @given( 2025-05-07T20:33:30.3286567Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.3286925Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.3287224Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.3287548Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.3287866Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.3288145Z ) 2025-05-07T20:33:30.3288493Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.3288933Z def test_silu_mul_quant( 2025-05-07T20:33:30.3289166Z self, 2025-05-07T20:33:30.3289356Z T: int, 2025-05-07T20:33:30.3289549Z D: int, 2025-05-07T20:33:30.3289753Z scale_ub: Optional[float], 2025-05-07T20:33:30.3290015Z contiguous: bool, 2025-05-07T20:33:30.3290246Z compiled: bool, 2025-05-07T20:33:30.3290455Z ) -> None: 2025-05-07T20:33:30.3290659Z torch.manual_seed(2025) 2025-05-07T20:33:30.3290891Z 2025-05-07T20:33:30.3291149Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.3291483Z 2025-05-07T20:33:30.3291673Z x_sign = torch.sign(x) 2025-05-07T20:33:30.3291956Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.3292264Z x = x_sign * x_clamp 2025-05-07T20:33:30.3292502Z x0 = x[:, :D] 2025-05-07T20:33:30.3292709Z x1 = x[:, D:] 2025-05-07T20:33:30.3292915Z 2025-05-07T20:33:30.3293104Z if contiguous: 2025-05-07T20:33:30.3293326Z x0 = x0.contiguous() 2025-05-07T20:33:30.3293581Z x1 = x1.contiguous() 2025-05-07T20:33:30.3293827Z 2025-05-07T20:33:30.3294013Z if scale_ub is not None: 2025-05-07T20:33:30.3294284Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.3294613Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.3294922Z ) 2025-05-07T20:33:30.3295111Z else: 2025-05-07T20:33:30.3295321Z scale_ub_tensor = None 2025-05-07T20:33:30.3295579Z 2025-05-07T20:33:30.3295800Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.3296108Z op = silu_mul_quant 2025-05-07T20:33:30.3296403Z if compiled: 2025-05-07T20:33:30.3296641Z op = torch.compile(op) 2025-05-07T20:33:30.3296937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.3297201Z 2025-05-07T20:33:30.3297377Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.3297540Z 2025-05-07T20:33:30.3297634Z moe/activation_test.py:117: 2025-05-07T20:33:30.3297929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.3298251Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.3298518Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.3299251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.3299923Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.3300460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.3301136Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.3301886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.3302409Z kernel = self.compile( 2025-05-07T20:33:30.3302956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.3303594Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.3303980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.3304200Z 2025-05-07T20:33:30.3304406Z self = 2025-05-07T20:33:30.3305455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.3306850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f4bab100>} 2025-05-07T20:33:30.3308274Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.3309276Z context = 2025-05-07T20:33:30.3309561Z 2025-05-07T20:33:30.3309724Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.3310238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.3310708Z module_map=module_map) 2025-05-07T20:33:30.3311069Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.3311414Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.3311668Z E ^ 2025-05-07T20:33:30.3312137Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.3312575Z 2025-05-07T20:33:30.3313001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.0268473Z 2025-05-07T20:33:31.0268981Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.0269591Z self=, 2025-05-07T20:33:31.0270005Z T=4096, 2025-05-07T20:33:31.0270194Z D=7168, 2025-05-07T20:33:31.0270379Z scale_ub=None, 2025-05-07T20:33:31.0270582Z contiguous=False, 2025-05-07T20:33:31.0270810Z compiled=False, 2025-05-07T20:33:31.0271017Z ) 2025-05-07T20:33:31.0271353Z self = 2025-05-07T20:33:31.0272139Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:31.0272419Z 2025-05-07T20:33:31.0272503Z @given( 2025-05-07T20:33:31.0272733Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.0273030Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.0273327Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.0273655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.0273966Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.0274253Z ) 2025-05-07T20:33:31.0274603Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.0275024Z def test_silu_mul_quant( 2025-05-07T20:33:31.0275259Z self, 2025-05-07T20:33:31.0275447Z T: int, 2025-05-07T20:33:31.0275629Z D: int, 2025-05-07T20:33:31.0275846Z scale_ub: Optional[float], 2025-05-07T20:33:31.0276108Z contiguous: bool, 2025-05-07T20:33:31.0276342Z compiled: bool, 2025-05-07T20:33:31.0276565Z ) -> None: 2025-05-07T20:33:31.0276780Z torch.manual_seed(2025) 2025-05-07T20:33:31.0277181Z 2025-05-07T20:33:31.0277442Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.0277778Z 2025-05-07T20:33:31.0277992Z x_sign = torch.sign(x) 2025-05-07T20:33:31.0278286Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.0278598Z x = x_sign * x_clamp 2025-05-07T20:33:31.0278853Z x0 = x[:, :D] 2025-05-07T20:33:31.0279088Z x1 = x[:, D:] 2025-05-07T20:33:31.0279292Z 2025-05-07T20:33:31.0279469Z if contiguous: 2025-05-07T20:33:31.0279697Z x0 = x0.contiguous() 2025-05-07T20:33:31.0279950Z x1 = x1.contiguous() 2025-05-07T20:33:31.0280175Z 2025-05-07T20:33:31.0280362Z if scale_ub is not None: 2025-05-07T20:33:31.0280726Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.0281051Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.0281360Z ) 2025-05-07T20:33:31.0281559Z else: 2025-05-07T20:33:31.0281763Z scale_ub_tensor = None 2025-05-07T20:33:31.0282012Z 2025-05-07T20:33:31.0282248Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.0282547Z op = silu_mul_quant 2025-05-07T20:33:31.0282797Z if compiled: 2025-05-07T20:33:31.0283037Z op = torch.compile(op) 2025-05-07T20:33:31.0283331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.0283599Z 2025-05-07T20:33:31.0283781Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.0283942Z 2025-05-07T20:33:31.0284047Z moe/activation_test.py:117: 2025-05-07T20:33:31.0284329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.0284657Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.0284937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.0285644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.0286328Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.0286861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.0287531Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.0288178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.0288704Z kernel = self.compile( 2025-05-07T20:33:31.0289240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.0289892Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.0290271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.0290498Z 2025-05-07T20:33:31.0290788Z self = 2025-05-07T20:33:31.0291914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.0293301Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f4baa980>} 2025-05-07T20:33:31.0294621Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.0295632Z context = 2025-05-07T20:33:31.0295918Z 2025-05-07T20:33:31.0296079Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.0296650Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.0297150Z module_map=module_map) 2025-05-07T20:33:31.0297510Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.0297862Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.0298108Z E ^ 2025-05-07T20:33:31.0298563Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.0299021Z 2025-05-07T20:33:31.0299436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.0299958Z 2025-05-07T20:33:31.0300063Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.0300467Z self=, 2025-05-07T20:33:31.0300905Z T=128, 2025-05-07T20:33:31.0301089Z D=7168, 2025-05-07T20:33:31.0301271Z scale_ub=None, 2025-05-07T20:33:31.0301485Z contiguous=False, 2025-05-07T20:33:31.0308050Z compiled=True, 2025-05-07T20:33:31.0308266Z ) 2025-05-07T20:33:31.0308593Z self = 2025-05-07T20:33:31.0309127Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:31.0309398Z 2025-05-07T20:33:31.0309477Z @given( 2025-05-07T20:33:31.0309708Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.0310027Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.0310324Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.0310649Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.0310970Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.0311250Z ) 2025-05-07T20:33:31.0311612Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.0312068Z def test_silu_mul_quant( 2025-05-07T20:33:31.0312307Z self, 2025-05-07T20:33:31.0312503Z T: int, 2025-05-07T20:33:31.0312702Z D: int, 2025-05-07T20:33:31.0312917Z scale_ub: Optional[float], 2025-05-07T20:33:31.0313184Z contiguous: bool, 2025-05-07T20:33:31.0313420Z compiled: bool, 2025-05-07T20:33:31.0313635Z ) -> None: 2025-05-07T20:33:31.0313849Z torch.manual_seed(2025) 2025-05-07T20:33:31.0314087Z 2025-05-07T20:33:31.0314361Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.0314699Z 2025-05-07T20:33:31.0314887Z x_sign = torch.sign(x) 2025-05-07T20:33:31.0315176Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.0315484Z x = x_sign * x_clamp 2025-05-07T20:33:31.0315726Z x0 = x[:, :D] 2025-05-07T20:33:31.0315949Z x1 = x[:, D:] 2025-05-07T20:33:31.0316146Z 2025-05-07T20:33:31.0316335Z if contiguous: 2025-05-07T20:33:31.0316643Z x0 = x0.contiguous() 2025-05-07T20:33:31.0316895Z x1 = x1.contiguous() 2025-05-07T20:33:31.0317137Z 2025-05-07T20:33:31.0317320Z if scale_ub is not None: 2025-05-07T20:33:31.0317580Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.0317905Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.0318209Z ) 2025-05-07T20:33:31.0318391Z else: 2025-05-07T20:33:31.0318593Z scale_ub_tensor = None 2025-05-07T20:33:31.0318841Z 2025-05-07T20:33:31.0319063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.0319359Z op = silu_mul_quant 2025-05-07T20:33:31.0319606Z if compiled: 2025-05-07T20:33:31.0319849Z op = torch.compile(op) 2025-05-07T20:33:31.0320136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.0320407Z 2025-05-07T20:33:31.0320593Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.0320870Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.0321155Z 2025-05-07T20:33:31.0321480Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.0321800Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.0322087Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.0322392Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.0322735Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.0323046Z 2025-05-07T20:33:31.0323248Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.0323442Z 2025-05-07T20:33:31.0323544Z moe/activation_test.py:126: 2025-05-07T20:33:31.0323831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.0324160Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.0324526Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.0325319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.0326066Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.0326622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.0327296Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.0327979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.0328697Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.0329464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.0330096Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.0330687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.0331196Z fn() 2025-05-07T20:33:31.0331712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.0332298Z self.fn.run( 2025-05-07T20:33:31.0332757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.0333272Z kernel = self.compile( 2025-05-07T20:33:31.0333826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.0334483Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.0334872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.0335093Z 2025-05-07T20:33:31.0335306Z self = 2025-05-07T20:33:31.0336435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.0337789Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f4253d80>} 2025-05-07T20:33:31.0339187Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.0340560Z context = 2025-05-07T20:33:31.0340838Z 2025-05-07T20:33:31.0341005Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.0341508Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.0341973Z module_map=module_map) 2025-05-07T20:33:31.0342442Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.0342851Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.0343105Z E ^ 2025-05-07T20:33:31.0343556Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.0344018Z 2025-05-07T20:33:31.0344451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2753901Z 2025-05-07T20:33:31.2754261Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2754889Z self=, 2025-05-07T20:33:31.2755313Z T=128, 2025-05-07T20:33:31.2755505Z D=7168, 2025-05-07T20:33:31.2756009Z scale_ub=None, 2025-05-07T20:33:31.2756232Z contiguous=False, 2025-05-07T20:33:31.2756463Z compiled=False, 2025-05-07T20:33:31.2756665Z ) 2025-05-07T20:33:31.2756997Z self = 2025-05-07T20:33:31.2757497Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:31.2757760Z 2025-05-07T20:33:31.2757845Z @given( 2025-05-07T20:33:31.2758067Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2758374Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2758678Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2758999Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2759333Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2759616Z ) 2025-05-07T20:33:31.2759957Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2760413Z def test_silu_mul_quant( 2025-05-07T20:33:31.2760655Z self, 2025-05-07T20:33:31.2760841Z T: int, 2025-05-07T20:33:31.2761042Z D: int, 2025-05-07T20:33:31.2761267Z scale_ub: Optional[float], 2025-05-07T20:33:31.2761530Z contiguous: bool, 2025-05-07T20:33:31.2761769Z compiled: bool, 2025-05-07T20:33:31.2761991Z ) -> None: 2025-05-07T20:33:31.2762208Z torch.manual_seed(2025) 2025-05-07T20:33:31.2762447Z 2025-05-07T20:33:31.2762715Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2763050Z 2025-05-07T20:33:31.2763238Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2763532Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2763842Z x = x_sign * x_clamp 2025-05-07T20:33:31.2764079Z x0 = x[:, :D] 2025-05-07T20:33:31.2764290Z x1 = x[:, D:] 2025-05-07T20:33:31.2764500Z 2025-05-07T20:33:31.2764675Z if contiguous: 2025-05-07T20:33:31.2764907Z x0 = x0.contiguous() 2025-05-07T20:33:31.2765156Z x1 = x1.contiguous() 2025-05-07T20:33:31.2765383Z 2025-05-07T20:33:31.2765659Z if scale_ub is not None: 2025-05-07T20:33:31.2765929Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2766265Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2766576Z ) 2025-05-07T20:33:31.2766774Z else: 2025-05-07T20:33:31.2766977Z scale_ub_tensor = None 2025-05-07T20:33:31.2767227Z 2025-05-07T20:33:31.2767451Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2767764Z op = silu_mul_quant 2025-05-07T20:33:31.2768009Z if compiled: 2025-05-07T20:33:31.2768254Z op = torch.compile(op) 2025-05-07T20:33:31.2768545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2768818Z 2025-05-07T20:33:31.2769007Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2769167Z 2025-05-07T20:33:31.2769275Z moe/activation_test.py:117: 2025-05-07T20:33:31.2769565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2769905Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2770344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2771038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2771721Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2772270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2772959Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2773609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2774131Z kernel = self.compile( 2025-05-07T20:33:31.2774676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2775426Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2775811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2776042Z 2025-05-07T20:33:31.2776243Z self = 2025-05-07T20:33:31.2777312Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2778683Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cbd29760>} 2025-05-07T20:33:31.2779993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2781002Z context = 2025-05-07T20:33:31.2781293Z 2025-05-07T20:33:31.2781458Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2781969Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2782422Z module_map=module_map) 2025-05-07T20:33:31.2782787Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2783135Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2783384Z E ^ 2025-05-07T20:33:31.2783845Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2784303Z 2025-05-07T20:33:31.2784717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2785225Z 2025-05-07T20:33:31.2785333Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2785776Z self=, 2025-05-07T20:33:31.2786174Z T=4096, 2025-05-07T20:33:31.2786362Z D=5120, 2025-05-07T20:33:31.2786545Z scale_ub=1200.0, 2025-05-07T20:33:31.2786756Z contiguous=True, 2025-05-07T20:33:31.2786972Z compiled=False, 2025-05-07T20:33:31.2787172Z ) 2025-05-07T20:33:31.2787569Z self = 2025-05-07T20:33:31.2788061Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:31.2788324Z 2025-05-07T20:33:31.2788405Z @given( 2025-05-07T20:33:31.2788624Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2788931Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2789229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2789548Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2789883Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2790167Z ) 2025-05-07T20:33:31.2790567Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2791069Z def test_silu_mul_quant( 2025-05-07T20:33:31.2791310Z self, 2025-05-07T20:33:31.2791505Z T: int, 2025-05-07T20:33:31.2791699Z D: int, 2025-05-07T20:33:31.2791919Z scale_ub: Optional[float], 2025-05-07T20:33:31.2792189Z contiguous: bool, 2025-05-07T20:33:31.2792419Z compiled: bool, 2025-05-07T20:33:31.2792637Z ) -> None: 2025-05-07T20:33:31.2792855Z torch.manual_seed(2025) 2025-05-07T20:33:31.2793091Z 2025-05-07T20:33:31.2793358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2793709Z 2025-05-07T20:33:31.2793891Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2794226Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2794535Z x = x_sign * x_clamp 2025-05-07T20:33:31.2794764Z x0 = x[:, :D] 2025-05-07T20:33:31.2794979Z x1 = x[:, D:] 2025-05-07T20:33:31.2795179Z 2025-05-07T20:33:31.2795365Z if contiguous: 2025-05-07T20:33:31.2795586Z x0 = x0.contiguous() 2025-05-07T20:33:31.2795846Z x1 = x1.contiguous() 2025-05-07T20:33:31.2796085Z 2025-05-07T20:33:31.2796265Z if scale_ub is not None: 2025-05-07T20:33:31.2796532Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2796862Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2797163Z ) 2025-05-07T20:33:31.2797356Z else: 2025-05-07T20:33:31.2797561Z scale_ub_tensor = None 2025-05-07T20:33:31.2797797Z 2025-05-07T20:33:31.2798022Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2798324Z op = silu_mul_quant 2025-05-07T20:33:31.2798565Z if compiled: 2025-05-07T20:33:31.2798808Z op = torch.compile(op) 2025-05-07T20:33:31.2799098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2799359Z 2025-05-07T20:33:31.2799550Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2799715Z 2025-05-07T20:33:31.2799812Z moe/activation_test.py:117: 2025-05-07T20:33:31.2800101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2800421Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2800699Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2801395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2802067Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2802605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2803279Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2803981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2804508Z kernel = self.compile( 2025-05-07T20:33:31.2805054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2805727Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2806109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2806337Z 2025-05-07T20:33:31.2806538Z self = 2025-05-07T20:33:31.2807678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2809169Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cbd2a3e0>} 2025-05-07T20:33:31.2810638Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2811693Z context = 2025-05-07T20:33:31.2811980Z 2025-05-07T20:33:31.2812142Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2812666Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2813129Z module_map=module_map) 2025-05-07T20:33:31.2813484Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2813884Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2814142Z E ^ 2025-05-07T20:33:31.2814595Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2815058Z 2025-05-07T20:33:31.2815478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2815988Z 2025-05-07T20:33:31.2816091Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2816498Z self=, 2025-05-07T20:33:31.2816888Z T=1, 2025-05-07T20:33:31.2817067Z D=5120, 2025-05-07T20:33:31.2817255Z scale_ub=None, 2025-05-07T20:33:31.2817458Z contiguous=True, 2025-05-07T20:33:31.2817677Z compiled=True, 2025-05-07T20:33:31.2817880Z ) 2025-05-07T20:33:31.2818191Z self = 2025-05-07T20:33:31.2818674Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.2818944Z 2025-05-07T20:33:31.2819022Z @given( 2025-05-07T20:33:31.2819248Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2819547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2819855Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2820186Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2820501Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2820780Z ) 2025-05-07T20:33:31.2821131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2821574Z def test_silu_mul_quant( 2025-05-07T20:33:31.2821811Z self, 2025-05-07T20:33:31.2822003Z T: int, 2025-05-07T20:33:31.2822189Z D: int, 2025-05-07T20:33:31.2822404Z scale_ub: Optional[float], 2025-05-07T20:33:31.2822668Z contiguous: bool, 2025-05-07T20:33:31.2822902Z compiled: bool, 2025-05-07T20:33:31.2823122Z ) -> None: 2025-05-07T20:33:31.2823334Z torch.manual_seed(2025) 2025-05-07T20:33:31.2823573Z 2025-05-07T20:33:31.2823886Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2824224Z 2025-05-07T20:33:31.2824418Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2824697Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2825009Z x = x_sign * x_clamp 2025-05-07T20:33:31.2825246Z x0 = x[:, :D] 2025-05-07T20:33:31.2825453Z x1 = x[:, D:] 2025-05-07T20:33:31.2825657Z 2025-05-07T20:33:31.2825837Z if contiguous: 2025-05-07T20:33:31.2826057Z x0 = x0.contiguous() 2025-05-07T20:33:31.2826311Z x1 = x1.contiguous() 2025-05-07T20:33:31.2826544Z 2025-05-07T20:33:31.2826725Z if scale_ub is not None: 2025-05-07T20:33:31.2826997Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2827331Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2827727Z ) 2025-05-07T20:33:31.2827917Z else: 2025-05-07T20:33:31.2828123Z scale_ub_tensor = None 2025-05-07T20:33:31.2828372Z 2025-05-07T20:33:31.2828642Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2828989Z op = silu_mul_quant 2025-05-07T20:33:31.2829234Z if compiled: 2025-05-07T20:33:31.2829472Z op = torch.compile(op) 2025-05-07T20:33:31.2829762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2830027Z 2025-05-07T20:33:31.2830206Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.2830487Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.2830769Z 2025-05-07T20:33:31.2830993Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2831319Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.2831605Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.2831950Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.2832300Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2832602Z 2025-05-07T20:33:31.2832800Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.2832990Z 2025-05-07T20:33:31.2833089Z moe/activation_test.py:126: 2025-05-07T20:33:31.2833377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2833705Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.2834022Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2834816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.2835554Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.2836095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2836760Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2837446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.2838160Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.2838904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.2839567Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.2840475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.2841007Z fn() 2025-05-07T20:33:31.2841524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.2842113Z self.fn.run( 2025-05-07T20:33:31.2842578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2843106Z kernel = self.compile( 2025-05-07T20:33:31.2843751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2844417Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2844815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2845046Z 2025-05-07T20:33:31.2845254Z self = 2025-05-07T20:33:31.2846329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2847676Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cbd2b060>} 2025-05-07T20:33:31.2849066Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2850158Z context = 2025-05-07T20:33:31.2850438Z 2025-05-07T20:33:31.2850601Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2851116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2851581Z module_map=module_map) 2025-05-07T20:33:31.2851932Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2852296Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.2852561Z E ^ 2025-05-07T20:33:31.2853022Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2853536Z 2025-05-07T20:33:31.2853964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.9738984Z 2025-05-07T20:33:31.9739650Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.9740452Z self=, 2025-05-07T20:33:31.9740997Z T=2048, 2025-05-07T20:33:31.9741235Z D=5120, 2025-05-07T20:33:31.9741473Z scale_ub=None, 2025-05-07T20:33:31.9741733Z contiguous=True, 2025-05-07T20:33:31.9741996Z compiled=True, 2025-05-07T20:33:31.9742203Z ) 2025-05-07T20:33:31.9742513Z self = 2025-05-07T20:33:31.9742998Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.9743439Z 2025-05-07T20:33:31.9743514Z @given( 2025-05-07T20:33:31.9743736Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.9744044Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.9744348Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.9751310Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.9751690Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.9751970Z ) 2025-05-07T20:33:31.9752325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.9752772Z def test_silu_mul_quant( 2025-05-07T20:33:31.9753003Z self, 2025-05-07T20:33:31.9753196Z T: int, 2025-05-07T20:33:31.9753391Z D: int, 2025-05-07T20:33:31.9753599Z scale_ub: Optional[float], 2025-05-07T20:33:31.9753867Z contiguous: bool, 2025-05-07T20:33:31.9754102Z compiled: bool, 2025-05-07T20:33:31.9754327Z ) -> None: 2025-05-07T20:33:31.9754533Z torch.manual_seed(2025) 2025-05-07T20:33:31.9754777Z 2025-05-07T20:33:31.9755057Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.9755398Z 2025-05-07T20:33:31.9755587Z x_sign = torch.sign(x) 2025-05-07T20:33:31.9756154Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.9756463Z x = x_sign * x_clamp 2025-05-07T20:33:31.9756696Z x0 = x[:, :D] 2025-05-07T20:33:31.9756908Z x1 = x[:, D:] 2025-05-07T20:33:31.9757106Z 2025-05-07T20:33:31.9757282Z if contiguous: 2025-05-07T20:33:31.9757505Z x0 = x0.contiguous() 2025-05-07T20:33:31.9757747Z x1 = x1.contiguous() 2025-05-07T20:33:31.9757979Z 2025-05-07T20:33:31.9758157Z if scale_ub is not None: 2025-05-07T20:33:31.9758418Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.9758747Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.9759056Z ) 2025-05-07T20:33:31.9759238Z else: 2025-05-07T20:33:31.9759439Z scale_ub_tensor = None 2025-05-07T20:33:31.9759685Z 2025-05-07T20:33:31.9759907Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.9760202Z op = silu_mul_quant 2025-05-07T20:33:31.9760450Z if compiled: 2025-05-07T20:33:31.9760901Z op = torch.compile(op) 2025-05-07T20:33:31.9761184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.9761452Z 2025-05-07T20:33:31.9761639Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.9761911Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.9762190Z 2025-05-07T20:33:31.9762416Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.9762734Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.9763016Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.9763349Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.9763698Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.9764101Z 2025-05-07T20:33:31.9764289Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.9764485Z 2025-05-07T20:33:31.9764582Z moe/activation_test.py:126: 2025-05-07T20:33:31.9764875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.9765205Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.9765522Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.9766318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.9767054Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.9767593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.9768293Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.9768986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.9769708Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.9770429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.9771061Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.9771653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.9772179Z fn() 2025-05-07T20:33:31.9772693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.9773267Z self.fn.run( 2025-05-07T20:33:31.9773732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.9774240Z kernel = self.compile( 2025-05-07T20:33:31.9774784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.9775496Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.9775886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.9776111Z 2025-05-07T20:33:31.9776315Z self = 2025-05-07T20:33:31.9777378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.9778746Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cb52ff60>} 2025-05-07T20:33:31.9780107Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.9781110Z context = 2025-05-07T20:33:31.9781451Z 2025-05-07T20:33:31.9781654Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.9782172Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.9782639Z module_map=module_map) 2025-05-07T20:33:31.9782988Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.9783346Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.9783606Z E ^ 2025-05-07T20:33:31.9784053Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.9784522Z 2025-05-07T20:33:31.9784952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.9785502Z 2025-05-07T20:33:31.9785602Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.9786011Z self=, 2025-05-07T20:33:31.9786394Z T=128, 2025-05-07T20:33:31.9786578Z D=5120, 2025-05-07T20:33:31.9786761Z scale_ub=None, 2025-05-07T20:33:31.9786964Z contiguous=True, 2025-05-07T20:33:31.9787183Z compiled=True, 2025-05-07T20:33:31.9787382Z ) 2025-05-07T20:33:31.9787750Z self = 2025-05-07T20:33:31.9788228Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.9788486Z 2025-05-07T20:33:31.9788563Z @given( 2025-05-07T20:33:31.9788785Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.9789089Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.9789388Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.9789712Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.9790029Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.9790314Z ) 2025-05-07T20:33:31.9790662Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.9791098Z def test_silu_mul_quant( 2025-05-07T20:33:31.9791330Z self, 2025-05-07T20:33:31.9791519Z T: int, 2025-05-07T20:33:31.9791710Z D: int, 2025-05-07T20:33:31.9791914Z scale_ub: Optional[float], 2025-05-07T20:33:31.9792178Z contiguous: bool, 2025-05-07T20:33:31.9792416Z compiled: bool, 2025-05-07T20:33:31.9792629Z ) -> None: 2025-05-07T20:33:31.9792844Z torch.manual_seed(2025) 2025-05-07T20:33:31.9793083Z 2025-05-07T20:33:31.9793349Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.9793685Z 2025-05-07T20:33:31.9793871Z x_sign = torch.sign(x) 2025-05-07T20:33:31.9794155Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.9794461Z x = x_sign * x_clamp 2025-05-07T20:33:31.9794746Z x0 = x[:, :D] 2025-05-07T20:33:31.9794951Z x1 = x[:, D:] 2025-05-07T20:33:31.9795159Z 2025-05-07T20:33:31.9795339Z if contiguous: 2025-05-07T20:33:31.9795560Z x0 = x0.contiguous() 2025-05-07T20:33:31.9795813Z x1 = x1.contiguous() 2025-05-07T20:33:31.9796046Z 2025-05-07T20:33:31.9796222Z if scale_ub is not None: 2025-05-07T20:33:31.9796492Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.9796828Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.9797121Z ) 2025-05-07T20:33:31.9797301Z else: 2025-05-07T20:33:31.9797504Z scale_ub_tensor = None 2025-05-07T20:33:31.9797753Z 2025-05-07T20:33:31.9797974Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.9798277Z op = silu_mul_quant 2025-05-07T20:33:31.9798518Z if compiled: 2025-05-07T20:33:31.9798753Z op = torch.compile(op) 2025-05-07T20:33:31.9799049Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.9799365Z 2025-05-07T20:33:31.9799583Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.9799860Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.9800140Z 2025-05-07T20:33:31.9800360Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.9800687Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.9800967Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.9801274Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.9801621Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.9801924Z 2025-05-07T20:33:31.9802127Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.9802315Z 2025-05-07T20:33:31.9802412Z moe/activation_test.py:126: 2025-05-07T20:33:31.9802748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.9803078Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.9803393Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.9804171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.9804925Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.9805465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.9806160Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.9806841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.9807566Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.9808327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.9808946Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.9809543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.9810060Z fn() 2025-05-07T20:33:31.9810575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.9811143Z self.fn.run( 2025-05-07T20:33:31.9811609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.9812114Z kernel = self.compile( 2025-05-07T20:33:31.9812642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.9813295Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.9813690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.9813909Z 2025-05-07T20:33:31.9814172Z self = 2025-05-07T20:33:31.9815264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.9816708Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cb315f80>} 2025-05-07T20:33:31.9818037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.9819082Z context = 2025-05-07T20:33:31.9819361Z 2025-05-07T20:33:31.9819523Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.9820076Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.9820577Z module_map=module_map) 2025-05-07T20:33:31.9820931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.9821278Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.9821535Z E ^ 2025-05-07T20:33:31.9821987Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.9822432Z 2025-05-07T20:33:31.9822855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.7622344Z 2025-05-07T20:33:32.7623464Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.7624818Z self=, 2025-05-07T20:33:32.7625625Z T=4096, 2025-05-07T20:33:32.7626000Z D=5120, 2025-05-07T20:33:32.7626367Z scale_ub=None, 2025-05-07T20:33:32.7626792Z contiguous=True, 2025-05-07T20:33:32.7627229Z compiled=True, 2025-05-07T20:33:32.7627761Z ) 2025-05-07T20:33:32.7628373Z self = 2025-05-07T20:33:32.7629268Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:32.7629592Z 2025-05-07T20:33:32.7629669Z @given( 2025-05-07T20:33:32.7629896Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.7630196Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.7630499Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.7630823Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.7631141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.7631426Z ) 2025-05-07T20:33:32.7631779Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.7632231Z def test_silu_mul_quant( 2025-05-07T20:33:32.7632477Z self, 2025-05-07T20:33:32.7632676Z T: int, 2025-05-07T20:33:32.7632874Z D: int, 2025-05-07T20:33:32.7633087Z scale_ub: Optional[float], 2025-05-07T20:33:32.7633361Z contiguous: bool, 2025-05-07T20:33:32.7633599Z compiled: bool, 2025-05-07T20:33:32.7633813Z ) -> None: 2025-05-07T20:33:32.7634029Z torch.manual_seed(2025) 2025-05-07T20:33:32.7634266Z 2025-05-07T20:33:32.7634530Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.7634868Z 2025-05-07T20:33:32.7635064Z x_sign = torch.sign(x) 2025-05-07T20:33:32.7635344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.7635661Z x = x_sign * x_clamp 2025-05-07T20:33:32.7635899Z x0 = x[:, :D] 2025-05-07T20:33:32.7636104Z x1 = x[:, D:] 2025-05-07T20:33:32.7636307Z 2025-05-07T20:33:32.7636490Z if contiguous: 2025-05-07T20:33:32.7636809Z x0 = x0.contiguous() 2025-05-07T20:33:32.7637071Z x1 = x1.contiguous() 2025-05-07T20:33:32.7637309Z 2025-05-07T20:33:32.7637490Z if scale_ub is not None: 2025-05-07T20:33:32.7637760Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.7638094Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.7638396Z ) 2025-05-07T20:33:32.7638578Z else: 2025-05-07T20:33:32.7638786Z scale_ub_tensor = None 2025-05-07T20:33:32.7639033Z 2025-05-07T20:33:32.7639284Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.7639618Z op = silu_mul_quant 2025-05-07T20:33:32.7639867Z if compiled: 2025-05-07T20:33:32.7640422Z op = torch.compile(op) 2025-05-07T20:33:32.7640721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.7640987Z 2025-05-07T20:33:32.7641178Z y_fp8, y_scale = fn() 2025-05-07T20:33:32.7641465Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:32.7641839Z 2025-05-07T20:33:32.7642149Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.7642481Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:32.7642764Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:32.7643080Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:32.7643435Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.7643736Z 2025-05-07T20:33:32.7643941Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:32.7644142Z 2025-05-07T20:33:32.7644241Z moe/activation_test.py:126: 2025-05-07T20:33:32.7644538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.7644930Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:32.7645258Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.7646049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:32.7646845Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:32.7647385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.7648064Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.7648750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:32.7649517Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:32.7650239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:32.7650875Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:32.7651482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:32.7651990Z fn() 2025-05-07T20:33:32.7652500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:32.7653094Z self.fn.run( 2025-05-07T20:33:32.7653556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.7654075Z kernel = self.compile( 2025-05-07T20:33:32.7654614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.7655281Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.7655667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.7655898Z 2025-05-07T20:33:32.7656108Z self = 2025-05-07T20:33:32.7657253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.7658646Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cb4be340>} 2025-05-07T20:33:32.7660004Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.7661065Z context = 2025-05-07T20:33:32.7661352Z 2025-05-07T20:33:32.7661517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.7662035Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.7662491Z module_map=module_map) 2025-05-07T20:33:32.7662942Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.7663305Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:32.7663566Z E ^ 2025-05-07T20:33:32.7664024Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.7664481Z 2025-05-07T20:33:32.7664908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.7665412Z 2025-05-07T20:33:32.7665518Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.7665918Z self=, 2025-05-07T20:33:32.7666372Z T=16384, 2025-05-07T20:33:32.7666569Z D=5120, 2025-05-07T20:33:32.7666760Z scale_ub=None, 2025-05-07T20:33:32.7666977Z contiguous=True, 2025-05-07T20:33:32.7667207Z compiled=True, 2025-05-07T20:33:32.7667468Z ) 2025-05-07T20:33:32.7667787Z self = 2025-05-07T20:33:32.7668278Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:32.7668545Z 2025-05-07T20:33:32.7668634Z @given( 2025-05-07T20:33:32.7668855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.7669166Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.7669472Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.7669793Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.7670120Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.7670405Z ) 2025-05-07T20:33:32.7670759Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.7671197Z def test_silu_mul_quant( 2025-05-07T20:33:32.7671442Z self, 2025-05-07T20:33:32.7671633Z T: int, 2025-05-07T20:33:32.7671830Z D: int, 2025-05-07T20:33:32.7672052Z scale_ub: Optional[float], 2025-05-07T20:33:32.7672321Z contiguous: bool, 2025-05-07T20:33:32.7672557Z compiled: bool, 2025-05-07T20:33:32.7672780Z ) -> None: 2025-05-07T20:33:32.7672991Z torch.manual_seed(2025) 2025-05-07T20:33:32.7673225Z 2025-05-07T20:33:32.7673495Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.7673842Z 2025-05-07T20:33:32.7674026Z x_sign = torch.sign(x) 2025-05-07T20:33:32.7674312Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.7674622Z x = x_sign * x_clamp 2025-05-07T20:33:32.7674857Z x0 = x[:, :D] 2025-05-07T20:33:32.7675076Z x1 = x[:, D:] 2025-05-07T20:33:32.7675281Z 2025-05-07T20:33:32.7675461Z if contiguous: 2025-05-07T20:33:32.7675686Z x0 = x0.contiguous() 2025-05-07T20:33:32.7675944Z x1 = x1.contiguous() 2025-05-07T20:33:32.7676188Z 2025-05-07T20:33:32.7676423Z if scale_ub is not None: 2025-05-07T20:33:32.7676695Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.7677028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.7677327Z ) 2025-05-07T20:33:32.7677521Z else: 2025-05-07T20:33:32.7677733Z scale_ub_tensor = None 2025-05-07T20:33:32.7677979Z 2025-05-07T20:33:32.7678211Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.7678524Z op = silu_mul_quant 2025-05-07T20:33:32.7678768Z if compiled: 2025-05-07T20:33:32.7679020Z op = torch.compile(op) 2025-05-07T20:33:32.7679327Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.7679593Z 2025-05-07T20:33:32.7679785Z y_fp8, y_scale = fn() 2025-05-07T20:33:32.7680068Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:32.7680347Z 2025-05-07T20:33:32.7680592Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.7681017Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:32.7681310Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:32.7681616Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:32.7681973Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.7682280Z 2025-05-07T20:33:32.7682478Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:32.7682684Z 2025-05-07T20:33:32.7682783Z moe/activation_test.py:126: 2025-05-07T20:33:32.7683080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.7683406Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:32.7683738Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.7684554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:32.7685300Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:32.7685847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.7686528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.7687216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:32.7687928Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:32.7688668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:32.7689339Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:32.7689961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:32.7690471Z fn() 2025-05-07T20:33:32.7691002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:32.7691606Z self.fn.run( 2025-05-07T20:33:32.7692079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.7692593Z kernel = self.compile( 2025-05-07T20:33:32.7693156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.7693803Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.7694194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.7694420Z 2025-05-07T20:33:32.7694623Z self = 2025-05-07T20:33:32.7695771Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.7697139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cb8a94e0>} 2025-05-07T20:33:32.7698500Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.7699551Z context = 2025-05-07T20:33:32.7699839Z 2025-05-07T20:33:32.7700001Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.7700518Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.7700982Z module_map=module_map) 2025-05-07T20:33:32.7701342Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.7708019Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:32.7708455Z E ^ 2025-05-07T20:33:32.7708921Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.7709410Z 2025-05-07T20:33:32.7709863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.7898197Z W0507 20:33:32.788000 89776 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:32.7899999Z W0507 20:33:32.788000 89776 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:32.7901565Z W0507 20:33:32.788000 89776 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:32.7902540Z W0507 20:33:32.788000 89776 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:32.7903690Z W0507 20:33:32.788000 89776 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:33.1918697Z 2025-05-07T20:33:33.1919009Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.1919498Z self=, 2025-05-07T20:33:33.1920056Z T=1, 2025-05-07T20:33:33.1920285Z D=5120, 2025-05-07T20:33:33.1920529Z scale_ub=1200.0, 2025-05-07T20:33:33.1920811Z contiguous=True, 2025-05-07T20:33:33.1921103Z compiled=True, 2025-05-07T20:33:33.1921326Z ) 2025-05-07T20:33:33.1921651Z self = 2025-05-07T20:33:33.1922151Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:33.1922422Z 2025-05-07T20:33:33.1922500Z @given( 2025-05-07T20:33:33.1922728Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.1923036Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.1923335Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.1923664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.1923989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.1924263Z ) 2025-05-07T20:33:33.1924606Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.1925051Z def test_silu_mul_quant( 2025-05-07T20:33:33.1925289Z self, 2025-05-07T20:33:33.1925475Z T: int, 2025-05-07T20:33:33.1925672Z D: int, 2025-05-07T20:33:33.1925886Z scale_ub: Optional[float], 2025-05-07T20:33:33.1926153Z contiguous: bool, 2025-05-07T20:33:33.1926665Z compiled: bool, 2025-05-07T20:33:33.1926893Z ) -> None: 2025-05-07T20:33:33.1927104Z torch.manual_seed(2025) 2025-05-07T20:33:33.1927345Z 2025-05-07T20:33:33.1927614Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.1927947Z 2025-05-07T20:33:33.1928133Z x_sign = torch.sign(x) 2025-05-07T20:33:33.1928413Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.1928707Z x = x_sign * x_clamp 2025-05-07T20:33:33.1928943Z x0 = x[:, :D] 2025-05-07T20:33:33.1929155Z x1 = x[:, D:] 2025-05-07T20:33:33.1929356Z 2025-05-07T20:33:33.1929540Z if contiguous: 2025-05-07T20:33:33.1929770Z x0 = x0.contiguous() 2025-05-07T20:33:33.1930018Z x1 = x1.contiguous() 2025-05-07T20:33:33.1930286Z 2025-05-07T20:33:33.1930467Z if scale_ub is not None: 2025-05-07T20:33:33.1930739Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.1931081Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.1931587Z ) 2025-05-07T20:33:33.1931784Z else: 2025-05-07T20:33:33.1931997Z scale_ub_tensor = None 2025-05-07T20:33:33.1932247Z 2025-05-07T20:33:33.1932473Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.1932781Z op = silu_mul_quant 2025-05-07T20:33:33.1933037Z if compiled: 2025-05-07T20:33:33.1933278Z op = torch.compile(op) 2025-05-07T20:33:33.1933574Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.1933849Z 2025-05-07T20:33:33.1934034Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.1934206Z 2025-05-07T20:33:33.1934306Z moe/activation_test.py:117: 2025-05-07T20:33:33.1934601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.1935007Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.1935285Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.1935857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.1936413Z return fn(*args, **kwargs) 2025-05-07T20:33:33.1937054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.1937723Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.1938262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.1938922Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.1939582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.1940396Z kernel = self.compile( 2025-05-07T20:33:33.1940949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.1941590Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.1941979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.1942204Z 2025-05-07T20:33:33.1942414Z self = 2025-05-07T20:33:33.1943481Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.1944850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cb8f9d00>} 2025-05-07T20:33:33.1946256Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.1947324Z context = 2025-05-07T20:33:33.1947676Z 2025-05-07T20:33:33.1947850Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.1948354Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.1948826Z module_map=module_map) 2025-05-07T20:33:33.1949192Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.1949539Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.1949781Z E ^ 2025-05-07T20:33:33.1950244Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.1950700Z 2025-05-07T20:33:33.1951132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.1951632Z 2025-05-07T20:33:33.1951734Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.1952262Z self=, 2025-05-07T20:33:33.1952663Z T=1, 2025-05-07T20:33:33.1952844Z D=5120, 2025-05-07T20:33:33.1953022Z scale_ub=None, 2025-05-07T20:33:33.1953230Z contiguous=False, 2025-05-07T20:33:33.1953445Z compiled=True, 2025-05-07T20:33:33.1953636Z ) 2025-05-07T20:33:33.1953951Z self = 2025-05-07T20:33:33.1954431Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:33.1954694Z 2025-05-07T20:33:33.1954771Z @given( 2025-05-07T20:33:33.1954996Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.1955305Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.1955670Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.1955996Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.1956322Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.1956608Z ) 2025-05-07T20:33:33.1956943Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.1957393Z def test_silu_mul_quant( 2025-05-07T20:33:33.1957630Z self, 2025-05-07T20:33:33.1957818Z T: int, 2025-05-07T20:33:33.1958012Z D: int, 2025-05-07T20:33:33.1958224Z scale_ub: Optional[float], 2025-05-07T20:33:33.1958490Z contiguous: bool, 2025-05-07T20:33:33.1958724Z compiled: bool, 2025-05-07T20:33:33.1958939Z ) -> None: 2025-05-07T20:33:33.1959149Z torch.manual_seed(2025) 2025-05-07T20:33:33.1959387Z 2025-05-07T20:33:33.1959654Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.1959983Z 2025-05-07T20:33:33.1960173Z x_sign = torch.sign(x) 2025-05-07T20:33:33.1960454Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.1960760Z x = x_sign * x_clamp 2025-05-07T20:33:33.1960989Z x0 = x[:, :D] 2025-05-07T20:33:33.1961204Z x1 = x[:, D:] 2025-05-07T20:33:33.1961402Z 2025-05-07T20:33:33.1961570Z if contiguous: 2025-05-07T20:33:33.1961796Z x0 = x0.contiguous() 2025-05-07T20:33:33.1962049Z x1 = x1.contiguous() 2025-05-07T20:33:33.1962283Z 2025-05-07T20:33:33.1962468Z if scale_ub is not None: 2025-05-07T20:33:33.1962734Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.1963060Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.1963372Z ) 2025-05-07T20:33:33.1963567Z else: 2025-05-07T20:33:33.1963764Z scale_ub_tensor = None 2025-05-07T20:33:33.1964008Z 2025-05-07T20:33:33.1964235Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.1964539Z op = silu_mul_quant 2025-05-07T20:33:33.1964785Z if compiled: 2025-05-07T20:33:33.1965089Z op = torch.compile(op) 2025-05-07T20:33:33.1965377Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.1965644Z 2025-05-07T20:33:33.1965828Z y_fp8, y_scale = fn() 2025-05-07T20:33:33.1966108Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:33.1966383Z 2025-05-07T20:33:33.1966612Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.1966937Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:33.1967217Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:33.1967523Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:33.1967880Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.1968178Z 2025-05-07T20:33:33.1968378Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:33.1968570Z 2025-05-07T20:33:33.1968676Z moe/activation_test.py:126: 2025-05-07T20:33:33.1968968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.1969310Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:33.1969751Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.1970548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:33.1971298Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:33.1971833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.1972511Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.1973192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:33.1973940Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.1974679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:33.1975344Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:33.1975945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:33.1976442Z fn() 2025-05-07T20:33:33.1976969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:33.1977539Z self.fn.run( 2025-05-07T20:33:33.1978001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.1978520Z kernel = self.compile( 2025-05-07T20:33:33.1979078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.1979752Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.1980142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.1980370Z 2025-05-07T20:33:33.1980574Z self = 2025-05-07T20:33:33.1981632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.1982982Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca706de0>} 2025-05-07T20:33:33.1984292Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.1985303Z context = 2025-05-07T20:33:33.1985589Z 2025-05-07T20:33:33.1985799Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.1986336Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.1986798Z module_map=module_map) 2025-05-07T20:33:33.1987161Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.1987590Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:33.1987844Z E ^ 2025-05-07T20:33:33.1988307Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.1988772Z 2025-05-07T20:33:33.1989193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.3387643Z 2025-05-07T20:33:33.3387942Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.3388497Z self=, 2025-05-07T20:33:33.3389069Z T=1, 2025-05-07T20:33:33.3389585Z D=5120, 2025-05-07T20:33:33.3389858Z scale_ub=None, 2025-05-07T20:33:33.3390080Z contiguous=True, 2025-05-07T20:33:33.3390306Z compiled=False, 2025-05-07T20:33:33.3390503Z ) 2025-05-07T20:33:33.3390833Z self = 2025-05-07T20:33:33.3391313Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:33.3391578Z 2025-05-07T20:33:33.3391666Z @given( 2025-05-07T20:33:33.3391886Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.3392194Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.3392498Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.3392818Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.3393238Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.3393522Z ) 2025-05-07T20:33:33.3393861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.3394302Z def test_silu_mul_quant( 2025-05-07T20:33:33.3394545Z self, 2025-05-07T20:33:33.3394735Z T: int, 2025-05-07T20:33:33.3394925Z D: int, 2025-05-07T20:33:33.3395146Z scale_ub: Optional[float], 2025-05-07T20:33:33.3395412Z contiguous: bool, 2025-05-07T20:33:33.3395646Z compiled: bool, 2025-05-07T20:33:33.3395872Z ) -> None: 2025-05-07T20:33:33.3396087Z torch.manual_seed(2025) 2025-05-07T20:33:33.3396320Z 2025-05-07T20:33:33.3396618Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.3396956Z 2025-05-07T20:33:33.3397146Z x_sign = torch.sign(x) 2025-05-07T20:33:33.3397432Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.3397744Z x = x_sign * x_clamp 2025-05-07T20:33:33.3397989Z x0 = x[:, :D] 2025-05-07T20:33:33.3398204Z x1 = x[:, D:] 2025-05-07T20:33:33.3398403Z 2025-05-07T20:33:33.3398589Z if contiguous: 2025-05-07T20:33:33.3398821Z x0 = x0.contiguous() 2025-05-07T20:33:33.3399066Z x1 = x1.contiguous() 2025-05-07T20:33:33.3399311Z 2025-05-07T20:33:33.3399500Z if scale_ub is not None: 2025-05-07T20:33:33.3399763Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.3400098Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.3400414Z ) 2025-05-07T20:33:33.3400601Z else: 2025-05-07T20:33:33.3400816Z scale_ub_tensor = None 2025-05-07T20:33:33.3401073Z 2025-05-07T20:33:33.3401308Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.3401608Z op = silu_mul_quant 2025-05-07T20:33:33.3401869Z if compiled: 2025-05-07T20:33:33.3402117Z op = torch.compile(op) 2025-05-07T20:33:33.3402409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.3402690Z 2025-05-07T20:33:33.3403007Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.3403177Z 2025-05-07T20:33:33.3403279Z moe/activation_test.py:117: 2025-05-07T20:33:33.3403578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.3403909Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.3404187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.3404893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.3405581Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.3406129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.3406797Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.3407469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.3408002Z kernel = self.compile( 2025-05-07T20:33:33.3408594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.3409285Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.3409682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.3409903Z 2025-05-07T20:33:33.3410113Z self = 2025-05-07T20:33:33.3411218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.3412594Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cb4be700>} 2025-05-07T20:33:33.3413964Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.3414980Z context = 2025-05-07T20:33:33.3415261Z 2025-05-07T20:33:33.3415433Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.3415944Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.3416411Z module_map=module_map) 2025-05-07T20:33:33.3416769Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.3417118Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.3417374Z E ^ 2025-05-07T20:33:33.3417833Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.3418284Z 2025-05-07T20:33:33.3418716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.3419224Z 2025-05-07T20:33:33.3419322Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.3419725Z self=, 2025-05-07T20:33:33.3420118Z T=128, 2025-05-07T20:33:33.3420302Z D=5120, 2025-05-07T20:33:33.3420481Z scale_ub=None, 2025-05-07T20:33:33.3420696Z contiguous=False, 2025-05-07T20:33:33.3420921Z compiled=True, 2025-05-07T20:33:33.3421117Z ) 2025-05-07T20:33:33.3421439Z self = 2025-05-07T20:33:33.3421920Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:33.3422196Z 2025-05-07T20:33:33.3422273Z @given( 2025-05-07T20:33:33.3422499Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.3422860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.3423161Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.3423498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.3423825Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.3424105Z ) 2025-05-07T20:33:33.3424458Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.3424903Z def test_silu_mul_quant( 2025-05-07T20:33:33.3425147Z self, 2025-05-07T20:33:33.3425331Z T: int, 2025-05-07T20:33:33.3425530Z D: int, 2025-05-07T20:33:33.3425754Z scale_ub: Optional[float], 2025-05-07T20:33:33.3426025Z contiguous: bool, 2025-05-07T20:33:33.3426260Z compiled: bool, 2025-05-07T20:33:33.3426483Z ) -> None: 2025-05-07T20:33:33.3426689Z torch.manual_seed(2025) 2025-05-07T20:33:33.3426930Z 2025-05-07T20:33:33.3427196Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.3427618Z 2025-05-07T20:33:33.3427809Z x_sign = torch.sign(x) 2025-05-07T20:33:33.3428190Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.3428492Z x = x_sign * x_clamp 2025-05-07T20:33:33.3428734Z x0 = x[:, :D] 2025-05-07T20:33:33.3428948Z x1 = x[:, D:] 2025-05-07T20:33:33.3429150Z 2025-05-07T20:33:33.3429332Z if contiguous: 2025-05-07T20:33:33.3429559Z x0 = x0.contiguous() 2025-05-07T20:33:33.3429813Z x1 = x1.contiguous() 2025-05-07T20:33:33.3430047Z 2025-05-07T20:33:33.3430236Z if scale_ub is not None: 2025-05-07T20:33:33.3430509Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.3430833Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.3431144Z ) 2025-05-07T20:33:33.3431385Z else: 2025-05-07T20:33:33.3431586Z scale_ub_tensor = None 2025-05-07T20:33:33.3431835Z 2025-05-07T20:33:33.3432071Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.3432373Z op = silu_mul_quant 2025-05-07T20:33:33.3432630Z if compiled: 2025-05-07T20:33:33.3432877Z op = torch.compile(op) 2025-05-07T20:33:33.3433171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.3433441Z 2025-05-07T20:33:33.3433632Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.3433793Z 2025-05-07T20:33:33.3433897Z moe/activation_test.py:117: 2025-05-07T20:33:33.3434183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.3434512Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.3434799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.3435351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.3435916Z return fn(*args, **kwargs) 2025-05-07T20:33:33.3436584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.3437273Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.3437808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.3438483Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.3439139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.3439658Z kernel = self.compile( 2025-05-07T20:33:33.3440513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.3441176Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.3441570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.3441796Z 2025-05-07T20:33:33.3442093Z self = 2025-05-07T20:33:33.3443177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.3444531Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca707880>} 2025-05-07T20:33:33.3445850Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.3446857Z context = 2025-05-07T20:33:33.3447146Z 2025-05-07T20:33:33.3447309Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.3447829Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.3448425Z module_map=module_map) 2025-05-07T20:33:33.3448782Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.3449142Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.3449396Z E ^ 2025-05-07T20:33:33.3449852Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.3450299Z 2025-05-07T20:33:33.3450722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.3451231Z 2025-05-07T20:33:33.3451330Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.3451744Z self=, 2025-05-07T20:33:33.3452197Z T=128, 2025-05-07T20:33:33.3452388Z D=7168, 2025-05-07T20:33:33.3452582Z scale_ub=1200.0, 2025-05-07T20:33:33.3452802Z contiguous=False, 2025-05-07T20:33:33.3453029Z compiled=False, 2025-05-07T20:33:33.5009407Z ) 2025-05-07T20:33:33.5010542Z self = 2025-05-07T20:33:33.5011690Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:33.5012245Z 2025-05-07T20:33:33.5012402Z @given( 2025-05-07T20:33:33.5012859Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.5013470Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.5014059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.5014696Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.5015329Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.5024282Z ) 2025-05-07T20:33:33.5024651Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.5025152Z def test_silu_mul_quant( 2025-05-07T20:33:33.5025398Z self, 2025-05-07T20:33:33.5025594Z T: int, 2025-05-07T20:33:33.5025790Z D: int, 2025-05-07T20:33:33.5026009Z scale_ub: Optional[float], 2025-05-07T20:33:33.5026273Z contiguous: bool, 2025-05-07T20:33:33.5026500Z compiled: bool, 2025-05-07T20:33:33.5026716Z ) -> None: 2025-05-07T20:33:33.5026930Z torch.manual_seed(2025) 2025-05-07T20:33:33.5027174Z 2025-05-07T20:33:33.5027521Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.5027855Z 2025-05-07T20:33:33.5028045Z x_sign = torch.sign(x) 2025-05-07T20:33:33.5028325Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.5028636Z x = x_sign * x_clamp 2025-05-07T20:33:33.5028875Z x0 = x[:, :D] 2025-05-07T20:33:33.5029080Z x1 = x[:, D:] 2025-05-07T20:33:33.5029281Z 2025-05-07T20:33:33.5029460Z if contiguous: 2025-05-07T20:33:33.5029676Z x0 = x0.contiguous() 2025-05-07T20:33:33.5030210Z x1 = x1.contiguous() 2025-05-07T20:33:33.5030455Z 2025-05-07T20:33:33.5030636Z if scale_ub is not None: 2025-05-07T20:33:33.5030907Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.5031236Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.5031547Z ) 2025-05-07T20:33:33.5031727Z else: 2025-05-07T20:33:33.5031936Z scale_ub_tensor = None 2025-05-07T20:33:33.5032184Z 2025-05-07T20:33:33.5032402Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.5032706Z op = silu_mul_quant 2025-05-07T20:33:33.5032950Z if compiled: 2025-05-07T20:33:33.5033184Z op = torch.compile(op) 2025-05-07T20:33:33.5033473Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.5033743Z 2025-05-07T20:33:33.5033925Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.5034090Z 2025-05-07T20:33:33.5034185Z moe/activation_test.py:117: 2025-05-07T20:33:33.5034563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.5034957Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.5035230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.5035913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.5036592Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.5037132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.5037798Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.5038455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.5039058Z kernel = self.compile( 2025-05-07T20:33:33.5039647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.5040724Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.5041118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.5041340Z 2025-05-07T20:33:33.5041542Z self = 2025-05-07T20:33:33.5042605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.5043974Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca98c7c0>} 2025-05-07T20:33:33.5045298Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.5046308Z context = 2025-05-07T20:33:33.5046593Z 2025-05-07T20:33:33.5046759Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.5047278Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.5047748Z module_map=module_map) 2025-05-07T20:33:33.5048109Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.5048457Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.5048714Z E ^ 2025-05-07T20:33:33.5049184Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.5049680Z 2025-05-07T20:33:33.5050095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.5050673Z 2025-05-07T20:33:33.5050781Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.5051188Z self=, 2025-05-07T20:33:33.5051591Z T=128, 2025-05-07T20:33:33.5051765Z D=5120, 2025-05-07T20:33:33.5051954Z scale_ub=None, 2025-05-07T20:33:33.5052164Z contiguous=False, 2025-05-07T20:33:33.5052381Z compiled=False, 2025-05-07T20:33:33.5052586Z ) 2025-05-07T20:33:33.5052910Z self = 2025-05-07T20:33:33.5053391Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:33.5053669Z 2025-05-07T20:33:33.5053743Z @given( 2025-05-07T20:33:33.5053965Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.5054269Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.5054566Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.5054900Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.5055385Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.5055663Z ) 2025-05-07T20:33:33.5056010Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.5056448Z def test_silu_mul_quant( 2025-05-07T20:33:33.5056685Z self, 2025-05-07T20:33:33.5056874Z T: int, 2025-05-07T20:33:33.5057065Z D: int, 2025-05-07T20:33:33.5057275Z scale_ub: Optional[float], 2025-05-07T20:33:33.5057544Z contiguous: bool, 2025-05-07T20:33:33.5057779Z compiled: bool, 2025-05-07T20:33:33.5057991Z ) -> None: 2025-05-07T20:33:33.5058210Z torch.manual_seed(2025) 2025-05-07T20:33:33.5058445Z 2025-05-07T20:33:33.5058709Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.5059117Z 2025-05-07T20:33:33.5059307Z x_sign = torch.sign(x) 2025-05-07T20:33:33.5059593Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.5059911Z x = x_sign * x_clamp 2025-05-07T20:33:33.5060153Z x0 = x[:, :D] 2025-05-07T20:33:33.5060371Z x1 = x[:, D:] 2025-05-07T20:33:33.5060576Z 2025-05-07T20:33:33.5060760Z if contiguous: 2025-05-07T20:33:33.5060995Z x0 = x0.contiguous() 2025-05-07T20:33:33.5061253Z x1 = x1.contiguous() 2025-05-07T20:33:33.5061495Z 2025-05-07T20:33:33.5061678Z if scale_ub is not None: 2025-05-07T20:33:33.5061948Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.5062283Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.5062581Z ) 2025-05-07T20:33:33.5062765Z else: 2025-05-07T20:33:33.5062978Z scale_ub_tensor = None 2025-05-07T20:33:33.5063231Z 2025-05-07T20:33:33.5063459Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.5063768Z op = silu_mul_quant 2025-05-07T20:33:33.5064023Z if compiled: 2025-05-07T20:33:33.5064264Z op = torch.compile(op) 2025-05-07T20:33:33.5064573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.5064846Z 2025-05-07T20:33:33.5065037Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.5065195Z 2025-05-07T20:33:33.5065293Z moe/activation_test.py:117: 2025-05-07T20:33:33.5065590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.5065914Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.5066186Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.5066883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.5067609Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.5068156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.5068887Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.5069554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.5070077Z kernel = self.compile( 2025-05-07T20:33:33.5070607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.5071250Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.5071644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.5071866Z 2025-05-07T20:33:33.5072073Z self = 2025-05-07T20:33:33.5073136Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.5074537Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cb8fa7a0>} 2025-05-07T20:33:33.5075890Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.5076952Z context = 2025-05-07T20:33:33.5077234Z 2025-05-07T20:33:33.5077401Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.5077911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.5078380Z module_map=module_map) 2025-05-07T20:33:33.5078784Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.5079126Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.5079389Z E ^ 2025-05-07T20:33:33.5079854Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.5080301Z 2025-05-07T20:33:33.5080724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.5081225Z 2025-05-07T20:33:33.5081324Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.5081743Z self=, 2025-05-07T20:33:33.5082144Z T=128, 2025-05-07T20:33:33.5082327Z D=5120, 2025-05-07T20:33:33.5082520Z scale_ub=1200.0, 2025-05-07T20:33:33.5082742Z contiguous=True, 2025-05-07T20:33:33.5082960Z compiled=False, 2025-05-07T20:33:33.5083162Z ) 2025-05-07T20:33:33.5083483Z self = 2025-05-07T20:33:33.5083956Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:33.5084234Z 2025-05-07T20:33:33.5084311Z @given( 2025-05-07T20:33:33.5084545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.5084854Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.5085155Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.5085488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.5085819Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.5086090Z ) 2025-05-07T20:33:33.5086438Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.5086882Z def test_silu_mul_quant( 2025-05-07T20:33:33.5087112Z self, 2025-05-07T20:33:33.5087302Z T: int, 2025-05-07T20:33:33.5087500Z D: int, 2025-05-07T20:33:33.5087717Z scale_ub: Optional[float], 2025-05-07T20:33:33.5087999Z contiguous: bool, 2025-05-07T20:33:33.5088233Z compiled: bool, 2025-05-07T20:33:33.5088455Z ) -> None: 2025-05-07T20:33:33.5088710Z torch.manual_seed(2025) 2025-05-07T20:33:33.5088956Z 2025-05-07T20:33:33.5089230Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.5089567Z 2025-05-07T20:33:33.5089755Z x_sign = torch.sign(x) 2025-05-07T20:33:33.5090042Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.5090343Z x = x_sign * x_clamp 2025-05-07T20:33:33.5090580Z x0 = x[:, :D] 2025-05-07T20:33:33.5090795Z x1 = x[:, D:] 2025-05-07T20:33:33.5090989Z 2025-05-07T20:33:33.5091163Z if contiguous: 2025-05-07T20:33:33.5091391Z x0 = x0.contiguous() 2025-05-07T20:33:33.5091636Z x1 = x1.contiguous() 2025-05-07T20:33:33.5091871Z 2025-05-07T20:33:33.5092054Z if scale_ub is not None: 2025-05-07T20:33:33.5092319Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.5092660Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.5092977Z ) 2025-05-07T20:33:33.5093175Z else: 2025-05-07T20:33:33.5093461Z scale_ub_tensor = None 2025-05-07T20:33:33.5093718Z 2025-05-07T20:33:33.5093949Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.5094251Z op = silu_mul_quant 2025-05-07T20:33:33.5094503Z if compiled: 2025-05-07T20:33:33.5094759Z op = torch.compile(op) 2025-05-07T20:33:33.5095059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.5095336Z 2025-05-07T20:33:33.5095529Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.5095690Z 2025-05-07T20:33:33.5095781Z moe/activation_test.py:117: 2025-05-07T20:33:33.5096078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.5096408Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.5096724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.5097433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.5098118Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.5098673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.5099340Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.5100002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.5100536Z kernel = self.compile( 2025-05-07T20:33:33.5101085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.5101746Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.5102154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.5102384Z 2025-05-07T20:33:33.5102612Z self = 2025-05-07T20:33:33.5103692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.5105058Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca480c20>} 2025-05-07T20:33:33.5106398Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.5107516Z context = 2025-05-07T20:33:33.5107811Z 2025-05-07T20:33:33.5107991Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.5108562Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.5109036Z module_map=module_map) 2025-05-07T20:33:33.5109415Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.5109778Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.5110044Z E ^ 2025-05-07T20:33:33.5110518Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.5110978Z 2025-05-07T20:33:33.5111405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.6644965Z 2025-05-07T20:33:33.6645923Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.6646430Z self=, 2025-05-07T20:33:33.6646882Z T=1, 2025-05-07T20:33:33.6647068Z D=7168, 2025-05-07T20:33:33.6647274Z scale_ub=1200.0, 2025-05-07T20:33:33.6647516Z contiguous=True, 2025-05-07T20:33:33.6648101Z compiled=True, 2025-05-07T20:33:33.6648317Z ) 2025-05-07T20:33:33.6648643Z self = 2025-05-07T20:33:33.6649130Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:33.6649409Z 2025-05-07T20:33:33.6649492Z @given( 2025-05-07T20:33:33.6649735Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.6650085Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.6650397Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.6650733Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.6651059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.6651341Z ) 2025-05-07T20:33:33.6651776Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.6652222Z def test_silu_mul_quant( 2025-05-07T20:33:33.6652462Z self, 2025-05-07T20:33:33.6652651Z T: int, 2025-05-07T20:33:33.6652845Z D: int, 2025-05-07T20:33:33.6653061Z scale_ub: Optional[float], 2025-05-07T20:33:33.6653328Z contiguous: bool, 2025-05-07T20:33:33.6653566Z compiled: bool, 2025-05-07T20:33:33.6653787Z ) -> None: 2025-05-07T20:33:33.6654002Z torch.manual_seed(2025) 2025-05-07T20:33:33.6654245Z 2025-05-07T20:33:33.6654509Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.6654843Z 2025-05-07T20:33:33.6655031Z x_sign = torch.sign(x) 2025-05-07T20:33:33.6655314Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.6655625Z x = x_sign * x_clamp 2025-05-07T20:33:33.6655861Z x0 = x[:, :D] 2025-05-07T20:33:33.6656075Z x1 = x[:, D:] 2025-05-07T20:33:33.6656271Z 2025-05-07T20:33:33.6656450Z if contiguous: 2025-05-07T20:33:33.6656677Z x0 = x0.contiguous() 2025-05-07T20:33:33.6656925Z x1 = x1.contiguous() 2025-05-07T20:33:33.6657161Z 2025-05-07T20:33:33.6657344Z if scale_ub is not None: 2025-05-07T20:33:33.6657605Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.6657931Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.6658229Z ) 2025-05-07T20:33:33.6658412Z else: 2025-05-07T20:33:33.6658615Z scale_ub_tensor = None 2025-05-07T20:33:33.6658862Z 2025-05-07T20:33:33.6659084Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.6659395Z op = silu_mul_quant 2025-05-07T20:33:33.6659636Z if compiled: 2025-05-07T20:33:33.6659872Z op = torch.compile(op) 2025-05-07T20:33:33.6660163Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.6660441Z 2025-05-07T20:33:33.6660630Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.6660801Z 2025-05-07T20:33:33.6660898Z moe/activation_test.py:117: 2025-05-07T20:33:33.6661277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.6661608Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.6661876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.6662448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.6663013Z return fn(*args, **kwargs) 2025-05-07T20:33:33.6663661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.6664340Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.6664888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.6665562Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.6666225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.6666832Z kernel = self.compile( 2025-05-07T20:33:33.6667389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.6668111Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.6668496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.6668726Z 2025-05-07T20:33:33.6668926Z self = 2025-05-07T20:33:33.6669994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.6671504Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca481ee0>} 2025-05-07T20:33:33.6672833Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.6673831Z context = 2025-05-07T20:33:33.6674116Z 2025-05-07T20:33:33.6674282Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.6674795Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.6675255Z module_map=module_map) 2025-05-07T20:33:33.6675608Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.6675958Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.6676222Z E ^ 2025-05-07T20:33:33.6676676Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.6677135Z 2025-05-07T20:33:33.6677557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.6678068Z 2025-05-07T20:33:33.6678167Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.6678572Z self=, 2025-05-07T20:33:33.6678955Z T=1, 2025-05-07T20:33:33.6679139Z D=7168, 2025-05-07T20:33:33.6679329Z scale_ub=1200.0, 2025-05-07T20:33:33.6679542Z contiguous=False, 2025-05-07T20:33:33.6679759Z compiled=True, 2025-05-07T20:33:33.6679962Z ) 2025-05-07T20:33:33.6680270Z self = 2025-05-07T20:33:33.6680746Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:33.6681006Z 2025-05-07T20:33:33.6681092Z @given( 2025-05-07T20:33:33.6681369Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.6681670Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.6681975Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.6682294Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.6682612Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.6682892Z ) 2025-05-07T20:33:33.6683245Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.6683683Z def test_silu_mul_quant( 2025-05-07T20:33:33.6683927Z self, 2025-05-07T20:33:33.6684122Z T: int, 2025-05-07T20:33:33.6684316Z D: int, 2025-05-07T20:33:33.6684533Z scale_ub: Optional[float], 2025-05-07T20:33:33.6684805Z contiguous: bool, 2025-05-07T20:33:33.6685042Z compiled: bool, 2025-05-07T20:33:33.6685262Z ) -> None: 2025-05-07T20:33:33.6685473Z torch.manual_seed(2025) 2025-05-07T20:33:33.6685706Z 2025-05-07T20:33:33.6685985Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.6686375Z 2025-05-07T20:33:33.6686607Z x_sign = torch.sign(x) 2025-05-07T20:33:33.6686893Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.6687201Z x = x_sign * x_clamp 2025-05-07T20:33:33.6687440Z x0 = x[:, :D] 2025-05-07T20:33:33.6687647Z x1 = x[:, D:] 2025-05-07T20:33:33.6687856Z 2025-05-07T20:33:33.6688039Z if contiguous: 2025-05-07T20:33:33.6688255Z x0 = x0.contiguous() 2025-05-07T20:33:33.6688502Z x1 = x1.contiguous() 2025-05-07T20:33:33.6688755Z 2025-05-07T20:33:33.6688941Z if scale_ub is not None: 2025-05-07T20:33:33.6689197Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.6689524Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.6689895Z ) 2025-05-07T20:33:33.6690079Z else: 2025-05-07T20:33:33.6690283Z scale_ub_tensor = None 2025-05-07T20:33:33.6690536Z 2025-05-07T20:33:33.6690759Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.6691069Z op = silu_mul_quant 2025-05-07T20:33:33.6691317Z if compiled: 2025-05-07T20:33:33.6691554Z op = torch.compile(op) 2025-05-07T20:33:33.6691847Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.6692115Z 2025-05-07T20:33:33.6692297Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.6692463Z 2025-05-07T20:33:33.6692556Z moe/activation_test.py:117: 2025-05-07T20:33:33.6692845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.6693171Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.6693442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.6703932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.6704606Z return fn(*args, **kwargs) 2025-05-07T20:33:33.6705295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.6705999Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.6706559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.6707249Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.6707990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.6708532Z kernel = self.compile( 2025-05-07T20:33:33.6709090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.6709740Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.6710147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.6710387Z 2025-05-07T20:33:33.6710679Z self = 2025-05-07T20:33:33.6711780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.6713143Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca482c00>} 2025-05-07T20:33:33.6714470Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.6715496Z context = 2025-05-07T20:33:33.6715784Z 2025-05-07T20:33:33.6715962Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.6716530Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.6717035Z module_map=module_map) 2025-05-07T20:33:33.6717405Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.6717761Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.6718027Z E ^ 2025-05-07T20:33:33.6718495Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.6719080Z 2025-05-07T20:33:33.6719730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.8754680Z 2025-05-07T20:33:33.8755213Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.8755800Z self=, 2025-05-07T20:33:33.8756221Z T=1, 2025-05-07T20:33:33.8756466Z D=7168, 2025-05-07T20:33:33.8756671Z scale_ub=None, 2025-05-07T20:33:33.8756904Z contiguous=False, 2025-05-07T20:33:33.8757131Z compiled=True, 2025-05-07T20:33:33.8757339Z ) 2025-05-07T20:33:33.8757661Z self = 2025-05-07T20:33:33.8758145Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:33.8758398Z 2025-05-07T20:33:33.8758486Z @given( 2025-05-07T20:33:33.8758722Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.8759030Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.8759337Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.8759667Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.8759996Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.8760291Z ) 2025-05-07T20:33:33.8760638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.8761097Z def test_silu_mul_quant( 2025-05-07T20:33:33.8761339Z self, 2025-05-07T20:33:33.8761537Z T: int, 2025-05-07T20:33:33.8761735Z D: int, 2025-05-07T20:33:33.8761964Z scale_ub: Optional[float], 2025-05-07T20:33:33.8762237Z contiguous: bool, 2025-05-07T20:33:33.8762478Z compiled: bool, 2025-05-07T20:33:33.8762705Z ) -> None: 2025-05-07T20:33:33.8762911Z torch.manual_seed(2025) 2025-05-07T20:33:33.8763151Z 2025-05-07T20:33:33.8763423Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.8763757Z 2025-05-07T20:33:33.8763953Z x_sign = torch.sign(x) 2025-05-07T20:33:33.8764240Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.8764544Z x = x_sign * x_clamp 2025-05-07T20:33:33.8764779Z x0 = x[:, :D] 2025-05-07T20:33:33.8764991Z x1 = x[:, D:] 2025-05-07T20:33:33.8765194Z 2025-05-07T20:33:33.8765389Z if contiguous: 2025-05-07T20:33:33.8765696Z x0 = x0.contiguous() 2025-05-07T20:33:33.8765955Z x1 = x1.contiguous() 2025-05-07T20:33:33.8766195Z 2025-05-07T20:33:33.8766385Z if scale_ub is not None: 2025-05-07T20:33:33.8766647Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.8766979Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.8767290Z ) 2025-05-07T20:33:33.8767484Z else: 2025-05-07T20:33:33.8767685Z scale_ub_tensor = None 2025-05-07T20:33:33.8767939Z 2025-05-07T20:33:33.8768165Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.8768461Z op = silu_mul_quant 2025-05-07T20:33:33.8768702Z if compiled: 2025-05-07T20:33:33.8768943Z op = torch.compile(op) 2025-05-07T20:33:33.8769235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.8769501Z 2025-05-07T20:33:33.8769684Z y_fp8, y_scale = fn() 2025-05-07T20:33:33.8769960Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:33.8770310Z 2025-05-07T20:33:33.8770595Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.8770916Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:33.8771196Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:33.8771500Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:33.8771847Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.8772142Z 2025-05-07T20:33:33.8772337Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:33.8772526Z 2025-05-07T20:33:33.8772627Z moe/activation_test.py:126: 2025-05-07T20:33:33.8772908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.8773230Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:33.8773594Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.8774392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:33.8775124Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:33.8775668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.8776333Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.8777008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:33.8777710Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.8778428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:33.8779054Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:33.8779641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:33.8780146Z fn() 2025-05-07T20:33:33.8780653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:33.8781232Z self.fn.run( 2025-05-07T20:33:33.8781691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.8782202Z kernel = self.compile( 2025-05-07T20:33:33.8782744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.8783396Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.8783776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.8783996Z 2025-05-07T20:33:33.8784203Z self = 2025-05-07T20:33:33.8785309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.8786656Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca6f4180>} 2025-05-07T20:33:33.8788068Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.8789114Z context = 2025-05-07T20:33:33.8789393Z 2025-05-07T20:33:33.8789583Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.8790112Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.8790579Z module_map=module_map) 2025-05-07T20:33:33.8791056Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.8791403Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:33.8791660Z E ^ 2025-05-07T20:33:33.8792126Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.8792571Z 2025-05-07T20:33:33.8792995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.8793497Z 2025-05-07T20:33:33.8793604Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.8794000Z self=, 2025-05-07T20:33:33.8794391Z T=1, 2025-05-07T20:33:33.8794614Z D=5120, 2025-05-07T20:33:33.8794794Z scale_ub=1200.0, 2025-05-07T20:33:33.8795010Z contiguous=False, 2025-05-07T20:33:33.8795227Z compiled=True, 2025-05-07T20:33:33.8795424Z ) 2025-05-07T20:33:33.8795734Z self = 2025-05-07T20:33:33.8796204Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:33.8796467Z 2025-05-07T20:33:33.8796546Z @given( 2025-05-07T20:33:33.8796764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.8797062Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.8797355Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.8797667Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.8797987Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.8798259Z ) 2025-05-07T20:33:33.8798593Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.8799030Z def test_silu_mul_quant( 2025-05-07T20:33:33.8799265Z self, 2025-05-07T20:33:33.8799448Z T: int, 2025-05-07T20:33:33.8799650Z D: int, 2025-05-07T20:33:33.8799898Z scale_ub: Optional[float], 2025-05-07T20:33:33.8800165Z contiguous: bool, 2025-05-07T20:33:33.8800393Z compiled: bool, 2025-05-07T20:33:33.8800603Z ) -> None: 2025-05-07T20:33:33.8800804Z torch.manual_seed(2025) 2025-05-07T20:33:33.8801034Z 2025-05-07T20:33:33.8801293Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.8801623Z 2025-05-07T20:33:33.8801800Z x_sign = torch.sign(x) 2025-05-07T20:33:33.8802076Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.8802373Z x = x_sign * x_clamp 2025-05-07T20:33:33.8802597Z x0 = x[:, :D] 2025-05-07T20:33:33.8802803Z x1 = x[:, D:] 2025-05-07T20:33:33.8802998Z 2025-05-07T20:33:33.8803167Z if contiguous: 2025-05-07T20:33:33.8803385Z x0 = x0.contiguous() 2025-05-07T20:33:33.8803630Z x1 = x1.contiguous() 2025-05-07T20:33:33.8803855Z 2025-05-07T20:33:33.8804080Z if scale_ub is not None: 2025-05-07T20:33:33.8804347Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.8804665Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.8804969Z ) 2025-05-07T20:33:33.8805149Z else: 2025-05-07T20:33:33.8805345Z scale_ub_tensor = None 2025-05-07T20:33:33.8805585Z 2025-05-07T20:33:33.8805803Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.8806099Z op = silu_mul_quant 2025-05-07T20:33:33.8806333Z if compiled: 2025-05-07T20:33:33.8806566Z op = torch.compile(op) 2025-05-07T20:33:33.8806853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.8807110Z 2025-05-07T20:33:33.8807294Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.8807455Z 2025-05-07T20:33:33.8807552Z moe/activation_test.py:117: 2025-05-07T20:33:33.8807835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.8808152Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.8808515Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.8809077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.8809633Z return fn(*args, **kwargs) 2025-05-07T20:33:33.8810278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.8810946Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.8811484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.8812149Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.8812854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.8813370Z kernel = self.compile( 2025-05-07T20:33:33.8813905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.8814539Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.8814921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.8815139Z 2025-05-07T20:33:33.8815343Z self = 2025-05-07T20:33:33.8816393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.8817734Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca6f5300>} 2025-05-07T20:33:33.8819051Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.8820103Z context = 2025-05-07T20:33:33.8820390Z 2025-05-07T20:33:33.8820551Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.8821062Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.8821513Z module_map=module_map) 2025-05-07T20:33:33.8821872Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.8822214Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.8822464Z E ^ 2025-05-07T20:33:33.8822916Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.8823373Z 2025-05-07T20:33:33.8823834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.0221761Z 2025-05-07T20:33:34.0222015Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.0222479Z self=, 2025-05-07T20:33:34.0222884Z T=1, 2025-05-07T20:33:34.0223121Z D=5120, 2025-05-07T20:33:34.0223319Z scale_ub=1200.0, 2025-05-07T20:33:34.0223540Z contiguous=False, 2025-05-07T20:33:34.0223774Z compiled=False, 2025-05-07T20:33:34.0223982Z ) 2025-05-07T20:33:34.0224293Z self = 2025-05-07T20:33:34.0224779Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:34.0225046Z 2025-05-07T20:33:34.0225138Z @given( 2025-05-07T20:33:34.0225364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.0225672Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.0225981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.0226464Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.0226786Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.0227073Z ) 2025-05-07T20:33:34.0227486Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.0227935Z def test_silu_mul_quant( 2025-05-07T20:33:34.0228180Z self, 2025-05-07T20:33:34.0228373Z T: int, 2025-05-07T20:33:34.0228563Z D: int, 2025-05-07T20:33:34.0228784Z scale_ub: Optional[float], 2025-05-07T20:33:34.0229054Z contiguous: bool, 2025-05-07T20:33:34.0229289Z compiled: bool, 2025-05-07T20:33:34.0229518Z ) -> None: 2025-05-07T20:33:34.0229764Z torch.manual_seed(2025) 2025-05-07T20:33:34.0230087Z 2025-05-07T20:33:34.0230354Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.0230694Z 2025-05-07T20:33:34.0230887Z x_sign = torch.sign(x) 2025-05-07T20:33:34.0231179Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.0231490Z x = x_sign * x_clamp 2025-05-07T20:33:34.0231729Z x0 = x[:, :D] 2025-05-07T20:33:34.0231944Z x1 = x[:, D:] 2025-05-07T20:33:34.0232156Z 2025-05-07T20:33:34.0232346Z if contiguous: 2025-05-07T20:33:34.0232578Z x0 = x0.contiguous() 2025-05-07T20:33:34.0232836Z x1 = x1.contiguous() 2025-05-07T20:33:34.0233083Z 2025-05-07T20:33:34.0233273Z if scale_ub is not None: 2025-05-07T20:33:34.0233548Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.0233883Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.0234193Z ) 2025-05-07T20:33:34.0234396Z else: 2025-05-07T20:33:34.0234616Z scale_ub_tensor = None 2025-05-07T20:33:34.0234868Z 2025-05-07T20:33:34.0235105Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.0235421Z op = silu_mul_quant 2025-05-07T20:33:34.0235678Z if compiled: 2025-05-07T20:33:34.0235931Z op = torch.compile(op) 2025-05-07T20:33:34.0236225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.0236506Z 2025-05-07T20:33:34.0236697Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.0236864Z 2025-05-07T20:33:34.0236963Z moe/activation_test.py:117: 2025-05-07T20:33:34.0237259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.0237582Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.0237876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.0238560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.0239245Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.0239866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.0240736Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.0241403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.0241919Z kernel = self.compile( 2025-05-07T20:33:34.0242463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.0243113Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.0243503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.0243724Z 2025-05-07T20:33:34.0243928Z self = 2025-05-07T20:33:34.0244996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.0247105Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca6f6020>} 2025-05-07T20:33:34.0248471Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.0249473Z context = 2025-05-07T20:33:34.0249760Z 2025-05-07T20:33:34.0249924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.0250435Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.0250961Z module_map=module_map) 2025-05-07T20:33:34.0251318Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.0251676Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.0251942Z E ^ 2025-05-07T20:33:34.0252400Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.0252855Z 2025-05-07T20:33:34.0253272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.0253775Z 2025-05-07T20:33:34.0253875Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.0254289Z self=, 2025-05-07T20:33:34.0254685Z T=16384, 2025-05-07T20:33:34.0254882Z D=5120, 2025-05-07T20:33:34.0255077Z scale_ub=1200.0, 2025-05-07T20:33:34.0255306Z contiguous=False, 2025-05-07T20:33:34.0255537Z compiled=True, 2025-05-07T20:33:34.0255743Z ) 2025-05-07T20:33:34.0256056Z self = 2025-05-07T20:33:34.0256553Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:34.0256838Z 2025-05-07T20:33:34.0256917Z @given( 2025-05-07T20:33:34.0257143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.0257449Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.0257751Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.0258082Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.0258405Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.0258693Z ) 2025-05-07T20:33:34.0259048Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.0259492Z def test_silu_mul_quant( 2025-05-07T20:33:34.0259735Z self, 2025-05-07T20:33:34.0259938Z T: int, 2025-05-07T20:33:34.0260134Z D: int, 2025-05-07T20:33:34.0260347Z scale_ub: Optional[float], 2025-05-07T20:33:34.0260688Z contiguous: bool, 2025-05-07T20:33:34.0260928Z compiled: bool, 2025-05-07T20:33:34.0261149Z ) -> None: 2025-05-07T20:33:34.0261363Z torch.manual_seed(2025) 2025-05-07T20:33:34.0261604Z 2025-05-07T20:33:34.0261865Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.0262209Z 2025-05-07T20:33:34.0262401Z x_sign = torch.sign(x) 2025-05-07T20:33:34.0262684Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.0262996Z x = x_sign * x_clamp 2025-05-07T20:33:34.0263236Z x0 = x[:, :D] 2025-05-07T20:33:34.0263451Z x1 = x[:, D:] 2025-05-07T20:33:34.0263661Z 2025-05-07T20:33:34.0263853Z if contiguous: 2025-05-07T20:33:34.0264080Z x0 = x0.contiguous() 2025-05-07T20:33:34.0264335Z x1 = x1.contiguous() 2025-05-07T20:33:34.0264576Z 2025-05-07T20:33:34.0264761Z if scale_ub is not None: 2025-05-07T20:33:34.0265032Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.0265364Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.0265779Z ) 2025-05-07T20:33:34.0265975Z else: 2025-05-07T20:33:34.0266184Z scale_ub_tensor = None 2025-05-07T20:33:34.0266434Z 2025-05-07T20:33:34.0266657Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.0266966Z op = silu_mul_quant 2025-05-07T20:33:34.0267217Z if compiled: 2025-05-07T20:33:34.0267523Z op = torch.compile(op) 2025-05-07T20:33:34.0267817Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.0268093Z 2025-05-07T20:33:34.0268282Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.0268447Z 2025-05-07T20:33:34.0268545Z moe/activation_test.py:117: 2025-05-07T20:33:34.0268842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.0269216Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.0269495Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.0270069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:34.0270634Z return fn(*args, **kwargs) 2025-05-07T20:33:34.0271286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.0271963Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.0272500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.0273169Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.0273821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.0274355Z kernel = self.compile( 2025-05-07T20:33:34.0274912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.0275583Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.0275975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.0276207Z 2025-05-07T20:33:34.0276409Z self = 2025-05-07T20:33:34.0277484Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.0278836Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca6f7600>} 2025-05-07T20:33:34.0280249Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.0281256Z context = 2025-05-07T20:33:34.0281540Z 2025-05-07T20:33:34.0281709Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.0282226Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.0282686Z module_map=module_map) 2025-05-07T20:33:34.0283047Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.0283412Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.0283662Z E ^ 2025-05-07T20:33:34.0290045Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.0290492Z 2025-05-07T20:33:34.0290922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.0291427Z 2025-05-07T20:33:34.0291527Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.0292044Z self=, 2025-05-07T20:33:34.0292449Z T=2048, 2025-05-07T20:33:34.0292640Z D=7168, 2025-05-07T20:33:34.0292827Z scale_ub=1200.0, 2025-05-07T20:33:34.0293056Z contiguous=False, 2025-05-07T20:33:34.0293277Z compiled=True, 2025-05-07T20:33:34.2145433Z ) 2025-05-07T20:33:34.2145759Z self = 2025-05-07T20:33:34.2146290Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:34.2146650Z 2025-05-07T20:33:34.2146731Z @given( 2025-05-07T20:33:34.2146963Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.2147453Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.2147754Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.2148074Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.2148393Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.2148712Z ) 2025-05-07T20:33:34.2149054Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.2149501Z def test_silu_mul_quant( 2025-05-07T20:33:34.2149739Z self, 2025-05-07T20:33:34.2149929Z T: int, 2025-05-07T20:33:34.2150125Z D: int, 2025-05-07T20:33:34.2150336Z scale_ub: Optional[float], 2025-05-07T20:33:34.2150603Z contiguous: bool, 2025-05-07T20:33:34.2150835Z compiled: bool, 2025-05-07T20:33:34.2151047Z ) -> None: 2025-05-07T20:33:34.2151252Z torch.manual_seed(2025) 2025-05-07T20:33:34.2151486Z 2025-05-07T20:33:34.2151745Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.2152086Z 2025-05-07T20:33:34.2152273Z x_sign = torch.sign(x) 2025-05-07T20:33:34.2152554Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.2152859Z x = x_sign * x_clamp 2025-05-07T20:33:34.2153103Z x0 = x[:, :D] 2025-05-07T20:33:34.2153314Z x1 = x[:, D:] 2025-05-07T20:33:34.2153508Z 2025-05-07T20:33:34.2153684Z if contiguous: 2025-05-07T20:33:34.2153907Z x0 = x0.contiguous() 2025-05-07T20:33:34.2154153Z x1 = x1.contiguous() 2025-05-07T20:33:34.2154383Z 2025-05-07T20:33:34.2154561Z if scale_ub is not None: 2025-05-07T20:33:34.2154817Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.2155143Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.2155442Z ) 2025-05-07T20:33:34.2155623Z else: 2025-05-07T20:33:34.2155822Z scale_ub_tensor = None 2025-05-07T20:33:34.2156066Z 2025-05-07T20:33:34.2156288Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.2156590Z op = silu_mul_quant 2025-05-07T20:33:34.2156833Z if compiled: 2025-05-07T20:33:34.2157143Z op = torch.compile(op) 2025-05-07T20:33:34.2157442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.2157708Z 2025-05-07T20:33:34.2157887Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.2158049Z 2025-05-07T20:33:34.2158145Z moe/activation_test.py:117: 2025-05-07T20:33:34.2158435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.2158754Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.2159023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.2159585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:34.2160129Z return fn(*args, **kwargs) 2025-05-07T20:33:34.2160768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.2161437Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.2162026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.2162765Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.2163406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.2163919Z kernel = self.compile( 2025-05-07T20:33:34.2164450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.2165086Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.2165472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.2165694Z 2025-05-07T20:33:34.2165894Z self = 2025-05-07T20:33:34.2167002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.2168374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca038720>} 2025-05-07T20:33:34.2169719Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.2170720Z context = 2025-05-07T20:33:34.2171001Z 2025-05-07T20:33:34.2171161Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.2171671Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.2172134Z module_map=module_map) 2025-05-07T20:33:34.2172486Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.2172838Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.2173085Z E ^ 2025-05-07T20:33:34.2173540Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.2173987Z 2025-05-07T20:33:34.2174399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.2174894Z 2025-05-07T20:33:34.2174997Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.2175389Z self=, 2025-05-07T20:33:34.2175777Z T=1, 2025-05-07T20:33:34.2175950Z D=5120, 2025-05-07T20:33:34.2176135Z scale_ub=None, 2025-05-07T20:33:34.2176342Z contiguous=False, 2025-05-07T20:33:34.2176564Z compiled=False, 2025-05-07T20:33:34.2176767Z ) 2025-05-07T20:33:34.2177126Z self = 2025-05-07T20:33:34.2177611Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:34.2177863Z 2025-05-07T20:33:34.2177945Z @given( 2025-05-07T20:33:34.2178162Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.2178469Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.2178766Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.2179080Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.2179400Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.2179677Z ) 2025-05-07T20:33:34.2180044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.2180496Z def test_silu_mul_quant( 2025-05-07T20:33:34.2180735Z self, 2025-05-07T20:33:34.2180929Z T: int, 2025-05-07T20:33:34.2181118Z D: int, 2025-05-07T20:33:34.2181330Z scale_ub: Optional[float], 2025-05-07T20:33:34.2181590Z contiguous: bool, 2025-05-07T20:33:34.2181901Z compiled: bool, 2025-05-07T20:33:34.2182122Z ) -> None: 2025-05-07T20:33:34.2182333Z torch.manual_seed(2025) 2025-05-07T20:33:34.2182562Z 2025-05-07T20:33:34.2182819Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.2183148Z 2025-05-07T20:33:34.2183330Z x_sign = torch.sign(x) 2025-05-07T20:33:34.2183613Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.2183912Z x = x_sign * x_clamp 2025-05-07T20:33:34.2184139Z x0 = x[:, :D] 2025-05-07T20:33:34.2184345Z x1 = x[:, D:] 2025-05-07T20:33:34.2184545Z 2025-05-07T20:33:34.2184719Z if contiguous: 2025-05-07T20:33:34.2184944Z x0 = x0.contiguous() 2025-05-07T20:33:34.2185245Z x1 = x1.contiguous() 2025-05-07T20:33:34.2185474Z 2025-05-07T20:33:34.2185652Z if scale_ub is not None: 2025-05-07T20:33:34.2185919Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.2186249Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.2186547Z ) 2025-05-07T20:33:34.2186735Z else: 2025-05-07T20:33:34.2186938Z scale_ub_tensor = None 2025-05-07T20:33:34.2187179Z 2025-05-07T20:33:34.2187462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.2187763Z op = silu_mul_quant 2025-05-07T20:33:34.2188000Z if compiled: 2025-05-07T20:33:34.2188238Z op = torch.compile(op) 2025-05-07T20:33:34.2188524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.2188786Z 2025-05-07T20:33:34.2188970Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.2189129Z 2025-05-07T20:33:34.2189228Z moe/activation_test.py:117: 2025-05-07T20:33:34.2189516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.2189834Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.2190127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.2190847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.2191514Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.2192054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.2192718Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.2193362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.2193873Z kernel = self.compile( 2025-05-07T20:33:34.2194418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.2195054Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.2195511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.2195741Z 2025-05-07T20:33:34.2195940Z self = 2025-05-07T20:33:34.2196993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.2198330Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca039120>} 2025-05-07T20:33:34.2199638Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.2200638Z context = 2025-05-07T20:33:34.2200920Z 2025-05-07T20:33:34.2201158Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.2201665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.2202121Z module_map=module_map) 2025-05-07T20:33:34.2202467Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.2202809Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.2203058Z E ^ 2025-05-07T20:33:34.2203503Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.2203960Z 2025-05-07T20:33:34.2204375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.2204926Z 2025-05-07T20:33:34.2205027Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.2205434Z self=, 2025-05-07T20:33:34.2205830Z T=4096, 2025-05-07T20:33:34.2206021Z D=7168, 2025-05-07T20:33:34.2206214Z scale_ub=1200.0, 2025-05-07T20:33:34.2206432Z contiguous=False, 2025-05-07T20:33:34.2206658Z compiled=False, 2025-05-07T20:33:34.2206863Z ) 2025-05-07T20:33:34.2207181Z self = 2025-05-07T20:33:34.2207667Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:34.2207935Z 2025-05-07T20:33:34.2208020Z @given( 2025-05-07T20:33:34.2208239Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.2208544Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.2208846Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.2209172Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.2209489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.2209772Z ) 2025-05-07T20:33:34.2210121Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.2210551Z def test_silu_mul_quant( 2025-05-07T20:33:34.2210796Z self, 2025-05-07T20:33:34.2210990Z T: int, 2025-05-07T20:33:34.2211181Z D: int, 2025-05-07T20:33:34.2211405Z scale_ub: Optional[float], 2025-05-07T20:33:34.2211674Z contiguous: bool, 2025-05-07T20:33:34.2211907Z compiled: bool, 2025-05-07T20:33:34.2212134Z ) -> None: 2025-05-07T20:33:34.2212343Z torch.manual_seed(2025) 2025-05-07T20:33:34.2212578Z 2025-05-07T20:33:34.2212853Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.2213191Z 2025-05-07T20:33:34.2213384Z x_sign = torch.sign(x) 2025-05-07T20:33:34.2213669Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.2213979Z x = x_sign * x_clamp 2025-05-07T20:33:34.2214214Z x0 = x[:, :D] 2025-05-07T20:33:34.2214493Z x1 = x[:, D:] 2025-05-07T20:33:34.2214705Z 2025-05-07T20:33:34.2214891Z if contiguous: 2025-05-07T20:33:34.2215110Z x0 = x0.contiguous() 2025-05-07T20:33:34.2215363Z x1 = x1.contiguous() 2025-05-07T20:33:34.2215599Z 2025-05-07T20:33:34.2215789Z if scale_ub is not None: 2025-05-07T20:33:34.2216063Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.2216393Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.2216689Z ) 2025-05-07T20:33:34.2216884Z else: 2025-05-07T20:33:34.2217093Z scale_ub_tensor = None 2025-05-07T20:33:34.2217335Z 2025-05-07T20:33:34.2217564Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.2217874Z op = silu_mul_quant 2025-05-07T20:33:34.2218132Z if compiled: 2025-05-07T20:33:34.2218375Z op = torch.compile(op) 2025-05-07T20:33:34.2218666Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.2218944Z 2025-05-07T20:33:34.2219217Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.2219385Z 2025-05-07T20:33:34.2219482Z moe/activation_test.py:117: 2025-05-07T20:33:34.2219772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.2220093Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.2220375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.2221059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.2221733Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.2222280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.2222944Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.2223642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.2224158Z kernel = self.compile( 2025-05-07T20:33:34.2224696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.2225334Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.2225728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.2225947Z 2025-05-07T20:33:34.2226148Z self = 2025-05-07T20:33:34.2227197Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.2228595Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca03a480>} 2025-05-07T20:33:34.2229953Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.2231005Z context = 2025-05-07T20:33:34.2231284Z 2025-05-07T20:33:34.2231446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.2231964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.2232435Z module_map=module_map) 2025-05-07T20:33:34.2232792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.2233141Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.2233398Z E ^ 2025-05-07T20:33:34.2233902Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.2234351Z 2025-05-07T20:33:34.2234767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.3780685Z 2025-05-07T20:33:34.3780890Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.3781315Z self=, 2025-05-07T20:33:34.3781759Z T=16384, 2025-05-07T20:33:34.3781953Z D=7168, 2025-05-07T20:33:34.3782138Z scale_ub=None, 2025-05-07T20:33:34.3782349Z contiguous=True, 2025-05-07T20:33:34.3782568Z compiled=True, 2025-05-07T20:33:34.3782765Z ) 2025-05-07T20:33:34.3783076Z self = 2025-05-07T20:33:34.3783555Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:34.3783821Z 2025-05-07T20:33:34.3783905Z @given( 2025-05-07T20:33:34.3784126Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.3784432Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.3784882Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.3785204Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.3785522Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.3785804Z ) 2025-05-07T20:33:34.3786137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.3786583Z def test_silu_mul_quant( 2025-05-07T20:33:34.3786820Z self, 2025-05-07T20:33:34.3787010Z T: int, 2025-05-07T20:33:34.3787200Z D: int, 2025-05-07T20:33:34.3787499Z scale_ub: Optional[float], 2025-05-07T20:33:34.3787771Z contiguous: bool, 2025-05-07T20:33:34.3787997Z compiled: bool, 2025-05-07T20:33:34.3788291Z ) -> None: 2025-05-07T20:33:34.3788497Z torch.manual_seed(2025) 2025-05-07T20:33:34.3788727Z 2025-05-07T20:33:34.3788991Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.3789331Z 2025-05-07T20:33:34.3789516Z x_sign = torch.sign(x) 2025-05-07T20:33:34.3789801Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.3790112Z x = x_sign * x_clamp 2025-05-07T20:33:34.3790346Z x0 = x[:, :D] 2025-05-07T20:33:34.3790559Z x1 = x[:, D:] 2025-05-07T20:33:34.3790759Z 2025-05-07T20:33:34.3790935Z if contiguous: 2025-05-07T20:33:34.3791161Z x0 = x0.contiguous() 2025-05-07T20:33:34.3791414Z x1 = x1.contiguous() 2025-05-07T20:33:34.3791646Z 2025-05-07T20:33:34.3791829Z if scale_ub is not None: 2025-05-07T20:33:34.3792099Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.3792431Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.3792730Z ) 2025-05-07T20:33:34.3792923Z else: 2025-05-07T20:33:34.3793129Z scale_ub_tensor = None 2025-05-07T20:33:34.3793377Z 2025-05-07T20:33:34.3793607Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.3793923Z op = silu_mul_quant 2025-05-07T20:33:34.3794162Z if compiled: 2025-05-07T20:33:34.3794411Z op = torch.compile(op) 2025-05-07T20:33:34.3794699Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.3794966Z 2025-05-07T20:33:34.3795159Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.3795318Z 2025-05-07T20:33:34.3795418Z moe/activation_test.py:117: 2025-05-07T20:33:34.3795704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.3796026Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.3796299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.3796855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:34.3797399Z return fn(*args, **kwargs) 2025-05-07T20:33:34.3798120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.3798794Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.3799333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.3799994Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.3800644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.3801168Z kernel = self.compile( 2025-05-07T20:33:34.3801712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.3802353Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.3802752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.3802977Z 2025-05-07T20:33:34.3803187Z self = 2025-05-07T20:33:34.3804359Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.3805704Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca03b740>} 2025-05-07T20:33:34.3807012Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.3808078Z context = 2025-05-07T20:33:34.3808357Z 2025-05-07T20:33:34.3808522Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.3809032Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.3809496Z module_map=module_map) 2025-05-07T20:33:34.3809856Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.3810195Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.3810462Z E ^ 2025-05-07T20:33:34.3810929Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.3811388Z 2025-05-07T20:33:34.3811806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.3812303Z 2025-05-07T20:33:34.3812404Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.3812810Z self=, 2025-05-07T20:33:34.3813197Z T=4096, 2025-05-07T20:33:34.3813376Z D=5120, 2025-05-07T20:33:34.3813564Z scale_ub=None, 2025-05-07T20:33:34.3813780Z contiguous=False, 2025-05-07T20:33:34.3814000Z compiled=True, 2025-05-07T20:33:34.3814200Z ) 2025-05-07T20:33:34.3814514Z self = 2025-05-07T20:33:34.3814990Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:34.3815254Z 2025-05-07T20:33:34.3815332Z @given( 2025-05-07T20:33:34.3815556Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.3815855Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.3816148Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.3816468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.3816786Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.3817071Z ) 2025-05-07T20:33:34.3817419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.3817894Z def test_silu_mul_quant( 2025-05-07T20:33:34.3818136Z self, 2025-05-07T20:33:34.3818329Z T: int, 2025-05-07T20:33:34.3818517Z D: int, 2025-05-07T20:33:34.3818730Z scale_ub: Optional[float], 2025-05-07T20:33:34.3818988Z contiguous: bool, 2025-05-07T20:33:34.3819225Z compiled: bool, 2025-05-07T20:33:34.3819441Z ) -> None: 2025-05-07T20:33:34.3819647Z torch.manual_seed(2025) 2025-05-07T20:33:34.3819883Z 2025-05-07T20:33:34.3820143Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.3820474Z 2025-05-07T20:33:34.3820659Z x_sign = torch.sign(x) 2025-05-07T20:33:34.3820943Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.3821241Z x = x_sign * x_clamp 2025-05-07T20:33:34.3821479Z x0 = x[:, :D] 2025-05-07T20:33:34.3821698Z x1 = x[:, D:] 2025-05-07T20:33:34.3821896Z 2025-05-07T20:33:34.3822076Z if contiguous: 2025-05-07T20:33:34.3822306Z x0 = x0.contiguous() 2025-05-07T20:33:34.3822638Z x1 = x1.contiguous() 2025-05-07T20:33:34.3822871Z 2025-05-07T20:33:34.3823059Z if scale_ub is not None: 2025-05-07T20:33:34.3823317Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.3823643Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.3823943Z ) 2025-05-07T20:33:34.3824129Z else: 2025-05-07T20:33:34.3824327Z scale_ub_tensor = None 2025-05-07T20:33:34.3824580Z 2025-05-07T20:33:34.3830211Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.3830560Z op = silu_mul_quant 2025-05-07T20:33:34.3830805Z if compiled: 2025-05-07T20:33:34.3831052Z op = torch.compile(op) 2025-05-07T20:33:34.3831416Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.3831683Z 2025-05-07T20:33:34.3831871Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.3832038Z 2025-05-07T20:33:34.3832137Z moe/activation_test.py:117: 2025-05-07T20:33:34.3832434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.3832762Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.3833037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.3833593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:34.3834141Z return fn(*args, **kwargs) 2025-05-07T20:33:34.3834793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.3835458Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.3835988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.3836660Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.3837311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.3837831Z kernel = self.compile( 2025-05-07T20:33:34.3838378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.3839024Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.3839404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.3839626Z 2025-05-07T20:33:34.3839828Z self = 2025-05-07T20:33:34.3841191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.3842628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca254c20>} 2025-05-07T20:33:34.3844293Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.3845533Z context = 2025-05-07T20:33:34.3845871Z 2025-05-07T20:33:34.3846053Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.3846658Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.3847197Z module_map=module_map) 2025-05-07T20:33:34.3847594Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.3847990Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.3848271Z E ^ 2025-05-07T20:33:34.3848804Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.3849378Z 2025-05-07T20:33:34.3849808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.5209480Z 2025-05-07T20:33:34.5209959Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.5210369Z self=, 2025-05-07T20:33:34.5210780Z T=4096, 2025-05-07T20:33:34.5210968Z D=5120, 2025-05-07T20:33:34.5211153Z scale_ub=1200.0, 2025-05-07T20:33:34.5211377Z contiguous=False, 2025-05-07T20:33:34.5211607Z compiled=False, 2025-05-07T20:33:34.5211805Z ) 2025-05-07T20:33:34.5212121Z self = 2025-05-07T20:33:34.5212706Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:34.5212983Z 2025-05-07T20:33:34.5213068Z @given( 2025-05-07T20:33:34.5213292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.5213617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.5213933Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.5214253Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.5214572Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.5214855Z ) 2025-05-07T20:33:34.5215203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.5215649Z def test_silu_mul_quant( 2025-05-07T20:33:34.5215884Z self, 2025-05-07T20:33:34.5216070Z T: int, 2025-05-07T20:33:34.5216261Z D: int, 2025-05-07T20:33:34.5216481Z scale_ub: Optional[float], 2025-05-07T20:33:34.5216749Z contiguous: bool, 2025-05-07T20:33:34.5216987Z compiled: bool, 2025-05-07T20:33:34.5217210Z ) -> None: 2025-05-07T20:33:34.5217418Z torch.manual_seed(2025) 2025-05-07T20:33:34.5217654Z 2025-05-07T20:33:34.5217922Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.5218258Z 2025-05-07T20:33:34.5218441Z x_sign = torch.sign(x) 2025-05-07T20:33:34.5218728Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.5219062Z x = x_sign * x_clamp 2025-05-07T20:33:34.5219296Z x0 = x[:, :D] 2025-05-07T20:33:34.5219505Z x1 = x[:, D:] 2025-05-07T20:33:34.5219703Z 2025-05-07T20:33:34.5219882Z if contiguous: 2025-05-07T20:33:34.5220111Z x0 = x0.contiguous() 2025-05-07T20:33:34.5220356Z x1 = x1.contiguous() 2025-05-07T20:33:34.5220593Z 2025-05-07T20:33:34.5220777Z if scale_ub is not None: 2025-05-07T20:33:34.5221038Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.5221372Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.5221683Z ) 2025-05-07T20:33:34.5221873Z else: 2025-05-07T20:33:34.5222147Z scale_ub_tensor = None 2025-05-07T20:33:34.5222398Z 2025-05-07T20:33:34.5222620Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.5222922Z op = silu_mul_quant 2025-05-07T20:33:34.5223166Z if compiled: 2025-05-07T20:33:34.5223409Z op = torch.compile(op) 2025-05-07T20:33:34.5223697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.5223968Z 2025-05-07T20:33:34.5224158Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.5224322Z 2025-05-07T20:33:34.5224419Z moe/activation_test.py:117: 2025-05-07T20:33:34.5224710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.5225038Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.5225306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.5225988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.5226672Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.5227326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.5228069Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.5228720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.5229235Z kernel = self.compile( 2025-05-07T20:33:34.5229802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.5230463Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.5230855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.5231120Z 2025-05-07T20:33:34.5231328Z self = 2025-05-07T20:33:34.5232384Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.5233732Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca2556c0>} 2025-05-07T20:33:34.5235041Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.5236093Z context = 2025-05-07T20:33:34.5236371Z 2025-05-07T20:33:34.5236539Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.5237044Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.5237503Z module_map=module_map) 2025-05-07T20:33:34.5237868Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.5238204Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.5238457Z E ^ 2025-05-07T20:33:34.5238915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.5239364Z 2025-05-07T20:33:34.5239783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.5240487Z 2025-05-07T20:33:34.5240727Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.5241133Z self=, 2025-05-07T20:33:34.5241535Z T=4096, 2025-05-07T20:33:34.5241715Z D=5120, 2025-05-07T20:33:34.5241899Z scale_ub=1200.0, 2025-05-07T20:33:34.5242116Z contiguous=False, 2025-05-07T20:33:34.5242405Z compiled=True, 2025-05-07T20:33:34.5242622Z ) 2025-05-07T20:33:34.5242978Z self = 2025-05-07T20:33:34.5243536Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:34.5243853Z 2025-05-07T20:33:34.5243929Z @given( 2025-05-07T20:33:34.5244166Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.5244506Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.5244834Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.5245195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.5245552Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.5245855Z ) 2025-05-07T20:33:34.5246241Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.5246746Z def test_silu_mul_quant( 2025-05-07T20:33:34.5246995Z self, 2025-05-07T20:33:34.5247195Z T: int, 2025-05-07T20:33:34.5247396Z D: int, 2025-05-07T20:33:34.5247792Z scale_ub: Optional[float], 2025-05-07T20:33:34.5248051Z contiguous: bool, 2025-05-07T20:33:34.5248285Z compiled: bool, 2025-05-07T20:33:34.5248500Z ) -> None: 2025-05-07T20:33:34.5248700Z torch.manual_seed(2025) 2025-05-07T20:33:34.5248931Z 2025-05-07T20:33:34.5249198Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.5249533Z 2025-05-07T20:33:34.5249718Z x_sign = torch.sign(x) 2025-05-07T20:33:34.5249998Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.5250293Z x = x_sign * x_clamp 2025-05-07T20:33:34.5250524Z x0 = x[:, :D] 2025-05-07T20:33:34.5250731Z x1 = x[:, D:] 2025-05-07T20:33:34.5250927Z 2025-05-07T20:33:34.5251171Z if contiguous: 2025-05-07T20:33:34.5251393Z x0 = x0.contiguous() 2025-05-07T20:33:34.5251638Z x1 = x1.contiguous() 2025-05-07T20:33:34.5251869Z 2025-05-07T20:33:34.5252052Z if scale_ub is not None: 2025-05-07T20:33:34.5252317Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.5252638Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.5252943Z ) 2025-05-07T20:33:34.5253124Z else: 2025-05-07T20:33:34.5253322Z scale_ub_tensor = None 2025-05-07T20:33:34.5253564Z 2025-05-07T20:33:34.5253788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.5254086Z op = silu_mul_quant 2025-05-07T20:33:34.5254328Z if compiled: 2025-05-07T20:33:34.5254566Z op = torch.compile(op) 2025-05-07T20:33:34.5254854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.5255121Z 2025-05-07T20:33:34.5255306Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.5255469Z 2025-05-07T20:33:34.5255565Z moe/activation_test.py:117: 2025-05-07T20:33:34.5255850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.5256184Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.5256455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.5257000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:34.5257539Z return fn(*args, **kwargs) 2025-05-07T20:33:34.5258185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.5258863Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.5259406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.5260111Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.5260758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.5261328Z kernel = self.compile( 2025-05-07T20:33:34.5261869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.5262503Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.5262888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.5263109Z 2025-05-07T20:33:34.5263318Z self = 2025-05-07T20:33:34.5264379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.5265713Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca256fc0>} 2025-05-07T20:33:34.5267072Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.5268191Z context = 2025-05-07T20:33:34.5268475Z 2025-05-07T20:33:34.5268642Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.5269143Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.5269609Z module_map=module_map) 2025-05-07T20:33:34.5269998Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.5270379Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.5270629Z E ^ 2025-05-07T20:33:34.5271132Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.5271570Z 2025-05-07T20:33:34.5272004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.5272505Z 2025-05-07T20:33:34.5272606Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.5273018Z self=, 2025-05-07T20:33:34.5273407Z T=2048, 2025-05-07T20:33:34.5273597Z D=7168, 2025-05-07T20:33:34.5273782Z scale_ub=1200.0, 2025-05-07T20:33:34.5274006Z contiguous=False, 2025-05-07T20:33:34.5274230Z compiled=False, 2025-05-07T20:33:34.7227024Z ) 2025-05-07T20:33:34.7227547Z self = 2025-05-07T20:33:34.7228084Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:34.7228370Z 2025-05-07T20:33:34.7228454Z @given( 2025-05-07T20:33:34.7228677Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.7228986Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.7229287Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.7229613Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.7229937Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.7230416Z ) 2025-05-07T20:33:34.7231096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.7231940Z def test_silu_mul_quant( 2025-05-07T20:33:34.7232415Z self, 2025-05-07T20:33:34.7232779Z T: int, 2025-05-07T20:33:34.7233159Z D: int, 2025-05-07T20:33:34.7233583Z scale_ub: Optional[float], 2025-05-07T20:33:34.7234120Z contiguous: bool, 2025-05-07T20:33:34.7234576Z compiled: bool, 2025-05-07T20:33:34.7235004Z ) -> None: 2025-05-07T20:33:34.7235424Z torch.manual_seed(2025) 2025-05-07T20:33:34.7235893Z 2025-05-07T20:33:34.7236433Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.7237302Z 2025-05-07T20:33:34.7237670Z x_sign = torch.sign(x) 2025-05-07T20:33:34.7238266Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.7238888Z x = x_sign * x_clamp 2025-05-07T20:33:34.7239355Z x0 = x[:, :D] 2025-05-07T20:33:34.7239731Z x1 = x[:, D:] 2025-05-07T20:33:34.7239945Z 2025-05-07T20:33:34.7240292Z if contiguous: 2025-05-07T20:33:34.7240525Z x0 = x0.contiguous() 2025-05-07T20:33:34.7240793Z x1 = x1.contiguous() 2025-05-07T20:33:34.7241037Z 2025-05-07T20:33:34.7241228Z if scale_ub is not None: 2025-05-07T20:33:34.7241508Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.7241848Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.7242158Z ) 2025-05-07T20:33:34.7242359Z else: 2025-05-07T20:33:34.7242566Z scale_ub_tensor = None 2025-05-07T20:33:34.7242802Z 2025-05-07T20:33:34.7243025Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.7243463Z op = silu_mul_quant 2025-05-07T20:33:34.7243749Z if compiled: 2025-05-07T20:33:34.7243987Z op = torch.compile(op) 2025-05-07T20:33:34.7244279Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.7244558Z 2025-05-07T20:33:34.7244754Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.7244919Z 2025-05-07T20:33:34.7245019Z moe/activation_test.py:117: 2025-05-07T20:33:34.7245316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.7245651Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.7245924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.7246599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.7247333Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.7247856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.7248527Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.7249171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.7249703Z kernel = self.compile( 2025-05-07T20:33:34.7250263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.7250921Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.7251320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.7251538Z 2025-05-07T20:33:34.7251741Z self = 2025-05-07T20:33:34.7252799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.7254144Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca257ec0>} 2025-05-07T20:33:34.7255500Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.7256549Z context = 2025-05-07T20:33:34.7256830Z 2025-05-07T20:33:34.7256997Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.7257504Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.7257971Z module_map=module_map) 2025-05-07T20:33:34.7258406Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.7258760Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.7259012Z E ^ 2025-05-07T20:33:34.7259477Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.7259920Z 2025-05-07T20:33:34.7260397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.7260900Z 2025-05-07T20:33:34.7260999Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.7261399Z self=, 2025-05-07T20:33:34.7261805Z T=1, 2025-05-07T20:33:34.7261990Z D=7168, 2025-05-07T20:33:34.7262173Z scale_ub=None, 2025-05-07T20:33:34.7262382Z contiguous=True, 2025-05-07T20:33:34.7262602Z compiled=False, 2025-05-07T20:33:34.7262802Z ) 2025-05-07T20:33:34.7263124Z self = 2025-05-07T20:33:34.7263681Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:34.7263944Z 2025-05-07T20:33:34.7264021Z @given( 2025-05-07T20:33:34.7264242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.7264549Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.7264842Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.7265167Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.7265487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.7265763Z ) 2025-05-07T20:33:34.7266093Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.7266536Z def test_silu_mul_quant( 2025-05-07T20:33:34.7266777Z self, 2025-05-07T20:33:34.7267005Z T: int, 2025-05-07T20:33:34.7267198Z D: int, 2025-05-07T20:33:34.7267466Z scale_ub: Optional[float], 2025-05-07T20:33:34.7267733Z contiguous: bool, 2025-05-07T20:33:34.7267968Z compiled: bool, 2025-05-07T20:33:34.7268189Z ) -> None: 2025-05-07T20:33:34.7268396Z torch.manual_seed(2025) 2025-05-07T20:33:34.7268631Z 2025-05-07T20:33:34.7268893Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.7269217Z 2025-05-07T20:33:34.7269403Z x_sign = torch.sign(x) 2025-05-07T20:33:34.7269682Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.7269997Z x = x_sign * x_clamp 2025-05-07T20:33:34.7270263Z x0 = x[:, :D] 2025-05-07T20:33:34.7270471Z x1 = x[:, D:] 2025-05-07T20:33:34.7270674Z 2025-05-07T20:33:34.7270851Z if contiguous: 2025-05-07T20:33:34.7271076Z x0 = x0.contiguous() 2025-05-07T20:33:34.7271345Z x1 = x1.contiguous() 2025-05-07T20:33:34.7271574Z 2025-05-07T20:33:34.7271763Z if scale_ub is not None: 2025-05-07T20:33:34.7272040Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.7272371Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.7272686Z ) 2025-05-07T20:33:34.7272886Z else: 2025-05-07T20:33:34.7273091Z scale_ub_tensor = None 2025-05-07T20:33:34.7273334Z 2025-05-07T20:33:34.7273562Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.7273859Z op = silu_mul_quant 2025-05-07T20:33:34.7274104Z if compiled: 2025-05-07T20:33:34.7274347Z op = torch.compile(op) 2025-05-07T20:33:34.7274628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.7274897Z 2025-05-07T20:33:34.7275076Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.7275234Z 2025-05-07T20:33:34.7275330Z moe/activation_test.py:117: 2025-05-07T20:33:34.7275612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.7275937Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.7276261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.7276954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.7277649Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.7278180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.7278851Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.7279499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.7280029Z kernel = self.compile( 2025-05-07T20:33:34.7280570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.7281236Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.7281629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.7281935Z 2025-05-07T20:33:34.7282173Z self = 2025-05-07T20:33:34.7283279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.7284630Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819f10cc0>} 2025-05-07T20:33:34.7285985Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.7287024Z context = 2025-05-07T20:33:34.7287301Z 2025-05-07T20:33:34.7287469Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.7287977Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.7288432Z module_map=module_map) 2025-05-07T20:33:34.7288786Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.7289122Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.7289376Z E ^ 2025-05-07T20:33:34.7289846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.7290305Z 2025-05-07T20:33:34.7290726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.7291229Z 2025-05-07T20:33:34.7291334Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.7291728Z self=, 2025-05-07T20:33:34.7292129Z T=16384, 2025-05-07T20:33:34.7292321Z D=7168, 2025-05-07T20:33:34.7292504Z scale_ub=1200.0, 2025-05-07T20:33:34.7298565Z contiguous=False, 2025-05-07T20:33:34.7298802Z compiled=True, 2025-05-07T20:33:34.7299005Z ) 2025-05-07T20:33:34.7299323Z self = 2025-05-07T20:33:34.7299817Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:34.7300100Z 2025-05-07T20:33:34.7300188Z @given( 2025-05-07T20:33:34.7300410Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.7300714Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.7301015Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.7301328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.7301648Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.7301920Z ) 2025-05-07T20:33:34.7302325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.7302780Z def test_silu_mul_quant( 2025-05-07T20:33:34.7303021Z self, 2025-05-07T20:33:34.7303207Z T: int, 2025-05-07T20:33:34.7303404Z D: int, 2025-05-07T20:33:34.7303611Z scale_ub: Optional[float], 2025-05-07T20:33:34.7303874Z contiguous: bool, 2025-05-07T20:33:34.7304113Z compiled: bool, 2025-05-07T20:33:34.7304333Z ) -> None: 2025-05-07T20:33:34.7304538Z torch.manual_seed(2025) 2025-05-07T20:33:34.7304776Z 2025-05-07T20:33:34.7305040Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.7305377Z 2025-05-07T20:33:34.7305559Z x_sign = torch.sign(x) 2025-05-07T20:33:34.7305845Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.7306152Z x = x_sign * x_clamp 2025-05-07T20:33:34.7306382Z x0 = x[:, :D] 2025-05-07T20:33:34.7306594Z x1 = x[:, D:] 2025-05-07T20:33:34.7306798Z 2025-05-07T20:33:34.7306971Z if contiguous: 2025-05-07T20:33:34.7307283Z x0 = x0.contiguous() 2025-05-07T20:33:34.7307609Z x1 = x1.contiguous() 2025-05-07T20:33:34.7307839Z 2025-05-07T20:33:34.7308022Z if scale_ub is not None: 2025-05-07T20:33:34.7308290Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.7308610Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.7308907Z ) 2025-05-07T20:33:34.7309093Z else: 2025-05-07T20:33:34.7309289Z scale_ub_tensor = None 2025-05-07T20:33:34.7309538Z 2025-05-07T20:33:34.7309791Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.7310115Z op = silu_mul_quant 2025-05-07T20:33:34.7310353Z if compiled: 2025-05-07T20:33:34.7310640Z op = torch.compile(op) 2025-05-07T20:33:34.7310931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.7311199Z 2025-05-07T20:33:34.7311394Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.7311560Z 2025-05-07T20:33:34.7311668Z moe/activation_test.py:117: 2025-05-07T20:33:34.7311951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.7312271Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.7312546Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.7313104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:34.7313650Z return fn(*args, **kwargs) 2025-05-07T20:33:34.7314295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.7314964Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.7315512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.7316176Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.7316827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.7317342Z kernel = self.compile( 2025-05-07T20:33:34.7317888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.7318547Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.7318930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.7319149Z 2025-05-07T20:33:34.7319352Z self = 2025-05-07T20:33:34.7320408Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.7321815Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819f120c0>} 2025-05-07T20:33:34.7323127Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.7324177Z context = 2025-05-07T20:33:34.7324464Z 2025-05-07T20:33:34.7324627Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.7325139Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.7325594Z module_map=module_map) 2025-05-07T20:33:34.7325954Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.7326298Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.7326553Z E ^ 2025-05-07T20:33:34.7327049Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.7327541Z 2025-05-07T20:33:34.7327960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.8661303Z 2025-05-07T20:33:34.8661521Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.8661936Z self=, 2025-05-07T20:33:34.8662337Z T=1, 2025-05-07T20:33:34.8662515Z D=7168, 2025-05-07T20:33:34.8662768Z scale_ub=None, 2025-05-07T20:33:34.8662973Z contiguous=False, 2025-05-07T20:33:34.8663193Z compiled=False, 2025-05-07T20:33:34.8663394Z ) 2025-05-07T20:33:34.8663824Z self = 2025-05-07T20:33:34.8664317Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:34.8664574Z 2025-05-07T20:33:34.8664661Z @given( 2025-05-07T20:33:34.8664885Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.8665185Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.8665485Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.8665808Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.8666126Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.8666404Z ) 2025-05-07T20:33:34.8666745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.8667196Z def test_silu_mul_quant( 2025-05-07T20:33:34.8667497Z self, 2025-05-07T20:33:34.8667704Z T: int, 2025-05-07T20:33:34.8667903Z D: int, 2025-05-07T20:33:34.8668123Z scale_ub: Optional[float], 2025-05-07T20:33:34.8668397Z contiguous: bool, 2025-05-07T20:33:34.8668629Z compiled: bool, 2025-05-07T20:33:34.8668845Z ) -> None: 2025-05-07T20:33:34.8669060Z torch.manual_seed(2025) 2025-05-07T20:33:34.8669293Z 2025-05-07T20:33:34.8669560Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.8669896Z 2025-05-07T20:33:34.8670083Z x_sign = torch.sign(x) 2025-05-07T20:33:34.8670368Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.8670674Z x = x_sign * x_clamp 2025-05-07T20:33:34.8670904Z x0 = x[:, :D] 2025-05-07T20:33:34.8671112Z x1 = x[:, D:] 2025-05-07T20:33:34.8671317Z 2025-05-07T20:33:34.8671502Z if contiguous: 2025-05-07T20:33:34.8671730Z x0 = x0.contiguous() 2025-05-07T20:33:34.8671978Z x1 = x1.contiguous() 2025-05-07T20:33:34.8672217Z 2025-05-07T20:33:34.8672402Z if scale_ub is not None: 2025-05-07T20:33:34.8672673Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.8673010Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.8673380Z ) 2025-05-07T20:33:34.8673582Z else: 2025-05-07T20:33:34.8673803Z scale_ub_tensor = None 2025-05-07T20:33:34.8674072Z 2025-05-07T20:33:34.8674319Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.8674659Z op = silu_mul_quant 2025-05-07T20:33:34.8674918Z if compiled: 2025-05-07T20:33:34.8675179Z op = torch.compile(op) 2025-05-07T20:33:34.8675496Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.8675797Z 2025-05-07T20:33:34.8675990Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.8676172Z 2025-05-07T20:33:34.8676273Z moe/activation_test.py:117: 2025-05-07T20:33:34.8676594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.8676959Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.8677270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.8678088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.8679049Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.8679667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.8680472Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.8681253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.8681871Z kernel = self.compile( 2025-05-07T20:33:34.8682499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.8683270Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.8683757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.8684022Z 2025-05-07T20:33:34.8684251Z self = 2025-05-07T20:33:34.8685570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.8687272Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819f12c00>} 2025-05-07T20:33:34.8688922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.8690211Z context = 2025-05-07T20:33:34.8690548Z 2025-05-07T20:33:34.8690728Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.8691330Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.8691878Z module_map=module_map) 2025-05-07T20:33:34.8692274Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.8692662Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.8692937Z E ^ 2025-05-07T20:33:34.8693466Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.8694014Z 2025-05-07T20:33:34.8694510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.8695131Z 2025-05-07T20:33:34.8695235Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.8695696Z self=, 2025-05-07T20:33:34.8696148Z T=2048, 2025-05-07T20:33:34.8696338Z D=7168, 2025-05-07T20:33:34.8696536Z scale_ub=None, 2025-05-07T20:33:34.8696830Z contiguous=False, 2025-05-07T20:33:34.8697056Z compiled=True, 2025-05-07T20:33:34.8697257Z ) 2025-05-07T20:33:34.8697561Z self = 2025-05-07T20:33:34.8698045Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:34.8698311Z 2025-05-07T20:33:34.8698386Z @given( 2025-05-07T20:33:34.8698608Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.8698906Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.8699201Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.8699522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.8699854Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.8700173Z ) 2025-05-07T20:33:34.8700512Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.8700939Z def test_silu_mul_quant( 2025-05-07T20:33:34.8701174Z self, 2025-05-07T20:33:34.8701411Z T: int, 2025-05-07T20:33:34.8701643Z D: int, 2025-05-07T20:33:34.8701849Z scale_ub: Optional[float], 2025-05-07T20:33:34.8702111Z contiguous: bool, 2025-05-07T20:33:34.8702346Z compiled: bool, 2025-05-07T20:33:34.8702566Z ) -> None: 2025-05-07T20:33:34.8702772Z torch.manual_seed(2025) 2025-05-07T20:33:34.8703006Z 2025-05-07T20:33:34.8703265Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.8703598Z 2025-05-07T20:33:34.8703782Z x_sign = torch.sign(x) 2025-05-07T20:33:34.8704062Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.8704359Z x = x_sign * x_clamp 2025-05-07T20:33:34.8704591Z x0 = x[:, :D] 2025-05-07T20:33:34.8704852Z x1 = x[:, D:] 2025-05-07T20:33:34.8705052Z 2025-05-07T20:33:34.8705231Z if contiguous: 2025-05-07T20:33:34.8705458Z x0 = x0.contiguous() 2025-05-07T20:33:34.8705707Z x1 = x1.contiguous() 2025-05-07T20:33:34.8705944Z 2025-05-07T20:33:34.8706132Z if scale_ub is not None: 2025-05-07T20:33:34.8706392Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.8706720Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.8707026Z ) 2025-05-07T20:33:34.8707211Z else: 2025-05-07T20:33:34.8707482Z scale_ub_tensor = None 2025-05-07T20:33:34.8707728Z 2025-05-07T20:33:34.8707955Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.8708254Z op = silu_mul_quant 2025-05-07T20:33:34.8708505Z if compiled: 2025-05-07T20:33:34.8708751Z op = torch.compile(op) 2025-05-07T20:33:34.8709038Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.8709316Z 2025-05-07T20:33:34.8709505Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.8709664Z 2025-05-07T20:33:34.8709761Z moe/activation_test.py:117: 2025-05-07T20:33:34.8710056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.8710384Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.8710658Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.8711227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:34.8711779Z return fn(*args, **kwargs) 2025-05-07T20:33:34.8712433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.8713104Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.8713634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.8714300Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.8714999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.8715521Z kernel = self.compile( 2025-05-07T20:33:34.8716062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.8716696Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.8717076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.8717301Z 2025-05-07T20:33:34.8717498Z self = 2025-05-07T20:33:34.8718548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.8719905Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca1842c0>} 2025-05-07T20:33:34.8721337Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.8722327Z context = 2025-05-07T20:33:34.8722616Z 2025-05-07T20:33:34.8722777Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.8723298Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.8723768Z module_map=module_map) 2025-05-07T20:33:34.8724128Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.8724479Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.8724785Z E ^ 2025-05-07T20:33:34.8725257Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.8725704Z 2025-05-07T20:33:34.8726137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.8726647Z 2025-05-07T20:33:34.8726748Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.8727150Z self=, 2025-05-07T20:33:34.8727537Z T=4096, 2025-05-07T20:33:34.8727731Z D=7168, 2025-05-07T20:33:34.8727924Z scale_ub=None, 2025-05-07T20:33:34.8728138Z contiguous=False, 2025-05-07T20:33:34.8728366Z compiled=True, 2025-05-07T20:33:35.2843886Z ) 2025-05-07T20:33:35.2844464Z self = 2025-05-07T20:33:35.2845159Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:35.2845551Z 2025-05-07T20:33:35.2845666Z @given( 2025-05-07T20:33:35.2845951Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.2846268Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.2846587Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.2846917Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.2847237Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.2847522Z ) 2025-05-07T20:33:35.2847867Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.2848322Z def test_silu_mul_quant( 2025-05-07T20:33:35.2848565Z self, 2025-05-07T20:33:35.2848761Z T: int, 2025-05-07T20:33:35.2848952Z D: int, 2025-05-07T20:33:35.2849178Z scale_ub: Optional[float], 2025-05-07T20:33:35.2849454Z contiguous: bool, 2025-05-07T20:33:35.2849691Z compiled: bool, 2025-05-07T20:33:35.2849928Z ) -> None: 2025-05-07T20:33:35.2850148Z torch.manual_seed(2025) 2025-05-07T20:33:35.2850386Z 2025-05-07T20:33:35.2850945Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.2851298Z 2025-05-07T20:33:35.2851493Z x_sign = torch.sign(x) 2025-05-07T20:33:35.2851790Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.2852104Z x = x_sign * x_clamp 2025-05-07T20:33:35.2852352Z x0 = x[:, :D] 2025-05-07T20:33:35.2852561Z x1 = x[:, D:] 2025-05-07T20:33:35.2852774Z 2025-05-07T20:33:35.2852970Z if contiguous: 2025-05-07T20:33:35.2853196Z x0 = x0.contiguous() 2025-05-07T20:33:35.2853460Z x1 = x1.contiguous() 2025-05-07T20:33:35.2853703Z 2025-05-07T20:33:35.2853889Z if scale_ub is not None: 2025-05-07T20:33:35.2854170Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.2854515Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.2854833Z ) 2025-05-07T20:33:35.2855034Z else: 2025-05-07T20:33:35.2855254Z scale_ub_tensor = None 2025-05-07T20:33:35.2855501Z 2025-05-07T20:33:35.2855835Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.2856235Z op = silu_mul_quant 2025-05-07T20:33:35.2856484Z if compiled: 2025-05-07T20:33:35.2856741Z op = torch.compile(op) 2025-05-07T20:33:35.2857040Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.2857318Z 2025-05-07T20:33:35.2857502Z > y_fp8, y_scale = fn() 2025-05-07T20:33:35.2857675Z 2025-05-07T20:33:35.2857776Z moe/activation_test.py:117: 2025-05-07T20:33:35.2858076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.2858399Z moe/activation_test.py:115: in fn 2025-05-07T20:33:35.2858687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.2859267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:35.2859949Z return fn(*args, **kwargs) 2025-05-07T20:33:35.2860630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:35.2861312Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:35.2861851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.2862531Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.2863197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.2863725Z kernel = self.compile( 2025-05-07T20:33:35.2864278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.2864921Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.2865322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.2865545Z 2025-05-07T20:33:35.2865765Z self = 2025-05-07T20:33:35.2866837Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.2868373Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca184d60>} 2025-05-07T20:33:35.2869758Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.2870774Z context = 2025-05-07T20:33:35.2871057Z 2025-05-07T20:33:35.2871286Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.2871802Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.2872274Z module_map=module_map) 2025-05-07T20:33:35.2872644Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.2872997Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:35.2873248Z E ^ 2025-05-07T20:33:35.2873710Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.2874172Z 2025-05-07T20:33:35.2874613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.2875154Z 2025-05-07T20:33:35.2875259Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.2875675Z self=, 2025-05-07T20:33:35.2876076Z T=16384, 2025-05-07T20:33:35.2876279Z D=5120, 2025-05-07T20:33:35.2876468Z scale_ub=1200.0, 2025-05-07T20:33:35.2876790Z contiguous=False, 2025-05-07T20:33:35.2877021Z compiled=False, 2025-05-07T20:33:35.2877223Z ) 2025-05-07T20:33:35.2877548Z self = 2025-05-07T20:33:35.2878046Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:35.2878321Z 2025-05-07T20:33:35.2878400Z @given( 2025-05-07T20:33:35.2878632Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.2878953Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.2879251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.2879580Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.2879937Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.2880298Z ) 2025-05-07T20:33:35.2880646Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.2881089Z def test_silu_mul_quant( 2025-05-07T20:33:35.2881339Z self, 2025-05-07T20:33:35.2881569Z T: int, 2025-05-07T20:33:35.2881768Z D: int, 2025-05-07T20:33:35.2881979Z scale_ub: Optional[float], 2025-05-07T20:33:35.2882251Z contiguous: bool, 2025-05-07T20:33:35.2882493Z compiled: bool, 2025-05-07T20:33:35.2882716Z ) -> None: 2025-05-07T20:33:35.2882929Z torch.manual_seed(2025) 2025-05-07T20:33:35.2883170Z 2025-05-07T20:33:35.2883439Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.2883783Z 2025-05-07T20:33:35.2883980Z x_sign = torch.sign(x) 2025-05-07T20:33:35.2884265Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.2884578Z x = x_sign * x_clamp 2025-05-07T20:33:35.2884822Z x0 = x[:, :D] 2025-05-07T20:33:35.2885030Z x1 = x[:, D:] 2025-05-07T20:33:35.2885238Z 2025-05-07T20:33:35.2885447Z if contiguous: 2025-05-07T20:33:35.2885687Z x0 = x0.contiguous() 2025-05-07T20:33:35.2885943Z x1 = x1.contiguous() 2025-05-07T20:33:35.2886182Z 2025-05-07T20:33:35.2886380Z if scale_ub is not None: 2025-05-07T20:33:35.2894433Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.2894787Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.2895112Z ) 2025-05-07T20:33:35.2895303Z else: 2025-05-07T20:33:35.2895513Z scale_ub_tensor = None 2025-05-07T20:33:35.2895766Z 2025-05-07T20:33:35.2895994Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.2896312Z op = silu_mul_quant 2025-05-07T20:33:35.2896565Z if compiled: 2025-05-07T20:33:35.2896808Z op = torch.compile(op) 2025-05-07T20:33:35.2897113Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.2897395Z 2025-05-07T20:33:35.2897586Z > y_fp8, y_scale = fn() 2025-05-07T20:33:35.2897837Z 2025-05-07T20:33:35.2897939Z moe/activation_test.py:117: 2025-05-07T20:33:35.2898243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.2898582Z moe/activation_test.py:115: in fn 2025-05-07T20:33:35.2898861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.2899558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:35.2900246Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:35.2900793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.2901477Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.2902145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.2902674Z kernel = self.compile( 2025-05-07T20:33:35.2903281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.2903972Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.2904373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.2904597Z 2025-05-07T20:33:35.2904802Z self = 2025-05-07T20:33:35.2905873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.2907242Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca185c60>} 2025-05-07T20:33:35.2908753Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.2909821Z context = 2025-05-07T20:33:35.2910145Z 2025-05-07T20:33:35.2910311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.2910825Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.2911289Z module_map=module_map) 2025-05-07T20:33:35.2911659Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.2912002Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:35.2912259Z E ^ 2025-05-07T20:33:35.2912721Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.2913178Z 2025-05-07T20:33:35.2913591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.2914108Z 2025-05-07T20:33:35.2914209Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.2914614Z self=, 2025-05-07T20:33:35.2915019Z T=16384, 2025-05-07T20:33:35.2915204Z D=5120, 2025-05-07T20:33:35.2915401Z scale_ub=1200.0, 2025-05-07T20:33:35.2915626Z contiguous=True, 2025-05-07T20:33:35.2915839Z compiled=True, 2025-05-07T20:33:35.2916047Z ) 2025-05-07T20:33:35.2916374Z self = 2025-05-07T20:33:35.2916856Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:35.2917134Z 2025-05-07T20:33:35.2917213Z @given( 2025-05-07T20:33:35.2917447Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.2917762Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.2918114Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.2918449Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.2918772Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.2919053Z ) 2025-05-07T20:33:35.2919400Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.2919852Z def test_silu_mul_quant( 2025-05-07T20:33:35.2920090Z self, 2025-05-07T20:33:35.2920294Z T: int, 2025-05-07T20:33:35.2920494Z D: int, 2025-05-07T20:33:35.2920711Z scale_ub: Optional[float], 2025-05-07T20:33:35.2920986Z contiguous: bool, 2025-05-07T20:33:35.2921234Z compiled: bool, 2025-05-07T20:33:35.2921454Z ) -> None: 2025-05-07T20:33:35.2921672Z torch.manual_seed(2025) 2025-05-07T20:33:35.2921919Z 2025-05-07T20:33:35.2922189Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.2922530Z 2025-05-07T20:33:35.2922729Z x_sign = torch.sign(x) 2025-05-07T20:33:35.2923103Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.2923452Z x = x_sign * x_clamp 2025-05-07T20:33:35.2923697Z x0 = x[:, :D] 2025-05-07T20:33:35.2923917Z x1 = x[:, D:] 2025-05-07T20:33:35.2924121Z 2025-05-07T20:33:35.2924308Z if contiguous: 2025-05-07T20:33:35.2924545Z x0 = x0.contiguous() 2025-05-07T20:33:35.2924796Z x1 = x1.contiguous() 2025-05-07T20:33:35.2925041Z 2025-05-07T20:33:35.2925233Z if scale_ub is not None: 2025-05-07T20:33:35.2925504Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.2925840Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.2926141Z ) 2025-05-07T20:33:35.2926330Z else: 2025-05-07T20:33:35.2926591Z scale_ub_tensor = None 2025-05-07T20:33:35.2926843Z 2025-05-07T20:33:35.2927063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.2927375Z op = silu_mul_quant 2025-05-07T20:33:35.2927634Z if compiled: 2025-05-07T20:33:35.2927883Z op = torch.compile(op) 2025-05-07T20:33:35.2928171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.2928445Z 2025-05-07T20:33:35.2928638Z > y_fp8, y_scale = fn() 2025-05-07T20:33:35.2928801Z 2025-05-07T20:33:35.2928899Z moe/activation_test.py:117: 2025-05-07T20:33:35.2929195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.2929530Z moe/activation_test.py:115: in fn 2025-05-07T20:33:35.2929805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.2930387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:35.2930949Z return fn(*args, **kwargs) 2025-05-07T20:33:35.2931641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:35.2932313Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:35.2932876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.2933552Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.2934205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.2934731Z kernel = self.compile( 2025-05-07T20:33:35.2935293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.2935943Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.2936331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.2936569Z 2025-05-07T20:33:35.2936773Z self = 2025-05-07T20:33:35.2937891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.2939259Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca187380>} 2025-05-07T20:33:35.2940922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.2941936Z context = 2025-05-07T20:33:35.2942232Z 2025-05-07T20:33:35.2942396Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.2942927Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.2943563Z module_map=module_map) 2025-05-07T20:33:35.2943934Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.2944291Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:35.2944557Z E ^ 2025-05-07T20:33:35.2945015Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.2945479Z 2025-05-07T20:33:35.2945905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.4495286Z 2025-05-07T20:33:35.4495900Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.4497113Z self=, 2025-05-07T20:33:35.4498722Z T=16384, 2025-05-07T20:33:35.4499243Z D=5120, 2025-05-07T20:33:35.4499735Z scale_ub=None, 2025-05-07T20:33:35.4500092Z contiguous=False, 2025-05-07T20:33:35.4500389Z compiled=True, 2025-05-07T20:33:35.4500637Z ) 2025-05-07T20:33:35.4500960Z self = 2025-05-07T20:33:35.4501459Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:35.4501738Z 2025-05-07T20:33:35.4501842Z @given( 2025-05-07T20:33:35.4502073Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.4502390Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.4502688Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.4503019Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.4503346Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.4503629Z ) 2025-05-07T20:33:35.4503973Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.4504419Z def test_silu_mul_quant( 2025-05-07T20:33:35.4504659Z self, 2025-05-07T20:33:35.4504853Z T: int, 2025-05-07T20:33:35.4505054Z D: int, 2025-05-07T20:33:35.4505280Z scale_ub: Optional[float], 2025-05-07T20:33:35.4505545Z contiguous: bool, 2025-05-07T20:33:35.4505787Z compiled: bool, 2025-05-07T20:33:35.4506017Z ) -> None: 2025-05-07T20:33:35.4506230Z torch.manual_seed(2025) 2025-05-07T20:33:35.4506469Z 2025-05-07T20:33:35.4506747Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.4507079Z 2025-05-07T20:33:35.4507279Z x_sign = torch.sign(x) 2025-05-07T20:33:35.4507657Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.4507972Z x = x_sign * x_clamp 2025-05-07T20:33:35.4508206Z x0 = x[:, :D] 2025-05-07T20:33:35.4508428Z x1 = x[:, D:] 2025-05-07T20:33:35.4508640Z 2025-05-07T20:33:35.4508819Z if contiguous: 2025-05-07T20:33:35.4509054Z x0 = x0.contiguous() 2025-05-07T20:33:35.4509314Z x1 = x1.contiguous() 2025-05-07T20:33:35.4509649Z 2025-05-07T20:33:35.4509851Z if scale_ub is not None: 2025-05-07T20:33:35.4510132Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.4510464Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.4510774Z ) 2025-05-07T20:33:35.4510972Z else: 2025-05-07T20:33:35.4511179Z scale_ub_tensor = None 2025-05-07T20:33:35.4511434Z 2025-05-07T20:33:35.4511671Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.4511980Z op = silu_mul_quant 2025-05-07T20:33:35.4512231Z if compiled: 2025-05-07T20:33:35.4512487Z op = torch.compile(op) 2025-05-07T20:33:35.4512785Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.4513049Z 2025-05-07T20:33:35.4513245Z > y_fp8, y_scale = fn() 2025-05-07T20:33:35.4513410Z 2025-05-07T20:33:35.4513518Z moe/activation_test.py:117: 2025-05-07T20:33:35.4513813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.4514295Z moe/activation_test.py:115: in fn 2025-05-07T20:33:35.4514578Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.4515146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:35.4515718Z return fn(*args, **kwargs) 2025-05-07T20:33:35.4516383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:35.4517058Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:35.4517604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.4518281Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.4518992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.4519515Z kernel = self.compile( 2025-05-07T20:33:35.4520072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.4520720Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.4521108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.4521342Z 2025-05-07T20:33:35.4521547Z self = 2025-05-07T20:33:35.4522620Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.4524007Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819ea05e0>} 2025-05-07T20:33:35.4525337Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.4526392Z context = 2025-05-07T20:33:35.4526684Z 2025-05-07T20:33:35.4526849Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.4527369Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.4527834Z module_map=module_map) 2025-05-07T20:33:35.4528197Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.4528548Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:35.4528810Z E ^ 2025-05-07T20:33:35.4529262Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.4529769Z 2025-05-07T20:33:35.4530201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.4530717Z 2025-05-07T20:33:35.4530818Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.4531227Z self=, 2025-05-07T20:33:35.4531631Z T=2048, 2025-05-07T20:33:35.4531820Z D=5120, 2025-05-07T20:33:35.4532016Z scale_ub=None, 2025-05-07T20:33:35.4532229Z contiguous=False, 2025-05-07T20:33:35.4532457Z compiled=True, 2025-05-07T20:33:35.4532662Z ) 2025-05-07T20:33:35.4532980Z self = 2025-05-07T20:33:35.4533474Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:35.4533742Z 2025-05-07T20:33:35.4533829Z @given( 2025-05-07T20:33:35.4534054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.4534374Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.4534766Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.4535096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.4535413Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.4535706Z ) 2025-05-07T20:33:35.4536053Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.4536487Z def test_silu_mul_quant( 2025-05-07T20:33:35.4536737Z self, 2025-05-07T20:33:35.4536935Z T: int, 2025-05-07T20:33:35.4537129Z D: int, 2025-05-07T20:33:35.4537353Z scale_ub: Optional[float], 2025-05-07T20:33:35.4537626Z contiguous: bool, 2025-05-07T20:33:35.4537864Z compiled: bool, 2025-05-07T20:33:35.4538090Z ) -> None: 2025-05-07T20:33:35.4538359Z torch.manual_seed(2025) 2025-05-07T20:33:35.4538595Z 2025-05-07T20:33:35.4538869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.4539221Z 2025-05-07T20:33:35.4539424Z x_sign = torch.sign(x) 2025-05-07T20:33:35.4539717Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.4540028Z x = x_sign * x_clamp 2025-05-07T20:33:35.4540577Z x0 = x[:, :D] 2025-05-07T20:33:35.4540790Z x1 = x[:, D:] 2025-05-07T20:33:35.4541000Z 2025-05-07T20:33:35.4541193Z if contiguous: 2025-05-07T20:33:35.4541418Z x0 = x0.contiguous() 2025-05-07T20:33:35.4541678Z x1 = x1.contiguous() 2025-05-07T20:33:35.4541919Z 2025-05-07T20:33:35.4542107Z if scale_ub is not None: 2025-05-07T20:33:35.4542381Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.4542715Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.4543014Z ) 2025-05-07T20:33:35.4543210Z else: 2025-05-07T20:33:35.4543418Z scale_ub_tensor = None 2025-05-07T20:33:35.4543656Z 2025-05-07T20:33:35.4543882Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.4544199Z op = silu_mul_quant 2025-05-07T20:33:35.4544438Z if compiled: 2025-05-07T20:33:35.4544681Z op = torch.compile(op) 2025-05-07T20:33:35.4544978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.4545258Z 2025-05-07T20:33:35.4545449Z > y_fp8, y_scale = fn() 2025-05-07T20:33:35.4545626Z 2025-05-07T20:33:35.4545724Z moe/activation_test.py:117: 2025-05-07T20:33:35.4546025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.4546350Z moe/activation_test.py:115: in fn 2025-05-07T20:33:35.4546634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.4547198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:35.4547835Z return fn(*args, **kwargs) 2025-05-07T20:33:35.4548604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:35.4549285Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:35.4549825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.4550489Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.4551149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.4551674Z kernel = self.compile( 2025-05-07T20:33:35.4552207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.4552849Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.4553249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.4553471Z 2025-05-07T20:33:35.4553685Z self = 2025-05-07T20:33:35.4555269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.4556664Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819ea1440>} 2025-05-07T20:33:35.4558031Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.4559092Z context = 2025-05-07T20:33:35.4559441Z 2025-05-07T20:33:35.4559614Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.4560129Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.4560606Z module_map=module_map) 2025-05-07T20:33:35.4560966Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.4561319Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:35.4561572Z E ^ 2025-05-07T20:33:35.4562038Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.4562485Z 2025-05-07T20:33:35.4562913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.6154981Z 2025-05-07T20:33:35.6155511Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.6156160Z self=, 2025-05-07T20:33:35.6156721Z T=2048, 2025-05-07T20:33:35.6156980Z D=5120, 2025-05-07T20:33:35.6157205Z scale_ub=1200.0, 2025-05-07T20:33:35.6157424Z contiguous=False, 2025-05-07T20:33:35.6157667Z compiled=True, 2025-05-07T20:33:35.6157878Z ) 2025-05-07T20:33:35.6158192Z self = 2025-05-07T20:33:35.6158685Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:35.6158962Z 2025-05-07T20:33:35.6159041Z @given( 2025-05-07T20:33:35.6159270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.6159571Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.6159875Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.6160204Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.6160527Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.6160817Z ) 2025-05-07T20:33:35.6161172Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.6161617Z def test_silu_mul_quant( 2025-05-07T20:33:35.6162129Z self, 2025-05-07T20:33:35.6162335Z T: int, 2025-05-07T20:33:35.6162534Z D: int, 2025-05-07T20:33:35.6162755Z scale_ub: Optional[float], 2025-05-07T20:33:35.6163028Z contiguous: bool, 2025-05-07T20:33:35.6163272Z compiled: bool, 2025-05-07T20:33:35.6163494Z ) -> None: 2025-05-07T20:33:35.6163711Z torch.manual_seed(2025) 2025-05-07T20:33:35.6163949Z 2025-05-07T20:33:35.6164214Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.6164565Z 2025-05-07T20:33:35.6164758Z x_sign = torch.sign(x) 2025-05-07T20:33:35.6165040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.6165351Z x = x_sign * x_clamp 2025-05-07T20:33:35.6165590Z x0 = x[:, :D] 2025-05-07T20:33:35.6165803Z x1 = x[:, D:] 2025-05-07T20:33:35.6166014Z 2025-05-07T20:33:35.6166200Z if contiguous: 2025-05-07T20:33:35.6166430Z x0 = x0.contiguous() 2025-05-07T20:33:35.6166685Z x1 = x1.contiguous() 2025-05-07T20:33:35.6167105Z 2025-05-07T20:33:35.6167292Z if scale_ub is not None: 2025-05-07T20:33:35.6167564Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.6167892Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.6168195Z ) 2025-05-07T20:33:35.6168382Z else: 2025-05-07T20:33:35.6168589Z scale_ub_tensor = None 2025-05-07T20:33:35.6168838Z 2025-05-07T20:33:35.6169066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.6169370Z op = silu_mul_quant 2025-05-07T20:33:35.6169620Z if compiled: 2025-05-07T20:33:35.6169869Z op = torch.compile(op) 2025-05-07T20:33:35.6170206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.6170569Z 2025-05-07T20:33:35.6170757Z > y_fp8, y_scale = fn() 2025-05-07T20:33:35.6170921Z 2025-05-07T20:33:35.6171022Z moe/activation_test.py:117: 2025-05-07T20:33:35.6171316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.6171639Z moe/activation_test.py:115: in fn 2025-05-07T20:33:35.6171915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.6172490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:35.6173052Z return fn(*args, **kwargs) 2025-05-07T20:33:35.6173696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:35.6174370Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:35.6174917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.6175587Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.6176245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.6176770Z kernel = self.compile( 2025-05-07T20:33:35.6177322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.6177960Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.6178355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.6178587Z 2025-05-07T20:33:35.6178788Z self = 2025-05-07T20:33:35.6179866Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.6181331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819ea2660>} 2025-05-07T20:33:35.6182660Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.6183694Z context = 2025-05-07T20:33:35.6183980Z 2025-05-07T20:33:35.6184149Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.6184670Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.6185132Z module_map=module_map) 2025-05-07T20:33:35.6192591Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.6192998Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:35.6193258Z E ^ 2025-05-07T20:33:35.6193722Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.6194252Z 2025-05-07T20:33:35.6194726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.6195232Z 2025-05-07T20:33:35.6195342Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.6195744Z self=, 2025-05-07T20:33:35.6196141Z T=4096, 2025-05-07T20:33:35.6196332Z D=5120, 2025-05-07T20:33:35.6196524Z scale_ub=1200.0, 2025-05-07T20:33:35.6196739Z contiguous=True, 2025-05-07T20:33:35.6196962Z compiled=True, 2025-05-07T20:33:35.6197167Z ) 2025-05-07T20:33:35.6197483Z self = 2025-05-07T20:33:35.6197972Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:35.6198282Z 2025-05-07T20:33:35.6198369Z @given( 2025-05-07T20:33:35.6198596Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.6198919Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.6199230Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.6199551Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.6199881Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.6200169Z ) 2025-05-07T20:33:35.6200515Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.6200954Z def test_silu_mul_quant( 2025-05-07T20:33:35.6201200Z self, 2025-05-07T20:33:35.6201400Z T: int, 2025-05-07T20:33:35.6201591Z D: int, 2025-05-07T20:33:35.6201813Z scale_ub: Optional[float], 2025-05-07T20:33:35.6202084Z contiguous: bool, 2025-05-07T20:33:35.6202317Z compiled: bool, 2025-05-07T20:33:35.6202550Z ) -> None: 2025-05-07T20:33:35.6202763Z torch.manual_seed(2025) 2025-05-07T20:33:35.6202998Z 2025-05-07T20:33:35.6203274Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.6203630Z 2025-05-07T20:33:35.6203817Z x_sign = torch.sign(x) 2025-05-07T20:33:35.6204104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.6204414Z x = x_sign * x_clamp 2025-05-07T20:33:35.6204646Z x0 = x[:, :D] 2025-05-07T20:33:35.6204865Z x1 = x[:, D:] 2025-05-07T20:33:35.6205073Z 2025-05-07T20:33:35.6205260Z if contiguous: 2025-05-07T20:33:35.6205483Z x0 = x0.contiguous() 2025-05-07T20:33:35.6205741Z x1 = x1.contiguous() 2025-05-07T20:33:35.6205976Z 2025-05-07T20:33:35.6206163Z if scale_ub is not None: 2025-05-07T20:33:35.6206436Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.6206765Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.6207075Z ) 2025-05-07T20:33:35.6207269Z else: 2025-05-07T20:33:35.6207531Z scale_ub_tensor = None 2025-05-07T20:33:35.6207773Z 2025-05-07T20:33:35.6208010Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.6208322Z op = silu_mul_quant 2025-05-07T20:33:35.6208571Z if compiled: 2025-05-07T20:33:35.6208820Z op = torch.compile(op) 2025-05-07T20:33:35.6209114Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.6209382Z 2025-05-07T20:33:35.6209578Z > y_fp8, y_scale = fn() 2025-05-07T20:33:35.6209746Z 2025-05-07T20:33:35.6209844Z moe/activation_test.py:117: 2025-05-07T20:33:35.6210141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.6210464Z moe/activation_test.py:115: in fn 2025-05-07T20:33:35.6210750Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.6211323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:35.6211871Z return fn(*args, **kwargs) 2025-05-07T20:33:35.6212581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:35.6213302Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:35.6213839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.6214511Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.6215169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.6215696Z kernel = self.compile( 2025-05-07T20:33:35.6216268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.6216990Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.6217392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.6217619Z 2025-05-07T20:33:35.6217836Z self = 2025-05-07T20:33:35.6218907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.6220298Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819ea39c0>} 2025-05-07T20:33:35.6221678Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.6222746Z context = 2025-05-07T20:33:35.6223028Z 2025-05-07T20:33:35.6223204Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.6223723Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.6224194Z module_map=module_map) 2025-05-07T20:33:35.6224555Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.6224910Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:35.6225166Z E ^ 2025-05-07T20:33:35.6225627Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.6226068Z 2025-05-07T20:33:35.6226490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.7903685Z 2025-05-07T20:33:35.7904253Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.7905430Z self=, 2025-05-07T20:33:35.7906488Z T=128, 2025-05-07T20:33:35.7907237Z D=5120, 2025-05-07T20:33:35.7907736Z scale_ub=1200.0, 2025-05-07T20:33:35.7908172Z contiguous=False, 2025-05-07T20:33:35.7908596Z compiled=True, 2025-05-07T20:33:35.7908995Z ) 2025-05-07T20:33:35.7909616Z self = 2025-05-07T20:33:35.7910298Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:35.7910591Z 2025-05-07T20:33:35.7910670Z @given( 2025-05-07T20:33:35.7910897Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.7911198Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.7911502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.7911828Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.7912152Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.7912424Z ) 2025-05-07T20:33:35.7912772Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.7913211Z def test_silu_mul_quant( 2025-05-07T20:33:35.7913636Z self, 2025-05-07T20:33:35.7913831Z T: int, 2025-05-07T20:33:35.7914024Z D: int, 2025-05-07T20:33:35.7914237Z scale_ub: Optional[float], 2025-05-07T20:33:35.7914505Z contiguous: bool, 2025-05-07T20:33:35.7914742Z compiled: bool, 2025-05-07T20:33:35.7914961Z ) -> None: 2025-05-07T20:33:35.7915174Z torch.manual_seed(2025) 2025-05-07T20:33:35.7915414Z 2025-05-07T20:33:35.7915674Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.7916012Z 2025-05-07T20:33:35.7916198Z x_sign = torch.sign(x) 2025-05-07T20:33:35.7916477Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.7916783Z x = x_sign * x_clamp 2025-05-07T20:33:35.7917137Z x0 = x[:, :D] 2025-05-07T20:33:35.7917356Z x1 = x[:, D:] 2025-05-07T20:33:35.7917554Z 2025-05-07T20:33:35.7917742Z if contiguous: 2025-05-07T20:33:35.7917967Z x0 = x0.contiguous() 2025-05-07T20:33:35.7918222Z x1 = x1.contiguous() 2025-05-07T20:33:35.7918454Z 2025-05-07T20:33:35.7918637Z if scale_ub is not None: 2025-05-07T20:33:35.7918896Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.7919221Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.7919531Z ) 2025-05-07T20:33:35.7919714Z else: 2025-05-07T20:33:35.7919917Z scale_ub_tensor = None 2025-05-07T20:33:35.7920170Z 2025-05-07T20:33:35.7920389Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.7920696Z op = silu_mul_quant 2025-05-07T20:33:35.7920940Z if compiled: 2025-05-07T20:33:35.7921178Z op = torch.compile(op) 2025-05-07T20:33:35.7921477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.7921747Z 2025-05-07T20:33:35.7921933Z > y_fp8, y_scale = fn() 2025-05-07T20:33:35.7922095Z 2025-05-07T20:33:35.7922194Z moe/activation_test.py:117: 2025-05-07T20:33:35.7922488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.7922810Z moe/activation_test.py:115: in fn 2025-05-07T20:33:35.7923079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.7923643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:35.7924208Z return fn(*args, **kwargs) 2025-05-07T20:33:35.7924859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:35.7925527Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:35.7926076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.7926745Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.7927445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.7927974Z kernel = self.compile( 2025-05-07T20:33:35.7928532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.7929176Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.7929561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.7929789Z 2025-05-07T20:33:35.7930015Z self = 2025-05-07T20:33:35.7931115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.7932544Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cae84fe0>} 2025-05-07T20:33:35.7933889Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.7934946Z context = 2025-05-07T20:33:35.7935234Z 2025-05-07T20:33:35.7935397Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.7935911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.7936404Z module_map=module_map) 2025-05-07T20:33:35.7936761Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.7937157Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:35.7937418Z E ^ 2025-05-07T20:33:35.7937877Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.7938349Z 2025-05-07T20:33:35.7938774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.7939282Z 2025-05-07T20:33:35.7939383Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.7939793Z self=, 2025-05-07T20:33:35.7940484Z T=16384, 2025-05-07T20:33:35.7940676Z D=7168, 2025-05-07T20:33:35.7940865Z scale_ub=1200.0, 2025-05-07T20:33:35.7941077Z contiguous=True, 2025-05-07T20:33:35.7941294Z compiled=True, 2025-05-07T20:33:35.7941497Z ) 2025-05-07T20:33:35.7941815Z self = 2025-05-07T20:33:35.7942305Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:35.7942579Z 2025-05-07T20:33:35.7942660Z @given( 2025-05-07T20:33:35.7942889Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.7943218Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.7943516Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.7943845Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.7944172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.7944454Z ) 2025-05-07T20:33:35.7944798Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.7945244Z def test_silu_mul_quant( 2025-05-07T20:33:35.7945486Z self, 2025-05-07T20:33:35.7945669Z T: int, 2025-05-07T20:33:35.7945864Z D: int, 2025-05-07T20:33:35.7946076Z scale_ub: Optional[float], 2025-05-07T20:33:35.7946340Z contiguous: bool, 2025-05-07T20:33:35.7946592Z compiled: bool, 2025-05-07T20:33:35.7946815Z ) -> None: 2025-05-07T20:33:35.7947097Z torch.manual_seed(2025) 2025-05-07T20:33:35.7947342Z 2025-05-07T20:33:35.7947692Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.7948027Z 2025-05-07T20:33:35.7948221Z x_sign = torch.sign(x) 2025-05-07T20:33:35.7948509Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.7948805Z x = x_sign * x_clamp 2025-05-07T20:33:35.7949045Z x0 = x[:, :D] 2025-05-07T20:33:35.7949259Z x1 = x[:, D:] 2025-05-07T20:33:35.7949455Z 2025-05-07T20:33:35.7949636Z if contiguous: 2025-05-07T20:33:35.7949863Z x0 = x0.contiguous() 2025-05-07T20:33:35.7950114Z x1 = x1.contiguous() 2025-05-07T20:33:35.7950350Z 2025-05-07T20:33:35.7950542Z if scale_ub is not None: 2025-05-07T20:33:35.7950812Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.7951136Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.7951442Z ) 2025-05-07T20:33:35.7951645Z else: 2025-05-07T20:33:35.7951845Z scale_ub_tensor = None 2025-05-07T20:33:35.7952222Z 2025-05-07T20:33:35.7952453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.7952750Z op = silu_mul_quant 2025-05-07T20:33:35.7952999Z if compiled: 2025-05-07T20:33:35.7953251Z op = torch.compile(op) 2025-05-07T20:33:35.7953536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.7953810Z 2025-05-07T20:33:35.7953998Z > y_fp8, y_scale = fn() 2025-05-07T20:33:35.7954161Z 2025-05-07T20:33:35.7954256Z moe/activation_test.py:117: 2025-05-07T20:33:35.7954554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.7954877Z moe/activation_test.py:115: in fn 2025-05-07T20:33:35.7955221Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.7955783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:35.7956343Z return fn(*args, **kwargs) 2025-05-07T20:33:35.7957015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:35.7957689Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:35.7958230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.7958919Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.7959575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.7960102Z kernel = self.compile( 2025-05-07T20:33:35.7960653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.7961314Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.7961720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.7961950Z 2025-05-07T20:33:35.7962156Z self = 2025-05-07T20:33:35.7963215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.7964569Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cae85e40>} 2025-05-07T20:33:35.7965889Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.7966891Z context = 2025-05-07T20:33:35.7967227Z 2025-05-07T20:33:35.7967392Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.7967906Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.7968367Z module_map=module_map) 2025-05-07T20:33:35.7968724Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.7969079Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:35.7969331Z E ^ 2025-05-07T20:33:35.7969785Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.7970239Z 2025-05-07T20:33:35.7970672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.9133537Z 2025-05-07T20:33:35.9134243Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.9134945Z self=, 2025-05-07T20:33:35.9135722Z T=16384, 2025-05-07T20:33:35.9136016Z D=5120, 2025-05-07T20:33:35.9136212Z scale_ub=1200.0, 2025-05-07T20:33:35.9136430Z contiguous=True, 2025-05-07T20:33:35.9136653Z compiled=False, 2025-05-07T20:33:35.9136859Z ) 2025-05-07T20:33:35.9137167Z self = 2025-05-07T20:33:35.9137674Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:35.9137952Z 2025-05-07T20:33:35.9138034Z @given( 2025-05-07T20:33:35.9138262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.9138564Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.9138866Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.9139290Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.9139608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.9139889Z ) 2025-05-07T20:33:35.9140527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.9140973Z def test_silu_mul_quant( 2025-05-07T20:33:35.9141206Z self, 2025-05-07T20:33:35.9141401Z T: int, 2025-05-07T20:33:35.9141590Z D: int, 2025-05-07T20:33:35.9141798Z scale_ub: Optional[float], 2025-05-07T20:33:35.9142067Z contiguous: bool, 2025-05-07T20:33:35.9142309Z compiled: bool, 2025-05-07T20:33:35.9142526Z ) -> None: 2025-05-07T20:33:35.9142736Z torch.manual_seed(2025) 2025-05-07T20:33:35.9142978Z 2025-05-07T20:33:35.9143240Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.9143581Z 2025-05-07T20:33:35.9143769Z x_sign = torch.sign(x) 2025-05-07T20:33:35.9144050Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.9144372Z x = x_sign * x_clamp 2025-05-07T20:33:35.9144613Z x0 = x[:, :D] 2025-05-07T20:33:35.9144822Z x1 = x[:, D:] 2025-05-07T20:33:35.9145030Z 2025-05-07T20:33:35.9145222Z if contiguous: 2025-05-07T20:33:35.9145452Z x0 = x0.contiguous() 2025-05-07T20:33:35.9145707Z x1 = x1.contiguous() 2025-05-07T20:33:35.9145944Z 2025-05-07T20:33:35.9146127Z if scale_ub is not None: 2025-05-07T20:33:35.9146399Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.9146730Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.9147024Z ) 2025-05-07T20:33:35.9147206Z else: 2025-05-07T20:33:35.9147499Z scale_ub_tensor = None 2025-05-07T20:33:35.9147746Z 2025-05-07T20:33:35.9147968Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.9148272Z op = silu_mul_quant 2025-05-07T20:33:35.9148517Z if compiled: 2025-05-07T20:33:35.9148753Z op = torch.compile(op) 2025-05-07T20:33:35.9149044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.9149422Z 2025-05-07T20:33:35.9149605Z > y_fp8, y_scale = fn() 2025-05-07T20:33:35.9149776Z 2025-05-07T20:33:35.9149874Z moe/activation_test.py:117: 2025-05-07T20:33:35.9150165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.9150491Z moe/activation_test.py:115: in fn 2025-05-07T20:33:35.9150765Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.9151474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:35.9152152Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:35.9152679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.9153351Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.9154008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.9154530Z kernel = self.compile( 2025-05-07T20:33:35.9155206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.9155863Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.9156252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.9156474Z 2025-05-07T20:33:35.9156676Z self = 2025-05-07T20:33:35.9157737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.9159214Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cae86ca0>} 2025-05-07T20:33:35.9160592Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.9161663Z context = 2025-05-07T20:33:35.9161952Z 2025-05-07T20:33:35.9162116Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.9162631Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.9163096Z module_map=module_map) 2025-05-07T20:33:35.9163462Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.9163809Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:35.9164069Z E ^ 2025-05-07T20:33:35.9164526Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.9164977Z 2025-05-07T20:33:35.9165406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.9165917Z 2025-05-07T20:33:35.9166017Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.9166423Z self=, 2025-05-07T20:33:35.9166819Z T=1, 2025-05-07T20:33:35.9166993Z D=7168, 2025-05-07T20:33:35.9167186Z scale_ub=1200.0, 2025-05-07T20:33:35.9167409Z contiguous=False, 2025-05-07T20:33:35.9167626Z compiled=False, 2025-05-07T20:33:35.9167828Z ) 2025-05-07T20:33:35.9168148Z self = 2025-05-07T20:33:35.9168616Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:35.9168883Z 2025-05-07T20:33:35.9168958Z @given( 2025-05-07T20:33:35.9169180Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.9169538Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.9169839Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.9170160Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.9170483Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.9170759Z ) 2025-05-07T20:33:35.9171110Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.9171557Z def test_silu_mul_quant( 2025-05-07T20:33:35.9171791Z self, 2025-05-07T20:33:35.9171984Z T: int, 2025-05-07T20:33:35.9172176Z D: int, 2025-05-07T20:33:35.9172383Z scale_ub: Optional[float], 2025-05-07T20:33:35.9172649Z contiguous: bool, 2025-05-07T20:33:35.9172888Z compiled: bool, 2025-05-07T20:33:35.9173099Z ) -> None: 2025-05-07T20:33:35.9173314Z torch.manual_seed(2025) 2025-05-07T20:33:35.9173549Z 2025-05-07T20:33:35.9173810Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.9174142Z 2025-05-07T20:33:35.9174422Z x_sign = torch.sign(x) 2025-05-07T20:33:35.9174713Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.9175018Z x = x_sign * x_clamp 2025-05-07T20:33:35.9175255Z x0 = x[:, :D] 2025-05-07T20:33:35.9175466Z x1 = x[:, D:] 2025-05-07T20:33:35.9175663Z 2025-05-07T20:33:35.9175842Z if contiguous: 2025-05-07T20:33:35.9176084Z x0 = x0.contiguous() 2025-05-07T20:33:35.9176334Z x1 = x1.contiguous() 2025-05-07T20:33:35.9176576Z 2025-05-07T20:33:35.9176761Z if scale_ub is not None: 2025-05-07T20:33:35.9177029Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.9177365Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.9185564Z ) 2025-05-07T20:33:35.9185773Z else: 2025-05-07T20:33:35.9185992Z scale_ub_tensor = None 2025-05-07T20:33:35.9186245Z 2025-05-07T20:33:35.9186490Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.9186822Z op = silu_mul_quant 2025-05-07T20:33:35.9187081Z if compiled: 2025-05-07T20:33:35.9187331Z op = torch.compile(op) 2025-05-07T20:33:35.9187724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.9187992Z 2025-05-07T20:33:35.9188180Z > y_fp8, y_scale = fn() 2025-05-07T20:33:35.9188353Z 2025-05-07T20:33:35.9188454Z moe/activation_test.py:117: 2025-05-07T20:33:35.9188755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.9189077Z moe/activation_test.py:115: in fn 2025-05-07T20:33:35.9189360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.9190050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:35.9190761Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:35.9191290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.9192509Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.9193213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.9193733Z kernel = self.compile( 2025-05-07T20:33:35.9194278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.9194946Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.9195338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.9195559Z 2025-05-07T20:33:35.9195762Z self = 2025-05-07T20:33:35.9196925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.9198307Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca3940e0>} 2025-05-07T20:33:35.9202768Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.9203774Z context = 2025-05-07T20:33:35.9204065Z 2025-05-07T20:33:35.9204227Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.9204739Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.9205198Z module_map=module_map) 2025-05-07T20:33:35.9205566Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.9205985Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:35.9206244Z E ^ 2025-05-07T20:33:35.9206704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.9207165Z 2025-05-07T20:33:35.9207605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.9208107Z 2025-05-07T20:33:35.9208212Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.9208612Z self=, 2025-05-07T20:33:35.9209011Z T=4096, 2025-05-07T20:33:35.9209195Z D=7168, 2025-05-07T20:33:35.9209463Z scale_ub=1200.0, 2025-05-07T20:33:35.9209691Z contiguous=False, 2025-05-07T20:33:35.9209923Z compiled=True, 2025-05-07T20:33:36.0817792Z ) 2025-05-07T20:33:36.0818333Z self = 2025-05-07T20:33:36.0819018Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:36.0819377Z 2025-05-07T20:33:36.0819462Z @given( 2025-05-07T20:33:36.0819691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.0820005Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.0820319Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.0820638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.0820959Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.0821237Z ) 2025-05-07T20:33:36.0821572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.0822024Z def test_silu_mul_quant( 2025-05-07T20:33:36.0822273Z self, 2025-05-07T20:33:36.0822463Z T: int, 2025-05-07T20:33:36.0822689Z D: int, 2025-05-07T20:33:36.0822903Z scale_ub: Optional[float], 2025-05-07T20:33:36.0823172Z contiguous: bool, 2025-05-07T20:33:36.0823415Z compiled: bool, 2025-05-07T20:33:36.0823636Z ) -> None: 2025-05-07T20:33:36.0823852Z torch.manual_seed(2025) 2025-05-07T20:33:36.0824095Z 2025-05-07T20:33:36.0824360Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.0824715Z 2025-05-07T20:33:36.0824911Z x_sign = torch.sign(x) 2025-05-07T20:33:36.0825192Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.0825497Z x = x_sign * x_clamp 2025-05-07T20:33:36.0825743Z x0 = x[:, :D] 2025-05-07T20:33:36.0825962Z x1 = x[:, D:] 2025-05-07T20:33:36.0826161Z 2025-05-07T20:33:36.0826346Z if contiguous: 2025-05-07T20:33:36.0826575Z x0 = x0.contiguous() 2025-05-07T20:33:36.0826830Z x1 = x1.contiguous() 2025-05-07T20:33:36.0827072Z 2025-05-07T20:33:36.0827266Z if scale_ub is not None: 2025-05-07T20:33:36.0827894Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:36.0828240Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:36.0828558Z ) 2025-05-07T20:33:36.0828745Z else: 2025-05-07T20:33:36.0828953Z scale_ub_tensor = None 2025-05-07T20:33:36.0829201Z 2025-05-07T20:33:36.0829422Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.0829878Z op = silu_mul_quant 2025-05-07T20:33:36.0830129Z if compiled: 2025-05-07T20:33:36.0830367Z op = torch.compile(op) 2025-05-07T20:33:36.0830664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.0830934Z 2025-05-07T20:33:36.0831130Z > y_fp8, y_scale = fn() 2025-05-07T20:33:36.0831293Z 2025-05-07T20:33:36.0831390Z moe/activation_test.py:117: 2025-05-07T20:33:36.0831686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.0832011Z moe/activation_test.py:115: in fn 2025-05-07T20:33:36.0832290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.0832947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:36.0833505Z return fn(*args, **kwargs) 2025-05-07T20:33:36.0834161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:36.0834847Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:36.0835388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:36.0836068Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:36.0836722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:36.0837338Z kernel = self.compile( 2025-05-07T20:33:36.0837896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:36.0838548Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:36.0838946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.0839175Z 2025-05-07T20:33:36.0839379Z self = 2025-05-07T20:33:36.0840797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:36.0842182Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca395300>} 2025-05-07T20:33:36.0843512Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:36.0844533Z context = 2025-05-07T20:33:36.0844824Z 2025-05-07T20:33:36.0844990Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:36.0845514Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:36.0845984Z module_map=module_map) 2025-05-07T20:33:36.0846348Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:36.0846794Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:36.0847056Z E ^ 2025-05-07T20:33:36.0847519Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:36.0847986Z 2025-05-07T20:33:36.0848579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:36.0849114Z 2025-05-07T20:33:36.0849228Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.0849631Z self=, 2025-05-07T20:33:36.0850045Z T=128, 2025-05-07T20:33:36.0850244Z D=7168, 2025-05-07T20:33:36.0850513Z scale_ub=1200.0, 2025-05-07T20:33:36.0850847Z contiguous=False, 2025-05-07T20:33:36.0851074Z compiled=True, 2025-05-07T20:33:36.0851280Z ) 2025-05-07T20:33:36.0851599Z self = 2025-05-07T20:33:36.0852084Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:36.0852350Z 2025-05-07T20:33:36.0852433Z @given( 2025-05-07T20:33:36.0852655Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.0852969Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.0853273Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.0853601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.0853993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.0854278Z ) 2025-05-07T20:33:36.0854624Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.0855063Z def test_silu_mul_quant( 2025-05-07T20:33:36.0855312Z self, 2025-05-07T20:33:36.0855505Z T: int, 2025-05-07T20:33:36.0855692Z D: int, 2025-05-07T20:33:36.0855908Z scale_ub: Optional[float], 2025-05-07T20:33:36.0856175Z contiguous: bool, 2025-05-07T20:33:36.0856406Z compiled: bool, 2025-05-07T20:33:36.0856630Z ) -> None: 2025-05-07T20:33:36.0856842Z torch.manual_seed(2025) 2025-05-07T20:33:36.0857076Z 2025-05-07T20:33:36.0857415Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.0857753Z 2025-05-07T20:33:36.0857937Z x_sign = torch.sign(x) 2025-05-07T20:33:36.0858230Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.0858540Z x = x_sign * x_clamp 2025-05-07T20:33:36.0858781Z x0 = x[:, :D] 2025-05-07T20:33:36.0858993Z x1 = x[:, D:] 2025-05-07T20:33:36.0859200Z 2025-05-07T20:33:36.0859377Z if contiguous: 2025-05-07T20:33:36.0859601Z x0 = x0.contiguous() 2025-05-07T20:33:36.0859862Z x1 = x1.contiguous() 2025-05-07T20:33:36.0860105Z 2025-05-07T20:33:36.0860289Z if scale_ub is not None: 2025-05-07T20:33:36.0860559Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:36.0860896Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:36.0861195Z ) 2025-05-07T20:33:36.0861393Z else: 2025-05-07T20:33:36.0861608Z scale_ub_tensor = None 2025-05-07T20:33:36.0861857Z 2025-05-07T20:33:36.0862104Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.0862412Z op = silu_mul_quant 2025-05-07T20:33:36.0862659Z if compiled: 2025-05-07T20:33:36.0862910Z op = torch.compile(op) 2025-05-07T20:33:36.0863213Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.0863488Z 2025-05-07T20:33:36.0863672Z > y_fp8, y_scale = fn() 2025-05-07T20:33:36.0863849Z 2025-05-07T20:33:36.0863951Z moe/activation_test.py:117: 2025-05-07T20:33:36.0864259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.0864590Z moe/activation_test.py:115: in fn 2025-05-07T20:33:36.0864873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.0865435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:36.0865993Z return fn(*args, **kwargs) 2025-05-07T20:33:36.0866657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:36.0867389Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:36.0868001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:36.0868676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:36.0869345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:36.0869927Z kernel = self.compile( 2025-05-07T20:33:36.0870521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:36.0871159Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:36.0871554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.0871794Z 2025-05-07T20:33:36.0872007Z self = 2025-05-07T20:33:36.0873155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:36.0874863Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca396160>} 2025-05-07T20:33:36.0876533Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:36.0877784Z context = 2025-05-07T20:33:36.0878120Z 2025-05-07T20:33:36.0878311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:36.0878959Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:36.0879522Z module_map=module_map) 2025-05-07T20:33:36.0879946Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:36.0880343Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:36.0880626Z E ^ 2025-05-07T20:33:36.0881173Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:36.0881723Z 2025-05-07T20:33:36.0882235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:36.0882861Z 2025-05-07T20:33:36.0882972Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.0883443Z self=, 2025-05-07T20:33:36.0883901Z T=2048, 2025-05-07T20:33:36.0884104Z D=7168, 2025-05-07T20:33:36.0884301Z scale_ub=None, 2025-05-07T20:33:36.0884528Z contiguous=True, 2025-05-07T20:33:36.0884771Z compiled=True, 2025-05-07T20:33:36.2098977Z ) 2025-05-07T20:33:36.2099520Z self = 2025-05-07T20:33:36.2100189Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:36.2100548Z 2025-05-07T20:33:36.2100649Z @given( 2025-05-07T20:33:36.2100940Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.2101253Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.2101560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.2101884Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.2102292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.2102583Z ) 2025-05-07T20:33:36.2102928Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.2103380Z def test_silu_mul_quant( 2025-05-07T20:33:36.2103625Z self, 2025-05-07T20:33:36.2103816Z T: int, 2025-05-07T20:33:36.2104293Z D: int, 2025-05-07T20:33:36.2104526Z scale_ub: Optional[float], 2025-05-07T20:33:36.2104797Z contiguous: bool, 2025-05-07T20:33:36.2105043Z compiled: bool, 2025-05-07T20:33:36.2105275Z ) -> None: 2025-05-07T20:33:36.2105490Z torch.manual_seed(2025) 2025-05-07T20:33:36.2105738Z 2025-05-07T20:33:36.2106008Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.2106459Z 2025-05-07T20:33:36.2106643Z x_sign = torch.sign(x) 2025-05-07T20:33:36.2106933Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.2107250Z x = x_sign * x_clamp 2025-05-07T20:33:36.2107584Z x0 = x[:, :D] 2025-05-07T20:33:36.2107797Z x1 = x[:, D:] 2025-05-07T20:33:36.2108006Z 2025-05-07T20:33:36.2108181Z if contiguous: 2025-05-07T20:33:36.2108412Z x0 = x0.contiguous() 2025-05-07T20:33:36.2108694Z x1 = x1.contiguous() 2025-05-07T20:33:36.2108932Z 2025-05-07T20:33:36.2109121Z if scale_ub is not None: 2025-05-07T20:33:36.2109533Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:36.2109867Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:36.2110167Z ) 2025-05-07T20:33:36.2110347Z else: 2025-05-07T20:33:36.2110549Z scale_ub_tensor = None 2025-05-07T20:33:36.2110802Z 2025-05-07T20:33:36.2111023Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.2111324Z op = silu_mul_quant 2025-05-07T20:33:36.2111570Z if compiled: 2025-05-07T20:33:36.2111806Z op = torch.compile(op) 2025-05-07T20:33:36.2112093Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.2112355Z 2025-05-07T20:33:36.2112535Z > y_fp8, y_scale = fn() 2025-05-07T20:33:36.2112791Z 2025-05-07T20:33:36.2112887Z moe/activation_test.py:117: 2025-05-07T20:33:36.2113182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.2113508Z moe/activation_test.py:115: in fn 2025-05-07T20:33:36.2113786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.2114371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:36.2114940Z return fn(*args, **kwargs) 2025-05-07T20:33:36.2115589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:36.2116276Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:36.2116810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:36.2117481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:36.2118133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:36.2118664Z kernel = self.compile( 2025-05-07T20:33:36.2119235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:36.2119883Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:36.2120275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.2120515Z 2025-05-07T20:33:36.2120721Z self = 2025-05-07T20:33:36.2121825Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:36.2123250Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca397420>} 2025-05-07T20:33:36.2124634Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:36.2125643Z context = 2025-05-07T20:33:36.2125936Z 2025-05-07T20:33:36.2126101Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:36.2126678Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:36.2127150Z module_map=module_map) 2025-05-07T20:33:36.2127514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:36.2127876Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:36.2128136Z E ^ 2025-05-07T20:33:36.2128597Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:36.2129049Z 2025-05-07T20:33:36.2129523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:36.2130028Z 2025-05-07T20:33:36.2130137Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.2130538Z self=, 2025-05-07T20:33:36.2130936Z T=16384, 2025-05-07T20:33:36.2131129Z D=5120, 2025-05-07T20:33:36.2131313Z scale_ub=None, 2025-05-07T20:33:36.2131532Z contiguous=False, 2025-05-07T20:33:36.2131757Z compiled=False, 2025-05-07T20:33:36.2131948Z ) 2025-05-07T20:33:36.2132272Z self = 2025-05-07T20:33:36.2132764Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:36.2133036Z 2025-05-07T20:33:36.2133169Z @given( 2025-05-07T20:33:36.2133393Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.2133702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.2134010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.2134335Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.2134665Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.2134945Z ) 2025-05-07T20:33:36.2135281Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.2135735Z def test_silu_mul_quant( 2025-05-07T20:33:36.2135976Z self, 2025-05-07T20:33:36.2136175Z T: int, 2025-05-07T20:33:36.2136377Z D: int, 2025-05-07T20:33:36.2136594Z scale_ub: Optional[float], 2025-05-07T20:33:36.2136874Z contiguous: bool, 2025-05-07T20:33:36.2137106Z compiled: bool, 2025-05-07T20:33:36.2137335Z ) -> None: 2025-05-07T20:33:36.2137552Z torch.manual_seed(2025) 2025-05-07T20:33:36.2137795Z 2025-05-07T20:33:36.2138065Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.2138404Z 2025-05-07T20:33:36.2138596Z x_sign = torch.sign(x) 2025-05-07T20:33:36.2138897Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.2141419Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.2143316Z 2025-05-07T20:33:36.2143438Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:36.2143656Z 2025-05-07T20:33:36.2143771Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.2144268Z self=, 2025-05-07T20:33:36.2144685Z T=4096, 2025-05-07T20:33:36.2144883Z D=7168, 2025-05-07T20:33:36.2145075Z scale_ub=1200.0, 2025-05-07T20:33:36.2145314Z contiguous=True, 2025-05-07T20:33:36.2145542Z compiled=True, 2025-05-07T20:33:36.2145744Z ) 2025-05-07T20:33:36.2146058Z self = 2025-05-07T20:33:36.2146614Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:36.2146887Z 2025-05-07T20:33:36.2146971Z @given( 2025-05-07T20:33:36.2147187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.2147572Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.2147883Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.2148198Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.2148522Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.2148804Z ) 2025-05-07T20:33:36.2149144Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.2149653Z def test_silu_mul_quant( 2025-05-07T20:33:36.2149902Z self, 2025-05-07T20:33:36.2150085Z T: int, 2025-05-07T20:33:36.2150277Z D: int, 2025-05-07T20:33:36.2150491Z scale_ub: Optional[float], 2025-05-07T20:33:36.2150755Z contiguous: bool, 2025-05-07T20:33:36.2150993Z compiled: bool, 2025-05-07T20:33:36.2151217Z ) -> None: 2025-05-07T20:33:36.2151425Z torch.manual_seed(2025) 2025-05-07T20:33:36.2151655Z 2025-05-07T20:33:36.2151917Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.2152257Z 2025-05-07T20:33:36.2152436Z x_sign = torch.sign(x) 2025-05-07T20:33:36.2152720Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.2155209Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.2157081Z 2025-05-07T20:33:36.2157208Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:36.2157414Z 2025-05-07T20:33:36.2157518Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.2157920Z self=, 2025-05-07T20:33:36.2158343Z T=16384, 2025-05-07T20:33:36.2158535Z D=7168, 2025-05-07T20:33:36.2158720Z scale_ub=None, 2025-05-07T20:33:36.2158934Z contiguous=False, 2025-05-07T20:33:36.2159161Z compiled=False, 2025-05-07T20:33:36.2159359Z ) 2025-05-07T20:33:36.2159679Z self = 2025-05-07T20:33:36.2160174Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:36.2160443Z 2025-05-07T20:33:36.2168593Z @given( 2025-05-07T20:33:36.2168860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.2169177Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.2169472Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.2169792Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.2170110Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.2170387Z ) 2025-05-07T20:33:36.2170727Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.2171165Z def test_silu_mul_quant( 2025-05-07T20:33:36.2171399Z self, 2025-05-07T20:33:36.2171584Z T: int, 2025-05-07T20:33:36.2171778Z D: int, 2025-05-07T20:33:36.2172072Z scale_ub: Optional[float], 2025-05-07T20:33:36.2172338Z contiguous: bool, 2025-05-07T20:33:36.2172580Z compiled: bool, 2025-05-07T20:33:36.2172798Z ) -> None: 2025-05-07T20:33:36.2173004Z torch.manual_seed(2025) 2025-05-07T20:33:36.2173240Z 2025-05-07T20:33:36.2173504Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.2175632Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.2177506Z 2025-05-07T20:33:36.2177627Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:36.3409572Z 2025-05-07T20:33:36.3409990Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.3410590Z self=, 2025-05-07T20:33:36.3411096Z T=2048, 2025-05-07T20:33:36.3411289Z D=7168, 2025-05-07T20:33:36.3411485Z scale_ub=1200.0, 2025-05-07T20:33:36.3411705Z contiguous=True, 2025-05-07T20:33:36.3411923Z compiled=True, 2025-05-07T20:33:36.3412124Z ) 2025-05-07T20:33:36.3412446Z self = 2025-05-07T20:33:36.3412936Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:36.3413200Z 2025-05-07T20:33:36.3413275Z @given( 2025-05-07T20:33:36.3413624Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.3413933Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.3414247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.3414571Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.3414897Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.3415185Z ) 2025-05-07T20:33:36.3415523Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.3415968Z def test_silu_mul_quant( 2025-05-07T20:33:36.3416214Z self, 2025-05-07T20:33:36.3416402Z T: int, 2025-05-07T20:33:36.3416598Z D: int, 2025-05-07T20:33:36.3416816Z scale_ub: Optional[float], 2025-05-07T20:33:36.3417081Z contiguous: bool, 2025-05-07T20:33:36.3417319Z compiled: bool, 2025-05-07T20:33:36.3417548Z ) -> None: 2025-05-07T20:33:36.3417755Z torch.manual_seed(2025) 2025-05-07T20:33:36.3418000Z 2025-05-07T20:33:36.3418278Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.3418623Z 2025-05-07T20:33:36.3418821Z x_sign = torch.sign(x) 2025-05-07T20:33:36.3419121Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.3421263Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.3423205Z 2025-05-07T20:33:36.3423327Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:36.3423540Z 2025-05-07T20:33:36.3423645Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.3424071Z self=, 2025-05-07T20:33:36.3424581Z T=2048, 2025-05-07T20:33:36.3424775Z D=7168, 2025-05-07T20:33:36.3424961Z scale_ub=None, 2025-05-07T20:33:36.3425173Z contiguous=True, 2025-05-07T20:33:36.3425395Z compiled=False, 2025-05-07T20:33:36.3425592Z ) 2025-05-07T20:33:36.3425911Z self = 2025-05-07T20:33:36.3426503Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:36.3426781Z 2025-05-07T20:33:36.3426858Z @given( 2025-05-07T20:33:36.3427084Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.3427483Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.3427781Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.3428110Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.3428439Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.3428724Z ) 2025-05-07T20:33:36.3429067Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.3429549Z def test_silu_mul_quant( 2025-05-07T20:33:36.3429792Z self, 2025-05-07T20:33:36.3430018Z T: int, 2025-05-07T20:33:36.3430213Z D: int, 2025-05-07T20:33:36.3430420Z scale_ub: Optional[float], 2025-05-07T20:33:36.3430689Z contiguous: bool, 2025-05-07T20:33:36.3430929Z compiled: bool, 2025-05-07T20:33:36.3431143Z ) -> None: 2025-05-07T20:33:36.3431360Z torch.manual_seed(2025) 2025-05-07T20:33:36.3431599Z 2025-05-07T20:33:36.3431861Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.3432211Z 2025-05-07T20:33:36.3432400Z > x_sign = torch.sign(x) 2025-05-07T20:33:36.3434400Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.3436403Z 2025-05-07T20:33:36.3436521Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:36.3436732Z 2025-05-07T20:33:36.3436834Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.3437241Z self=, 2025-05-07T20:33:36.3437634Z T=1, 2025-05-07T20:33:36.3437811Z D=7168, 2025-05-07T20:33:36.3437997Z scale_ub=1200.0, 2025-05-07T20:33:36.3438215Z contiguous=True, 2025-05-07T20:33:36.3438430Z compiled=False, 2025-05-07T20:33:36.3438634Z ) 2025-05-07T20:33:36.3438946Z self = 2025-05-07T20:33:36.3439422Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:36.3439688Z 2025-05-07T20:33:36.3439768Z @given( 2025-05-07T20:33:36.3439997Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.3440595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.3440899Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.3441229Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.3441551Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.3441832Z ) 2025-05-07T20:33:36.3442175Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.3442627Z def test_silu_mul_quant( 2025-05-07T20:33:36.3442869Z self, 2025-05-07T20:33:36.3443064Z T: int, 2025-05-07T20:33:36.3443266Z D: int, 2025-05-07T20:33:36.3443480Z scale_ub: Optional[float], 2025-05-07T20:33:36.3443755Z contiguous: bool, 2025-05-07T20:33:36.3444074Z compiled: bool, 2025-05-07T20:33:36.3444293Z ) -> None: 2025-05-07T20:33:36.3444519Z torch.manual_seed(2025) 2025-05-07T20:33:36.3444757Z 2025-05-07T20:33:36.3445019Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.3445362Z 2025-05-07T20:33:36.3445556Z x_sign = torch.sign(x) 2025-05-07T20:33:36.3445915Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.3446213Z x = x_sign * x_clamp 2025-05-07T20:33:36.3446460Z x0 = x[:, :D] 2025-05-07T20:33:36.3446671Z x1 = x[:, D:] 2025-05-07T20:33:36.3446871Z 2025-05-07T20:33:36.3447053Z if contiguous: 2025-05-07T20:33:36.3447282Z x0 = x0.contiguous() 2025-05-07T20:33:36.3447531Z x1 = x1.contiguous() 2025-05-07T20:33:36.3447771Z 2025-05-07T20:33:36.3447957Z if scale_ub is not None: 2025-05-07T20:33:36.3448225Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:36.3448559Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:36.3448869Z ) 2025-05-07T20:33:36.3449124Z else: 2025-05-07T20:33:36.3449335Z scale_ub_tensor = None 2025-05-07T20:33:36.3449588Z 2025-05-07T20:33:36.3449810Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.3450141Z op = silu_mul_quant 2025-05-07T20:33:36.3450440Z if compiled: 2025-05-07T20:33:36.3450686Z op = torch.compile(op) 2025-05-07T20:33:36.3450982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.3451257Z 2025-05-07T20:33:36.3451453Z > y_fp8, y_scale = fn() 2025-05-07T20:33:36.3451614Z 2025-05-07T20:33:36.3451712Z moe/activation_test.py:117: 2025-05-07T20:33:36.3452005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.3452399Z moe/activation_test.py:115: in fn 2025-05-07T20:33:36.3452680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.3453379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:36.3454068Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:36.3454611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:36.3455292Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:36.3455954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:36.3456490Z kernel = self.compile( 2025-05-07T20:33:36.3457053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:36.3457729Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:36.3458138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.3458365Z 2025-05-07T20:33:36.3458580Z self = 2025-05-07T20:33:36.3459676Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:36.3461039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f88199aa2a0>} 2025-05-07T20:33:36.3462365Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:36.3463383Z context = 2025-05-07T20:33:36.3463674Z 2025-05-07T20:33:36.3463894Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:36.3464418Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:36.3464888Z module_map=module_map) 2025-05-07T20:33:36.3465259Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:36.3465604Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:36.3465918Z E ^ 2025-05-07T20:33:36.3466379Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:36.3466831Z 2025-05-07T20:33:36.3467262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:36.3467837Z 2025-05-07T20:33:36.3467939Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.3468345Z self=, 2025-05-07T20:33:36.3468738Z T=128, 2025-05-07T20:33:36.3468921Z D=5120, 2025-05-07T20:33:36.3469110Z scale_ub=None, 2025-05-07T20:33:36.3469395Z contiguous=True, 2025-05-07T20:33:36.3469614Z compiled=False, 2025-05-07T20:33:36.3469814Z ) 2025-05-07T20:33:36.3470133Z self = 2025-05-07T20:33:36.3470614Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:36.3470887Z 2025-05-07T20:33:36.3470963Z @given( 2025-05-07T20:33:36.3471210Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.3471516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.3471817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.3472141Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.3472470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.3472793Z ) 2025-05-07T20:33:36.3473147Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.3473608Z def test_silu_mul_quant( 2025-05-07T20:33:36.3473846Z self, 2025-05-07T20:33:36.3474037Z T: int, 2025-05-07T20:33:36.3474234Z D: int, 2025-05-07T20:33:36.3474446Z scale_ub: Optional[float], 2025-05-07T20:33:36.3474724Z contiguous: bool, 2025-05-07T20:33:36.3474964Z compiled: bool, 2025-05-07T20:33:36.3475179Z ) -> None: 2025-05-07T20:33:36.3475393Z torch.manual_seed(2025) 2025-05-07T20:33:36.3475635Z 2025-05-07T20:33:36.3475896Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.3476231Z 2025-05-07T20:33:36.3476431Z x_sign = torch.sign(x) 2025-05-07T20:33:36.3476716Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.3477013Z x = x_sign * x_clamp 2025-05-07T20:33:36.3477256Z x0 = x[:, :D] 2025-05-07T20:33:36.3477471Z x1 = x[:, D:] 2025-05-07T20:33:36.3477671Z 2025-05-07T20:33:36.3477856Z if contiguous: 2025-05-07T20:33:36.3478084Z x0 = x0.contiguous() 2025-05-07T20:33:36.3478331Z x1 = x1.contiguous() 2025-05-07T20:33:36.3478563Z 2025-05-07T20:33:36.3478749Z if scale_ub is not None: 2025-05-07T20:33:36.3479014Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:36.3479346Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:36.3479655Z ) 2025-05-07T20:33:36.3479841Z else: 2025-05-07T20:33:36.3480050Z scale_ub_tensor = None 2025-05-07T20:33:36.3480298Z 2025-05-07T20:33:36.3480518Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.3480824Z op = silu_mul_quant 2025-05-07T20:33:36.3481075Z if compiled: 2025-05-07T20:33:36.3481321Z op = torch.compile(op) 2025-05-07T20:33:36.3481613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.3481879Z 2025-05-07T20:33:36.3482073Z > y_fp8, y_scale = fn() 2025-05-07T20:33:36.3482283Z 2025-05-07T20:33:36.3482382Z moe/activation_test.py:117: 2025-05-07T20:33:36.3482676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.3483003Z moe/activation_test.py:115: in fn 2025-05-07T20:33:36.3483274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.3483958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:36.3484686Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:36.3485235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:36.3485903Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:36.3486563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:36.3487087Z kernel = self.compile( 2025-05-07T20:33:36.3487671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:36.3488319Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:36.3488716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.3488943Z 2025-05-07T20:33:36.3489155Z self = 2025-05-07T20:33:36.3490311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:36.3491665Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f88199ab1a0>} 2025-05-07T20:33:36.3493037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:36.3494044Z context = 2025-05-07T20:33:36.3494324Z 2025-05-07T20:33:36.3494494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:36.3495012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:36.3495491Z module_map=module_map) 2025-05-07T20:33:36.3495855Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:36.3496200Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:36.3496457Z E ^ 2025-05-07T20:33:36.3496923Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:36.3497375Z 2025-05-07T20:33:36.3497799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:36.4632412Z 2025-05-07T20:33:36.4633018Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.4633628Z self=, 2025-05-07T20:33:36.4634168Z T=128, 2025-05-07T20:33:36.4634382Z D=7168, 2025-05-07T20:33:36.4634580Z scale_ub=None, 2025-05-07T20:33:36.4634784Z contiguous=True, 2025-05-07T20:33:36.4635003Z compiled=False, 2025-05-07T20:33:36.4635206Z ) 2025-05-07T20:33:36.4635522Z self = 2025-05-07T20:33:36.4636002Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:36.4636272Z 2025-05-07T20:33:36.4636349Z @given( 2025-05-07T20:33:36.4636580Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.4636878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.4637486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.4637814Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.4638131Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.4638406Z ) 2025-05-07T20:33:36.4638745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.4639315Z def test_silu_mul_quant( 2025-05-07T20:33:36.4639553Z self, 2025-05-07T20:33:36.4639734Z T: int, 2025-05-07T20:33:36.4639930Z D: int, 2025-05-07T20:33:36.4640409Z scale_ub: Optional[float], 2025-05-07T20:33:36.4640795Z contiguous: bool, 2025-05-07T20:33:36.4641033Z compiled: bool, 2025-05-07T20:33:36.4641262Z ) -> None: 2025-05-07T20:33:36.4641471Z torch.manual_seed(2025) 2025-05-07T20:33:36.4641713Z 2025-05-07T20:33:36.4641987Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.4642319Z 2025-05-07T20:33:36.4642516Z x_sign = torch.sign(x) 2025-05-07T20:33:36.4642905Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.4643212Z x = x_sign * x_clamp 2025-05-07T20:33:36.4643449Z x0 = x[:, :D] 2025-05-07T20:33:36.4643671Z x1 = x[:, D:] 2025-05-07T20:33:36.4643873Z 2025-05-07T20:33:36.4644050Z if contiguous: 2025-05-07T20:33:36.4644280Z x0 = x0.contiguous() 2025-05-07T20:33:36.4644534Z x1 = x1.contiguous() 2025-05-07T20:33:36.4644765Z 2025-05-07T20:33:36.4644953Z if scale_ub is not None: 2025-05-07T20:33:36.4645226Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:36.4645548Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:36.4645857Z ) 2025-05-07T20:33:36.4646044Z else: 2025-05-07T20:33:36.4646340Z scale_ub_tensor = None 2025-05-07T20:33:36.4646587Z 2025-05-07T20:33:36.4646811Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.4647115Z op = silu_mul_quant 2025-05-07T20:33:36.4647357Z if compiled: 2025-05-07T20:33:36.4647602Z op = torch.compile(op) 2025-05-07T20:33:36.4647890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.4648163Z 2025-05-07T20:33:36.4648353Z > y_fp8, y_scale = fn() 2025-05-07T20:33:36.4648515Z 2025-05-07T20:33:36.4648621Z moe/activation_test.py:117: 2025-05-07T20:33:36.4648909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.4649230Z moe/activation_test.py:115: in fn 2025-05-07T20:33:36.4649508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.4650208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:36.4650888Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:36.4651439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:36.4652117Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:36.4652762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:36.4653292Z kernel = self.compile( 2025-05-07T20:33:36.4653852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:36.4654493Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:36.4654885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.4655119Z 2025-05-07T20:33:36.4655321Z self = 2025-05-07T20:33:36.4656507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:36.4657883Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819b78040>} 2025-05-07T20:33:36.4659202Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:36.4660345Z context = 2025-05-07T20:33:36.4660624Z 2025-05-07T20:33:36.4660795Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:36.4661309Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:36.4661766Z module_map=module_map) 2025-05-07T20:33:36.4662133Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:36.4662492Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:36.4662788Z E ^ 2025-05-07T20:33:36.4663254Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:36.4663714Z 2025-05-07T20:33:36.4664150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:36.4664651Z 2025-05-07T20:33:36.4664758Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.4665156Z self=, 2025-05-07T20:33:36.4665553Z T=2048, 2025-05-07T20:33:36.4665740Z D=7168, 2025-05-07T20:33:36.4665922Z scale_ub=1200.0, 2025-05-07T20:33:36.4666142Z contiguous=True, 2025-05-07T20:33:36.4666410Z compiled=False, 2025-05-07T20:33:36.4666610Z ) 2025-05-07T20:33:36.4666929Z self = 2025-05-07T20:33:36.4667500Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:36.4667772Z 2025-05-07T20:33:36.4667853Z @given( 2025-05-07T20:33:36.4668069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.4668374Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.4668675Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.4668996Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.4669321Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.4669601Z ) 2025-05-07T20:33:36.4669945Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.4670373Z def test_silu_mul_quant( 2025-05-07T20:33:36.4670612Z self, 2025-05-07T20:33:36.4670795Z T: int, 2025-05-07T20:33:36.4670996Z D: int, 2025-05-07T20:33:36.4671208Z scale_ub: Optional[float], 2025-05-07T20:33:36.4671475Z contiguous: bool, 2025-05-07T20:33:36.4671709Z compiled: bool, 2025-05-07T20:33:36.4671932Z ) -> None: 2025-05-07T20:33:36.4672146Z torch.manual_seed(2025) 2025-05-07T20:33:36.4672377Z 2025-05-07T20:33:36.4672640Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.4674664Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.4676608Z 2025-05-07T20:33:36.4676727Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:36.4676933Z 2025-05-07T20:33:36.4677082Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.4677493Z self=, 2025-05-07T20:33:36.4677894Z T=1, 2025-05-07T20:33:36.4678074Z D=5120, 2025-05-07T20:33:36.4678254Z scale_ub=1200.0, 2025-05-07T20:33:36.4678473Z contiguous=True, 2025-05-07T20:33:36.4678704Z compiled=False, 2025-05-07T20:33:36.4678949Z ) 2025-05-07T20:33:36.4679269Z self = 2025-05-07T20:33:36.4679748Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:36.4680006Z 2025-05-07T20:33:36.4680084Z @given( 2025-05-07T20:33:36.4680310Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.4680614Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.4680911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.4689707Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.4690059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.4690460Z ) 2025-05-07T20:33:36.4690807Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.4691261Z def test_silu_mul_quant( 2025-05-07T20:33:36.4691505Z self, 2025-05-07T20:33:36.4691699Z T: int, 2025-05-07T20:33:36.4691902Z D: int, 2025-05-07T20:33:36.4692118Z scale_ub: Optional[float], 2025-05-07T20:33:36.4692381Z contiguous: bool, 2025-05-07T20:33:36.4692631Z compiled: bool, 2025-05-07T20:33:36.4692857Z ) -> None: 2025-05-07T20:33:36.4693074Z torch.manual_seed(2025) 2025-05-07T20:33:36.4693309Z 2025-05-07T20:33:36.4693580Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.4693973Z 2025-05-07T20:33:36.4694160Z x_sign = torch.sign(x) 2025-05-07T20:33:36.4694452Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.4694757Z x = x_sign * x_clamp 2025-05-07T20:33:36.4694986Z x0 = x[:, :D] 2025-05-07T20:33:36.4695193Z x1 = x[:, D:] 2025-05-07T20:33:36.4695400Z 2025-05-07T20:33:36.4695574Z if contiguous: 2025-05-07T20:33:36.4695804Z x0 = x0.contiguous() 2025-05-07T20:33:36.4696060Z x1 = x1.contiguous() 2025-05-07T20:33:36.4696293Z 2025-05-07T20:33:36.4696481Z if scale_ub is not None: 2025-05-07T20:33:36.4696753Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:36.4697076Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:36.4697379Z ) 2025-05-07T20:33:36.4697569Z else: 2025-05-07T20:33:36.4697777Z scale_ub_tensor = None 2025-05-07T20:33:36.4698022Z 2025-05-07T20:33:36.4698249Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.4698563Z op = silu_mul_quant 2025-05-07T20:33:36.4698801Z if compiled: 2025-05-07T20:33:36.4699053Z op = torch.compile(op) 2025-05-07T20:33:36.4699351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.4699615Z 2025-05-07T20:33:36.4699800Z > y_fp8, y_scale = fn() 2025-05-07T20:33:36.4699964Z 2025-05-07T20:33:36.4700068Z moe/activation_test.py:117: 2025-05-07T20:33:36.4700360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.4700690Z moe/activation_test.py:115: in fn 2025-05-07T20:33:36.4700974Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.4701677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:36.4702356Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:36.4702917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:36.4703592Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:36.4704303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:36.4704837Z kernel = self.compile( 2025-05-07T20:33:36.4705387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:36.4706060Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:36.4706500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.4706729Z 2025-05-07T20:33:36.4706934Z self = 2025-05-07T20:33:36.4708067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:36.4709478Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819b79580>} 2025-05-07T20:33:36.4710801Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:36.4711816Z context = 2025-05-07T20:33:36.4712108Z 2025-05-07T20:33:36.4712272Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:36.4712794Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:36.4713262Z module_map=module_map) 2025-05-07T20:33:36.4713671Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:36.4714029Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:36.4714288Z E ^ 2025-05-07T20:33:36.4714762Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:36.4715218Z 2025-05-07T20:33:36.4715638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:36.5563093Z 2025-05-07T20:33:36.5563394Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.5564030Z self=, 2025-05-07T20:33:36.5564431Z T=2048, 2025-05-07T20:33:36.5564623Z D=5120, 2025-05-07T20:33:36.5564814Z scale_ub=None, 2025-05-07T20:33:36.5565025Z contiguous=True, 2025-05-07T20:33:36.5565251Z compiled=False, 2025-05-07T20:33:36.5565465Z ) 2025-05-07T20:33:36.5565792Z self = 2025-05-07T20:33:36.5566293Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:36.5566574Z 2025-05-07T20:33:36.5566657Z @given( 2025-05-07T20:33:36.5566892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.5567197Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.5567535Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.5567867Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.5568195Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.5568477Z ) 2025-05-07T20:33:36.5568818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.5569278Z def test_silu_mul_quant( 2025-05-07T20:33:36.5569519Z self, 2025-05-07T20:33:36.5569709Z T: int, 2025-05-07T20:33:36.5569908Z D: int, 2025-05-07T20:33:36.5570130Z scale_ub: Optional[float], 2025-05-07T20:33:36.5570399Z contiguous: bool, 2025-05-07T20:33:36.5570640Z compiled: bool, 2025-05-07T20:33:36.5570864Z ) -> None: 2025-05-07T20:33:36.5571297Z torch.manual_seed(2025) 2025-05-07T20:33:36.5571540Z 2025-05-07T20:33:36.5571811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.5572150Z 2025-05-07T20:33:36.5572339Z > x_sign = torch.sign(x) 2025-05-07T20:33:36.5574268Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.5576283Z 2025-05-07T20:33:36.5576399Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:36.5576608Z 2025-05-07T20:33:36.5576721Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.5577203Z self=, 2025-05-07T20:33:36.5577619Z T=16384, 2025-05-07T20:33:36.5577815Z D=5120, 2025-05-07T20:33:36.5577995Z scale_ub=None, 2025-05-07T20:33:36.5578210Z contiguous=True, 2025-05-07T20:33:36.5578433Z compiled=False, 2025-05-07T20:33:36.5578632Z ) 2025-05-07T20:33:36.5578949Z self = 2025-05-07T20:33:36.5579455Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:36.5579729Z 2025-05-07T20:33:36.5579813Z @given( 2025-05-07T20:33:36.5580034Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.5580354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.5580748Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.5581066Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.5581393Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.5581675Z ) 2025-05-07T20:33:36.5582016Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.5582462Z def test_silu_mul_quant( 2025-05-07T20:33:36.5582704Z self, 2025-05-07T20:33:36.5582890Z T: int, 2025-05-07T20:33:36.5583087Z D: int, 2025-05-07T20:33:36.5583311Z scale_ub: Optional[float], 2025-05-07T20:33:36.5583580Z contiguous: bool, 2025-05-07T20:33:36.5583815Z compiled: bool, 2025-05-07T20:33:36.5584037Z ) -> None: 2025-05-07T20:33:36.5584252Z torch.manual_seed(2025) 2025-05-07T20:33:36.5584489Z 2025-05-07T20:33:36.5584757Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.5586822Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.5588837Z 2025-05-07T20:33:36.5588962Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:36.5589170Z 2025-05-07T20:33:36.5589276Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.5589686Z self=, 2025-05-07T20:33:36.5590081Z T=4096, 2025-05-07T20:33:36.5590271Z D=5120, 2025-05-07T20:33:36.5590453Z scale_ub=None, 2025-05-07T20:33:36.5590671Z contiguous=True, 2025-05-07T20:33:36.5590900Z compiled=False, 2025-05-07T20:33:36.5591096Z ) 2025-05-07T20:33:36.5591461Z self = 2025-05-07T20:33:36.5591949Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:36.5592212Z 2025-05-07T20:33:36.5592291Z @given( 2025-05-07T20:33:36.5592514Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.5592818Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.5593180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.5593509Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.5593834Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.5594116Z ) 2025-05-07T20:33:36.5594460Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.5594909Z def test_silu_mul_quant( 2025-05-07T20:33:36.5595151Z self, 2025-05-07T20:33:36.5595338Z T: int, 2025-05-07T20:33:36.5595534Z D: int, 2025-05-07T20:33:36.5595763Z scale_ub: Optional[float], 2025-05-07T20:33:36.5596028Z contiguous: bool, 2025-05-07T20:33:36.5596270Z compiled: bool, 2025-05-07T20:33:36.5596541Z ) -> None: 2025-05-07T20:33:36.5596754Z torch.manual_seed(2025) 2025-05-07T20:33:36.5597000Z 2025-05-07T20:33:36.5597275Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.5599322Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.5601326Z 2025-05-07T20:33:36.5601440Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:36.5601657Z 2025-05-07T20:33:36.5601761Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.5602185Z self=, 2025-05-07T20:33:36.5602578Z T=2048, 2025-05-07T20:33:36.5602761Z D=5120, 2025-05-07T20:33:36.5602954Z scale_ub=None, 2025-05-07T20:33:36.5603170Z contiguous=False, 2025-05-07T20:33:36.5603396Z compiled=False, 2025-05-07T20:33:36.5603611Z ) 2025-05-07T20:33:36.5603938Z self = 2025-05-07T20:33:36.5604435Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:36.5604716Z 2025-05-07T20:33:36.5604796Z @given( 2025-05-07T20:33:36.5605023Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.5605339Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.5605635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.5605962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.5606291Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.5606565Z ) 2025-05-07T20:33:36.5606910Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.5607349Z def test_silu_mul_quant( 2025-05-07T20:33:36.5607587Z self, 2025-05-07T20:33:36.5607780Z T: int, 2025-05-07T20:33:36.5607986Z D: int, 2025-05-07T20:33:36.5608198Z scale_ub: Optional[float], 2025-05-07T20:33:36.5608473Z contiguous: bool, 2025-05-07T20:33:36.5608712Z compiled: bool, 2025-05-07T20:33:36.5608931Z ) -> None: 2025-05-07T20:33:36.5609149Z torch.manual_seed(2025) 2025-05-07T20:33:36.5609387Z 2025-05-07T20:33:36.5609646Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.5611877Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.5613866Z 2025-05-07T20:33:36.5613980Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:36.5614194Z 2025-05-07T20:33:36.5614293Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.5614705Z self=, 2025-05-07T20:33:36.5615109Z T=4096, 2025-05-07T20:33:36.5615298Z D=7168, 2025-05-07T20:33:36.5615485Z scale_ub=None, 2025-05-07T20:33:36.5615693Z contiguous=True, 2025-05-07T20:33:36.5615922Z compiled=True, 2025-05-07T20:33:36.5616126Z ) 2025-05-07T20:33:36.5616449Z self = 2025-05-07T20:33:36.5616985Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:36.5617263Z 2025-05-07T20:33:36.5617345Z @given( 2025-05-07T20:33:36.5617572Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.5617882Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.5618200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.5618531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.5618861Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.5619150Z ) 2025-05-07T20:33:36.5619500Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.5619978Z def test_silu_mul_quant( 2025-05-07T20:33:36.5620220Z self, 2025-05-07T20:33:36.5620413Z T: int, 2025-05-07T20:33:36.5620619Z D: int, 2025-05-07T20:33:36.5620840Z scale_ub: Optional[float], 2025-05-07T20:33:36.5621114Z contiguous: bool, 2025-05-07T20:33:36.5621363Z compiled: bool, 2025-05-07T20:33:36.5621584Z ) -> None: 2025-05-07T20:33:36.5621807Z torch.manual_seed(2025) 2025-05-07T20:33:36.5622051Z 2025-05-07T20:33:36.5622316Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.5624349Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.5626222Z 2025-05-07T20:33:36.5626341Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:36.5626555Z 2025-05-07T20:33:36.5626658Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.5627066Z self=, 2025-05-07T20:33:36.5627529Z T=2048, 2025-05-07T20:33:36.5627715Z D=5120, 2025-05-07T20:33:36.5627905Z scale_ub=1200.0, 2025-05-07T20:33:36.5628120Z contiguous=False, 2025-05-07T20:33:36.5628348Z compiled=False, 2025-05-07T20:33:36.6184305Z ) 2025-05-07T20:33:36.6185272Z self = 2025-05-07T20:33:36.6186311Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:36.6186854Z 2025-05-07T20:33:36.6187008Z @given( 2025-05-07T20:33:36.6187575Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.6188176Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.6189053Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.6189697Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.6190180Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.6190453Z ) 2025-05-07T20:33:36.6190789Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.6191221Z def test_silu_mul_quant( 2025-05-07T20:33:36.6191537Z self, 2025-05-07T20:33:36.6191724Z T: int, 2025-05-07T20:33:36.6191912Z D: int, 2025-05-07T20:33:36.6192128Z scale_ub: Optional[float], 2025-05-07T20:33:36.6192394Z contiguous: bool, 2025-05-07T20:33:36.6192635Z compiled: bool, 2025-05-07T20:33:36.6192860Z ) -> None: 2025-05-07T20:33:36.6193070Z torch.manual_seed(2025) 2025-05-07T20:33:36.6193306Z 2025-05-07T20:33:36.6193579Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.6195828Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.6197793Z 2025-05-07T20:33:36.6197917Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:36.6198124Z 2025-05-07T20:33:36.6198232Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.6198634Z self=, 2025-05-07T20:33:36.6199114Z T=4096, 2025-05-07T20:33:36.6199327Z D=7168, 2025-05-07T20:33:36.6199521Z scale_ub=1200.0, 2025-05-07T20:33:36.6199745Z contiguous=True, 2025-05-07T20:33:36.6199960Z compiled=False, 2025-05-07T20:33:36.6200165Z ) 2025-05-07T20:33:36.6200483Z self = 2025-05-07T20:33:36.6200962Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:36.6201231Z 2025-05-07T20:33:36.6201302Z @given( 2025-05-07T20:33:36.6201522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.6201820Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.6202119Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.6202442Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.6202755Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.6203035Z ) 2025-05-07T20:33:36.6203390Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.6203837Z def test_silu_mul_quant( 2025-05-07T20:33:36.6204068Z self, 2025-05-07T20:33:36.6204263Z T: int, 2025-05-07T20:33:36.6204453Z D: int, 2025-05-07T20:33:36.6204661Z scale_ub: Optional[float], 2025-05-07T20:33:36.6204928Z contiguous: bool, 2025-05-07T20:33:36.6205168Z compiled: bool, 2025-05-07T20:33:36.6205388Z ) -> None: 2025-05-07T20:33:36.6205597Z torch.manual_seed(2025) 2025-05-07T20:33:36.6205830Z 2025-05-07T20:33:36.6206091Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.6208225Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.6210102Z 2025-05-07T20:33:36.6210220Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:36.6210436Z 2025-05-07T20:33:36.6210538Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.6210942Z self=, 2025-05-07T20:33:36.6211372Z T=16384, 2025-05-07T20:33:36.6211558Z D=7168, 2025-05-07T20:33:36.6211746Z scale_ub=None, 2025-05-07T20:33:36.6211948Z contiguous=False, 2025-05-07T20:33:36.6212167Z compiled=True, 2025-05-07T20:33:36.6212364Z ) 2025-05-07T20:33:36.6212672Z self = 2025-05-07T20:33:36.6213166Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:36.6213438Z 2025-05-07T20:33:36.6213519Z @given( 2025-05-07T20:33:36.6213742Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.6214056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.6214409Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.6214732Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.6215045Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.6215329Z ) 2025-05-07T20:33:36.6215682Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.6216125Z def test_silu_mul_quant( 2025-05-07T20:33:36.6216364Z self, 2025-05-07T20:33:36.6216557Z T: int, 2025-05-07T20:33:36.6216742Z D: int, 2025-05-07T20:33:36.6216964Z scale_ub: Optional[float], 2025-05-07T20:33:36.6217233Z contiguous: bool, 2025-05-07T20:33:36.6217464Z compiled: bool, 2025-05-07T20:33:36.6217681Z ) -> None: 2025-05-07T20:33:36.6217939Z torch.manual_seed(2025) 2025-05-07T20:33:36.6218174Z 2025-05-07T20:33:36.6218432Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.6220446Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.6222303Z 2025-05-07T20:33:36.6222416Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:36.6222620Z 2025-05-07T20:33:36.6222725Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.6223129Z self=, 2025-05-07T20:33:36.6223536Z T=4096, 2025-05-07T20:33:36.6223717Z D=7168, 2025-05-07T20:33:36.6223902Z scale_ub=None, 2025-05-07T20:33:36.6224108Z contiguous=True, 2025-05-07T20:33:36.6224332Z compiled=False, 2025-05-07T20:33:36.6224530Z ) 2025-05-07T20:33:36.6224839Z self = 2025-05-07T20:33:36.6225343Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:36.6225609Z 2025-05-07T20:33:36.6225692Z @given( 2025-05-07T20:33:36.6225909Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.6226212Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.6226511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.6226821Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.6227143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.6227492Z ) 2025-05-07T20:33:36.6227832Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.6228280Z def test_silu_mul_quant( 2025-05-07T20:33:36.6228564Z self, 2025-05-07T20:33:36.6228755Z T: int, 2025-05-07T20:33:36.6228940Z D: int, 2025-05-07T20:33:36.6229153Z scale_ub: Optional[float], 2025-05-07T20:33:36.6229435Z contiguous: bool, 2025-05-07T20:33:36.6229665Z compiled: bool, 2025-05-07T20:33:36.6229887Z ) -> None: 2025-05-07T20:33:36.6230165Z torch.manual_seed(2025) 2025-05-07T20:33:36.6230423Z 2025-05-07T20:33:36.6230693Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.6232733Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.6234589Z 2025-05-07T20:33:36.6234703Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:36.6234911Z 2025-05-07T20:33:36.6235015Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.6235414Z self=, 2025-05-07T20:33:36.6235826Z T=16384, 2025-05-07T20:33:36.6244805Z D=7168, 2025-05-07T20:33:36.6245009Z scale_ub=None, 2025-05-07T20:33:36.6245226Z contiguous=True, 2025-05-07T20:33:36.6245449Z compiled=False, 2025-05-07T20:33:36.6245644Z ) 2025-05-07T20:33:36.6245963Z self = 2025-05-07T20:33:36.6246610Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:36.6246888Z 2025-05-07T20:33:36.6246972Z @given( 2025-05-07T20:33:36.6247196Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.6247520Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.6247826Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.6248145Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.6248464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.6248748Z ) 2025-05-07T20:33:36.6249083Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.6249518Z def test_silu_mul_quant( 2025-05-07T20:33:36.6249756Z self, 2025-05-07T20:33:36.6249947Z T: int, 2025-05-07T20:33:36.6250139Z D: int, 2025-05-07T20:33:36.6250356Z scale_ub: Optional[float], 2025-05-07T20:33:36.6250625Z contiguous: bool, 2025-05-07T20:33:36.6250855Z compiled: bool, 2025-05-07T20:33:36.6251077Z ) -> None: 2025-05-07T20:33:36.6251286Z torch.manual_seed(2025) 2025-05-07T20:33:36.6251517Z 2025-05-07T20:33:36.6251788Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.6253822Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.6255809Z 2025-05-07T20:33:36.6255933Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:36.6256138Z 2025-05-07T20:33:36.6256245Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.6256648Z self=, 2025-05-07T20:33:36.6257114Z T=16384, 2025-05-07T20:33:36.6257303Z D=7168, 2025-05-07T20:33:36.6257485Z scale_ub=1200.0, 2025-05-07T20:33:36.6257705Z contiguous=True, 2025-05-07T20:33:36.6257926Z compiled=False, 2025-05-07T20:33:36.6258118Z ) 2025-05-07T20:33:36.6258430Z self = 2025-05-07T20:33:36.6258914Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:36.6259256Z 2025-05-07T20:33:36.6259332Z @given( 2025-05-07T20:33:36.6259551Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.6259866Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.6260166Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.6260481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.6260802Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.6261081Z ) 2025-05-07T20:33:36.6261418Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.6261861Z def test_silu_mul_quant( 2025-05-07T20:33:36.6262210Z self, 2025-05-07T20:33:36.6262396Z T: int, 2025-05-07T20:33:36.6262591Z D: int, 2025-05-07T20:33:36.6262806Z scale_ub: Optional[float], 2025-05-07T20:33:36.6263067Z contiguous: bool, 2025-05-07T20:33:36.6263300Z compiled: bool, 2025-05-07T20:33:36.6263522Z ) -> None: 2025-05-07T20:33:36.6263729Z torch.manual_seed(2025) 2025-05-07T20:33:36.6263965Z 2025-05-07T20:33:36.6264232Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.6266285Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.6268282Z 2025-05-07T20:33:36.6268402Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:36.8065128Z 2025-05-07T20:33:36.8065720Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.8066959Z self=, 2025-05-07T20:33:36.8067878Z T=128, 2025-05-07T20:33:36.8068302Z D=5120, 2025-05-07T20:33:36.8068673Z scale_ub=1200.0, 2025-05-07T20:33:36.8069097Z contiguous=False, 2025-05-07T20:33:36.8069533Z compiled=False, 2025-05-07T20:33:36.8069933Z ) 2025-05-07T20:33:36.8070381Z self = 2025-05-07T20:33:36.8070905Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:36.8071191Z 2025-05-07T20:33:36.8071280Z @given( 2025-05-07T20:33:36.8071516Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.8071821Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.8072124Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.8072452Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.8072770Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.8073064Z ) 2025-05-07T20:33:36.8073411Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.8073867Z def test_silu_mul_quant( 2025-05-07T20:33:36.8074113Z self, 2025-05-07T20:33:36.8074308Z T: int, 2025-05-07T20:33:36.8074498Z D: int, 2025-05-07T20:33:36.8074715Z scale_ub: Optional[float], 2025-05-07T20:33:36.8074984Z contiguous: bool, 2025-05-07T20:33:36.8075232Z compiled: bool, 2025-05-07T20:33:36.8075450Z ) -> None: 2025-05-07T20:33:36.8075948Z torch.manual_seed(2025) 2025-05-07T20:33:36.8076189Z 2025-05-07T20:33:36.8076447Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.8076785Z 2025-05-07T20:33:36.8076969Z x_sign = torch.sign(x) 2025-05-07T20:33:36.8077250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.8077560Z x = x_sign * x_clamp 2025-05-07T20:33:36.8077879Z x0 = x[:, :D] 2025-05-07T20:33:36.8078085Z x1 = x[:, D:] 2025-05-07T20:33:36.8078288Z 2025-05-07T20:33:36.8078470Z if contiguous: 2025-05-07T20:33:36.8078692Z x0 = x0.contiguous() 2025-05-07T20:33:36.8078942Z x1 = x1.contiguous() 2025-05-07T20:33:36.8079176Z 2025-05-07T20:33:36.8079355Z if scale_ub is not None: 2025-05-07T20:33:36.8079627Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:36.8079953Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:36.8080262Z ) 2025-05-07T20:33:36.8080448Z else: 2025-05-07T20:33:36.8080740Z scale_ub_tensor = None 2025-05-07T20:33:36.8080993Z 2025-05-07T20:33:36.8081212Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.8081521Z op = silu_mul_quant 2025-05-07T20:33:36.8081764Z if compiled: 2025-05-07T20:33:36.8082001Z op = torch.compile(op) 2025-05-07T20:33:36.8082300Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.8082571Z 2025-05-07T20:33:36.8082751Z > y_fp8, y_scale = fn() 2025-05-07T20:33:36.8082919Z 2025-05-07T20:33:36.8083016Z moe/activation_test.py:117: 2025-05-07T20:33:36.8083308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.8083626Z moe/activation_test.py:115: in fn 2025-05-07T20:33:36.8083996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.8084694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:36.8085376Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:36.8085910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:36.8086580Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:36.8087253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:36.8087778Z kernel = self.compile( 2025-05-07T20:33:36.8088332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:36.8088973Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:36.8089364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.8089592Z 2025-05-07T20:33:36.8089796Z self = 2025-05-07T20:33:36.8090867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:36.8092239Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819876e80>} 2025-05-07T20:33:36.8093597Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:36.8094646Z context = 2025-05-07T20:33:36.8094926Z 2025-05-07T20:33:36.8095089Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:36.8095657Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:36.8096119Z module_map=module_map) 2025-05-07T20:33:36.8096476Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:36.8096818Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:36.8097077Z E ^ 2025-05-07T20:33:36.8097533Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:36.8098016Z 2025-05-07T20:33:36.8098434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:36.8098939Z 2025-05-07T20:33:36.8099039Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.8099443Z self=, 2025-05-07T20:33:36.8099840Z T=2048, 2025-05-07T20:33:36.8100020Z D=7168, 2025-05-07T20:33:36.8100210Z scale_ub=None, 2025-05-07T20:33:36.8100426Z contiguous=False, 2025-05-07T20:33:36.8100640Z compiled=False, 2025-05-07T20:33:36.8100889Z ) 2025-05-07T20:33:36.8101209Z self = 2025-05-07T20:33:36.8101701Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:36.8101975Z 2025-05-07T20:33:36.8102052Z @given( 2025-05-07T20:33:36.8102275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.8102574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.8102873Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.8103196Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.8103515Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.8103787Z ) 2025-05-07T20:33:36.8104173Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.8104624Z def test_silu_mul_quant( 2025-05-07T20:33:36.8104858Z self, 2025-05-07T20:33:36.8105053Z T: int, 2025-05-07T20:33:36.8105250Z D: int, 2025-05-07T20:33:36.8105462Z scale_ub: Optional[float], 2025-05-07T20:33:36.8105732Z contiguous: bool, 2025-05-07T20:33:36.8105971Z compiled: bool, 2025-05-07T20:33:36.8106186Z ) -> None: 2025-05-07T20:33:36.8106400Z torch.manual_seed(2025) 2025-05-07T20:33:36.8106642Z 2025-05-07T20:33:36.8106907Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.8109020Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:36.8110980Z 2025-05-07T20:33:36.8111095Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:36.8111308Z 2025-05-07T20:33:36.8111409Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.8111814Z self=, 2025-05-07T20:33:36.8112208Z T=128, 2025-05-07T20:33:36.8112418Z D=7168, 2025-05-07T20:33:36.8112601Z scale_ub=1200.0, 2025-05-07T20:33:36.8112821Z contiguous=True, 2025-05-07T20:33:36.8113041Z compiled=True, 2025-05-07T20:33:36.8113233Z ) 2025-05-07T20:33:36.8113548Z self = 2025-05-07T20:33:36.8114037Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:36.8114311Z 2025-05-07T20:33:36.8114397Z @given( 2025-05-07T20:33:36.8114616Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.8114975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.8115281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.8115597Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.8115920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.8116204Z ) 2025-05-07T20:33:36.8116555Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.8117034Z def test_silu_mul_quant( 2025-05-07T20:33:36.8117275Z self, 2025-05-07T20:33:36.8117469Z T: int, 2025-05-07T20:33:36.8117656Z D: int, 2025-05-07T20:33:36.8117867Z scale_ub: Optional[float], 2025-05-07T20:33:36.8118132Z contiguous: bool, 2025-05-07T20:33:36.8118363Z compiled: bool, 2025-05-07T20:33:36.8118586Z ) -> None: 2025-05-07T20:33:36.8118795Z torch.manual_seed(2025) 2025-05-07T20:33:36.8119027Z 2025-05-07T20:33:36.8119301Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.8119642Z 2025-05-07T20:33:36.8119878Z x_sign = torch.sign(x) 2025-05-07T20:33:36.8120168Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.8120468Z x = x_sign * x_clamp 2025-05-07T20:33:36.8120702Z x0 = x[:, :D] 2025-05-07T20:33:36.8120921Z x1 = x[:, D:] 2025-05-07T20:33:36.8121131Z 2025-05-07T20:33:36.8121309Z if contiguous: 2025-05-07T20:33:36.8121536Z x0 = x0.contiguous() 2025-05-07T20:33:36.8121790Z x1 = x1.contiguous() 2025-05-07T20:33:36.8122020Z 2025-05-07T20:33:36.8122213Z if scale_ub is not None: 2025-05-07T20:33:36.8122476Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:36.8122810Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:36.8123160Z ) 2025-05-07T20:33:36.8123350Z else: 2025-05-07T20:33:36.8123557Z scale_ub_tensor = None 2025-05-07T20:33:36.8123803Z 2025-05-07T20:33:36.8124030Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.8124345Z op = silu_mul_quant 2025-05-07T20:33:36.8124589Z if compiled: 2025-05-07T20:33:36.8124840Z op = torch.compile(op) 2025-05-07T20:33:36.8125142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.8125412Z 2025-05-07T20:33:36.8125608Z > y_fp8, y_scale = fn() 2025-05-07T20:33:36.8125771Z 2025-05-07T20:33:36.8125873Z moe/activation_test.py:117: 2025-05-07T20:33:36.8126163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.8126488Z moe/activation_test.py:115: in fn 2025-05-07T20:33:36.8126774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.8127333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:36.8127881Z return fn(*args, **kwargs) 2025-05-07T20:33:36.8128542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:36.8129234Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:36.8129762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:36.8130484Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:36.8131170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:36.8131699Z kernel = self.compile( 2025-05-07T20:33:36.8132252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:36.8132924Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:36.8133322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.8133543Z 2025-05-07T20:33:36.8133803Z self = 2025-05-07T20:33:36.8134859Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:36.8136275Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f88197c7b00>} 2025-05-07T20:33:36.8137629Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:36.8138689Z context = 2025-05-07T20:33:36.8138972Z 2025-05-07T20:33:36.8139139Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:36.8139698Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:36.8140370Z module_map=module_map) 2025-05-07T20:33:36.8140728Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:36.8141065Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:36.8141320Z E ^ 2025-05-07T20:33:36.8141777Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:36.8142222Z 2025-05-07T20:33:36.8142652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.0907707Z 2025-05-07T20:33:37.0907988Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.0908849Z self=, 2025-05-07T20:33:37.0909273Z T=128, 2025-05-07T20:33:37.0909463Z D=7168, 2025-05-07T20:33:37.0909652Z scale_ub=1200.0, 2025-05-07T20:33:37.0909881Z contiguous=True, 2025-05-07T20:33:37.0910099Z compiled=False, 2025-05-07T20:33:37.0910298Z ) 2025-05-07T20:33:37.0910617Z self = 2025-05-07T20:33:37.0911099Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:37.0911372Z 2025-05-07T20:33:37.0911475Z @given( 2025-05-07T20:33:37.0911695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.0912000Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.0912302Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.0912622Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.0912944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.0913235Z ) 2025-05-07T20:33:37.0913582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.0914010Z def test_silu_mul_quant( 2025-05-07T20:33:37.0914254Z self, 2025-05-07T20:33:37.0914448Z T: int, 2025-05-07T20:33:37.0914639Z D: int, 2025-05-07T20:33:37.0914857Z scale_ub: Optional[float], 2025-05-07T20:33:37.0915126Z contiguous: bool, 2025-05-07T20:33:37.0915360Z compiled: bool, 2025-05-07T20:33:37.0915583Z ) -> None: 2025-05-07T20:33:37.0915798Z torch.manual_seed(2025) 2025-05-07T20:33:37.0916034Z 2025-05-07T20:33:37.0916301Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.0916640Z 2025-05-07T20:33:37.0916827Z x_sign = torch.sign(x) 2025-05-07T20:33:37.0917117Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.0919298Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.0921266Z 2025-05-07T20:33:37.0921453Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:37.0921660Z 2025-05-07T20:33:37.0921766Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.0922171Z self=, 2025-05-07T20:33:37.0922579Z T=128, 2025-05-07T20:33:37.0922766Z D=5120, 2025-05-07T20:33:37.0922951Z scale_ub=1200.0, 2025-05-07T20:33:37.0923179Z contiguous=True, 2025-05-07T20:33:37.0923402Z compiled=True, 2025-05-07T20:33:37.0923596Z ) 2025-05-07T20:33:37.0923907Z self = 2025-05-07T20:33:37.0924394Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:37.0924742Z 2025-05-07T20:33:37.0924829Z @given( 2025-05-07T20:33:37.0925052Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.0925366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.0925668Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.0925989Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.0926311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.0926592Z ) 2025-05-07T20:33:37.0926931Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.0927382Z def test_silu_mul_quant( 2025-05-07T20:33:37.0927621Z self, 2025-05-07T20:33:37.0927856Z T: int, 2025-05-07T20:33:37.0928049Z D: int, 2025-05-07T20:33:37.0928266Z scale_ub: Optional[float], 2025-05-07T20:33:37.0928531Z contiguous: bool, 2025-05-07T20:33:37.0928769Z compiled: bool, 2025-05-07T20:33:37.0928993Z ) -> None: 2025-05-07T20:33:37.0929207Z torch.manual_seed(2025) 2025-05-07T20:33:37.0929440Z 2025-05-07T20:33:37.0929707Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.0930046Z 2025-05-07T20:33:37.0930229Z x_sign = torch.sign(x) 2025-05-07T20:33:37.0930514Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.0932470Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.0934497Z 2025-05-07T20:33:37.0934671Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:37.0934957Z 2025-05-07T20:33:37.0935106Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.0935662Z self=, 2025-05-07T20:33:37.0936207Z T=128, 2025-05-07T20:33:37.0936455Z D=7168, 2025-05-07T20:33:37.0936706Z scale_ub=None, 2025-05-07T20:33:37.0936993Z contiguous=True, 2025-05-07T20:33:37.0937293Z compiled=True, 2025-05-07T20:33:37.0937554Z ) 2025-05-07T20:33:37.0937992Z self = 2025-05-07T20:33:37.0938647Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:37.0939005Z 2025-05-07T20:33:37.0939116Z @given( 2025-05-07T20:33:37.0939411Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.0939894Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.0940560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.0941000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.0941459Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.0941839Z ) 2025-05-07T20:33:37.0942306Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.0943009Z def test_silu_mul_quant( 2025-05-07T20:33:37.0943337Z self, 2025-05-07T20:33:37.0943588Z T: int, 2025-05-07T20:33:37.0943854Z D: int, 2025-05-07T20:33:37.0944148Z scale_ub: Optional[float], 2025-05-07T20:33:37.0944505Z contiguous: bool, 2025-05-07T20:33:37.0944828Z compiled: bool, 2025-05-07T20:33:37.0945124Z ) -> None: 2025-05-07T20:33:37.0957543Z torch.manual_seed(2025) 2025-05-07T20:33:37.0957874Z 2025-05-07T20:33:37.0958248Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.0961265Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.0963864Z 2025-05-07T20:33:37.0964029Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.0964313Z 2025-05-07T20:33:37.0964855Z FAILED 2025-05-07T20:33:37.0965003Z 2025-05-07T20:33:37.0965171Z =================================== FAILURES =================================== 2025-05-07T20:33:37.0965838Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:37.0966434Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:37.0967259Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:33:37.0967999Z | yield 2025-05-07T20:33:37.0968586Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:33:37.0969283Z | self._callTestMethod(testMethod) 2025-05-07T20:33:37.0969666Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:33:37.0970447Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:33:37.0971193Z | if method() is not None: 2025-05-07T20:33:37.0971525Z | ~~~~~~^^ 2025-05-07T20:33:37.0972389Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:37.0973406Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.0973791Z | ^^^^^^^ 2025-05-07T20:33:37.0974555Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:37.0975395Z | raise the_error_hypothesis_found 2025-05-07T20:33:37.0975969Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:37.0976533Z +-+---------------- 1 ---------------- 2025-05-07T20:33:37.0976928Z | Traceback (most recent call last): 2025-05-07T20:33:37.0977894Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:37.0978951Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.0981917Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.0984145Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:37.0984580Z | self=, 2025-05-07T20:33:37.0984991Z | T=2048, 2025-05-07T20:33:37.0985222Z | D=5120, # or any other generated value 2025-05-07T20:33:37.0985602Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:37.0986136Z | contiguous=True, # or any other generated value 2025-05-07T20:33:37.0986610Z | compiled=False, # or any other generated value 2025-05-07T20:33:37.0986911Z | ) 2025-05-07T20:33:37.0987089Z | 2025-05-07T20:33:37.0987756Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:37.0988365Z +---------------- 2 ---------------- 2025-05-07T20:33:37.0988656Z | Traceback (most recent call last): 2025-05-07T20:33:37.0989350Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:37.0990100Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.0992113Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.0994198Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:37.0994630Z | self=, 2025-05-07T20:33:37.0995030Z | T=128, 2025-05-07T20:33:37.0995222Z | D=7168, 2025-05-07T20:33:37.0995428Z | scale_ub=None, 2025-05-07T20:33:37.0995664Z | contiguous=True, 2025-05-07T20:33:37.0995893Z | compiled=True, 2025-05-07T20:33:37.0996111Z | ) 2025-05-07T20:33:37.0996288Z | 2025-05-07T20:33:37.0996891Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:37.0997736Z +---------------- 3 ---------------- 2025-05-07T20:33:37.0998129Z | Traceback (most recent call last): 2025-05-07T20:33:37.0999078Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:37.1000086Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1002867Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.1005586Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:37.1006306Z | self=, 2025-05-07T20:33:37.1006864Z | T=128, 2025-05-07T20:33:37.1007129Z | D=5120, 2025-05-07T20:33:37.1007418Z | scale_ub=1200.0, 2025-05-07T20:33:37.1007751Z | contiguous=True, 2025-05-07T20:33:37.1008072Z | compiled=True, 2025-05-07T20:33:37.1008443Z | ) 2025-05-07T20:33:37.1008672Z | 2025-05-07T20:33:37.1009361Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:37.1010199Z +---------------- 4 ---------------- 2025-05-07T20:33:37.1010591Z | Traceback (most recent call last): 2025-05-07T20:33:37.1011562Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:37.1012550Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.1012941Z | ~~~~~~^^ 2025-05-07T20:33:37.1013886Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:37.1014836Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1016002Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:37.1017100Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.1017488Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:33:37.1017838Z | a, 2025-05-07T20:33:37.1018110Z | ^^ 2025-05-07T20:33:37.1018389Z | ...<23 lines>... 2025-05-07T20:33:37.1018775Z | USE_INT64=use_int64, 2025-05-07T20:33:37.1019135Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:37.1019475Z | ) 2025-05-07T20:33:37.1019716Z | ^ 2025-05-07T20:33:37.1020441Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:37.1021447Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1022058Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:37.1022946Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:37.1024018Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.1024662Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:37.1025550Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:37.1026515Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.1027033Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:37.1027934Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:37.1028721Z | fn() 2025-05-07T20:33:37.1028992Z | ~~^^ 2025-05-07T20:33:37.1029778Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:37.1030691Z | self.fn.run( 2025-05-07T20:33:37.1030982Z | ~~~~~~~~~~~^ 2025-05-07T20:33:37.1031269Z | *args, 2025-05-07T20:33:37.1031549Z | ^^^^^^ 2025-05-07T20:33:37.1031825Z | **current, 2025-05-07T20:33:37.1032130Z | ^^^^^^^^^^ 2025-05-07T20:33:37.1032424Z | ) 2025-05-07T20:33:37.1032661Z | ^ 2025-05-07T20:33:37.1033433Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:37.1034241Z | kernel = self.compile( 2025-05-07T20:33:37.1034573Z | src, 2025-05-07T20:33:37.1034855Z | target=target, 2025-05-07T20:33:37.1035203Z | options=options.__dict__, 2025-05-07T20:33:37.1035561Z | ) 2025-05-07T20:33:37.1036356Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:37.1037316Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1038289Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:37.1039379Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1040037Z | module_map=module_map) 2025-05-07T20:33:37.1040822Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1041417Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.1041787Z | ^ 2025-05-07T20:33:37.1042415Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1043202Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:37.1043729Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:37.1044417Z | self=, 2025-05-07T20:33:37.1044990Z | T=1, # or any other generated value 2025-05-07T20:33:37.1045395Z | D=5120, # or any other generated value 2025-05-07T20:33:37.1045855Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:37.1046407Z | contiguous=True, # or any other generated value 2025-05-07T20:33:37.1046888Z | compiled=True, # or any other generated value 2025-05-07T20:33:37.1047288Z | ) 2025-05-07T20:33:37.1047521Z | 2025-05-07T20:33:37.1048219Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:37.1049030Z +------------------------------------ 2025-05-07T20:33:37.1049536Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:37.1050046Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1050608Z self=, 2025-05-07T20:33:37.1051152Z T=1, 2025-05-07T20:33:37.1051394Z D=5120, 2025-05-07T20:33:37.1051651Z scale_ub=None, 2025-05-07T20:33:37.1051937Z contiguous=True, 2025-05-07T20:33:37.1052224Z compiled=True, 2025-05-07T20:33:37.1052478Z ) 2025-05-07T20:33:37.1052883Z self = 2025-05-07T20:33:37.1053494Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:37.1053849Z 2025-05-07T20:33:37.1053956Z @given( 2025-05-07T20:33:37.1054264Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1054681Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1055083Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1055524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1055953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1056317Z ) 2025-05-07T20:33:37.1056795Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1057393Z def test_silu_mul_quant( 2025-05-07T20:33:37.1057696Z self, 2025-05-07T20:33:37.1057951Z T: int, 2025-05-07T20:33:37.1058208Z D: int, 2025-05-07T20:33:37.1058475Z scale_ub: Optional[float], 2025-05-07T20:33:37.1058818Z contiguous: bool, 2025-05-07T20:33:37.1059121Z compiled: bool, 2025-05-07T20:33:37.1059485Z ) -> None: 2025-05-07T20:33:37.1059761Z torch.manual_seed(2025) 2025-05-07T20:33:37.1060071Z 2025-05-07T20:33:37.1060464Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1060895Z 2025-05-07T20:33:37.1061135Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1061500Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1061962Z x = x_sign * x_clamp 2025-05-07T20:33:37.1062273Z x0 = x[:, :D] 2025-05-07T20:33:37.1062547Z x1 = x[:, D:] 2025-05-07T20:33:37.1062815Z 2025-05-07T20:33:37.1063049Z if contiguous: 2025-05-07T20:33:37.1063339Z x0 = x0.contiguous() 2025-05-07T20:33:37.1063657Z x1 = x1.contiguous() 2025-05-07T20:33:37.1063965Z 2025-05-07T20:33:37.1064210Z if scale_ub is not None: 2025-05-07T20:33:37.1064546Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1064969Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1065356Z ) 2025-05-07T20:33:37.1065649Z else: 2025-05-07T20:33:37.1065924Z scale_ub_tensor = None 2025-05-07T20:33:37.1066251Z 2025-05-07T20:33:37.1066547Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1066938Z op = silu_mul_quant 2025-05-07T20:33:37.1067258Z if compiled: 2025-05-07T20:33:37.1067700Z op = torch.compile(op) 2025-05-07T20:33:37.1068078Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1068436Z 2025-05-07T20:33:37.1068689Z y_fp8, y_scale = fn() 2025-05-07T20:33:37.1069059Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:37.1069444Z 2025-05-07T20:33:37.1069757Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1070241Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:37.1070612Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:37.1071019Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:37.1071476Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1071889Z 2025-05-07T20:33:37.1072154Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.1072423Z 2025-05-07T20:33:37.1072561Z moe/activation_test.py:126: 2025-05-07T20:33:37.1072948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1073399Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:37.1073827Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1074909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:37.1075934Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.1076676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1077556Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1078445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:37.1079385Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.1080335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:37.1081165Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.1081940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:37.1082607Z fn() 2025-05-07T20:33:37.1083258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:37.1084008Z self.fn.run( 2025-05-07T20:33:37.1084671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1085368Z kernel = self.compile( 2025-05-07T20:33:37.1086071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1086907Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1087486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1087777Z 2025-05-07T20:33:37.1088050Z self = 2025-05-07T20:33:37.1089489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1091379Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f5d3a700>} 2025-05-07T20:33:37.1093204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1094549Z context = 2025-05-07T20:33:37.1094922Z 2025-05-07T20:33:37.1095145Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1095844Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1096486Z module_map=module_map) 2025-05-07T20:33:37.1096969Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1097550Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.1097902Z E ^ 2025-05-07T20:33:37.1098533Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1099139Z 2025-05-07T20:33:37.1099713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1100441Z 2025-05-07T20:33:37.1100575Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1101127Z self=, 2025-05-07T20:33:37.1101668Z T=2048, 2025-05-07T20:33:37.1101918Z D=5120, 2025-05-07T20:33:37.1102164Z scale_ub=1200.0, 2025-05-07T20:33:37.1102458Z contiguous=True, 2025-05-07T20:33:37.1102752Z compiled=False, 2025-05-07T20:33:37.1103018Z ) 2025-05-07T20:33:37.1103437Z self = 2025-05-07T20:33:37.1104091Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:37.1104447Z 2025-05-07T20:33:37.1104548Z @given( 2025-05-07T20:33:37.1104854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1105265Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1105649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1106067Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1106486Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1106860Z ) 2025-05-07T20:33:37.1107303Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1107972Z def test_silu_mul_quant( 2025-05-07T20:33:37.1108281Z self, 2025-05-07T20:33:37.1108522Z T: int, 2025-05-07T20:33:37.1108776Z D: int, 2025-05-07T20:33:37.1109061Z scale_ub: Optional[float], 2025-05-07T20:33:37.1109399Z contiguous: bool, 2025-05-07T20:33:37.1109706Z compiled: bool, 2025-05-07T20:33:37.1109994Z ) -> None: 2025-05-07T20:33:37.1110261Z torch.manual_seed(2025) 2025-05-07T20:33:37.1110571Z 2025-05-07T20:33:37.1110972Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1111408Z 2025-05-07T20:33:37.1111651Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1112023Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1112412Z x = x_sign * x_clamp 2025-05-07T20:33:37.1112721Z x0 = x[:, :D] 2025-05-07T20:33:37.1113050Z x1 = x[:, D:] 2025-05-07T20:33:37.1113316Z 2025-05-07T20:33:37.1113548Z if contiguous: 2025-05-07T20:33:37.1113844Z x0 = x0.contiguous() 2025-05-07T20:33:37.1114176Z x1 = x1.contiguous() 2025-05-07T20:33:37.1114478Z 2025-05-07T20:33:37.1114727Z if scale_ub is not None: 2025-05-07T20:33:37.1115077Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1115501Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1115891Z ) 2025-05-07T20:33:37.1116140Z else: 2025-05-07T20:33:37.1116404Z scale_ub_tensor = None 2025-05-07T20:33:37.1116731Z 2025-05-07T20:33:37.1117095Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1117520Z op = silu_mul_quant 2025-05-07T20:33:37.1117868Z if compiled: 2025-05-07T20:33:37.1118208Z op = torch.compile(op) 2025-05-07T20:33:37.1118606Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1118979Z 2025-05-07T20:33:37.1119229Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1119443Z 2025-05-07T20:33:37.1119585Z moe/activation_test.py:117: 2025-05-07T20:33:37.1119979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1120433Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1120803Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1121788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1122688Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1123385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1124266Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1125126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1125817Z kernel = self.compile( 2025-05-07T20:33:37.1126515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1127358Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1127871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1128179Z 2025-05-07T20:33:37.1128438Z self = 2025-05-07T20:33:37.1129840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1131698Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f5bf2020>} 2025-05-07T20:33:37.1133462Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1134805Z context = 2025-05-07T20:33:37.1135180Z 2025-05-07T20:33:37.1135400Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1136074Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1136738Z module_map=module_map) 2025-05-07T20:33:37.1137213Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1137661Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1137984Z E ^ 2025-05-07T20:33:37.1138584Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1139274Z 2025-05-07T20:33:37.1139834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1140770Z 2025-05-07T20:33:37.1140920Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1141474Z self=, 2025-05-07T20:33:37.1142011Z T=2048, 2025-05-07T20:33:37.1142272Z D=5120, 2025-05-07T20:33:37.1142537Z scale_ub=1200.0, 2025-05-07T20:33:37.1173251Z contiguous=True, 2025-05-07T20:33:37.1173524Z compiled=True, 2025-05-07T20:33:37.1173751Z ) 2025-05-07T20:33:37.1174396Z self = 2025-05-07T20:33:37.1174898Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:37.1175179Z 2025-05-07T20:33:37.1175268Z @given( 2025-05-07T20:33:37.1175492Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1175816Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1176124Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1176444Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1176765Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1177041Z ) 2025-05-07T20:33:37.1177387Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1177920Z def test_silu_mul_quant( 2025-05-07T20:33:37.1178252Z self, 2025-05-07T20:33:37.1178493Z T: int, 2025-05-07T20:33:37.1178818Z D: int, 2025-05-07T20:33:37.1179333Z scale_ub: Optional[float], 2025-05-07T20:33:37.1179658Z contiguous: bool, 2025-05-07T20:33:37.1179988Z compiled: bool, 2025-05-07T20:33:37.1180385Z ) -> None: 2025-05-07T20:33:37.1194734Z torch.manual_seed(2025) 2025-05-07T20:33:37.1195002Z 2025-05-07T20:33:37.1195276Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1195641Z 2025-05-07T20:33:37.1195832Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1196113Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1196425Z x = x_sign * x_clamp 2025-05-07T20:33:37.1196665Z x0 = x[:, :D] 2025-05-07T20:33:37.1196873Z x1 = x[:, D:] 2025-05-07T20:33:37.1197076Z 2025-05-07T20:33:37.1197260Z if contiguous: 2025-05-07T20:33:37.1197491Z x0 = x0.contiguous() 2025-05-07T20:33:37.1197738Z x1 = x1.contiguous() 2025-05-07T20:33:37.1197976Z 2025-05-07T20:33:37.1198170Z if scale_ub is not None: 2025-05-07T20:33:37.1198434Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1198766Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1199076Z ) 2025-05-07T20:33:37.1199260Z else: 2025-05-07T20:33:37.1199464Z scale_ub_tensor = None 2025-05-07T20:33:37.1199714Z 2025-05-07T20:33:37.1199933Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1200244Z op = silu_mul_quant 2025-05-07T20:33:37.1200486Z if compiled: 2025-05-07T20:33:37.1200724Z op = torch.compile(op) 2025-05-07T20:33:37.1201016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1201288Z 2025-05-07T20:33:37.1201470Z y_fp8, y_scale = fn() 2025-05-07T20:33:37.1201753Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:37.1202036Z 2025-05-07T20:33:37.1202390Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1202716Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:37.1203003Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:37.1203312Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:37.1203657Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1204052Z 2025-05-07T20:33:37.1204251Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.1204441Z 2025-05-07T20:33:37.1204541Z moe/activation_test.py:126: 2025-05-07T20:33:37.1204833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1205162Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:37.1205485Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1206254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:37.1207015Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.1207600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1208267Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1208953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:37.1209665Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.1210392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:37.1211065Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.1211670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:37.1212244Z fn() 2025-05-07T20:33:37.1212758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:37.1213350Z self.fn.run( 2025-05-07T20:33:37.1213823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1214343Z kernel = self.compile( 2025-05-07T20:33:37.1214893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1215561Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1215957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1216186Z 2025-05-07T20:33:37.1216395Z self = 2025-05-07T20:33:37.1217457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1218830Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f4acf560>} 2025-05-07T20:33:37.1220190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1221297Z context = 2025-05-07T20:33:37.1221578Z 2025-05-07T20:33:37.1221745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1222252Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1222720Z module_map=module_map) 2025-05-07T20:33:37.1223078Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1223466Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.1223734Z E ^ 2025-05-07T20:33:37.1224196Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1224643Z 2025-05-07T20:33:37.1225078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1225642Z 2025-05-07T20:33:37.1225744Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1226152Z self=, 2025-05-07T20:33:37.1226562Z T=16384, 2025-05-07T20:33:37.1226743Z D=7168, 2025-05-07T20:33:37.1226940Z scale_ub=1200.0, 2025-05-07T20:33:37.1227163Z contiguous=False, 2025-05-07T20:33:37.1227386Z compiled=False, 2025-05-07T20:33:37.1227651Z ) 2025-05-07T20:33:37.1227967Z self = 2025-05-07T20:33:37.1228452Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:37.1228799Z 2025-05-07T20:33:37.1228875Z @given( 2025-05-07T20:33:37.1229103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1229416Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1229713Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1230038Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1230360Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1230631Z ) 2025-05-07T20:33:37.1230987Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1231439Z def test_silu_mul_quant( 2025-05-07T20:33:37.1231670Z self, 2025-05-07T20:33:37.1231915Z T: int, 2025-05-07T20:33:37.1232115Z D: int, 2025-05-07T20:33:37.1232322Z scale_ub: Optional[float], 2025-05-07T20:33:37.1232588Z contiguous: bool, 2025-05-07T20:33:37.1232824Z compiled: bool, 2025-05-07T20:33:37.1233044Z ) -> None: 2025-05-07T20:33:37.1233254Z torch.manual_seed(2025) 2025-05-07T20:33:37.1233500Z 2025-05-07T20:33:37.1233772Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1234116Z 2025-05-07T20:33:37.1234308Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1234599Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1234905Z x = x_sign * x_clamp 2025-05-07T20:33:37.1235143Z x0 = x[:, :D] 2025-05-07T20:33:37.1235359Z x1 = x[:, D:] 2025-05-07T20:33:37.1235554Z 2025-05-07T20:33:37.1235736Z if contiguous: 2025-05-07T20:33:37.1235965Z x0 = x0.contiguous() 2025-05-07T20:33:37.1236211Z x1 = x1.contiguous() 2025-05-07T20:33:37.1236453Z 2025-05-07T20:33:37.1236642Z if scale_ub is not None: 2025-05-07T20:33:37.1236904Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1237237Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1237554Z ) 2025-05-07T20:33:37.1237752Z else: 2025-05-07T20:33:37.1237956Z scale_ub_tensor = None 2025-05-07T20:33:37.1238202Z 2025-05-07T20:33:37.1238427Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1238727Z op = silu_mul_quant 2025-05-07T20:33:37.1238979Z if compiled: 2025-05-07T20:33:37.1239229Z op = torch.compile(op) 2025-05-07T20:33:37.1239516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1239789Z 2025-05-07T20:33:37.1239981Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1240489Z 2025-05-07T20:33:37.1240629Z moe/activation_test.py:117: 2025-05-07T20:33:37.1240934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1241266Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1241542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1242324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1243006Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1243540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1244258Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1244908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1245427Z kernel = self.compile( 2025-05-07T20:33:37.1245975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1246638Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1247024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1247250Z 2025-05-07T20:33:37.1247520Z self = 2025-05-07T20:33:37.1248619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1249963Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f4d787c0>} 2025-05-07T20:33:37.1251327Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1252405Z context = 2025-05-07T20:33:37.1252687Z 2025-05-07T20:33:37.1252864Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1253386Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1253863Z module_map=module_map) 2025-05-07T20:33:37.1254233Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1254591Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1254845Z E ^ 2025-05-07T20:33:37.1255312Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1255766Z 2025-05-07T20:33:37.1256190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1256691Z 2025-05-07T20:33:37.1256796Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1257207Z self=, 2025-05-07T20:33:37.1257609Z T=1, 2025-05-07T20:33:37.1257796Z D=7168, 2025-05-07T20:33:37.1257981Z scale_ub=None, 2025-05-07T20:33:37.1258197Z contiguous=True, 2025-05-07T20:33:37.1258418Z compiled=True, 2025-05-07T20:33:37.1258617Z ) 2025-05-07T20:33:37.1258939Z self = 2025-05-07T20:33:37.1259417Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:37.1259672Z 2025-05-07T20:33:37.1259748Z @given( 2025-05-07T20:33:37.1259971Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1260273Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1260568Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1260897Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1261226Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1261507Z ) 2025-05-07T20:33:37.1261901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1262354Z def test_silu_mul_quant( 2025-05-07T20:33:37.1262598Z self, 2025-05-07T20:33:37.1262785Z T: int, 2025-05-07T20:33:37.1262986Z D: int, 2025-05-07T20:33:37.1263200Z scale_ub: Optional[float], 2025-05-07T20:33:37.1263464Z contiguous: bool, 2025-05-07T20:33:37.1263704Z compiled: bool, 2025-05-07T20:33:37.1263967Z ) -> None: 2025-05-07T20:33:37.1264172Z torch.manual_seed(2025) 2025-05-07T20:33:37.1264412Z 2025-05-07T20:33:37.1264682Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1265013Z 2025-05-07T20:33:37.1265209Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1265495Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1265806Z x = x_sign * x_clamp 2025-05-07T20:33:37.1266045Z x0 = x[:, :D] 2025-05-07T20:33:37.1266256Z x1 = x[:, D:] 2025-05-07T20:33:37.1266466Z 2025-05-07T20:33:37.1266644Z if contiguous: 2025-05-07T20:33:37.1266873Z x0 = x0.contiguous() 2025-05-07T20:33:37.1267165Z x1 = x1.contiguous() 2025-05-07T20:33:37.1267387Z 2025-05-07T20:33:37.1267620Z if scale_ub is not None: 2025-05-07T20:33:37.1267878Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1268193Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1268485Z ) 2025-05-07T20:33:37.1268664Z else: 2025-05-07T20:33:37.1268858Z scale_ub_tensor = None 2025-05-07T20:33:37.1269093Z 2025-05-07T20:33:37.1269305Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1269597Z op = silu_mul_quant 2025-05-07T20:33:37.1269831Z if compiled: 2025-05-07T20:33:37.1270070Z op = torch.compile(op) 2025-05-07T20:33:37.1270402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1270661Z 2025-05-07T20:33:37.1270837Z y_fp8, y_scale = fn() 2025-05-07T20:33:37.1271107Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:37.1271384Z 2025-05-07T20:33:37.1271602Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1271919Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:37.1272190Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:37.1272488Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:37.1272830Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1273124Z 2025-05-07T20:33:37.1273318Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.1273507Z 2025-05-07T20:33:37.1273607Z moe/activation_test.py:126: 2025-05-07T20:33:37.1273891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1274212Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:37.1274517Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1275283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:37.1276006Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.1276533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1277198Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1277868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:37.1278561Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.1279281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:37.1279899Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.1280551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:37.1281052Z fn() 2025-05-07T20:33:37.1281550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:37.1282137Z self.fn.run( 2025-05-07T20:33:37.1282598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1283145Z kernel = self.compile( 2025-05-07T20:33:37.1283687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1284317Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1284703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1284929Z 2025-05-07T20:33:37.1285127Z self = 2025-05-07T20:33:37.1286226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1287567Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f4baa340>} 2025-05-07T20:33:37.1288922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1289920Z context = 2025-05-07T20:33:37.1290201Z 2025-05-07T20:33:37.1290402Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1290924Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1291388Z module_map=module_map) 2025-05-07T20:33:37.1291747Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1292096Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.1292345Z E ^ 2025-05-07T20:33:37.1292788Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1293236Z 2025-05-07T20:33:37.1293659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1294162Z 2025-05-07T20:33:37.1294260Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1294656Z self=, 2025-05-07T20:33:37.1295061Z T=4096, 2025-05-07T20:33:37.1295243Z D=5120, 2025-05-07T20:33:37.1295423Z scale_ub=None, 2025-05-07T20:33:37.1295631Z contiguous=False, 2025-05-07T20:33:37.1295848Z compiled=False, 2025-05-07T20:33:37.1296036Z ) 2025-05-07T20:33:37.1296347Z self = 2025-05-07T20:33:37.1296836Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:37.1297103Z 2025-05-07T20:33:37.1297186Z @given( 2025-05-07T20:33:37.1297402Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1297702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1297995Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1298309Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1298628Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1298901Z ) 2025-05-07T20:33:37.1299228Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1299660Z def test_silu_mul_quant( 2025-05-07T20:33:37.1299889Z self, 2025-05-07T20:33:37.1300152Z T: int, 2025-05-07T20:33:37.1300336Z D: int, 2025-05-07T20:33:37.1300548Z scale_ub: Optional[float], 2025-05-07T20:33:37.1300637Z contiguous: bool, 2025-05-07T20:33:37.1300722Z compiled: bool, 2025-05-07T20:33:37.1300796Z ) -> None: 2025-05-07T20:33:37.1300905Z torch.manual_seed(2025) 2025-05-07T20:33:37.1300990Z 2025-05-07T20:33:37.1301218Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1301283Z 2025-05-07T20:33:37.1301371Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1301488Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1301571Z x = x_sign * x_clamp 2025-05-07T20:33:37.1301647Z x0 = x[:, :D] 2025-05-07T20:33:37.1301721Z x1 = x[:, D:] 2025-05-07T20:33:37.1301791Z 2025-05-07T20:33:37.1301872Z if contiguous: 2025-05-07T20:33:37.1301958Z x0 = x0.contiguous() 2025-05-07T20:33:37.1302048Z x1 = x1.contiguous() 2025-05-07T20:33:37.1302120Z 2025-05-07T20:33:37.1302205Z if scale_ub is not None: 2025-05-07T20:33:37.1302354Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1302487Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1302563Z ) 2025-05-07T20:33:37.1302643Z else: 2025-05-07T20:33:37.1302733Z scale_ub_tensor = None 2025-05-07T20:33:37.1302808Z 2025-05-07T20:33:37.1302940Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1303031Z op = silu_mul_quant 2025-05-07T20:33:37.1303113Z if compiled: 2025-05-07T20:33:37.1303217Z op = torch.compile(op) 2025-05-07T20:33:37.1303325Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1303405Z 2025-05-07T20:33:37.1303537Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1303541Z 2025-05-07T20:33:37.1303635Z moe/activation_test.py:117: 2025-05-07T20:33:37.1303773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1303876Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1303972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1304483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1304579Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1304945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1305168Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1305509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1305607Z kernel = self.compile( 2025-05-07T20:33:37.1305991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1306165Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1306298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1306303Z 2025-05-07T20:33:37.1306502Z self = 2025-05-07T20:33:37.1307270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1307850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f4bab100>} 2025-05-07T20:33:37.1308634Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1308826Z context = 2025-05-07T20:33:37.1308830Z 2025-05-07T20:33:37.1308989Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1309263Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1309408Z module_map=module_map) 2025-05-07T20:33:37.1309575Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1309669Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1309747Z E ^ 2025-05-07T20:33:37.1310106Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1310110Z 2025-05-07T20:33:37.1310534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1310538Z 2025-05-07T20:33:37.1310639Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1310902Z self=, 2025-05-07T20:33:37.1310977Z T=4096, 2025-05-07T20:33:37.1311060Z D=7168, 2025-05-07T20:33:37.1311140Z scale_ub=None, 2025-05-07T20:33:37.1311222Z contiguous=False, 2025-05-07T20:33:37.1311306Z compiled=False, 2025-05-07T20:33:37.1311382Z ) 2025-05-07T20:33:37.1311596Z self = 2025-05-07T20:33:37.1311772Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:37.1311776Z 2025-05-07T20:33:37.1311850Z @given( 2025-05-07T20:33:37.1311964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1312066Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1312219Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1312337Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1312448Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1312517Z ) 2025-05-07T20:33:37.1312772Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1312861Z def test_silu_mul_quant( 2025-05-07T20:33:37.1312932Z self, 2025-05-07T20:33:37.1313012Z T: int, 2025-05-07T20:33:37.1313087Z D: int, 2025-05-07T20:33:37.1313182Z scale_ub: Optional[float], 2025-05-07T20:33:37.1313274Z contiguous: bool, 2025-05-07T20:33:37.1313355Z compiled: bool, 2025-05-07T20:33:37.1313432Z ) -> None: 2025-05-07T20:33:37.1313529Z torch.manual_seed(2025) 2025-05-07T20:33:37.1313597Z 2025-05-07T20:33:37.1313767Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1313843Z 2025-05-07T20:33:37.1313931Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1314060Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1314150Z x = x_sign * x_clamp 2025-05-07T20:33:37.1314228Z x0 = x[:, :D] 2025-05-07T20:33:37.1314313Z x1 = x[:, D:] 2025-05-07T20:33:37.1314383Z 2025-05-07T20:33:37.1314463Z if contiguous: 2025-05-07T20:33:37.1314558Z x0 = x0.contiguous() 2025-05-07T20:33:37.1314640Z x1 = x1.contiguous() 2025-05-07T20:33:37.1314707Z 2025-05-07T20:33:37.1314801Z if scale_ub is not None: 2025-05-07T20:33:37.1314904Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1315041Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1315114Z ) 2025-05-07T20:33:37.1315187Z else: 2025-05-07T20:33:37.1315282Z scale_ub_tensor = None 2025-05-07T20:33:37.1315353Z 2025-05-07T20:33:37.1315481Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1315577Z op = silu_mul_quant 2025-05-07T20:33:37.1315661Z if compiled: 2025-05-07T20:33:37.1315804Z op = torch.compile(op) 2025-05-07T20:33:37.1315916Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1315985Z 2025-05-07T20:33:37.1316071Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1316075Z 2025-05-07T20:33:37.1316174Z moe/activation_test.py:117: 2025-05-07T20:33:37.1316297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1316440Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1316534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1317041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1317143Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1317506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1317727Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1318116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1318209Z kernel = self.compile( 2025-05-07T20:33:37.1318613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1318784Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1318908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1318912Z 2025-05-07T20:33:37.1319117Z self = 2025-05-07T20:33:37.1319934Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1320494Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f4baa980>} 2025-05-07T20:33:37.1321271Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1321460Z context = 2025-05-07T20:33:37.1321473Z 2025-05-07T20:33:37.1321632Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1321893Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1322005Z module_map=module_map) 2025-05-07T20:33:37.1322163Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1322260Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1322340Z E ^ 2025-05-07T20:33:37.1322700Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1322705Z 2025-05-07T20:33:37.1323141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1323146Z 2025-05-07T20:33:37.1323246Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1323465Z self=, 2025-05-07T20:33:37.1323547Z T=128, 2025-05-07T20:33:37.1323621Z D=7168, 2025-05-07T20:33:37.1323703Z scale_ub=None, 2025-05-07T20:33:37.1323796Z contiguous=False, 2025-05-07T20:33:37.1323875Z compiled=True, 2025-05-07T20:33:37.1323947Z ) 2025-05-07T20:33:37.1324175Z self = 2025-05-07T20:33:37.1324343Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:37.1324348Z 2025-05-07T20:33:37.1324474Z @given( 2025-05-07T20:33:37.1324593Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1324691Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1324811Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1324923Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1325031Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1325149Z ) 2025-05-07T20:33:37.1325390Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1325490Z def test_silu_mul_quant( 2025-05-07T20:33:37.1325565Z self, 2025-05-07T20:33:37.1325639Z T: int, 2025-05-07T20:33:37.1325719Z D: int, 2025-05-07T20:33:37.1325811Z scale_ub: Optional[float], 2025-05-07T20:33:37.1325899Z contiguous: bool, 2025-05-07T20:33:37.1325985Z compiled: bool, 2025-05-07T20:33:37.1326061Z ) -> None: 2025-05-07T20:33:37.1326153Z torch.manual_seed(2025) 2025-05-07T20:33:37.1326233Z 2025-05-07T20:33:37.1326436Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1326509Z 2025-05-07T20:33:37.1326602Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1326722Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1326805Z x = x_sign * x_clamp 2025-05-07T20:33:37.1326890Z x0 = x[:, :D] 2025-05-07T20:33:37.1326964Z x1 = x[:, D:] 2025-05-07T20:33:37.1327040Z 2025-05-07T20:33:37.1327121Z if contiguous: 2025-05-07T20:33:37.1327207Z x0 = x0.contiguous() 2025-05-07T20:33:37.1327296Z x1 = x1.contiguous() 2025-05-07T20:33:37.1327366Z 2025-05-07T20:33:37.1327456Z if scale_ub is not None: 2025-05-07T20:33:37.1327563Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1327763Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1327837Z ) 2025-05-07T20:33:37.1327918Z else: 2025-05-07T20:33:37.1328011Z scale_ub_tensor = None 2025-05-07T20:33:37.1328085Z 2025-05-07T20:33:37.1328215Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1328300Z op = silu_mul_quant 2025-05-07T20:33:37.1328388Z if compiled: 2025-05-07T20:33:37.1328484Z op = torch.compile(op) 2025-05-07T20:33:37.1328588Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1328663Z 2025-05-07T20:33:37.1328750Z y_fp8, y_scale = fn() 2025-05-07T20:33:37.1328866Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:37.1328942Z 2025-05-07T20:33:37.1329072Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1329170Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:37.1329277Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:37.1329393Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:37.1329540Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1329613Z 2025-05-07T20:33:37.1329712Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.1329717Z 2025-05-07T20:33:37.1329817Z moe/activation_test.py:126: 2025-05-07T20:33:37.1329942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1330046Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:37.1330181Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1330746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:37.1330841Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.1331222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1331445Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1331874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:37.1332136Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.1332506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:37.1337612Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.1337986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:37.1338067Z fn() 2025-05-07T20:33:37.1338473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:37.1338557Z self.fn.run( 2025-05-07T20:33:37.1338905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1338999Z kernel = self.compile( 2025-05-07T20:33:37.1339440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1339613Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1339735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1339743Z 2025-05-07T20:33:37.1339948Z self = 2025-05-07T20:33:37.1341325Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1341851Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89f4253d80>} 2025-05-07T20:33:37.1342739Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1342921Z context = 2025-05-07T20:33:37.1342926Z 2025-05-07T20:33:37.1343089Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1343358Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1343466Z module_map=module_map) 2025-05-07T20:33:37.1343623Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1343718Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.1343795Z E ^ 2025-05-07T20:33:37.1344169Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1344174Z 2025-05-07T20:33:37.1344593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1344598Z 2025-05-07T20:33:37.1344697Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1344917Z self=, 2025-05-07T20:33:37.1344995Z T=128, 2025-05-07T20:33:37.1345067Z D=7168, 2025-05-07T20:33:37.1345146Z scale_ub=None, 2025-05-07T20:33:37.1345228Z contiguous=False, 2025-05-07T20:33:37.1345307Z compiled=False, 2025-05-07T20:33:37.1345379Z ) 2025-05-07T20:33:37.1345602Z self = 2025-05-07T20:33:37.1345771Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:37.1345778Z 2025-05-07T20:33:37.1345850Z @given( 2025-05-07T20:33:37.1345971Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1346130Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1346245Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1346357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1346463Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1346537Z ) 2025-05-07T20:33:37.1346780Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1346923Z def test_silu_mul_quant( 2025-05-07T20:33:37.1346999Z self, 2025-05-07T20:33:37.1347073Z T: int, 2025-05-07T20:33:37.1347144Z D: int, 2025-05-07T20:33:37.1347243Z scale_ub: Optional[float], 2025-05-07T20:33:37.1347333Z contiguous: bool, 2025-05-07T20:33:37.1347474Z compiled: bool, 2025-05-07T20:33:37.1347560Z ) -> None: 2025-05-07T20:33:37.1347658Z torch.manual_seed(2025) 2025-05-07T20:33:37.1347740Z 2025-05-07T20:33:37.1347923Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1348002Z 2025-05-07T20:33:37.1348100Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1348293Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1348379Z x = x_sign * x_clamp 2025-05-07T20:33:37.1348459Z x0 = x[:, :D] 2025-05-07T20:33:37.1348534Z x1 = x[:, D:] 2025-05-07T20:33:37.1348599Z 2025-05-07T20:33:37.1348684Z if contiguous: 2025-05-07T20:33:37.1348769Z x0 = x0.contiguous() 2025-05-07T20:33:37.1348849Z x1 = x1.contiguous() 2025-05-07T20:33:37.1348920Z 2025-05-07T20:33:37.1349004Z if scale_ub is not None: 2025-05-07T20:33:37.1349107Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1349237Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1349310Z ) 2025-05-07T20:33:37.1349427Z else: 2025-05-07T20:33:37.1349516Z scale_ub_tensor = None 2025-05-07T20:33:37.1349582Z 2025-05-07T20:33:37.1349712Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1349797Z op = silu_mul_quant 2025-05-07T20:33:37.1349884Z if compiled: 2025-05-07T20:33:37.1349985Z op = torch.compile(op) 2025-05-07T20:33:37.1350087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1350153Z 2025-05-07T20:33:37.1350246Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1350253Z 2025-05-07T20:33:37.1350345Z moe/activation_test.py:117: 2025-05-07T20:33:37.1350472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1350567Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1350661Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1351174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1351266Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1351625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1351846Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1352191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1352284Z kernel = self.compile( 2025-05-07T20:33:37.1352667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1352836Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1352959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1352964Z 2025-05-07T20:33:37.1353161Z self = 2025-05-07T20:33:37.1353980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1354472Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cbd29760>} 2025-05-07T20:33:37.1355200Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1355427Z context = 2025-05-07T20:33:37.1355431Z 2025-05-07T20:33:37.1355587Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1355850Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1355955Z module_map=module_map) 2025-05-07T20:33:37.1356114Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1356248Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1356321Z E ^ 2025-05-07T20:33:37.1356675Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1356679Z 2025-05-07T20:33:37.1357104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1357111Z 2025-05-07T20:33:37.1357208Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1357431Z self=, 2025-05-07T20:33:37.1357503Z T=4096, 2025-05-07T20:33:37.1357577Z D=5120, 2025-05-07T20:33:37.1357667Z scale_ub=1200.0, 2025-05-07T20:33:37.1357786Z contiguous=True, 2025-05-07T20:33:37.1357870Z compiled=False, 2025-05-07T20:33:37.1357940Z ) 2025-05-07T20:33:37.1358158Z self = 2025-05-07T20:33:37.1358334Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:37.1358339Z 2025-05-07T20:33:37.1358410Z @given( 2025-05-07T20:33:37.1358524Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1358627Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1358739Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1358851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1358966Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1359036Z ) 2025-05-07T20:33:37.1359281Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1359370Z def test_silu_mul_quant( 2025-05-07T20:33:37.1359438Z self, 2025-05-07T20:33:37.1359518Z T: int, 2025-05-07T20:33:37.1359589Z D: int, 2025-05-07T20:33:37.1359681Z scale_ub: Optional[float], 2025-05-07T20:33:37.1359775Z contiguous: bool, 2025-05-07T20:33:37.1359856Z compiled: bool, 2025-05-07T20:33:37.1359930Z ) -> None: 2025-05-07T20:33:37.1360025Z torch.manual_seed(2025) 2025-05-07T20:33:37.1360092Z 2025-05-07T20:33:37.1360256Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1360333Z 2025-05-07T20:33:37.1360422Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1360544Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1360624Z x = x_sign * x_clamp 2025-05-07T20:33:37.1360695Z x0 = x[:, :D] 2025-05-07T20:33:37.1360769Z x1 = x[:, D:] 2025-05-07T20:33:37.1360835Z 2025-05-07T20:33:37.1360913Z if contiguous: 2025-05-07T20:33:37.1361000Z x0 = x0.contiguous() 2025-05-07T20:33:37.1361083Z x1 = x1.contiguous() 2025-05-07T20:33:37.1361146Z 2025-05-07T20:33:37.1361229Z if scale_ub is not None: 2025-05-07T20:33:37.1361373Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1361505Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1361583Z ) 2025-05-07T20:33:37.1361653Z else: 2025-05-07T20:33:37.1361741Z scale_ub_tensor = None 2025-05-07T20:33:37.1361813Z 2025-05-07T20:33:37.1361936Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1362065Z op = silu_mul_quant 2025-05-07T20:33:37.1362146Z if compiled: 2025-05-07T20:33:37.1362241Z op = torch.compile(op) 2025-05-07T20:33:37.1362345Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1362412Z 2025-05-07T20:33:37.1362496Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1362500Z 2025-05-07T20:33:37.1362592Z moe/activation_test.py:117: 2025-05-07T20:33:37.1362715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1362808Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1362903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1363465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1363558Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1363903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1364123Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1364460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1364546Z kernel = self.compile( 2025-05-07T20:33:37.1364939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1365148Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1365269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1365274Z 2025-05-07T20:33:37.1365475Z self = 2025-05-07T20:33:37.1366235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1366745Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cbd2a3e0>} 2025-05-07T20:33:37.1367476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1367663Z context = 2025-05-07T20:33:37.1367667Z 2025-05-07T20:33:37.1367829Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1368089Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1368192Z module_map=module_map) 2025-05-07T20:33:37.1368345Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1368438Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1368516Z E ^ 2025-05-07T20:33:37.1368864Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1368869Z 2025-05-07T20:33:37.1369286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1369298Z 2025-05-07T20:33:37.1369392Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1369651Z self=, 2025-05-07T20:33:37.1369729Z T=1, 2025-05-07T20:33:37.1369801Z D=5120, 2025-05-07T20:33:37.1369879Z scale_ub=None, 2025-05-07T20:33:37.1369967Z contiguous=True, 2025-05-07T20:33:37.1370045Z compiled=True, 2025-05-07T20:33:37.1370114Z ) 2025-05-07T20:33:37.1370331Z self = 2025-05-07T20:33:37.1370541Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:37.1370546Z 2025-05-07T20:33:37.1370622Z @given( 2025-05-07T20:33:37.1370733Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1370823Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1370936Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1371047Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1371160Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1371231Z ) 2025-05-07T20:33:37.1371471Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1371598Z def test_silu_mul_quant( 2025-05-07T20:33:37.1371675Z self, 2025-05-07T20:33:37.1371744Z T: int, 2025-05-07T20:33:37.1371813Z D: int, 2025-05-07T20:33:37.1371910Z scale_ub: Optional[float], 2025-05-07T20:33:37.1371993Z contiguous: bool, 2025-05-07T20:33:37.1372080Z compiled: bool, 2025-05-07T20:33:37.1372152Z ) -> None: 2025-05-07T20:33:37.1372241Z torch.manual_seed(2025) 2025-05-07T20:33:37.1372314Z 2025-05-07T20:33:37.1372476Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1372543Z 2025-05-07T20:33:37.1372631Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1372753Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1372879Z x = x_sign * x_clamp 2025-05-07T20:33:37.1372955Z x0 = x[:, :D] 2025-05-07T20:33:37.1373030Z x1 = x[:, D:] 2025-05-07T20:33:37.1373096Z 2025-05-07T20:33:37.1373181Z if contiguous: 2025-05-07T20:33:37.1373271Z x0 = x0.contiguous() 2025-05-07T20:33:37.1373359Z x1 = x1.contiguous() 2025-05-07T20:33:37.1373424Z 2025-05-07T20:33:37.1373507Z if scale_ub is not None: 2025-05-07T20:33:37.1373610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1373739Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1373814Z ) 2025-05-07T20:33:37.1373891Z else: 2025-05-07T20:33:37.1373978Z scale_ub_tensor = None 2025-05-07T20:33:37.1374044Z 2025-05-07T20:33:37.1374170Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1374254Z op = silu_mul_quant 2025-05-07T20:33:37.1374334Z if compiled: 2025-05-07T20:33:37.1374435Z op = torch.compile(op) 2025-05-07T20:33:37.1374535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1374599Z 2025-05-07T20:33:37.1374689Z y_fp8, y_scale = fn() 2025-05-07T20:33:37.1374804Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:37.1374872Z 2025-05-07T20:33:37.1374998Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1375092Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:37.1375190Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:37.1375307Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:37.1375439Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1375514Z 2025-05-07T20:33:37.1375606Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.1375611Z 2025-05-07T20:33:37.1375704Z moe/activation_test.py:126: 2025-05-07T20:33:37.1375823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1375923Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:37.1376051Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1376656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:37.1376751Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.1377114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1377372Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1377748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:37.1378000Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.1378367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:37.1378534Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.1378903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:37.1378976Z fn() 2025-05-07T20:33:37.1379381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:37.1379456Z self.fn.run( 2025-05-07T20:33:37.1379793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1379879Z kernel = self.compile( 2025-05-07T20:33:37.1380253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1380427Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1380545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1380590Z 2025-05-07T20:33:37.1380792Z self = 2025-05-07T20:33:37.1381556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1382079Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cbd2b060>} 2025-05-07T20:33:37.1382856Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1383039Z context = 2025-05-07T20:33:37.1383047Z 2025-05-07T20:33:37.1383205Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1383463Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1383572Z module_map=module_map) 2025-05-07T20:33:37.1383730Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1383825Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.1383900Z E ^ 2025-05-07T20:33:37.1384248Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1384255Z 2025-05-07T20:33:37.1384661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1384666Z 2025-05-07T20:33:37.1384763Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1384980Z self=, 2025-05-07T20:33:37.1385054Z T=2048, 2025-05-07T20:33:37.1385127Z D=5120, 2025-05-07T20:33:37.1385205Z scale_ub=None, 2025-05-07T20:33:37.1385332Z contiguous=True, 2025-05-07T20:33:37.1385408Z compiled=True, 2025-05-07T20:33:37.1385476Z ) 2025-05-07T20:33:37.1385695Z self = 2025-05-07T20:33:37.1385860Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:37.1385865Z 2025-05-07T20:33:37.1385934Z @given( 2025-05-07T20:33:37.1386090Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1386182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1386289Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1386401Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1386506Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1386578Z ) 2025-05-07T20:33:37.1386821Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1386913Z def test_silu_mul_quant( 2025-05-07T20:33:37.1386988Z self, 2025-05-07T20:33:37.1387062Z T: int, 2025-05-07T20:33:37.1387130Z D: int, 2025-05-07T20:33:37.1387265Z scale_ub: Optional[float], 2025-05-07T20:33:37.1387347Z contiguous: bool, 2025-05-07T20:33:37.1387475Z compiled: bool, 2025-05-07T20:33:37.1387553Z ) -> None: 2025-05-07T20:33:37.1387638Z torch.manual_seed(2025) 2025-05-07T20:33:37.1387707Z 2025-05-07T20:33:37.1387869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1387935Z 2025-05-07T20:33:37.1388020Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1388137Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1388216Z x = x_sign * x_clamp 2025-05-07T20:33:37.1388293Z x0 = x[:, :D] 2025-05-07T20:33:37.1388365Z x1 = x[:, D:] 2025-05-07T20:33:37.1388487Z 2025-05-07T20:33:37.1388568Z if contiguous: 2025-05-07T20:33:37.1388652Z x0 = x0.contiguous() 2025-05-07T20:33:37.1388735Z x1 = x1.contiguous() 2025-05-07T20:33:37.1388804Z 2025-05-07T20:33:37.1388888Z if scale_ub is not None: 2025-05-07T20:33:37.1388987Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1389116Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1389185Z ) 2025-05-07T20:33:37.1389253Z else: 2025-05-07T20:33:37.1389346Z scale_ub_tensor = None 2025-05-07T20:33:37.1389413Z 2025-05-07T20:33:37.1389544Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1389627Z op = silu_mul_quant 2025-05-07T20:33:37.1389709Z if compiled: 2025-05-07T20:33:37.1389801Z op = torch.compile(op) 2025-05-07T20:33:37.1389899Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1389969Z 2025-05-07T20:33:37.1390064Z y_fp8, y_scale = fn() 2025-05-07T20:33:37.1390181Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:37.1390245Z 2025-05-07T20:33:37.1390387Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1390486Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:37.1390592Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:37.1390708Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:37.1390841Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1390921Z 2025-05-07T20:33:37.1391014Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.1391018Z 2025-05-07T20:33:37.1391112Z moe/activation_test.py:126: 2025-05-07T20:33:37.1391234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1391330Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:37.1391456Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1392068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:37.1392164Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.1392534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1392750Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1393176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:37.1393429Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.1393813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:37.1393978Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.1394312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:37.1394386Z fn() 2025-05-07T20:33:37.1394846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:37.1394927Z self.fn.run( 2025-05-07T20:33:37.1395260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1395356Z kernel = self.compile( 2025-05-07T20:33:37.1395748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1395918Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1396037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1396041Z 2025-05-07T20:33:37.1396241Z self = 2025-05-07T20:33:37.1397101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1397624Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cb52ff60>} 2025-05-07T20:33:37.1398355Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1398543Z context = 2025-05-07T20:33:37.1398548Z 2025-05-07T20:33:37.1398707Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1398969Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1399072Z module_map=module_map) 2025-05-07T20:33:37.1399229Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1399327Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.1399397Z E ^ 2025-05-07T20:33:37.1399744Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1399749Z 2025-05-07T20:33:37.1400157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1400162Z 2025-05-07T20:33:37.1400255Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1400477Z self=, 2025-05-07T20:33:37.1400547Z T=128, 2025-05-07T20:33:37.1400620Z D=5120, 2025-05-07T20:33:37.1400695Z scale_ub=None, 2025-05-07T20:33:37.1400776Z contiguous=True, 2025-05-07T20:33:37.1400855Z compiled=True, 2025-05-07T20:33:37.1400922Z ) 2025-05-07T20:33:37.1401175Z self = 2025-05-07T20:33:37.1401345Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:37.1401349Z 2025-05-07T20:33:37.1401417Z @given( 2025-05-07T20:33:37.1401528Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1401625Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1401773Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1401885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1401990Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1402058Z ) 2025-05-07T20:33:37.1402298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1402383Z def test_silu_mul_quant( 2025-05-07T20:33:37.1402454Z self, 2025-05-07T20:33:37.1402525Z T: int, 2025-05-07T20:33:37.1402593Z D: int, 2025-05-07T20:33:37.1402681Z scale_ub: Optional[float], 2025-05-07T20:33:37.1402765Z contiguous: bool, 2025-05-07T20:33:37.1402882Z compiled: bool, 2025-05-07T20:33:37.1402956Z ) -> None: 2025-05-07T20:33:37.1403050Z torch.manual_seed(2025) 2025-05-07T20:33:37.1403114Z 2025-05-07T20:33:37.1403279Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1403350Z 2025-05-07T20:33:37.1403440Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1403562Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1403643Z x = x_sign * x_clamp 2025-05-07T20:33:37.1403719Z x0 = x[:, :D] 2025-05-07T20:33:37.1403796Z x1 = x[:, D:] 2025-05-07T20:33:37.1403861Z 2025-05-07T20:33:37.1403934Z if contiguous: 2025-05-07T20:33:37.1404021Z x0 = x0.contiguous() 2025-05-07T20:33:37.1404147Z x1 = x1.contiguous() 2025-05-07T20:33:37.1404212Z 2025-05-07T20:33:37.1404298Z if scale_ub is not None: 2025-05-07T20:33:37.1404399Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1404533Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1404605Z ) 2025-05-07T20:33:37.1404675Z else: 2025-05-07T20:33:37.1404764Z scale_ub_tensor = None 2025-05-07T20:33:37.1404832Z 2025-05-07T20:33:37.1404953Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1405041Z op = silu_mul_quant 2025-05-07T20:33:37.1405120Z if compiled: 2025-05-07T20:33:37.1405211Z op = torch.compile(op) 2025-05-07T20:33:37.1405310Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1405376Z 2025-05-07T20:33:37.1405459Z y_fp8, y_scale = fn() 2025-05-07T20:33:37.1405577Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:37.1405648Z 2025-05-07T20:33:37.1405782Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1405876Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:37.1405972Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:37.1406089Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:37.1406222Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1406288Z 2025-05-07T20:33:37.1406383Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.1406390Z 2025-05-07T20:33:37.1406483Z moe/activation_test.py:126: 2025-05-07T20:33:37.1406603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1406705Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:37.1406834Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1407420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:37.1407516Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.1407921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1408143Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1408500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:37.1408755Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.1409162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:37.1409320Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.1409656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:37.1409729Z fn() 2025-05-07T20:33:37.1410118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:37.1410197Z self.fn.run( 2025-05-07T20:33:37.1410564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1410653Z kernel = self.compile( 2025-05-07T20:33:37.1411046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1411214Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1411338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1411343Z 2025-05-07T20:33:37.1411539Z self = 2025-05-07T20:33:37.1412347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1412902Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cb315f80>} 2025-05-07T20:33:37.1413632Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1413817Z context = 2025-05-07T20:33:37.1413822Z 2025-05-07T20:33:37.1413978Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1414236Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1414337Z module_map=module_map) 2025-05-07T20:33:37.1414495Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1414590Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.1414662Z E ^ 2025-05-07T20:33:37.1415014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1415021Z 2025-05-07T20:33:37.1415445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1415452Z 2025-05-07T20:33:37.1415548Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1415766Z self=, 2025-05-07T20:33:37.1415838Z T=4096, 2025-05-07T20:33:37.1415907Z D=5120, 2025-05-07T20:33:37.1415989Z scale_ub=None, 2025-05-07T20:33:37.1416067Z contiguous=True, 2025-05-07T20:33:37.1416139Z compiled=True, 2025-05-07T20:33:37.1416209Z ) 2025-05-07T20:33:37.1416425Z self = 2025-05-07T20:33:37.1416594Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:37.1416640Z 2025-05-07T20:33:37.1416711Z @given( 2025-05-07T20:33:37.1416829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1416923Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1417030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1417142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1417296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1417361Z ) 2025-05-07T20:33:37.1417602Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1417689Z def test_silu_mul_quant( 2025-05-07T20:33:37.1417758Z self, 2025-05-07T20:33:37.1417831Z T: int, 2025-05-07T20:33:37.1417900Z D: int, 2025-05-07T20:33:37.1417988Z scale_ub: Optional[float], 2025-05-07T20:33:37.1418077Z contiguous: bool, 2025-05-07T20:33:37.1418154Z compiled: bool, 2025-05-07T20:33:37.1418223Z ) -> None: 2025-05-07T20:33:37.1418313Z torch.manual_seed(2025) 2025-05-07T20:33:37.1418377Z 2025-05-07T20:33:37.1418576Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1418647Z 2025-05-07T20:33:37.1418736Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1418852Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1418940Z x = x_sign * x_clamp 2025-05-07T20:33:37.1419012Z x0 = x[:, :D] 2025-05-07T20:33:37.1419089Z x1 = x[:, D:] 2025-05-07T20:33:37.1419155Z 2025-05-07T20:33:37.1419229Z if contiguous: 2025-05-07T20:33:37.1419316Z x0 = x0.contiguous() 2025-05-07T20:33:37.1419396Z x1 = x1.contiguous() 2025-05-07T20:33:37.1419459Z 2025-05-07T20:33:37.1419543Z if scale_ub is not None: 2025-05-07T20:33:37.1419685Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1419812Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1419889Z ) 2025-05-07T20:33:37.1419959Z else: 2025-05-07T20:33:37.1420048Z scale_ub_tensor = None 2025-05-07T20:33:37.1420116Z 2025-05-07T20:33:37.1420239Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1420323Z op = silu_mul_quant 2025-05-07T20:33:37.1420403Z if compiled: 2025-05-07T20:33:37.1420500Z op = torch.compile(op) 2025-05-07T20:33:37.1420602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1420668Z 2025-05-07T20:33:37.1420751Z y_fp8, y_scale = fn() 2025-05-07T20:33:37.1420868Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:37.1420934Z 2025-05-07T20:33:37.1421062Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1421158Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:37.1421257Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:37.1421368Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:37.1421507Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1421574Z 2025-05-07T20:33:37.1421670Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.1421674Z 2025-05-07T20:33:37.1421763Z moe/activation_test.py:126: 2025-05-07T20:33:37.1421882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1421986Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:37.1422111Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1422653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:37.1422748Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.1423115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1423405Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1423773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:37.1424023Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.1424411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:37.1424610Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.1424949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:37.1425022Z fn() 2025-05-07T20:33:37.1425433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:37.1425518Z self.fn.run( 2025-05-07T20:33:37.1425851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1425938Z kernel = self.compile( 2025-05-07T20:33:37.1426360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1426532Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1426655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1426662Z 2025-05-07T20:33:37.1426860Z self = 2025-05-07T20:33:37.1427718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1428257Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cb4be340>} 2025-05-07T20:33:37.1429031Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1429217Z context = 2025-05-07T20:33:37.1429225Z 2025-05-07T20:33:37.1429380Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1429636Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1429742Z module_map=module_map) 2025-05-07T20:33:37.1429896Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1429995Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.1430070Z E ^ 2025-05-07T20:33:37.1430433Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1430437Z 2025-05-07T20:33:37.1430865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1430870Z 2025-05-07T20:33:37.1430968Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1431190Z self=, 2025-05-07T20:33:37.1431264Z T=16384, 2025-05-07T20:33:37.1431332Z D=5120, 2025-05-07T20:33:37.1431410Z scale_ub=None, 2025-05-07T20:33:37.1431489Z contiguous=True, 2025-05-07T20:33:37.1431562Z compiled=True, 2025-05-07T20:33:37.1431632Z ) 2025-05-07T20:33:37.1431846Z self = 2025-05-07T20:33:37.1432013Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:37.1432020Z 2025-05-07T20:33:37.1432097Z @given( 2025-05-07T20:33:37.1432211Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1432350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1432463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1432576Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1432686Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1432757Z ) 2025-05-07T20:33:37.1432998Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1433123Z def test_silu_mul_quant( 2025-05-07T20:33:37.1433195Z self, 2025-05-07T20:33:37.1433266Z T: int, 2025-05-07T20:33:37.1433340Z D: int, 2025-05-07T20:33:37.1433430Z scale_ub: Optional[float], 2025-05-07T20:33:37.1433512Z contiguous: bool, 2025-05-07T20:33:37.1433597Z compiled: bool, 2025-05-07T20:33:37.1433672Z ) -> None: 2025-05-07T20:33:37.1433763Z torch.manual_seed(2025) 2025-05-07T20:33:37.1433830Z 2025-05-07T20:33:37.1433992Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1434060Z 2025-05-07T20:33:37.1434181Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1434302Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1434386Z x = x_sign * x_clamp 2025-05-07T20:33:37.1434460Z x0 = x[:, :D] 2025-05-07T20:33:37.1434535Z x1 = x[:, D:] 2025-05-07T20:33:37.1434609Z 2025-05-07T20:33:37.1434683Z if contiguous: 2025-05-07T20:33:37.1434767Z x0 = x0.contiguous() 2025-05-07T20:33:37.1434852Z x1 = x1.contiguous() 2025-05-07T20:33:37.1434918Z 2025-05-07T20:33:37.1435007Z if scale_ub is not None: 2025-05-07T20:33:37.1435105Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1435231Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1435342Z ) 2025-05-07T20:33:37.1435415Z else: 2025-05-07T20:33:37.1435503Z scale_ub_tensor = None 2025-05-07T20:33:37.1435576Z 2025-05-07T20:33:37.1435701Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1435787Z op = silu_mul_quant 2025-05-07T20:33:37.1435868Z if compiled: 2025-05-07T20:33:37.1435963Z op = torch.compile(op) 2025-05-07T20:33:37.1436065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1436134Z 2025-05-07T20:33:37.1436220Z y_fp8, y_scale = fn() 2025-05-07T20:33:37.1436338Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:37.1436403Z 2025-05-07T20:33:37.1436529Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1436630Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:37.1436723Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:37.1436835Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:37.1436973Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1437041Z 2025-05-07T20:33:37.1437135Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.1437140Z 2025-05-07T20:33:37.1437237Z moe/activation_test.py:126: 2025-05-07T20:33:37.1437356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1437459Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:37.1437588Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1438164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:37.1438263Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.1438630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1438848Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1439269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:37.1439521Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.1439896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:37.1440257Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.1440719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:37.1440799Z fn() 2025-05-07T20:33:37.1441204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:37.1441284Z self.fn.run( 2025-05-07T20:33:37.1441615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1441708Z kernel = self.compile( 2025-05-07T20:33:37.1442089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1442323Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1442444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1442449Z 2025-05-07T20:33:37.1442648Z self = 2025-05-07T20:33:37.1443409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1443933Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cb8a94e0>} 2025-05-07T20:33:37.1444765Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1444951Z context = 2025-05-07T20:33:37.1444956Z 2025-05-07T20:33:37.1445112Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1445365Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1445476Z module_map=module_map) 2025-05-07T20:33:37.1445631Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1445724Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.1445795Z E ^ 2025-05-07T20:33:37.1446150Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1446157Z 2025-05-07T20:33:37.1446588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1446593Z 2025-05-07T20:33:37.1446692Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1446909Z self=, 2025-05-07T20:33:37.1446984Z T=1, 2025-05-07T20:33:37.1447055Z D=5120, 2025-05-07T20:33:37.1447128Z scale_ub=1200.0, 2025-05-07T20:33:37.1447208Z contiguous=True, 2025-05-07T20:33:37.1447282Z compiled=True, 2025-05-07T20:33:37.1447352Z ) 2025-05-07T20:33:37.1447566Z self = 2025-05-07T20:33:37.1447722Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:37.1447727Z 2025-05-07T20:33:37.1447800Z @given( 2025-05-07T20:33:37.1447910Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1448004Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1448115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1448289Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1448401Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1448476Z ) 2025-05-07T20:33:37.1448721Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1448809Z def test_silu_mul_quant( 2025-05-07T20:33:37.1448920Z self, 2025-05-07T20:33:37.1448991Z T: int, 2025-05-07T20:33:37.1449062Z D: int, 2025-05-07T20:33:37.1449152Z scale_ub: Optional[float], 2025-05-07T20:33:37.1449234Z contiguous: bool, 2025-05-07T20:33:37.1449313Z compiled: bool, 2025-05-07T20:33:37.1449384Z ) -> None: 2025-05-07T20:33:37.1449468Z torch.manual_seed(2025) 2025-05-07T20:33:37.1449539Z 2025-05-07T20:33:37.1449698Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1449769Z 2025-05-07T20:33:37.1449855Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1449976Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1450103Z x = x_sign * x_clamp 2025-05-07T20:33:37.1450179Z x0 = x[:, :D] 2025-05-07T20:33:37.1450252Z x1 = x[:, D:] 2025-05-07T20:33:37.1450321Z 2025-05-07T20:33:37.1450398Z if contiguous: 2025-05-07T20:33:37.1450480Z x0 = x0.contiguous() 2025-05-07T20:33:37.1450566Z x1 = x1.contiguous() 2025-05-07T20:33:37.1450632Z 2025-05-07T20:33:37.1450713Z if scale_ub is not None: 2025-05-07T20:33:37.1450812Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1450943Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1451014Z ) 2025-05-07T20:33:37.1451090Z else: 2025-05-07T20:33:37.1451178Z scale_ub_tensor = None 2025-05-07T20:33:37.1451316Z 2025-05-07T20:33:37.1451441Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1451524Z op = silu_mul_quant 2025-05-07T20:33:37.1451615Z if compiled: 2025-05-07T20:33:37.1451711Z op = torch.compile(op) 2025-05-07T20:33:37.1451809Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1451881Z 2025-05-07T20:33:37.1451965Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1451969Z 2025-05-07T20:33:37.1452061Z moe/activation_test.py:117: 2025-05-07T20:33:37.1452188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1452281Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1452374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1452752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1452840Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1453325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1453423Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1453777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1453999Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1454332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1454424Z kernel = self.compile( 2025-05-07T20:33:37.1454801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1454968Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1455088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1455095Z 2025-05-07T20:33:37.1455291Z self = 2025-05-07T20:33:37.1456102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1456624Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cb8f9d00>} 2025-05-07T20:33:37.1457433Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1457622Z context = 2025-05-07T20:33:37.1457626Z 2025-05-07T20:33:37.1457783Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1458051Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1468049Z module_map=module_map) 2025-05-07T20:33:37.1468302Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1468402Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1468473Z E ^ 2025-05-07T20:33:37.1468827Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1468836Z 2025-05-07T20:33:37.1469259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1469264Z 2025-05-07T20:33:37.1469362Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1469580Z self=, 2025-05-07T20:33:37.1469654Z T=1, 2025-05-07T20:33:37.1469772Z D=5120, 2025-05-07T20:33:37.1469853Z scale_ub=None, 2025-05-07T20:33:37.1469934Z contiguous=False, 2025-05-07T20:33:37.1470012Z compiled=True, 2025-05-07T20:33:37.1470081Z ) 2025-05-07T20:33:37.1470295Z self = 2025-05-07T20:33:37.1470457Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:37.1470462Z 2025-05-07T20:33:37.1470534Z @given( 2025-05-07T20:33:37.1470647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1470745Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1470853Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1470961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1471069Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1471137Z ) 2025-05-07T20:33:37.1471375Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1471465Z def test_silu_mul_quant( 2025-05-07T20:33:37.1471536Z self, 2025-05-07T20:33:37.1471608Z T: int, 2025-05-07T20:33:37.1471678Z D: int, 2025-05-07T20:33:37.1471769Z scale_ub: Optional[float], 2025-05-07T20:33:37.1471855Z contiguous: bool, 2025-05-07T20:33:37.1471933Z compiled: bool, 2025-05-07T20:33:37.1472004Z ) -> None: 2025-05-07T20:33:37.1472094Z torch.manual_seed(2025) 2025-05-07T20:33:37.1472161Z 2025-05-07T20:33:37.1472323Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1472399Z 2025-05-07T20:33:37.1472485Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1472607Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1472691Z x = x_sign * x_clamp 2025-05-07T20:33:37.1472765Z x0 = x[:, :D] 2025-05-07T20:33:37.1472842Z x1 = x[:, D:] 2025-05-07T20:33:37.1472908Z 2025-05-07T20:33:37.1472983Z if contiguous: 2025-05-07T20:33:37.1473072Z x0 = x0.contiguous() 2025-05-07T20:33:37.1473153Z x1 = x1.contiguous() 2025-05-07T20:33:37.1473221Z 2025-05-07T20:33:37.1473355Z if scale_ub is not None: 2025-05-07T20:33:37.1473458Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1473586Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1473659Z ) 2025-05-07T20:33:37.1473732Z else: 2025-05-07T20:33:37.1473818Z scale_ub_tensor = None 2025-05-07T20:33:37.1473884Z 2025-05-07T20:33:37.1474049Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1474136Z op = silu_mul_quant 2025-05-07T20:33:37.1474214Z if compiled: 2025-05-07T20:33:37.1474306Z op = torch.compile(op) 2025-05-07T20:33:37.1474404Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1474471Z 2025-05-07T20:33:37.1474555Z y_fp8, y_scale = fn() 2025-05-07T20:33:37.1474676Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:37.1474741Z 2025-05-07T20:33:37.1474868Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1474971Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:37.1475107Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:37.1475223Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:37.1475354Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1475423Z 2025-05-07T20:33:37.1475525Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.1475529Z 2025-05-07T20:33:37.1475620Z moe/activation_test.py:126: 2025-05-07T20:33:37.1475740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1475841Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:37.1475968Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1476521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:37.1476659Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.1477018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1477234Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1477594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:37.1477845Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.1478234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:37.1478397Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.1478733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:37.1478803Z fn() 2025-05-07T20:33:37.1479203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:37.1479284Z self.fn.run( 2025-05-07T20:33:37.1479614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1479700Z kernel = self.compile( 2025-05-07T20:33:37.1480077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1480245Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1480369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1480373Z 2025-05-07T20:33:37.1480571Z self = 2025-05-07T20:33:37.1481378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1481876Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca706de0>} 2025-05-07T20:33:37.1482605Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1482831Z context = 2025-05-07T20:33:37.1482835Z 2025-05-07T20:33:37.1482990Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1483244Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1483348Z module_map=module_map) 2025-05-07T20:33:37.1483506Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1483605Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.1483676Z E ^ 2025-05-07T20:33:37.1484061Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1484066Z 2025-05-07T20:33:37.1484477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1484487Z 2025-05-07T20:33:37.1484582Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1484797Z self=, 2025-05-07T20:33:37.1484869Z T=1, 2025-05-07T20:33:37.1484941Z D=5120, 2025-05-07T20:33:37.1485020Z scale_ub=None, 2025-05-07T20:33:37.1485101Z contiguous=True, 2025-05-07T20:33:37.1485176Z compiled=False, 2025-05-07T20:33:37.1485284Z ) 2025-05-07T20:33:37.1485493Z self = 2025-05-07T20:33:37.1485651Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:37.1485660Z 2025-05-07T20:33:37.1485735Z @given( 2025-05-07T20:33:37.1485846Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1485940Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1486047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1486157Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1486266Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1486334Z ) 2025-05-07T20:33:37.1486569Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1486661Z def test_silu_mul_quant( 2025-05-07T20:33:37.1486732Z self, 2025-05-07T20:33:37.1486804Z T: int, 2025-05-07T20:33:37.1486883Z D: int, 2025-05-07T20:33:37.1486971Z scale_ub: Optional[float], 2025-05-07T20:33:37.1487055Z contiguous: bool, 2025-05-07T20:33:37.1487135Z compiled: bool, 2025-05-07T20:33:37.1487209Z ) -> None: 2025-05-07T20:33:37.1487302Z torch.manual_seed(2025) 2025-05-07T20:33:37.1487366Z 2025-05-07T20:33:37.1487526Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1487596Z 2025-05-07T20:33:37.1487679Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1487796Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1487884Z x = x_sign * x_clamp 2025-05-07T20:33:37.1487957Z x0 = x[:, :D] 2025-05-07T20:33:37.1488029Z x1 = x[:, D:] 2025-05-07T20:33:37.1488096Z 2025-05-07T20:33:37.1488173Z if contiguous: 2025-05-07T20:33:37.1488263Z x0 = x0.contiguous() 2025-05-07T20:33:37.1488346Z x1 = x1.contiguous() 2025-05-07T20:33:37.1488411Z 2025-05-07T20:33:37.1488499Z if scale_ub is not None: 2025-05-07T20:33:37.1488596Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1488773Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1488851Z ) 2025-05-07T20:33:37.1488925Z else: 2025-05-07T20:33:37.1489012Z scale_ub_tensor = None 2025-05-07T20:33:37.1489081Z 2025-05-07T20:33:37.1489202Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1489287Z op = silu_mul_quant 2025-05-07T20:33:37.1489406Z if compiled: 2025-05-07T20:33:37.1489501Z op = torch.compile(op) 2025-05-07T20:33:37.1489601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1489668Z 2025-05-07T20:33:37.1489751Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1489755Z 2025-05-07T20:33:37.1489850Z moe/activation_test.py:117: 2025-05-07T20:33:37.1489970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1490066Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1490169Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1490800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1490892Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1491248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1491462Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1491795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1491882Z kernel = self.compile( 2025-05-07T20:33:37.1492256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1492425Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1492584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1492589Z 2025-05-07T20:33:37.1492789Z self = 2025-05-07T20:33:37.1493551Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1494042Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cb4be700>} 2025-05-07T20:33:37.1494772Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1494959Z context = 2025-05-07T20:33:37.1494964Z 2025-05-07T20:33:37.1495123Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1495389Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1495489Z module_map=module_map) 2025-05-07T20:33:37.1495646Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1495737Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1495817Z E ^ 2025-05-07T20:33:37.1496163Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1496168Z 2025-05-07T20:33:37.1496569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1496574Z 2025-05-07T20:33:37.1496672Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1496887Z self=, 2025-05-07T20:33:37.1496963Z T=128, 2025-05-07T20:33:37.1497077Z D=5120, 2025-05-07T20:33:37.1497152Z scale_ub=None, 2025-05-07T20:33:37.1497238Z contiguous=False, 2025-05-07T20:33:37.1497311Z compiled=True, 2025-05-07T20:33:37.1497377Z ) 2025-05-07T20:33:37.1497589Z self = 2025-05-07T20:33:37.1497751Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:37.1497794Z 2025-05-07T20:33:37.1497866Z @given( 2025-05-07T20:33:37.1497983Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1498076Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1498183Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1498296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1498403Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1498481Z ) 2025-05-07T20:33:37.1498717Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1498808Z def test_silu_mul_quant( 2025-05-07T20:33:37.1498882Z self, 2025-05-07T20:33:37.1498992Z T: int, 2025-05-07T20:33:37.1499062Z D: int, 2025-05-07T20:33:37.1499159Z scale_ub: Optional[float], 2025-05-07T20:33:37.1499242Z contiguous: bool, 2025-05-07T20:33:37.1499321Z compiled: bool, 2025-05-07T20:33:37.1499400Z ) -> None: 2025-05-07T20:33:37.1499487Z torch.manual_seed(2025) 2025-05-07T20:33:37.1499555Z 2025-05-07T20:33:37.1499721Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1499794Z 2025-05-07T20:33:37.1499882Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1499997Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1500078Z x = x_sign * x_clamp 2025-05-07T20:33:37.1500201Z x0 = x[:, :D] 2025-05-07T20:33:37.1500275Z x1 = x[:, D:] 2025-05-07T20:33:37.1500343Z 2025-05-07T20:33:37.1500422Z if contiguous: 2025-05-07T20:33:37.1500509Z x0 = x0.contiguous() 2025-05-07T20:33:37.1500592Z x1 = x1.contiguous() 2025-05-07T20:33:37.1500661Z 2025-05-07T20:33:37.1500746Z if scale_ub is not None: 2025-05-07T20:33:37.1500841Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1500972Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1501045Z ) 2025-05-07T20:33:37.1501118Z else: 2025-05-07T20:33:37.1501207Z scale_ub_tensor = None 2025-05-07T20:33:37.1501273Z 2025-05-07T20:33:37.1501398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1501481Z op = silu_mul_quant 2025-05-07T20:33:37.1501560Z if compiled: 2025-05-07T20:33:37.1501656Z op = torch.compile(op) 2025-05-07T20:33:37.1501758Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1501824Z 2025-05-07T20:33:37.1501913Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1501917Z 2025-05-07T20:33:37.1502009Z moe/activation_test.py:117: 2025-05-07T20:33:37.1502136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1502227Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1502317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1502680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1502767Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1503249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1503343Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1503693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1503915Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1504293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1504383Z kernel = self.compile( 2025-05-07T20:33:37.1504758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1504924Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1505114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1505119Z 2025-05-07T20:33:37.1505318Z self = 2025-05-07T20:33:37.1506076Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1506575Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca707880>} 2025-05-07T20:33:37.1507344Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1507589Z context = 2025-05-07T20:33:37.1507597Z 2025-05-07T20:33:37.1507751Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1508004Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1508107Z module_map=module_map) 2025-05-07T20:33:37.1508261Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1508394Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1508469Z E ^ 2025-05-07T20:33:37.1508816Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1508823Z 2025-05-07T20:33:37.1509233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1509238Z 2025-05-07T20:33:37.1509332Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1509549Z self=, 2025-05-07T20:33:37.1509624Z T=128, 2025-05-07T20:33:37.1509693Z D=7168, 2025-05-07T20:33:37.1509772Z scale_ub=1200.0, 2025-05-07T20:33:37.1509852Z contiguous=False, 2025-05-07T20:33:37.1509927Z compiled=False, 2025-05-07T20:33:37.1509998Z ) 2025-05-07T20:33:37.1510207Z self = 2025-05-07T20:33:37.1510376Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:37.1510381Z 2025-05-07T20:33:37.1510457Z @given( 2025-05-07T20:33:37.1510570Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1510665Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1510776Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1510885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1510994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1511083Z ) 2025-05-07T20:33:37.1511346Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1511434Z def test_silu_mul_quant( 2025-05-07T20:33:37.1511505Z self, 2025-05-07T20:33:37.1511575Z T: int, 2025-05-07T20:33:37.1511648Z D: int, 2025-05-07T20:33:37.1511737Z scale_ub: Optional[float], 2025-05-07T20:33:37.1511817Z contiguous: bool, 2025-05-07T20:33:37.1511905Z compiled: bool, 2025-05-07T20:33:37.1511975Z ) -> None: 2025-05-07T20:33:37.1512061Z torch.manual_seed(2025) 2025-05-07T20:33:37.1512176Z 2025-05-07T20:33:37.1512339Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1512407Z 2025-05-07T20:33:37.1512492Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1512608Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1512693Z x = x_sign * x_clamp 2025-05-07T20:33:37.1512805Z x0 = x[:, :D] 2025-05-07T20:33:37.1512875Z x1 = x[:, D:] 2025-05-07T20:33:37.1512949Z 2025-05-07T20:33:37.1513028Z if contiguous: 2025-05-07T20:33:37.1513112Z x0 = x0.contiguous() 2025-05-07T20:33:37.1513197Z x1 = x1.contiguous() 2025-05-07T20:33:37.1513265Z 2025-05-07T20:33:37.1513348Z if scale_ub is not None: 2025-05-07T20:33:37.1513451Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1513583Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1513655Z ) 2025-05-07T20:33:37.1513730Z else: 2025-05-07T20:33:37.1513822Z scale_ub_tensor = None 2025-05-07T20:33:37.1513894Z 2025-05-07T20:33:37.1514056Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1514142Z op = silu_mul_quant 2025-05-07T20:33:37.1514227Z if compiled: 2025-05-07T20:33:37.1514320Z op = torch.compile(op) 2025-05-07T20:33:37.1514424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1514495Z 2025-05-07T20:33:37.1514581Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1514585Z 2025-05-07T20:33:37.1514675Z moe/activation_test.py:117: 2025-05-07T20:33:37.1514797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1514891Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1514987Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1515512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1515605Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1515963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1516177Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1516511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1516603Z kernel = self.compile( 2025-05-07T20:33:37.1516979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1517147Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1517266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1517273Z 2025-05-07T20:33:37.1517467Z self = 2025-05-07T20:33:37.1518232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1518721Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca98c7c0>} 2025-05-07T20:33:37.1519455Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1519638Z context = 2025-05-07T20:33:37.1519643Z 2025-05-07T20:33:37.1519801Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1520093Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1520196Z module_map=module_map) 2025-05-07T20:33:37.1520357Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1520464Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1520547Z E ^ 2025-05-07T20:33:37.1520919Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1520984Z 2025-05-07T20:33:37.1521412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1521417Z 2025-05-07T20:33:37.1521517Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1521732Z self=, 2025-05-07T20:33:37.1521808Z T=128, 2025-05-07T20:33:37.1521882Z D=5120, 2025-05-07T20:33:37.1521958Z scale_ub=None, 2025-05-07T20:33:37.1522039Z contiguous=False, 2025-05-07T20:33:37.1522123Z compiled=False, 2025-05-07T20:33:37.1522191Z ) 2025-05-07T20:33:37.1522440Z self = 2025-05-07T20:33:37.1522607Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:37.1522612Z 2025-05-07T20:33:37.1522683Z @given( 2025-05-07T20:33:37.1522796Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1522891Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1522997Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1523113Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1523219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1523289Z ) 2025-05-07T20:33:37.1523525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1523655Z def test_silu_mul_quant( 2025-05-07T20:33:37.1523727Z self, 2025-05-07T20:33:37.1523802Z T: int, 2025-05-07T20:33:37.1523871Z D: int, 2025-05-07T20:33:37.1523968Z scale_ub: Optional[float], 2025-05-07T20:33:37.1524050Z contiguous: bool, 2025-05-07T20:33:37.1524127Z compiled: bool, 2025-05-07T20:33:37.1524202Z ) -> None: 2025-05-07T20:33:37.1524290Z torch.manual_seed(2025) 2025-05-07T20:33:37.1524357Z 2025-05-07T20:33:37.1524525Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1524597Z 2025-05-07T20:33:37.1524681Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1524801Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1524882Z x = x_sign * x_clamp 2025-05-07T20:33:37.1524955Z x0 = x[:, :D] 2025-05-07T20:33:37.1525033Z x1 = x[:, D:] 2025-05-07T20:33:37.1525101Z 2025-05-07T20:33:37.1525190Z if contiguous: 2025-05-07T20:33:37.1525274Z x0 = x0.contiguous() 2025-05-07T20:33:37.1525356Z x1 = x1.contiguous() 2025-05-07T20:33:37.1525425Z 2025-05-07T20:33:37.1525507Z if scale_ub is not None: 2025-05-07T20:33:37.1525607Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1525735Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1525806Z ) 2025-05-07T20:33:37.1525875Z else: 2025-05-07T20:33:37.1525971Z scale_ub_tensor = None 2025-05-07T20:33:37.1526042Z 2025-05-07T20:33:37.1526167Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1526259Z op = silu_mul_quant 2025-05-07T20:33:37.1526339Z if compiled: 2025-05-07T20:33:37.1526435Z op = torch.compile(op) 2025-05-07T20:33:37.1526539Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1526606Z 2025-05-07T20:33:37.1526696Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1526703Z 2025-05-07T20:33:37.1526794Z moe/activation_test.py:117: 2025-05-07T20:33:37.1526963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1527061Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1527157Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1527646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1527742Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1528135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1528355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1528690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1528777Z kernel = self.compile( 2025-05-07T20:33:37.1529183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1529352Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1529516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1529521Z 2025-05-07T20:33:37.1529717Z self = 2025-05-07T20:33:37.1530473Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1530974Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cb8fa7a0>} 2025-05-07T20:33:37.1531710Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1531941Z context = 2025-05-07T20:33:37.1531946Z 2025-05-07T20:33:37.1532103Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1532358Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1532469Z module_map=module_map) 2025-05-07T20:33:37.1532627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1532730Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1532804Z E ^ 2025-05-07T20:33:37.1533151Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1533156Z 2025-05-07T20:33:37.1533594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1533599Z 2025-05-07T20:33:37.1533701Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1533926Z self=, 2025-05-07T20:33:37.1534000Z T=128, 2025-05-07T20:33:37.1534075Z D=5120, 2025-05-07T20:33:37.1534162Z scale_ub=1200.0, 2025-05-07T20:33:37.1534242Z contiguous=True, 2025-05-07T20:33:37.1534321Z compiled=False, 2025-05-07T20:33:37.1534394Z ) 2025-05-07T20:33:37.1534604Z self = 2025-05-07T20:33:37.1534766Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:37.1534771Z 2025-05-07T20:33:37.1534846Z @given( 2025-05-07T20:33:37.1534958Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1535053Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1535168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1535283Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1535439Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1535513Z ) 2025-05-07T20:33:37.1535754Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1535850Z def test_silu_mul_quant( 2025-05-07T20:33:37.1535921Z self, 2025-05-07T20:33:37.1535995Z T: int, 2025-05-07T20:33:37.1536076Z D: int, 2025-05-07T20:33:37.1536208Z scale_ub: Optional[float], 2025-05-07T20:33:37.1536293Z contiguous: bool, 2025-05-07T20:33:37.1536381Z compiled: bool, 2025-05-07T20:33:37.1536455Z ) -> None: 2025-05-07T20:33:37.1536544Z torch.manual_seed(2025) 2025-05-07T20:33:37.1536619Z 2025-05-07T20:33:37.1536779Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1536853Z 2025-05-07T20:33:37.1536944Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1537063Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1537154Z x = x_sign * x_clamp 2025-05-07T20:33:37.1537228Z x0 = x[:, :D] 2025-05-07T20:33:37.1537338Z x1 = x[:, D:] 2025-05-07T20:33:37.1537412Z 2025-05-07T20:33:37.1537492Z if contiguous: 2025-05-07T20:33:37.1537575Z x0 = x0.contiguous() 2025-05-07T20:33:37.1537664Z x1 = x1.contiguous() 2025-05-07T20:33:37.1537733Z 2025-05-07T20:33:37.1537820Z if scale_ub is not None: 2025-05-07T20:33:37.1537925Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1538052Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1538129Z ) 2025-05-07T20:33:37.1538202Z else: 2025-05-07T20:33:37.1538288Z scale_ub_tensor = None 2025-05-07T20:33:37.1538360Z 2025-05-07T20:33:37.1538483Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1538610Z op = silu_mul_quant 2025-05-07T20:33:37.1538697Z if compiled: 2025-05-07T20:33:37.1538796Z op = torch.compile(op) 2025-05-07T20:33:37.1538898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1538977Z 2025-05-07T20:33:37.1539064Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1539069Z 2025-05-07T20:33:37.1539162Z moe/activation_test.py:117: 2025-05-07T20:33:37.1539291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1539391Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1539493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1539985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1540278Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1540725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1540957Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1541356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1541448Z kernel = self.compile( 2025-05-07T20:33:37.1541829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1542005Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1542131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1542136Z 2025-05-07T20:33:37.1542334Z self = 2025-05-07T20:33:37.1543101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1543698Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca480c20>} 2025-05-07T20:33:37.1544441Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1544684Z context = 2025-05-07T20:33:37.1544689Z 2025-05-07T20:33:37.1544854Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1545110Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1545216Z module_map=module_map) 2025-05-07T20:33:37.1545383Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1545483Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1545556Z E ^ 2025-05-07T20:33:37.1545966Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1545971Z 2025-05-07T20:33:37.1546402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1546407Z 2025-05-07T20:33:37.1546509Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1546731Z self=, 2025-05-07T20:33:37.1546807Z T=1, 2025-05-07T20:33:37.1546883Z D=7168, 2025-05-07T20:33:37.1546962Z scale_ub=1200.0, 2025-05-07T20:33:37.1547039Z contiguous=True, 2025-05-07T20:33:37.1547119Z compiled=True, 2025-05-07T20:33:37.1547188Z ) 2025-05-07T20:33:37.1547461Z self = 2025-05-07T20:33:37.1547702Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:37.1547706Z 2025-05-07T20:33:37.1547778Z @given( 2025-05-07T20:33:37.1547895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1547992Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1548099Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1548215Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1548325Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1548402Z ) 2025-05-07T20:33:37.1548647Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1548739Z def test_silu_mul_quant( 2025-05-07T20:33:37.1548820Z self, 2025-05-07T20:33:37.1548892Z T: int, 2025-05-07T20:33:37.1548969Z D: int, 2025-05-07T20:33:37.1549071Z scale_ub: Optional[float], 2025-05-07T20:33:37.1549156Z contiguous: bool, 2025-05-07T20:33:37.1549243Z compiled: bool, 2025-05-07T20:33:37.1549322Z ) -> None: 2025-05-07T20:33:37.1549410Z torch.manual_seed(2025) 2025-05-07T20:33:37.1549481Z 2025-05-07T20:33:37.1549647Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1549715Z 2025-05-07T20:33:37.1549800Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1549920Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1550001Z x = x_sign * x_clamp 2025-05-07T20:33:37.1550080Z x0 = x[:, :D] 2025-05-07T20:33:37.1550153Z x1 = x[:, D:] 2025-05-07T20:33:37.1550220Z 2025-05-07T20:33:37.1550300Z if contiguous: 2025-05-07T20:33:37.1550383Z x0 = x0.contiguous() 2025-05-07T20:33:37.1550464Z x1 = x1.contiguous() 2025-05-07T20:33:37.1550533Z 2025-05-07T20:33:37.1550617Z if scale_ub is not None: 2025-05-07T20:33:37.1550716Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1550850Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1550922Z ) 2025-05-07T20:33:37.1550997Z else: 2025-05-07T20:33:37.1551158Z scale_ub_tensor = None 2025-05-07T20:33:37.1551226Z 2025-05-07T20:33:37.1551352Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1551441Z op = silu_mul_quant 2025-05-07T20:33:37.1551521Z if compiled: 2025-05-07T20:33:37.1551620Z op = torch.compile(op) 2025-05-07T20:33:37.1551764Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1551834Z 2025-05-07T20:33:37.1551925Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1551929Z 2025-05-07T20:33:37.1552023Z moe/activation_test.py:117: 2025-05-07T20:33:37.1552147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1552249Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1552344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1552712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1552809Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1553333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1553430Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1553785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1554005Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1554345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1554434Z kernel = self.compile( 2025-05-07T20:33:37.1554834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1555044Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1555169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1555173Z 2025-05-07T20:33:37.1555379Z self = 2025-05-07T20:33:37.1556142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1556641Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca481ee0>} 2025-05-07T20:33:37.1557373Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1557560Z context = 2025-05-07T20:33:37.1557564Z 2025-05-07T20:33:37.1557731Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1557994Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1558105Z module_map=module_map) 2025-05-07T20:33:37.1558263Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1558361Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1558442Z E ^ 2025-05-07T20:33:37.1558788Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1558793Z 2025-05-07T20:33:37.1559225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1559232Z 2025-05-07T20:33:37.1559331Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1559548Z self=, 2025-05-07T20:33:37.1559672Z T=1, 2025-05-07T20:33:37.1559749Z D=7168, 2025-05-07T20:33:37.1559833Z scale_ub=1200.0, 2025-05-07T20:33:37.1559925Z contiguous=False, 2025-05-07T20:33:37.1560006Z compiled=True, 2025-05-07T20:33:37.1560074Z ) 2025-05-07T20:33:37.1560296Z self = 2025-05-07T20:33:37.1560504Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:37.1560508Z 2025-05-07T20:33:37.1560586Z @given( 2025-05-07T20:33:37.1560698Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1560795Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1560910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1561024Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1561137Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1561212Z ) 2025-05-07T20:33:37.1561457Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1561588Z def test_silu_mul_quant( 2025-05-07T20:33:37.1561668Z self, 2025-05-07T20:33:37.1561742Z T: int, 2025-05-07T20:33:37.1561821Z D: int, 2025-05-07T20:33:37.1561914Z scale_ub: Optional[float], 2025-05-07T20:33:37.1561997Z contiguous: bool, 2025-05-07T20:33:37.1562087Z compiled: bool, 2025-05-07T20:33:37.1562162Z ) -> None: 2025-05-07T20:33:37.1562249Z torch.manual_seed(2025) 2025-05-07T20:33:37.1562325Z 2025-05-07T20:33:37.1562488Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1562558Z 2025-05-07T20:33:37.1562651Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1562772Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1562902Z x = x_sign * x_clamp 2025-05-07T20:33:37.1562987Z x0 = x[:, :D] 2025-05-07T20:33:37.1563062Z x1 = x[:, D:] 2025-05-07T20:33:37.1563132Z 2025-05-07T20:33:37.1563220Z if contiguous: 2025-05-07T20:33:37.1563306Z x0 = x0.contiguous() 2025-05-07T20:33:37.1563393Z x1 = x1.contiguous() 2025-05-07T20:33:37.1563461Z 2025-05-07T20:33:37.1563545Z if scale_ub is not None: 2025-05-07T20:33:37.1563648Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1563777Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1563850Z ) 2025-05-07T20:33:37.1563924Z else: 2025-05-07T20:33:37.1564012Z scale_ub_tensor = None 2025-05-07T20:33:37.1564081Z 2025-05-07T20:33:37.1564209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1564293Z op = silu_mul_quant 2025-05-07T20:33:37.1564370Z if compiled: 2025-05-07T20:33:37.1564469Z op = torch.compile(op) 2025-05-07T20:33:37.1564567Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1564639Z 2025-05-07T20:33:37.1564728Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1564733Z 2025-05-07T20:33:37.1564825Z moe/activation_test.py:117: 2025-05-07T20:33:37.1564951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1565046Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1565139Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1565503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1565591Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1566084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1566176Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1566534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1566803Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1567140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1567231Z kernel = self.compile( 2025-05-07T20:33:37.1567616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1567828Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1567952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1567957Z 2025-05-07T20:33:37.1568159Z self = 2025-05-07T20:33:37.1568921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1569461Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca482c00>} 2025-05-07T20:33:37.1570198Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1570394Z context = 2025-05-07T20:33:37.1570398Z 2025-05-07T20:33:37.1570557Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1570814Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1570917Z module_map=module_map) 2025-05-07T20:33:37.1571114Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1571216Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1571295Z E ^ 2025-05-07T20:33:37.1571647Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1571652Z 2025-05-07T20:33:37.1572082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1572087Z 2025-05-07T20:33:37.1572186Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1572405Z self=, 2025-05-07T20:33:37.1572478Z T=1, 2025-05-07T20:33:37.1572552Z D=7168, 2025-05-07T20:33:37.1572634Z scale_ub=None, 2025-05-07T20:33:37.1572716Z contiguous=False, 2025-05-07T20:33:37.1572795Z compiled=True, 2025-05-07T20:33:37.1572864Z ) 2025-05-07T20:33:37.1573075Z self = 2025-05-07T20:33:37.1573234Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:37.1573245Z 2025-05-07T20:33:37.1573319Z @given( 2025-05-07T20:33:37.1573433Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1573531Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1573637Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1573747Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1573860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1573933Z ) 2025-05-07T20:33:37.1574168Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1574259Z def test_silu_mul_quant( 2025-05-07T20:33:37.1574331Z self, 2025-05-07T20:33:37.1574403Z T: int, 2025-05-07T20:33:37.1574479Z D: int, 2025-05-07T20:33:37.1574571Z scale_ub: Optional[float], 2025-05-07T20:33:37.1574659Z contiguous: bool, 2025-05-07T20:33:37.1574739Z compiled: bool, 2025-05-07T20:33:37.1574813Z ) -> None: 2025-05-07T20:33:37.1574948Z torch.manual_seed(2025) 2025-05-07T20:33:37.1575019Z 2025-05-07T20:33:37.1575184Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1575261Z 2025-05-07T20:33:37.1575349Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1575468Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1575599Z x = x_sign * x_clamp 2025-05-07T20:33:37.1575676Z x0 = x[:, :D] 2025-05-07T20:33:37.1575753Z x1 = x[:, D:] 2025-05-07T20:33:37.1575824Z 2025-05-07T20:33:37.1575905Z if contiguous: 2025-05-07T20:33:37.1575992Z x0 = x0.contiguous() 2025-05-07T20:33:37.1576084Z x1 = x1.contiguous() 2025-05-07T20:33:37.1576153Z 2025-05-07T20:33:37.1576244Z if scale_ub is not None: 2025-05-07T20:33:37.1576349Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1576480Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1576563Z ) 2025-05-07T20:33:37.1576643Z else: 2025-05-07T20:33:37.1576772Z scale_ub_tensor = None 2025-05-07T20:33:37.1576852Z 2025-05-07T20:33:37.1576980Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1577067Z op = silu_mul_quant 2025-05-07T20:33:37.1577151Z if compiled: 2025-05-07T20:33:37.1577244Z op = torch.compile(op) 2025-05-07T20:33:37.1577345Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1577414Z 2025-05-07T20:33:37.1577499Z y_fp8, y_scale = fn() 2025-05-07T20:33:37.1577617Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:37.1577686Z 2025-05-07T20:33:37.1577814Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1577911Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:37.1578069Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:37.1578185Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:37.1578327Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1578405Z 2025-05-07T20:33:37.1578501Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.1578506Z 2025-05-07T20:33:37.1578606Z moe/activation_test.py:126: 2025-05-07T20:33:37.1578729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1578839Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:37.1578968Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.1579519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:37.1579620Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.1579972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1580196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1580566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:37.1580815Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.1581187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:37.1581349Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.1581686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:37.1585226Z fn() 2025-05-07T20:33:37.1585645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:37.1585731Z self.fn.run( 2025-05-07T20:33:37.1586131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1586222Z kernel = self.compile( 2025-05-07T20:33:37.1586603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1586770Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1586893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1586938Z 2025-05-07T20:33:37.1587144Z self = 2025-05-07T20:33:37.1587975Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1588469Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca6f4180>} 2025-05-07T20:33:37.1589245Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1589432Z context = 2025-05-07T20:33:37.1589440Z 2025-05-07T20:33:37.1589595Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1589849Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1589953Z module_map=module_map) 2025-05-07T20:33:37.1590109Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1590204Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.1590323Z E ^ 2025-05-07T20:33:37.1590710Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1590719Z 2025-05-07T20:33:37.1591146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1591151Z 2025-05-07T20:33:37.1591246Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1591460Z self=, 2025-05-07T20:33:37.1591542Z T=1, 2025-05-07T20:33:37.1591612Z D=5120, 2025-05-07T20:33:37.1591690Z scale_ub=1200.0, 2025-05-07T20:33:37.1591773Z contiguous=False, 2025-05-07T20:33:37.1591848Z compiled=True, 2025-05-07T20:33:37.1591918Z ) 2025-05-07T20:33:37.1592127Z self = 2025-05-07T20:33:37.1592287Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:37.1592295Z 2025-05-07T20:33:37.1592370Z @given( 2025-05-07T20:33:37.1592483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1592577Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1592692Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1592802Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1592910Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1592976Z ) 2025-05-07T20:33:37.1593211Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1593302Z def test_silu_mul_quant( 2025-05-07T20:33:37.1593372Z self, 2025-05-07T20:33:37.1593442Z T: int, 2025-05-07T20:33:37.1593513Z D: int, 2025-05-07T20:33:37.1593603Z scale_ub: Optional[float], 2025-05-07T20:33:37.1593687Z contiguous: bool, 2025-05-07T20:33:37.1593773Z compiled: bool, 2025-05-07T20:33:37.1593849Z ) -> None: 2025-05-07T20:33:37.1593943Z torch.manual_seed(2025) 2025-05-07T20:33:37.1594018Z 2025-05-07T20:33:37.1594226Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1594294Z 2025-05-07T20:33:37.1594390Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1594507Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1594604Z x = x_sign * x_clamp 2025-05-07T20:33:37.1594678Z x0 = x[:, :D] 2025-05-07T20:33:37.1594753Z x1 = x[:, D:] 2025-05-07T20:33:37.1594863Z 2025-05-07T20:33:37.1594941Z if contiguous: 2025-05-07T20:33:37.1595025Z x0 = x0.contiguous() 2025-05-07T20:33:37.1595108Z x1 = x1.contiguous() 2025-05-07T20:33:37.1595173Z 2025-05-07T20:33:37.1595256Z if scale_ub is not None: 2025-05-07T20:33:37.1595359Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1595486Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1595558Z ) 2025-05-07T20:33:37.1595632Z else: 2025-05-07T20:33:37.1595719Z scale_ub_tensor = None 2025-05-07T20:33:37.1595788Z 2025-05-07T20:33:37.1595917Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1596041Z op = silu_mul_quant 2025-05-07T20:33:37.1596127Z if compiled: 2025-05-07T20:33:37.1596221Z op = torch.compile(op) 2025-05-07T20:33:37.1596321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1596390Z 2025-05-07T20:33:37.1596478Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1596483Z 2025-05-07T20:33:37.1596574Z moe/activation_test.py:117: 2025-05-07T20:33:37.1596701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1596796Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1596892Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1597253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1597383Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1597877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1597966Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1598317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1598534Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1598869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1598958Z kernel = self.compile( 2025-05-07T20:33:37.1599336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1599502Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1599628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1599633Z 2025-05-07T20:33:37.1599832Z self = 2025-05-07T20:33:37.1600596Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1601090Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca6f5300>} 2025-05-07T20:33:37.1601820Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1602003Z context = 2025-05-07T20:33:37.1602010Z 2025-05-07T20:33:37.1602164Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1602465Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1602568Z module_map=module_map) 2025-05-07T20:33:37.1602722Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1602816Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1602925Z E ^ 2025-05-07T20:33:37.1603271Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1603281Z 2025-05-07T20:33:37.1603711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1603716Z 2025-05-07T20:33:37.1603813Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1604036Z self=, 2025-05-07T20:33:37.1604106Z T=1, 2025-05-07T20:33:37.1604174Z D=5120, 2025-05-07T20:33:37.1604266Z scale_ub=1200.0, 2025-05-07T20:33:37.1604347Z contiguous=False, 2025-05-07T20:33:37.1604463Z compiled=False, 2025-05-07T20:33:37.1604535Z ) 2025-05-07T20:33:37.1604747Z self = 2025-05-07T20:33:37.1604911Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:37.1604918Z 2025-05-07T20:33:37.1604992Z @given( 2025-05-07T20:33:37.1605105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1605201Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1605310Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1605418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1605526Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1605637Z ) 2025-05-07T20:33:37.1605878Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1605968Z def test_silu_mul_quant( 2025-05-07T20:33:37.1606037Z self, 2025-05-07T20:33:37.1606112Z T: int, 2025-05-07T20:33:37.1606181Z D: int, 2025-05-07T20:33:37.1606272Z scale_ub: Optional[float], 2025-05-07T20:33:37.1606356Z contiguous: bool, 2025-05-07T20:33:37.1606433Z compiled: bool, 2025-05-07T20:33:37.1606505Z ) -> None: 2025-05-07T20:33:37.1606596Z torch.manual_seed(2025) 2025-05-07T20:33:37.1606659Z 2025-05-07T20:33:37.1606818Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1606888Z 2025-05-07T20:33:37.1606973Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1607090Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1607175Z x = x_sign * x_clamp 2025-05-07T20:33:37.1607248Z x0 = x[:, :D] 2025-05-07T20:33:37.1607324Z x1 = x[:, D:] 2025-05-07T20:33:37.1607390Z 2025-05-07T20:33:37.1607465Z if contiguous: 2025-05-07T20:33:37.1607553Z x0 = x0.contiguous() 2025-05-07T20:33:37.1607633Z x1 = x1.contiguous() 2025-05-07T20:33:37.1607701Z 2025-05-07T20:33:37.1607784Z if scale_ub is not None: 2025-05-07T20:33:37.1607881Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1608008Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1608084Z ) 2025-05-07T20:33:37.1608153Z else: 2025-05-07T20:33:37.1608239Z scale_ub_tensor = None 2025-05-07T20:33:37.1608308Z 2025-05-07T20:33:37.1608430Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1608515Z op = silu_mul_quant 2025-05-07T20:33:37.1608595Z if compiled: 2025-05-07T20:33:37.1608688Z op = torch.compile(op) 2025-05-07T20:33:37.1608788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1608857Z 2025-05-07T20:33:37.1608941Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1608945Z 2025-05-07T20:33:37.1609084Z moe/activation_test.py:117: 2025-05-07T20:33:37.1609209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1609302Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1609395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1609879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1610011Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1610362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1610575Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1610912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1611003Z kernel = self.compile( 2025-05-07T20:33:37.1611383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1611614Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1611733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1611738Z 2025-05-07T20:33:37.1611935Z self = 2025-05-07T20:33:37.1612696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1613186Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca6f6020>} 2025-05-07T20:33:37.1613961Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1614144Z context = 2025-05-07T20:33:37.1614149Z 2025-05-07T20:33:37.1614306Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1614559Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1614664Z module_map=module_map) 2025-05-07T20:33:37.1614817Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1614908Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1614982Z E ^ 2025-05-07T20:33:37.1615326Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1615333Z 2025-05-07T20:33:37.1615743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1615751Z 2025-05-07T20:33:37.1615851Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1616065Z self=, 2025-05-07T20:33:37.1616138Z T=16384, 2025-05-07T20:33:37.1616211Z D=5120, 2025-05-07T20:33:37.1616286Z scale_ub=1200.0, 2025-05-07T20:33:37.1616373Z contiguous=False, 2025-05-07T20:33:37.1616448Z compiled=True, 2025-05-07T20:33:37.1616517Z ) 2025-05-07T20:33:37.1616728Z self = 2025-05-07T20:33:37.1616899Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:37.1616903Z 2025-05-07T20:33:37.1616976Z @given( 2025-05-07T20:33:37.1617090Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1617187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1617300Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1617451Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1617562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1617636Z ) 2025-05-07T20:33:37.1617872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1617959Z def test_silu_mul_quant( 2025-05-07T20:33:37.1618073Z self, 2025-05-07T20:33:37.1618142Z T: int, 2025-05-07T20:33:37.1618214Z D: int, 2025-05-07T20:33:37.1618304Z scale_ub: Optional[float], 2025-05-07T20:33:37.1618385Z contiguous: bool, 2025-05-07T20:33:37.1618467Z compiled: bool, 2025-05-07T20:33:37.1618539Z ) -> None: 2025-05-07T20:33:37.1618627Z torch.manual_seed(2025) 2025-05-07T20:33:37.1618698Z 2025-05-07T20:33:37.1618862Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1618928Z 2025-05-07T20:33:37.1619016Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1619134Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1619256Z x = x_sign * x_clamp 2025-05-07T20:33:37.1619332Z x0 = x[:, :D] 2025-05-07T20:33:37.1619403Z x1 = x[:, D:] 2025-05-07T20:33:37.1619468Z 2025-05-07T20:33:37.1619547Z if contiguous: 2025-05-07T20:33:37.1619631Z x0 = x0.contiguous() 2025-05-07T20:33:37.1619715Z x1 = x1.contiguous() 2025-05-07T20:33:37.1619782Z 2025-05-07T20:33:37.1619863Z if scale_ub is not None: 2025-05-07T20:33:37.1619967Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1620093Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1620160Z ) 2025-05-07T20:33:37.1620232Z else: 2025-05-07T20:33:37.1620318Z scale_ub_tensor = None 2025-05-07T20:33:37.1620425Z 2025-05-07T20:33:37.1620550Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1620631Z op = silu_mul_quant 2025-05-07T20:33:37.1620714Z if compiled: 2025-05-07T20:33:37.1620814Z op = torch.compile(op) 2025-05-07T20:33:37.1620914Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1620982Z 2025-05-07T20:33:37.1621070Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1621074Z 2025-05-07T20:33:37.1621163Z moe/activation_test.py:117: 2025-05-07T20:33:37.1621289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1621382Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1621472Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1621834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1621922Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1622409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1622504Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1622856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1623074Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1623408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1623498Z kernel = self.compile( 2025-05-07T20:33:37.1623893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1624060Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1624181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1624188Z 2025-05-07T20:33:37.1624383Z self = 2025-05-07T20:33:37.1625186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1625679Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca6f7600>} 2025-05-07T20:33:37.1626445Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1626634Z context = 2025-05-07T20:33:37.1626638Z 2025-05-07T20:33:37.1626791Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1627046Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1627152Z module_map=module_map) 2025-05-07T20:33:37.1627345Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1627533Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1627606Z E ^ 2025-05-07T20:33:37.1627951Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1627958Z 2025-05-07T20:33:37.1628389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1628393Z 2025-05-07T20:33:37.1628489Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1628704Z self=, 2025-05-07T20:33:37.1628775Z T=2048, 2025-05-07T20:33:37.1628891Z D=7168, 2025-05-07T20:33:37.1628971Z scale_ub=1200.0, 2025-05-07T20:33:37.1629050Z contiguous=False, 2025-05-07T20:33:37.1629125Z compiled=True, 2025-05-07T20:33:37.1629195Z ) 2025-05-07T20:33:37.1629408Z self = 2025-05-07T20:33:37.1629573Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:37.1629578Z 2025-05-07T20:33:37.1629647Z @given( 2025-05-07T20:33:37.1629759Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1629858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1629966Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1630075Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1630183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1630252Z ) 2025-05-07T20:33:37.1630486Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1630577Z def test_silu_mul_quant( 2025-05-07T20:33:37.1630649Z self, 2025-05-07T20:33:37.1630719Z T: int, 2025-05-07T20:33:37.1630793Z D: int, 2025-05-07T20:33:37.1630882Z scale_ub: Optional[float], 2025-05-07T20:33:37.1630966Z contiguous: bool, 2025-05-07T20:33:37.1631047Z compiled: bool, 2025-05-07T20:33:37.1631119Z ) -> None: 2025-05-07T20:33:37.1631207Z torch.manual_seed(2025) 2025-05-07T20:33:37.1631271Z 2025-05-07T20:33:37.1631432Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1631505Z 2025-05-07T20:33:37.1631589Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1631705Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1631788Z x = x_sign * x_clamp 2025-05-07T20:33:37.1631861Z x0 = x[:, :D] 2025-05-07T20:33:37.1631933Z x1 = x[:, D:] 2025-05-07T20:33:37.1632002Z 2025-05-07T20:33:37.1632080Z if contiguous: 2025-05-07T20:33:37.1632163Z x0 = x0.contiguous() 2025-05-07T20:33:37.1632247Z x1 = x1.contiguous() 2025-05-07T20:33:37.1632312Z 2025-05-07T20:33:37.1632440Z if scale_ub is not None: 2025-05-07T20:33:37.1632542Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1632670Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1632742Z ) 2025-05-07T20:33:37.1632812Z else: 2025-05-07T20:33:37.1632899Z scale_ub_tensor = None 2025-05-07T20:33:37.1633022Z 2025-05-07T20:33:37.1633145Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1633229Z op = silu_mul_quant 2025-05-07T20:33:37.1633310Z if compiled: 2025-05-07T20:33:37.1633402Z op = torch.compile(op) 2025-05-07T20:33:37.1633500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1633566Z 2025-05-07T20:33:37.1633647Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1633654Z 2025-05-07T20:33:37.1633747Z moe/activation_test.py:117: 2025-05-07T20:33:37.1633872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1633964Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1634098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1634460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1634546Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1635035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1635125Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1635478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1635692Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1636064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1636154Z kernel = self.compile( 2025-05-07T20:33:37.1636555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1636719Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1636841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1636848Z 2025-05-07T20:33:37.1637043Z self = 2025-05-07T20:33:37.1637806Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1638294Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca038720>} 2025-05-07T20:33:37.1639033Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1639215Z context = 2025-05-07T20:33:37.1639220Z 2025-05-07T20:33:37.1639374Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1639632Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1639732Z module_map=module_map) 2025-05-07T20:33:37.1639885Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1639980Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1640051Z E ^ 2025-05-07T20:33:37.1640684Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1640690Z 2025-05-07T20:33:37.1641193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1641198Z 2025-05-07T20:33:37.1641295Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1641512Z self=, 2025-05-07T20:33:37.1641581Z T=1, 2025-05-07T20:33:37.1641736Z D=5120, 2025-05-07T20:33:37.1641811Z scale_ub=None, 2025-05-07T20:33:37.1641892Z contiguous=False, 2025-05-07T20:33:37.1641971Z compiled=False, 2025-05-07T20:33:37.1642038Z ) 2025-05-07T20:33:37.1642247Z self = 2025-05-07T20:33:37.1642408Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:37.1642412Z 2025-05-07T20:33:37.1642487Z @given( 2025-05-07T20:33:37.1642598Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1642692Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1642803Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1642973Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1643079Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1643149Z ) 2025-05-07T20:33:37.1643387Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1643476Z def test_silu_mul_quant( 2025-05-07T20:33:37.1643547Z self, 2025-05-07T20:33:37.1643619Z T: int, 2025-05-07T20:33:37.1643689Z D: int, 2025-05-07T20:33:37.1643779Z scale_ub: Optional[float], 2025-05-07T20:33:37.1643863Z contiguous: bool, 2025-05-07T20:33:37.1643941Z compiled: bool, 2025-05-07T20:33:37.1644011Z ) -> None: 2025-05-07T20:33:37.1644109Z torch.manual_seed(2025) 2025-05-07T20:33:37.1644241Z 2025-05-07T20:33:37.1644405Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1644472Z 2025-05-07T20:33:37.1644559Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1644685Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1644770Z x = x_sign * x_clamp 2025-05-07T20:33:37.1644845Z x0 = x[:, :D] 2025-05-07T20:33:37.1644919Z x1 = x[:, D:] 2025-05-07T20:33:37.1644986Z 2025-05-07T20:33:37.1645062Z if contiguous: 2025-05-07T20:33:37.1645152Z x0 = x0.contiguous() 2025-05-07T20:33:37.1645236Z x1 = x1.contiguous() 2025-05-07T20:33:37.1645306Z 2025-05-07T20:33:37.1645397Z if scale_ub is not None: 2025-05-07T20:33:37.1645497Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1645625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1645700Z ) 2025-05-07T20:33:37.1645777Z else: 2025-05-07T20:33:37.1645869Z scale_ub_tensor = None 2025-05-07T20:33:37.1645936Z 2025-05-07T20:33:37.1646060Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1646147Z op = silu_mul_quant 2025-05-07T20:33:37.1646228Z if compiled: 2025-05-07T20:33:37.1646321Z op = torch.compile(op) 2025-05-07T20:33:37.1646423Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1646490Z 2025-05-07T20:33:37.1646574Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1646581Z 2025-05-07T20:33:37.1646673Z moe/activation_test.py:117: 2025-05-07T20:33:37.1646793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1646891Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1646988Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1647480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1647580Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1647980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1648201Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1648539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1648628Z kernel = self.compile( 2025-05-07T20:33:37.1649070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1649238Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1649359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1649364Z 2025-05-07T20:33:37.1649565Z self = 2025-05-07T20:33:37.1650359Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1650915Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca039120>} 2025-05-07T20:33:37.1651650Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1651836Z context = 2025-05-07T20:33:37.1651845Z 2025-05-07T20:33:37.1652001Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1652255Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1652401Z module_map=module_map) 2025-05-07T20:33:37.1652561Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1652653Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1652734Z E ^ 2025-05-07T20:33:37.1653080Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1653084Z 2025-05-07T20:33:37.1653496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1653503Z 2025-05-07T20:33:37.1653600Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1653815Z self=, 2025-05-07T20:33:37.1653886Z T=4096, 2025-05-07T20:33:37.1653958Z D=7168, 2025-05-07T20:33:37.1654036Z scale_ub=1200.0, 2025-05-07T20:33:37.1654120Z contiguous=False, 2025-05-07T20:33:37.1654203Z compiled=False, 2025-05-07T20:33:37.1654269Z ) 2025-05-07T20:33:37.1654487Z self = 2025-05-07T20:33:37.1654662Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:37.1654666Z 2025-05-07T20:33:37.1654737Z @given( 2025-05-07T20:33:37.1654848Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1654942Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1655057Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1655169Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1655278Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1655350Z ) 2025-05-07T20:33:37.1655586Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1655676Z def test_silu_mul_quant( 2025-05-07T20:33:37.1655745Z self, 2025-05-07T20:33:37.1655815Z T: int, 2025-05-07T20:33:37.1655889Z D: int, 2025-05-07T20:33:37.1655978Z scale_ub: Optional[float], 2025-05-07T20:33:37.1656107Z contiguous: bool, 2025-05-07T20:33:37.1656188Z compiled: bool, 2025-05-07T20:33:37.1656266Z ) -> None: 2025-05-07T20:33:37.1656352Z torch.manual_seed(2025) 2025-05-07T20:33:37.1656422Z 2025-05-07T20:33:37.1656582Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1656647Z 2025-05-07T20:33:37.1656733Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1656890Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1656970Z x = x_sign * x_clamp 2025-05-07T20:33:37.1657047Z x0 = x[:, :D] 2025-05-07T20:33:37.1657120Z x1 = x[:, D:] 2025-05-07T20:33:37.1657190Z 2025-05-07T20:33:37.1657263Z if contiguous: 2025-05-07T20:33:37.1657346Z x0 = x0.contiguous() 2025-05-07T20:33:37.1657434Z x1 = x1.contiguous() 2025-05-07T20:33:37.1657501Z 2025-05-07T20:33:37.1657585Z if scale_ub is not None: 2025-05-07T20:33:37.1657687Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1657818Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1657928Z ) 2025-05-07T20:33:37.1658005Z else: 2025-05-07T20:33:37.1658091Z scale_ub_tensor = None 2025-05-07T20:33:37.1658156Z 2025-05-07T20:33:37.1658283Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1658367Z op = silu_mul_quant 2025-05-07T20:33:37.1658448Z if compiled: 2025-05-07T20:33:37.1658539Z op = torch.compile(op) 2025-05-07T20:33:37.1658639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1658712Z 2025-05-07T20:33:37.1658796Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1658800Z 2025-05-07T20:33:37.1658896Z moe/activation_test.py:117: 2025-05-07T20:33:37.1659019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1659158Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1659253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1659749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1659841Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1660199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1660415Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1660751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1660842Z kernel = self.compile( 2025-05-07T20:33:37.1661219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1661387Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1661513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1661518Z 2025-05-07T20:33:37.1661716Z self = 2025-05-07T20:33:37.1662480Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1662971Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca03a480>} 2025-05-07T20:33:37.1663702Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1663889Z context = 2025-05-07T20:33:37.1663894Z 2025-05-07T20:33:37.1664087Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1664349Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1664450Z module_map=module_map) 2025-05-07T20:33:37.1664611Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1664740Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1664809Z E ^ 2025-05-07T20:33:37.1665155Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1665159Z 2025-05-07T20:33:37.1665587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1665595Z 2025-05-07T20:33:37.1665695Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1665908Z self=, 2025-05-07T20:33:37.1665982Z T=16384, 2025-05-07T20:33:37.1666055Z D=7168, 2025-05-07T20:33:37.1666172Z scale_ub=None, 2025-05-07T20:33:37.1666252Z contiguous=True, 2025-05-07T20:33:37.1666332Z compiled=True, 2025-05-07T20:33:37.1666399Z ) 2025-05-07T20:33:37.1666610Z self = 2025-05-07T20:33:37.1666782Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:37.1666786Z 2025-05-07T20:33:37.1666858Z @given( 2025-05-07T20:33:37.1666970Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1667066Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1667172Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1667286Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1667494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1667565Z ) 2025-05-07T20:33:37.1667806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1667896Z def test_silu_mul_quant( 2025-05-07T20:33:37.1667966Z self, 2025-05-07T20:33:37.1668042Z T: int, 2025-05-07T20:33:37.1668114Z D: int, 2025-05-07T20:33:37.1668206Z scale_ub: Optional[float], 2025-05-07T20:33:37.1668291Z contiguous: bool, 2025-05-07T20:33:37.1668374Z compiled: bool, 2025-05-07T20:33:37.1668450Z ) -> None: 2025-05-07T20:33:37.1668538Z torch.manual_seed(2025) 2025-05-07T20:33:37.1668604Z 2025-05-07T20:33:37.1668769Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1668837Z 2025-05-07T20:33:37.1668922Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1669043Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1669128Z x = x_sign * x_clamp 2025-05-07T20:33:37.1669204Z x0 = x[:, :D] 2025-05-07T20:33:37.1669279Z x1 = x[:, D:] 2025-05-07T20:33:37.1669344Z 2025-05-07T20:33:37.1669424Z if contiguous: 2025-05-07T20:33:37.1669516Z x0 = x0.contiguous() 2025-05-07T20:33:37.1669598Z x1 = x1.contiguous() 2025-05-07T20:33:37.1669662Z 2025-05-07T20:33:37.1669746Z if scale_ub is not None: 2025-05-07T20:33:37.1669845Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1669973Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1670046Z ) 2025-05-07T20:33:37.1670117Z else: 2025-05-07T20:33:37.1670211Z scale_ub_tensor = None 2025-05-07T20:33:37.1670279Z 2025-05-07T20:33:37.1670404Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1670496Z op = silu_mul_quant 2025-05-07T20:33:37.1670576Z if compiled: 2025-05-07T20:33:37.1670672Z op = torch.compile(op) 2025-05-07T20:33:37.1670778Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1670844Z 2025-05-07T20:33:37.1670999Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1671008Z 2025-05-07T20:33:37.1671103Z moe/activation_test.py:117: 2025-05-07T20:33:37.1671224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1671318Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1671411Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1671815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1671909Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1672393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1672484Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1672841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1673062Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1673438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1673527Z kernel = self.compile( 2025-05-07T20:33:37.1673925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1674102Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1674221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1674226Z 2025-05-07T20:33:37.1674425Z self = 2025-05-07T20:33:37.1675187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1675724Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca03b740>} 2025-05-07T20:33:37.1676460Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1676647Z context = 2025-05-07T20:33:37.1676652Z 2025-05-07T20:33:37.1676814Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1677066Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1677168Z module_map=module_map) 2025-05-07T20:33:37.1677332Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1677424Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1677501Z E ^ 2025-05-07T20:33:37.1677852Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1677856Z 2025-05-07T20:33:37.1678265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1678269Z 2025-05-07T20:33:37.1678371Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1678585Z self=, 2025-05-07T20:33:37.1678656Z T=4096, 2025-05-07T20:33:37.1678728Z D=5120, 2025-05-07T20:33:37.1678807Z scale_ub=None, 2025-05-07T20:33:37.1678894Z contiguous=False, 2025-05-07T20:33:37.1678971Z compiled=True, 2025-05-07T20:33:37.1679036Z ) 2025-05-07T20:33:37.1679250Z self = 2025-05-07T20:33:37.1679418Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:37.1679465Z 2025-05-07T20:33:37.1679536Z @given( 2025-05-07T20:33:37.1679658Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1679752Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1679863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1679973Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1680121Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1680192Z ) 2025-05-07T20:33:37.1680428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1680516Z def test_silu_mul_quant( 2025-05-07T20:33:37.1680591Z self, 2025-05-07T20:33:37.1680661Z T: int, 2025-05-07T20:33:37.1680733Z D: int, 2025-05-07T20:33:37.1680828Z scale_ub: Optional[float], 2025-05-07T20:33:37.1680914Z contiguous: bool, 2025-05-07T20:33:37.1680991Z compiled: bool, 2025-05-07T20:33:37.1681068Z ) -> None: 2025-05-07T20:33:37.1681156Z torch.manual_seed(2025) 2025-05-07T20:33:37.1681220Z 2025-05-07T20:33:37.1681425Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1681494Z 2025-05-07T20:33:37.1681584Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1681704Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1681788Z x = x_sign * x_clamp 2025-05-07T20:33:37.1681866Z x0 = x[:, :D] 2025-05-07T20:33:37.1681938Z x1 = x[:, D:] 2025-05-07T20:33:37.1682008Z 2025-05-07T20:33:37.1682089Z if contiguous: 2025-05-07T20:33:37.1682176Z x0 = x0.contiguous() 2025-05-07T20:33:37.1682261Z x1 = x1.contiguous() 2025-05-07T20:33:37.1682334Z 2025-05-07T20:33:37.1682417Z if scale_ub is not None: 2025-05-07T20:33:37.1682561Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1682692Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1682761Z ) 2025-05-07T20:33:37.1682832Z else: 2025-05-07T20:33:37.1682921Z scale_ub_tensor = None 2025-05-07T20:33:37.1682990Z 2025-05-07T20:33:37.1683118Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1683200Z op = silu_mul_quant 2025-05-07T20:33:37.1683280Z if compiled: 2025-05-07T20:33:37.1683377Z op = torch.compile(op) 2025-05-07T20:33:37.1683481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1683549Z 2025-05-07T20:33:37.1683641Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1683645Z 2025-05-07T20:33:37.1683737Z moe/activation_test.py:117: 2025-05-07T20:33:37.1683863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1683956Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1684051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1684419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1684511Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1685000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1685098Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1685449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1685671Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1686003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1686091Z kernel = self.compile( 2025-05-07T20:33:37.1686472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1686639Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1686802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1686808Z 2025-05-07T20:33:37.1687008Z self = 2025-05-07T20:33:37.1687769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1688301Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca254c20>} 2025-05-07T20:33:37.1689031Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1689222Z context = 2025-05-07T20:33:37.1689227Z 2025-05-07T20:33:37.1689417Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1689674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1689779Z module_map=module_map) 2025-05-07T20:33:37.1689936Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1690026Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1690100Z E ^ 2025-05-07T20:33:37.1690446Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1690451Z 2025-05-07T20:33:37.1690860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1690903Z 2025-05-07T20:33:37.1690997Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1691213Z self=, 2025-05-07T20:33:37.1691289Z T=4096, 2025-05-07T20:33:37.1691361Z D=5120, 2025-05-07T20:33:37.1691440Z scale_ub=1200.0, 2025-05-07T20:33:37.1691520Z contiguous=False, 2025-05-07T20:33:37.1691596Z compiled=False, 2025-05-07T20:33:37.1691663Z ) 2025-05-07T20:33:37.1691873Z self = 2025-05-07T20:33:37.1692045Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:37.1692049Z 2025-05-07T20:33:37.1692123Z @given( 2025-05-07T20:33:37.1692234Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1692324Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1692438Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1692552Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1692661Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1692729Z ) 2025-05-07T20:33:37.1692969Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1693060Z def test_silu_mul_quant( 2025-05-07T20:33:37.1693129Z self, 2025-05-07T20:33:37.1693200Z T: int, 2025-05-07T20:33:37.1693272Z D: int, 2025-05-07T20:33:37.1693360Z scale_ub: Optional[float], 2025-05-07T20:33:37.1693446Z contiguous: bool, 2025-05-07T20:33:37.1693525Z compiled: bool, 2025-05-07T20:33:37.1693595Z ) -> None: 2025-05-07T20:33:37.1693684Z torch.manual_seed(2025) 2025-05-07T20:33:37.1693750Z 2025-05-07T20:33:37.1693910Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1693976Z 2025-05-07T20:33:37.1694064Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1694183Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1694268Z x = x_sign * x_clamp 2025-05-07T20:33:37.1694342Z x0 = x[:, :D] 2025-05-07T20:33:37.1694457Z x1 = x[:, D:] 2025-05-07T20:33:37.1694530Z 2025-05-07T20:33:37.1694609Z if contiguous: 2025-05-07T20:33:37.1694694Z x0 = x0.contiguous() 2025-05-07T20:33:37.1694781Z x1 = x1.contiguous() 2025-05-07T20:33:37.1694849Z 2025-05-07T20:33:37.1694930Z if scale_ub is not None: 2025-05-07T20:33:37.1695031Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1695198Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1695266Z ) 2025-05-07T20:33:37.1695343Z else: 2025-05-07T20:33:37.1695431Z scale_ub_tensor = None 2025-05-07T20:33:37.1695499Z 2025-05-07T20:33:37.1695623Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1695708Z op = silu_mul_quant 2025-05-07T20:33:37.1695797Z if compiled: 2025-05-07T20:33:37.1695887Z op = torch.compile(op) 2025-05-07T20:33:37.1695988Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1696055Z 2025-05-07T20:33:37.1696180Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1696184Z 2025-05-07T20:33:37.1696275Z moe/activation_test.py:117: 2025-05-07T20:33:37.1696403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1696497Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1696597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1697085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1697175Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1697533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1697815Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1698153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1698243Z kernel = self.compile( 2025-05-07T20:33:37.1698625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1698796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1698917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1698924Z 2025-05-07T20:33:37.1699119Z self = 2025-05-07T20:33:37.1699881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1700376Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca2556c0>} 2025-05-07T20:33:37.1701111Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1701296Z context = 2025-05-07T20:33:37.1701303Z 2025-05-07T20:33:37.1701464Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1701719Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1705119Z module_map=module_map) 2025-05-07T20:33:37.1705312Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1705414Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1705494Z E ^ 2025-05-07T20:33:37.1705985Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1705990Z 2025-05-07T20:33:37.1706409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1706413Z 2025-05-07T20:33:37.1706514Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1706733Z self=, 2025-05-07T20:33:37.1706846Z T=4096, 2025-05-07T20:33:37.1706920Z D=5120, 2025-05-07T20:33:37.1706999Z scale_ub=1200.0, 2025-05-07T20:33:37.1707078Z contiguous=False, 2025-05-07T20:33:37.1707160Z compiled=True, 2025-05-07T20:33:37.1707227Z ) 2025-05-07T20:33:37.1707501Z self = 2025-05-07T20:33:37.1707675Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:37.1707682Z 2025-05-07T20:33:37.1707756Z @given( 2025-05-07T20:33:37.1707875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1707969Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1708126Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1708243Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1708349Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1708415Z ) 2025-05-07T20:33:37.1708661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1708749Z def test_silu_mul_quant( 2025-05-07T20:33:37.1708818Z self, 2025-05-07T20:33:37.1708891Z T: int, 2025-05-07T20:33:37.1708963Z D: int, 2025-05-07T20:33:37.1709057Z scale_ub: Optional[float], 2025-05-07T20:33:37.1709140Z contiguous: bool, 2025-05-07T20:33:37.1709218Z compiled: bool, 2025-05-07T20:33:37.1709338Z ) -> None: 2025-05-07T20:33:37.1709429Z torch.manual_seed(2025) 2025-05-07T20:33:37.1709499Z 2025-05-07T20:33:37.1709669Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1709737Z 2025-05-07T20:33:37.1709828Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1709953Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1710039Z x = x_sign * x_clamp 2025-05-07T20:33:37.1710116Z x0 = x[:, :D] 2025-05-07T20:33:37.1710197Z x1 = x[:, D:] 2025-05-07T20:33:37.1710268Z 2025-05-07T20:33:37.1710347Z if contiguous: 2025-05-07T20:33:37.1710435Z x0 = x0.contiguous() 2025-05-07T20:33:37.1710517Z x1 = x1.contiguous() 2025-05-07T20:33:37.1710591Z 2025-05-07T20:33:37.1710675Z if scale_ub is not None: 2025-05-07T20:33:37.1710775Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1710908Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1710985Z ) 2025-05-07T20:33:37.1711056Z else: 2025-05-07T20:33:37.1711146Z scale_ub_tensor = None 2025-05-07T20:33:37.1711215Z 2025-05-07T20:33:37.1711339Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1711428Z op = silu_mul_quant 2025-05-07T20:33:37.1711510Z if compiled: 2025-05-07T20:33:37.1711605Z op = torch.compile(op) 2025-05-07T20:33:37.1711708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1711779Z 2025-05-07T20:33:37.1711869Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1711874Z 2025-05-07T20:33:37.1711966Z moe/activation_test.py:117: 2025-05-07T20:33:37.1712091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1712190Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1712287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1712653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1712747Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1713280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1713380Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1713734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1713950Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1714899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1714992Z kernel = self.compile( 2025-05-07T20:33:37.1715371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1715545Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1715671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1715676Z 2025-05-07T20:33:37.1715879Z self = 2025-05-07T20:33:37.1716682Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1717183Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca256fc0>} 2025-05-07T20:33:37.1717914Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1718136Z context = 2025-05-07T20:33:37.1718141Z 2025-05-07T20:33:37.1718307Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1718566Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1718671Z module_map=module_map) 2025-05-07T20:33:37.1718826Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1718920Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1718999Z E ^ 2025-05-07T20:33:37.1719346Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1719351Z 2025-05-07T20:33:37.1719761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1719769Z 2025-05-07T20:33:37.1719870Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1720090Z self=, 2025-05-07T20:33:37.1720176Z T=2048, 2025-05-07T20:33:37.1720272Z D=7168, 2025-05-07T20:33:37.1720356Z scale_ub=1200.0, 2025-05-07T20:33:37.1720464Z contiguous=False, 2025-05-07T20:33:37.1720543Z compiled=False, 2025-05-07T20:33:37.1720612Z ) 2025-05-07T20:33:37.1720827Z self = 2025-05-07T20:33:37.1720997Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:37.1721005Z 2025-05-07T20:33:37.1721081Z @given( 2025-05-07T20:33:37.1721196Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1721290Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1721400Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1721510Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1721618Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1721690Z ) 2025-05-07T20:33:37.1721930Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1722064Z def test_silu_mul_quant( 2025-05-07T20:33:37.1722143Z self, 2025-05-07T20:33:37.1722221Z T: int, 2025-05-07T20:33:37.1722291Z D: int, 2025-05-07T20:33:37.1722384Z scale_ub: Optional[float], 2025-05-07T20:33:37.1722469Z contiguous: bool, 2025-05-07T20:33:37.1722553Z compiled: bool, 2025-05-07T20:33:37.1722670Z ) -> None: 2025-05-07T20:33:37.1722761Z torch.manual_seed(2025) 2025-05-07T20:33:37.1722830Z 2025-05-07T20:33:37.1722994Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1723062Z 2025-05-07T20:33:37.1723153Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1723270Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1723354Z x = x_sign * x_clamp 2025-05-07T20:33:37.1723442Z x0 = x[:, :D] 2025-05-07T20:33:37.1723518Z x1 = x[:, D:] 2025-05-07T20:33:37.1723587Z 2025-05-07T20:33:37.1723673Z if contiguous: 2025-05-07T20:33:37.1723761Z x0 = x0.contiguous() 2025-05-07T20:33:37.1723886Z x1 = x1.contiguous() 2025-05-07T20:33:37.1723959Z 2025-05-07T20:33:37.1724045Z if scale_ub is not None: 2025-05-07T20:33:37.1724150Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1724279Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1724356Z ) 2025-05-07T20:33:37.1724433Z else: 2025-05-07T20:33:37.1724522Z scale_ub_tensor = None 2025-05-07T20:33:37.1724592Z 2025-05-07T20:33:37.1724720Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1724804Z op = silu_mul_quant 2025-05-07T20:33:37.1724886Z if compiled: 2025-05-07T20:33:37.1724984Z op = torch.compile(op) 2025-05-07T20:33:37.1725130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1725200Z 2025-05-07T20:33:37.1725291Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1725298Z 2025-05-07T20:33:37.1725389Z moe/activation_test.py:117: 2025-05-07T20:33:37.1725518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1725613Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1725708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1726204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1726298Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1726653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1726871Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1727208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1727304Z kernel = self.compile( 2025-05-07T20:33:37.1727701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1727869Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1727992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1727997Z 2025-05-07T20:33:37.1728197Z self = 2025-05-07T20:33:37.1728961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1729453Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca257ec0>} 2025-05-07T20:33:37.1730546Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1730729Z context = 2025-05-07T20:33:37.1730734Z 2025-05-07T20:33:37.1730891Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1731188Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1731292Z module_map=module_map) 2025-05-07T20:33:37.1731448Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1731544Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1731618Z E ^ 2025-05-07T20:33:37.1731966Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1731974Z 2025-05-07T20:33:37.1732404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1732471Z 2025-05-07T20:33:37.1732570Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1732788Z self=, 2025-05-07T20:33:37.1732862Z T=1, 2025-05-07T20:33:37.1732938Z D=7168, 2025-05-07T20:33:37.1733019Z scale_ub=None, 2025-05-07T20:33:37.1733098Z contiguous=True, 2025-05-07T20:33:37.1733184Z compiled=False, 2025-05-07T20:33:37.1733253Z ) 2025-05-07T20:33:37.1733465Z self = 2025-05-07T20:33:37.1733626Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:37.1733631Z 2025-05-07T20:33:37.1733705Z @given( 2025-05-07T20:33:37.1733862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1733959Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1734071Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1734189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1734297Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1734368Z ) 2025-05-07T20:33:37.1734606Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1734693Z def test_silu_mul_quant( 2025-05-07T20:33:37.1734771Z self, 2025-05-07T20:33:37.1734849Z T: int, 2025-05-07T20:33:37.1734923Z D: int, 2025-05-07T20:33:37.1735015Z scale_ub: Optional[float], 2025-05-07T20:33:37.1735104Z contiguous: bool, 2025-05-07T20:33:37.1735184Z compiled: bool, 2025-05-07T20:33:37.1735257Z ) -> None: 2025-05-07T20:33:37.1735351Z torch.manual_seed(2025) 2025-05-07T20:33:37.1735421Z 2025-05-07T20:33:37.1735591Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1735661Z 2025-05-07T20:33:37.1735748Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1735876Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1735965Z x = x_sign * x_clamp 2025-05-07T20:33:37.1736043Z x0 = x[:, :D] 2025-05-07T20:33:37.1736123Z x1 = x[:, D:] 2025-05-07T20:33:37.1736191Z 2025-05-07T20:33:37.1736270Z if contiguous: 2025-05-07T20:33:37.1736366Z x0 = x0.contiguous() 2025-05-07T20:33:37.1736451Z x1 = x1.contiguous() 2025-05-07T20:33:37.1736521Z 2025-05-07T20:33:37.1736608Z if scale_ub is not None: 2025-05-07T20:33:37.1736709Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1736838Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1736912Z ) 2025-05-07T20:33:37.1736984Z else: 2025-05-07T20:33:37.1737079Z scale_ub_tensor = None 2025-05-07T20:33:37.1737150Z 2025-05-07T20:33:37.1737275Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1737411Z op = silu_mul_quant 2025-05-07T20:33:37.1737492Z if compiled: 2025-05-07T20:33:37.1737589Z op = torch.compile(op) 2025-05-07T20:33:37.1737693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1737761Z 2025-05-07T20:33:37.1737847Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1737852Z 2025-05-07T20:33:37.1737986Z moe/activation_test.py:117: 2025-05-07T20:33:37.1738110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1738208Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1738302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1738794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1738893Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1739247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1739507Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1739849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1739940Z kernel = self.compile( 2025-05-07T20:33:37.1740577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1740757Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1740878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1740883Z 2025-05-07T20:33:37.1741081Z self = 2025-05-07T20:33:37.1741846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1742438Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819f10cc0>} 2025-05-07T20:33:37.1743170Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1743357Z context = 2025-05-07T20:33:37.1743366Z 2025-05-07T20:33:37.1743521Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1743777Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1743889Z module_map=module_map) 2025-05-07T20:33:37.1744045Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1744140Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1744217Z E ^ 2025-05-07T20:33:37.1744566Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1744570Z 2025-05-07T20:33:37.1744987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1744994Z 2025-05-07T20:33:37.1745089Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1745304Z self=, 2025-05-07T20:33:37.1745381Z T=16384, 2025-05-07T20:33:37.1745455Z D=7168, 2025-05-07T20:33:37.1745533Z scale_ub=1200.0, 2025-05-07T20:33:37.1745616Z contiguous=False, 2025-05-07T20:33:37.1745696Z compiled=True, 2025-05-07T20:33:37.1745764Z ) 2025-05-07T20:33:37.1745977Z self = 2025-05-07T20:33:37.1746229Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:37.1746234Z 2025-05-07T20:33:37.1746313Z @given( 2025-05-07T20:33:37.1746428Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1746521Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1746634Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1746805Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1746914Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1746988Z ) 2025-05-07T20:33:37.1747223Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1747314Z def test_silu_mul_quant( 2025-05-07T20:33:37.1747388Z self, 2025-05-07T20:33:37.1747513Z T: int, 2025-05-07T20:33:37.1747593Z D: int, 2025-05-07T20:33:37.1747686Z scale_ub: Optional[float], 2025-05-07T20:33:37.1747772Z contiguous: bool, 2025-05-07T20:33:37.1747858Z compiled: bool, 2025-05-07T20:33:37.1747934Z ) -> None: 2025-05-07T20:33:37.1748086Z torch.manual_seed(2025) 2025-05-07T20:33:37.1748157Z 2025-05-07T20:33:37.1748320Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1748390Z 2025-05-07T20:33:37.1748478Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1748599Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1748681Z x = x_sign * x_clamp 2025-05-07T20:33:37.1748758Z x0 = x[:, :D] 2025-05-07T20:33:37.1748831Z x1 = x[:, D:] 2025-05-07T20:33:37.1748902Z 2025-05-07T20:33:37.1748978Z if contiguous: 2025-05-07T20:33:37.1749064Z x0 = x0.contiguous() 2025-05-07T20:33:37.1749149Z x1 = x1.contiguous() 2025-05-07T20:33:37.1749260Z 2025-05-07T20:33:37.1749344Z if scale_ub is not None: 2025-05-07T20:33:37.1749448Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1749580Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1749652Z ) 2025-05-07T20:33:37.1749730Z else: 2025-05-07T20:33:37.1749820Z scale_ub_tensor = None 2025-05-07T20:33:37.1749888Z 2025-05-07T20:33:37.1750015Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1750100Z op = silu_mul_quant 2025-05-07T20:33:37.1750187Z if compiled: 2025-05-07T20:33:37.1750300Z op = torch.compile(op) 2025-05-07T20:33:37.1750408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1750498Z 2025-05-07T20:33:37.1750588Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1750592Z 2025-05-07T20:33:37.1750683Z moe/activation_test.py:117: 2025-05-07T20:33:37.1750808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1750907Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1751002Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1751373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1751462Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1751951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1752042Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1752396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1752617Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1752951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1753038Z kernel = self.compile( 2025-05-07T20:33:37.1753421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1753634Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1753761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1753766Z 2025-05-07T20:33:37.1753962Z self = 2025-05-07T20:33:37.1754723Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1755256Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819f120c0>} 2025-05-07T20:33:37.1755990Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1756218Z context = 2025-05-07T20:33:37.1756223Z 2025-05-07T20:33:37.1756381Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1756637Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1756742Z module_map=module_map) 2025-05-07T20:33:37.1756898Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1756993Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1757067Z E ^ 2025-05-07T20:33:37.1757412Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1757417Z 2025-05-07T20:33:37.1757891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1757896Z 2025-05-07T20:33:37.1757996Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1758216Z self=, 2025-05-07T20:33:37.1758290Z T=1, 2025-05-07T20:33:37.1758363Z D=7168, 2025-05-07T20:33:37.1758440Z scale_ub=None, 2025-05-07T20:33:37.1758523Z contiguous=False, 2025-05-07T20:33:37.1758601Z compiled=False, 2025-05-07T20:33:37.1758674Z ) 2025-05-07T20:33:37.1758886Z self = 2025-05-07T20:33:37.1759047Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:37.1759055Z 2025-05-07T20:33:37.1759126Z @given( 2025-05-07T20:33:37.1759240Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1759337Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1759447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1759556Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1759670Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1759738Z ) 2025-05-07T20:33:37.1759979Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1760072Z def test_silu_mul_quant( 2025-05-07T20:33:37.1760144Z self, 2025-05-07T20:33:37.1760217Z T: int, 2025-05-07T20:33:37.1760292Z D: int, 2025-05-07T20:33:37.1760386Z scale_ub: Optional[float], 2025-05-07T20:33:37.1760472Z contiguous: bool, 2025-05-07T20:33:37.1760551Z compiled: bool, 2025-05-07T20:33:37.1760626Z ) -> None: 2025-05-07T20:33:37.1760716Z torch.manual_seed(2025) 2025-05-07T20:33:37.1760785Z 2025-05-07T20:33:37.1760947Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1761019Z 2025-05-07T20:33:37.1761106Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1761224Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1761357Z x = x_sign * x_clamp 2025-05-07T20:33:37.1761431Z x0 = x[:, :D] 2025-05-07T20:33:37.1761506Z x1 = x[:, D:] 2025-05-07T20:33:37.1761574Z 2025-05-07T20:33:37.1761653Z if contiguous: 2025-05-07T20:33:37.1761744Z x0 = x0.contiguous() 2025-05-07T20:33:37.1761829Z x1 = x1.contiguous() 2025-05-07T20:33:37.1761898Z 2025-05-07T20:33:37.1762064Z if scale_ub is not None: 2025-05-07T20:33:37.1762163Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1762292Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1762366Z ) 2025-05-07T20:33:37.1762439Z else: 2025-05-07T20:33:37.1762530Z scale_ub_tensor = None 2025-05-07T20:33:37.1762604Z 2025-05-07T20:33:37.1762727Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1762813Z op = silu_mul_quant 2025-05-07T20:33:37.1762896Z if compiled: 2025-05-07T20:33:37.1762991Z op = torch.compile(op) 2025-05-07T20:33:37.1763093Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1763202Z 2025-05-07T20:33:37.1763289Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1763293Z 2025-05-07T20:33:37.1763394Z moe/activation_test.py:117: 2025-05-07T20:33:37.1763518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1763614Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1763715Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1764203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1764294Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1764653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1764916Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1765258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1765348Z kernel = self.compile( 2025-05-07T20:33:37.1765730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1765899Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1766021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1766026Z 2025-05-07T20:33:37.1766231Z self = 2025-05-07T20:33:37.1766991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1767490Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819f12c00>} 2025-05-07T20:33:37.1768234Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1768424Z context = 2025-05-07T20:33:37.1768428Z 2025-05-07T20:33:37.1768591Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1768846Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1768948Z module_map=module_map) 2025-05-07T20:33:37.1769109Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1769205Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1769278Z E ^ 2025-05-07T20:33:37.1769672Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1769677Z 2025-05-07T20:33:37.1770109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1770113Z 2025-05-07T20:33:37.1770214Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1770471Z self=, 2025-05-07T20:33:37.1770552Z T=2048, 2025-05-07T20:33:37.1770627Z D=7168, 2025-05-07T20:33:37.1770706Z scale_ub=None, 2025-05-07T20:33:37.1770793Z contiguous=False, 2025-05-07T20:33:37.1770873Z compiled=True, 2025-05-07T20:33:37.1770944Z ) 2025-05-07T20:33:37.1771163Z self = 2025-05-07T20:33:37.1771337Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:37.1771342Z 2025-05-07T20:33:37.1771416Z @given( 2025-05-07T20:33:37.1771537Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1771673Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1771789Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1771902Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1772012Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1772088Z ) 2025-05-07T20:33:37.1772325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1772414Z def test_silu_mul_quant( 2025-05-07T20:33:37.1772490Z self, 2025-05-07T20:33:37.1772563Z T: int, 2025-05-07T20:33:37.1772638Z D: int, 2025-05-07T20:33:37.1772733Z scale_ub: Optional[float], 2025-05-07T20:33:37.1772818Z contiguous: bool, 2025-05-07T20:33:37.1772941Z compiled: bool, 2025-05-07T20:33:37.1773022Z ) -> None: 2025-05-07T20:33:37.1773114Z torch.manual_seed(2025) 2025-05-07T20:33:37.1773194Z 2025-05-07T20:33:37.1773360Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1773433Z 2025-05-07T20:33:37.1773525Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1773645Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1773733Z x = x_sign * x_clamp 2025-05-07T20:33:37.1773816Z x0 = x[:, :D] 2025-05-07T20:33:37.1773891Z x1 = x[:, D:] 2025-05-07T20:33:37.1773966Z 2025-05-07T20:33:37.1774051Z if contiguous: 2025-05-07T20:33:37.1774138Z x0 = x0.contiguous() 2025-05-07T20:33:37.1774223Z x1 = x1.contiguous() 2025-05-07T20:33:37.1774293Z 2025-05-07T20:33:37.1774378Z if scale_ub is not None: 2025-05-07T20:33:37.1774479Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1774614Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1774687Z ) 2025-05-07T20:33:37.1774767Z else: 2025-05-07T20:33:37.1774859Z scale_ub_tensor = None 2025-05-07T20:33:37.1774928Z 2025-05-07T20:33:37.1775062Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1775149Z op = silu_mul_quant 2025-05-07T20:33:37.1775229Z if compiled: 2025-05-07T20:33:37.1775330Z op = torch.compile(op) 2025-05-07T20:33:37.1775434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1775502Z 2025-05-07T20:33:37.1775592Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1775596Z 2025-05-07T20:33:37.1775689Z moe/activation_test.py:117: 2025-05-07T20:33:37.1775816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1775910Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1776004Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1776373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1776509Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1776998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1777101Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1777455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1777720Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1778057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1778147Z kernel = self.compile( 2025-05-07T20:33:37.1778532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1778706Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1778832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1778841Z 2025-05-07T20:33:37.1779078Z self = 2025-05-07T20:33:37.1779843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1780374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca1842c0>} 2025-05-07T20:33:37.1781128Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1781359Z context = 2025-05-07T20:33:37.1781364Z 2025-05-07T20:33:37.1781526Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1781786Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1781899Z module_map=module_map) 2025-05-07T20:33:37.1782056Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1782153Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1782235Z E ^ 2025-05-07T20:33:37.1782583Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1782588Z 2025-05-07T20:33:37.1783003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1783010Z 2025-05-07T20:33:37.1783109Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1783329Z self=, 2025-05-07T20:33:37.1783411Z T=4096, 2025-05-07T20:33:37.1783487Z D=7168, 2025-05-07T20:33:37.1783572Z scale_ub=None, 2025-05-07T20:33:37.1783660Z contiguous=False, 2025-05-07T20:33:37.1783738Z compiled=True, 2025-05-07T20:33:37.1783811Z ) 2025-05-07T20:33:37.1784025Z self = 2025-05-07T20:33:37.1784197Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:37.1784201Z 2025-05-07T20:33:37.1784278Z @given( 2025-05-07T20:33:37.1784391Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1784483Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1784594Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1784704Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1784818Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1784890Z ) 2025-05-07T20:33:37.1785171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1785264Z def test_silu_mul_quant( 2025-05-07T20:33:37.1785336Z self, 2025-05-07T20:33:37.1785410Z T: int, 2025-05-07T20:33:37.1785485Z D: int, 2025-05-07T20:33:37.1785579Z scale_ub: Optional[float], 2025-05-07T20:33:37.1785664Z contiguous: bool, 2025-05-07T20:33:37.1785795Z compiled: bool, 2025-05-07T20:33:37.1785871Z ) -> None: 2025-05-07T20:33:37.1785963Z torch.manual_seed(2025) 2025-05-07T20:33:37.1786040Z 2025-05-07T20:33:37.1786207Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1786281Z 2025-05-07T20:33:37.1786370Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1786491Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1786583Z x = x_sign * x_clamp 2025-05-07T20:33:37.1786660Z x0 = x[:, :D] 2025-05-07T20:33:37.1786736Z x1 = x[:, D:] 2025-05-07T20:33:37.1786810Z 2025-05-07T20:33:37.1786887Z if contiguous: 2025-05-07T20:33:37.1787011Z x0 = x0.contiguous() 2025-05-07T20:33:37.1787100Z x1 = x1.contiguous() 2025-05-07T20:33:37.1787167Z 2025-05-07T20:33:37.1787250Z if scale_ub is not None: 2025-05-07T20:33:37.1787352Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1787537Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1787610Z ) 2025-05-07T20:33:37.1787683Z else: 2025-05-07T20:33:37.1787772Z scale_ub_tensor = None 2025-05-07T20:33:37.1787843Z 2025-05-07T20:33:37.1787966Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1788051Z op = silu_mul_quant 2025-05-07T20:33:37.1788132Z if compiled: 2025-05-07T20:33:37.1788272Z op = torch.compile(op) 2025-05-07T20:33:37.1788375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1788442Z 2025-05-07T20:33:37.1788529Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1788534Z 2025-05-07T20:33:37.1788626Z moe/activation_test.py:117: 2025-05-07T20:33:37.1788751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1788845Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1788940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1789304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1789392Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1789880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1789972Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1790328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1790554Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1790896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1790991Z kernel = self.compile( 2025-05-07T20:33:37.1791367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1791539Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1791663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1791668Z 2025-05-07T20:33:37.1791865Z self = 2025-05-07T20:33:37.1792628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1793192Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca184d60>} 2025-05-07T20:33:37.1793928Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1794155Z context = 2025-05-07T20:33:37.1794159Z 2025-05-07T20:33:37.1794319Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1794578Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1794684Z module_map=module_map) 2025-05-07T20:33:37.1794845Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1794944Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1795017Z E ^ 2025-05-07T20:33:37.1795410Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1795418Z 2025-05-07T20:33:37.1795849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1795856Z 2025-05-07T20:33:37.1795954Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1796173Z self=, 2025-05-07T20:33:37.1796246Z T=16384, 2025-05-07T20:33:37.1796318Z D=5120, 2025-05-07T20:33:37.1796401Z scale_ub=1200.0, 2025-05-07T20:33:37.1796484Z contiguous=False, 2025-05-07T20:33:37.1796564Z compiled=False, 2025-05-07T20:33:37.1796637Z ) 2025-05-07T20:33:37.1796893Z self = 2025-05-07T20:33:37.1797071Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:37.1797078Z 2025-05-07T20:33:37.1797150Z @given( 2025-05-07T20:33:37.1797266Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1797363Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1797474Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1797585Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1797703Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1797772Z ) 2025-05-07T20:33:37.1798017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1798103Z def test_silu_mul_quant( 2025-05-07T20:33:37.1798178Z self, 2025-05-07T20:33:37.1798257Z T: int, 2025-05-07T20:33:37.1798330Z D: int, 2025-05-07T20:33:37.1798425Z scale_ub: Optional[float], 2025-05-07T20:33:37.1798510Z contiguous: bool, 2025-05-07T20:33:37.1798589Z compiled: bool, 2025-05-07T20:33:37.1798661Z ) -> None: 2025-05-07T20:33:37.1798755Z torch.manual_seed(2025) 2025-05-07T20:33:37.1798830Z 2025-05-07T20:33:37.1798996Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1799072Z 2025-05-07T20:33:37.1799158Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1799282Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1799365Z x = x_sign * x_clamp 2025-05-07T20:33:37.1799440Z x0 = x[:, :D] 2025-05-07T20:33:37.1799516Z x1 = x[:, D:] 2025-05-07T20:33:37.1799582Z 2025-05-07T20:33:37.1799659Z if contiguous: 2025-05-07T20:33:37.1799748Z x0 = x0.contiguous() 2025-05-07T20:33:37.1799833Z x1 = x1.contiguous() 2025-05-07T20:33:37.1799904Z 2025-05-07T20:33:37.1799994Z if scale_ub is not None: 2025-05-07T20:33:37.1800097Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1800224Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1800350Z ) 2025-05-07T20:33:37.1800424Z else: 2025-05-07T20:33:37.1800516Z scale_ub_tensor = None 2025-05-07T20:33:37.1800589Z 2025-05-07T20:33:37.1800712Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1800799Z op = silu_mul_quant 2025-05-07T20:33:37.1800880Z if compiled: 2025-05-07T20:33:37.1801018Z op = torch.compile(op) 2025-05-07T20:33:37.1801121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1801189Z 2025-05-07T20:33:37.1801274Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1801279Z 2025-05-07T20:33:37.1801372Z moe/activation_test.py:117: 2025-05-07T20:33:37.1801497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1801592Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1801691Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1802183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1802315Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1802672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1802887Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1803229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1803319Z kernel = self.compile( 2025-05-07T20:33:37.1803714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1803884Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1804046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1804050Z 2025-05-07T20:33:37.1804253Z self = 2025-05-07T20:33:37.1805017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1805512Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca185c60>} 2025-05-07T20:33:37.1806248Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1806430Z context = 2025-05-07T20:33:37.1806437Z 2025-05-07T20:33:37.1806601Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1806857Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1806970Z module_map=module_map) 2025-05-07T20:33:37.1807127Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1807222Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1807298Z E ^ 2025-05-07T20:33:37.1807645Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1807650Z 2025-05-07T20:33:37.1808079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1808086Z 2025-05-07T20:33:37.1808182Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1808400Z self=, 2025-05-07T20:33:37.1808481Z T=16384, 2025-05-07T20:33:37.1808554Z D=5120, 2025-05-07T20:33:37.1808679Z scale_ub=1200.0, 2025-05-07T20:33:37.1808768Z contiguous=True, 2025-05-07T20:33:37.1808855Z compiled=True, 2025-05-07T20:33:37.1808921Z ) 2025-05-07T20:33:37.1809137Z self = 2025-05-07T20:33:37.1809310Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:37.1809353Z 2025-05-07T20:33:37.1809429Z @given( 2025-05-07T20:33:37.1809543Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1809640Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1809756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1809866Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1809974Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1810051Z ) 2025-05-07T20:33:37.1810298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1810401Z def test_silu_mul_quant( 2025-05-07T20:33:37.1810490Z self, 2025-05-07T20:33:37.1810575Z T: int, 2025-05-07T20:33:37.1810701Z D: int, 2025-05-07T20:33:37.1810797Z scale_ub: Optional[float], 2025-05-07T20:33:37.1810881Z contiguous: bool, 2025-05-07T20:33:37.1810963Z compiled: bool, 2025-05-07T20:33:37.1811037Z ) -> None: 2025-05-07T20:33:37.1811130Z torch.manual_seed(2025) 2025-05-07T20:33:37.1811202Z 2025-05-07T20:33:37.1811363Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1811432Z 2025-05-07T20:33:37.1811522Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1811641Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1811724Z x = x_sign * x_clamp 2025-05-07T20:33:37.1811802Z x0 = x[:, :D] 2025-05-07T20:33:37.1811917Z x1 = x[:, D:] 2025-05-07T20:33:37.1811985Z 2025-05-07T20:33:37.1812068Z if contiguous: 2025-05-07T20:33:37.1812153Z x0 = x0.contiguous() 2025-05-07T20:33:37.1812240Z x1 = x1.contiguous() 2025-05-07T20:33:37.1812313Z 2025-05-07T20:33:37.1812399Z if scale_ub is not None: 2025-05-07T20:33:37.1812507Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1812637Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1812711Z ) 2025-05-07T20:33:37.1812793Z else: 2025-05-07T20:33:37.1812883Z scale_ub_tensor = None 2025-05-07T20:33:37.1812953Z 2025-05-07T20:33:37.1813086Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1813174Z op = silu_mul_quant 2025-05-07T20:33:37.1813258Z if compiled: 2025-05-07T20:33:37.1813359Z op = torch.compile(op) 2025-05-07T20:33:37.1813462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1813542Z 2025-05-07T20:33:37.1813632Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1813636Z 2025-05-07T20:33:37.1813734Z moe/activation_test.py:117: 2025-05-07T20:33:37.1813867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1813966Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1814062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1814425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1814517Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1815001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1815094Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1815446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1815669Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1816055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1816147Z kernel = self.compile( 2025-05-07T20:33:37.1816547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1816714Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1816880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1816884Z 2025-05-07T20:33:37.1817081Z self = 2025-05-07T20:33:37.1817846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1818355Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca187380>} 2025-05-07T20:33:37.1819124Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1819316Z context = 2025-05-07T20:33:37.1819323Z 2025-05-07T20:33:37.1819482Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1819737Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1819844Z module_map=module_map) 2025-05-07T20:33:37.1819999Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1820159Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1820237Z E ^ 2025-05-07T20:33:37.1820589Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1820593Z 2025-05-07T20:33:37.1821038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1821043Z 2025-05-07T20:33:37.1821157Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1821383Z self=, 2025-05-07T20:33:37.1821460Z T=16384, 2025-05-07T20:33:37.1825017Z D=5120, 2025-05-07T20:33:37.1825111Z scale_ub=None, 2025-05-07T20:33:37.1825193Z contiguous=False, 2025-05-07T20:33:37.1825268Z compiled=True, 2025-05-07T20:33:37.1825339Z ) 2025-05-07T20:33:37.1825554Z self = 2025-05-07T20:33:37.1825724Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:37.1825734Z 2025-05-07T20:33:37.1825807Z @given( 2025-05-07T20:33:37.1825928Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1826031Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1826142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1826254Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1826364Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1826435Z ) 2025-05-07T20:33:37.1826678Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1826771Z def test_silu_mul_quant( 2025-05-07T20:33:37.1826845Z self, 2025-05-07T20:33:37.1826918Z T: int, 2025-05-07T20:33:37.1826994Z D: int, 2025-05-07T20:33:37.1827087Z scale_ub: Optional[float], 2025-05-07T20:33:37.1827174Z contiguous: bool, 2025-05-07T20:33:37.1827255Z compiled: bool, 2025-05-07T20:33:37.1827331Z ) -> None: 2025-05-07T20:33:37.1827490Z torch.manual_seed(2025) 2025-05-07T20:33:37.1827563Z 2025-05-07T20:33:37.1827794Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1827867Z 2025-05-07T20:33:37.1827955Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1828074Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1828160Z x = x_sign * x_clamp 2025-05-07T20:33:37.1828236Z x0 = x[:, :D] 2025-05-07T20:33:37.1828354Z x1 = x[:, D:] 2025-05-07T20:33:37.1828425Z 2025-05-07T20:33:37.1828506Z if contiguous: 2025-05-07T20:33:37.1828594Z x0 = x0.contiguous() 2025-05-07T20:33:37.1828682Z x1 = x1.contiguous() 2025-05-07T20:33:37.1828752Z 2025-05-07T20:33:37.1828838Z if scale_ub is not None: 2025-05-07T20:33:37.1828942Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1829071Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1829146Z ) 2025-05-07T20:33:37.1829216Z else: 2025-05-07T20:33:37.1829306Z scale_ub_tensor = None 2025-05-07T20:33:37.1829381Z 2025-05-07T20:33:37.1829547Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1829637Z op = silu_mul_quant 2025-05-07T20:33:37.1829725Z if compiled: 2025-05-07T20:33:37.1829820Z op = torch.compile(op) 2025-05-07T20:33:37.1829920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1829999Z 2025-05-07T20:33:37.1830087Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1830091Z 2025-05-07T20:33:37.1830190Z moe/activation_test.py:117: 2025-05-07T20:33:37.1830315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1830412Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1830510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1830881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1831017Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1831515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1831608Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1831964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1832182Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1832521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1832613Z kernel = self.compile( 2025-05-07T20:33:37.1832992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1833160Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1833288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1833295Z 2025-05-07T20:33:37.1833493Z self = 2025-05-07T20:33:37.1834262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1834756Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819ea05e0>} 2025-05-07T20:33:37.1835492Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1835683Z context = 2025-05-07T20:33:37.1835687Z 2025-05-07T20:33:37.1835887Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1836154Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1836258Z module_map=module_map) 2025-05-07T20:33:37.1836416Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1836515Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1836630Z E ^ 2025-05-07T20:33:37.1836982Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1836987Z 2025-05-07T20:33:37.1837417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1837421Z 2025-05-07T20:33:37.1837519Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1837744Z self=, 2025-05-07T20:33:37.1837817Z T=2048, 2025-05-07T20:33:37.1837896Z D=5120, 2025-05-07T20:33:37.1837974Z scale_ub=None, 2025-05-07T20:33:37.1838093Z contiguous=False, 2025-05-07T20:33:37.1838174Z compiled=True, 2025-05-07T20:33:37.1838243Z ) 2025-05-07T20:33:37.1838455Z self = 2025-05-07T20:33:37.1838625Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:37.1838633Z 2025-05-07T20:33:37.1838706Z @given( 2025-05-07T20:33:37.1838820Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1838916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1839026Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1839140Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1839248Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1839358Z ) 2025-05-07T20:33:37.1839598Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1839685Z def test_silu_mul_quant( 2025-05-07T20:33:37.1839761Z self, 2025-05-07T20:33:37.1839835Z T: int, 2025-05-07T20:33:37.1839908Z D: int, 2025-05-07T20:33:37.1840001Z scale_ub: Optional[float], 2025-05-07T20:33:37.1840551Z contiguous: bool, 2025-05-07T20:33:37.1840663Z compiled: bool, 2025-05-07T20:33:37.1840741Z ) -> None: 2025-05-07T20:33:37.1840835Z torch.manual_seed(2025) 2025-05-07T20:33:37.1840908Z 2025-05-07T20:33:37.1841075Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1841147Z 2025-05-07T20:33:37.1841235Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1841359Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1841448Z x = x_sign * x_clamp 2025-05-07T20:33:37.1841524Z x0 = x[:, :D] 2025-05-07T20:33:37.1841603Z x1 = x[:, D:] 2025-05-07T20:33:37.1841673Z 2025-05-07T20:33:37.1841755Z if contiguous: 2025-05-07T20:33:37.1841847Z x0 = x0.contiguous() 2025-05-07T20:33:37.1841935Z x1 = x1.contiguous() 2025-05-07T20:33:37.1842003Z 2025-05-07T20:33:37.1842091Z if scale_ub is not None: 2025-05-07T20:33:37.1842194Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1842324Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1842402Z ) 2025-05-07T20:33:37.1842476Z else: 2025-05-07T20:33:37.1842568Z scale_ub_tensor = None 2025-05-07T20:33:37.1842634Z 2025-05-07T20:33:37.1842758Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1842847Z op = silu_mul_quant 2025-05-07T20:33:37.1842928Z if compiled: 2025-05-07T20:33:37.1843025Z op = torch.compile(op) 2025-05-07T20:33:37.1843131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1843200Z 2025-05-07T20:33:37.1843284Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1843383Z 2025-05-07T20:33:37.1843479Z moe/activation_test.py:117: 2025-05-07T20:33:37.1843607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1843703Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1843794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1844153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1844301Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1844785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1844876Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1845232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1845451Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1845854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1845943Z kernel = self.compile( 2025-05-07T20:33:37.1846319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1846487Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1846612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1846617Z 2025-05-07T20:33:37.1846815Z self = 2025-05-07T20:33:37.1847582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1848139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819ea1440>} 2025-05-07T20:33:37.1848873Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1849059Z context = 2025-05-07T20:33:37.1849063Z 2025-05-07T20:33:37.1849223Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1849477Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1849579Z module_map=module_map) 2025-05-07T20:33:37.1849738Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1849833Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1849907Z E ^ 2025-05-07T20:33:37.1850263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1850267Z 2025-05-07T20:33:37.1850676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1850680Z 2025-05-07T20:33:37.1850776Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1850992Z self=, 2025-05-07T20:33:37.1851064Z T=2048, 2025-05-07T20:33:37.1851138Z D=5120, 2025-05-07T20:33:37.1851212Z scale_ub=1200.0, 2025-05-07T20:33:37.1851296Z contiguous=False, 2025-05-07T20:33:37.1851373Z compiled=True, 2025-05-07T20:33:37.1851441Z ) 2025-05-07T20:33:37.1851654Z self = 2025-05-07T20:33:37.1851824Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:37.1851829Z 2025-05-07T20:33:37.1851941Z @given( 2025-05-07T20:33:37.1852060Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1852153Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1852261Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1852373Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1852479Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1852590Z ) 2025-05-07T20:33:37.1852826Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1852912Z def test_silu_mul_quant( 2025-05-07T20:33:37.1852987Z self, 2025-05-07T20:33:37.1853059Z T: int, 2025-05-07T20:33:37.1853130Z D: int, 2025-05-07T20:33:37.1853225Z scale_ub: Optional[float], 2025-05-07T20:33:37.1853309Z contiguous: bool, 2025-05-07T20:33:37.1853387Z compiled: bool, 2025-05-07T20:33:37.1853462Z ) -> None: 2025-05-07T20:33:37.1853550Z torch.manual_seed(2025) 2025-05-07T20:33:37.1853622Z 2025-05-07T20:33:37.1853855Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1853926Z 2025-05-07T20:33:37.1854013Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1854131Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1854214Z x = x_sign * x_clamp 2025-05-07T20:33:37.1854293Z x0 = x[:, :D] 2025-05-07T20:33:37.1854366Z x1 = x[:, D:] 2025-05-07T20:33:37.1854433Z 2025-05-07T20:33:37.1854514Z if contiguous: 2025-05-07T20:33:37.1854597Z x0 = x0.contiguous() 2025-05-07T20:33:37.1854677Z x1 = x1.contiguous() 2025-05-07T20:33:37.1854749Z 2025-05-07T20:33:37.1854832Z if scale_ub is not None: 2025-05-07T20:33:37.1854932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1855105Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1855178Z ) 2025-05-07T20:33:37.1855253Z else: 2025-05-07T20:33:37.1855343Z scale_ub_tensor = None 2025-05-07T20:33:37.1855413Z 2025-05-07T20:33:37.1855539Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1855623Z op = silu_mul_quant 2025-05-07T20:33:37.1855701Z if compiled: 2025-05-07T20:33:37.1855800Z op = torch.compile(op) 2025-05-07T20:33:37.1855901Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1855973Z 2025-05-07T20:33:37.1856059Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1856063Z 2025-05-07T20:33:37.1856153Z moe/activation_test.py:117: 2025-05-07T20:33:37.1856276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1856371Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1856462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1856827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1856917Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1857401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1857496Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1857847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1858066Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1858405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1858491Z kernel = self.compile( 2025-05-07T20:33:37.1858871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1859040Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1859205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1859210Z 2025-05-07T20:33:37.1859409Z self = 2025-05-07T20:33:37.1860169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1860700Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819ea2660>} 2025-05-07T20:33:37.1861430Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1861618Z context = 2025-05-07T20:33:37.1861625Z 2025-05-07T20:33:37.1861819Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1862073Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1862177Z module_map=module_map) 2025-05-07T20:33:37.1862330Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1862427Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1862502Z E ^ 2025-05-07T20:33:37.1862846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1862850Z 2025-05-07T20:33:37.1863256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1863301Z 2025-05-07T20:33:37.1863397Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1863615Z self=, 2025-05-07T20:33:37.1863692Z T=4096, 2025-05-07T20:33:37.1863768Z D=5120, 2025-05-07T20:33:37.1863846Z scale_ub=1200.0, 2025-05-07T20:33:37.1863928Z contiguous=True, 2025-05-07T20:33:37.1864004Z compiled=True, 2025-05-07T20:33:37.1864068Z ) 2025-05-07T20:33:37.1864282Z self = 2025-05-07T20:33:37.1864450Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:37.1864455Z 2025-05-07T20:33:37.1864528Z @given( 2025-05-07T20:33:37.1864639Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1864731Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1864843Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1864952Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1865061Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1865133Z ) 2025-05-07T20:33:37.1865371Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1865462Z def test_silu_mul_quant( 2025-05-07T20:33:37.1865534Z self, 2025-05-07T20:33:37.1865603Z T: int, 2025-05-07T20:33:37.1865674Z D: int, 2025-05-07T20:33:37.1865764Z scale_ub: Optional[float], 2025-05-07T20:33:37.1865847Z contiguous: bool, 2025-05-07T20:33:37.1865930Z compiled: bool, 2025-05-07T20:33:37.1866002Z ) -> None: 2025-05-07T20:33:37.1866089Z torch.manual_seed(2025) 2025-05-07T20:33:37.1866159Z 2025-05-07T20:33:37.1866320Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1866389Z 2025-05-07T20:33:37.1866477Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1866594Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1866681Z x = x_sign * x_clamp 2025-05-07T20:33:37.1866756Z x0 = x[:, :D] 2025-05-07T20:33:37.1866829Z x1 = x[:, D:] 2025-05-07T20:33:37.1866946Z 2025-05-07T20:33:37.1867026Z if contiguous: 2025-05-07T20:33:37.1867113Z x0 = x0.contiguous() 2025-05-07T20:33:37.1867200Z x1 = x1.contiguous() 2025-05-07T20:33:37.1867267Z 2025-05-07T20:33:37.1867352Z if scale_ub is not None: 2025-05-07T20:33:37.1867513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1867685Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1867753Z ) 2025-05-07T20:33:37.1867826Z else: 2025-05-07T20:33:37.1867913Z scale_ub_tensor = None 2025-05-07T20:33:37.1867978Z 2025-05-07T20:33:37.1868101Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1868183Z op = silu_mul_quant 2025-05-07T20:33:37.1868264Z if compiled: 2025-05-07T20:33:37.1868359Z op = torch.compile(op) 2025-05-07T20:33:37.1868459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1868525Z 2025-05-07T20:33:37.1868611Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1868615Z 2025-05-07T20:33:37.1868745Z moe/activation_test.py:117: 2025-05-07T20:33:37.1868871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1868963Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1869054Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1869425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1869511Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1869998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1870091Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1870537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1870757Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1871091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1871182Z kernel = self.compile( 2025-05-07T20:33:37.1871558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1871728Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1871850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1871854Z 2025-05-07T20:33:37.1872049Z self = 2025-05-07T20:33:37.1872809Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1873307Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819ea39c0>} 2025-05-07T20:33:37.1874037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1874225Z context = 2025-05-07T20:33:37.1874229Z 2025-05-07T20:33:37.1874384Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1874641Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1874742Z module_map=module_map) 2025-05-07T20:33:37.1874898Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1874990Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1875107Z E ^ 2025-05-07T20:33:37.1875455Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1875462Z 2025-05-07T20:33:37.1875869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1875912Z 2025-05-07T20:33:37.1876008Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1876226Z self=, 2025-05-07T20:33:37.1876299Z T=128, 2025-05-07T20:33:37.1876370Z D=5120, 2025-05-07T20:33:37.1876455Z scale_ub=1200.0, 2025-05-07T20:33:37.1876537Z contiguous=False, 2025-05-07T20:33:37.1876613Z compiled=True, 2025-05-07T20:33:37.1876679Z ) 2025-05-07T20:33:37.1876893Z self = 2025-05-07T20:33:37.1877062Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:37.1877066Z 2025-05-07T20:33:37.1877138Z @given( 2025-05-07T20:33:37.1877288Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1877385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1877494Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1877603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1877720Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1877787Z ) 2025-05-07T20:33:37.1878022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1878111Z def test_silu_mul_quant( 2025-05-07T20:33:37.1878183Z self, 2025-05-07T20:33:37.1878255Z T: int, 2025-05-07T20:33:37.1878325Z D: int, 2025-05-07T20:33:37.1878457Z scale_ub: Optional[float], 2025-05-07T20:33:37.1878541Z contiguous: bool, 2025-05-07T20:33:37.1878619Z compiled: bool, 2025-05-07T20:33:37.1878691Z ) -> None: 2025-05-07T20:33:37.1878780Z torch.manual_seed(2025) 2025-05-07T20:33:37.1878850Z 2025-05-07T20:33:37.1879011Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1879082Z 2025-05-07T20:33:37.1879166Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1879285Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1879373Z x = x_sign * x_clamp 2025-05-07T20:33:37.1879446Z x0 = x[:, :D] 2025-05-07T20:33:37.1879522Z x1 = x[:, D:] 2025-05-07T20:33:37.1879589Z 2025-05-07T20:33:37.1879664Z if contiguous: 2025-05-07T20:33:37.1879752Z x0 = x0.contiguous() 2025-05-07T20:33:37.1879834Z x1 = x1.contiguous() 2025-05-07T20:33:37.1879900Z 2025-05-07T20:33:37.1879983Z if scale_ub is not None: 2025-05-07T20:33:37.1880084Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1880211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1880285Z ) 2025-05-07T20:33:37.1880357Z else: 2025-05-07T20:33:37.1880447Z scale_ub_tensor = None 2025-05-07T20:33:37.1880514Z 2025-05-07T20:33:37.1880637Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1880723Z op = silu_mul_quant 2025-05-07T20:33:37.1880805Z if compiled: 2025-05-07T20:33:37.1880902Z op = torch.compile(op) 2025-05-07T20:33:37.1881003Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1881070Z 2025-05-07T20:33:37.1881154Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1881159Z 2025-05-07T20:33:37.1881255Z moe/activation_test.py:117: 2025-05-07T20:33:37.1881376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1881471Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1881565Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1881973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1882070Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1882557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1882649Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1883070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1883288Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1883624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1883715Z kernel = self.compile( 2025-05-07T20:33:37.1884097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1884277Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1884436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1884441Z 2025-05-07T20:33:37.1884638Z self = 2025-05-07T20:33:37.1885400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1885897Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cae84fe0>} 2025-05-07T20:33:37.1886633Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1886856Z context = 2025-05-07T20:33:37.1886861Z 2025-05-07T20:33:37.1887019Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1887275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1887375Z module_map=module_map) 2025-05-07T20:33:37.1887536Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1887630Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1887702Z E ^ 2025-05-07T20:33:37.1888055Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1888059Z 2025-05-07T20:33:37.1888469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1888476Z 2025-05-07T20:33:37.1888578Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1888798Z self=, 2025-05-07T20:33:37.1888873Z T=16384, 2025-05-07T20:33:37.1888947Z D=7168, 2025-05-07T20:33:37.1889024Z scale_ub=1200.0, 2025-05-07T20:33:37.1889103Z contiguous=True, 2025-05-07T20:33:37.1889185Z compiled=True, 2025-05-07T20:33:37.1889253Z ) 2025-05-07T20:33:37.1889467Z self = 2025-05-07T20:33:37.1889640Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:37.1889644Z 2025-05-07T20:33:37.1889715Z @given( 2025-05-07T20:33:37.1889833Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1889926Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1890036Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1890153Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1890262Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1890377Z ) 2025-05-07T20:33:37.1890622Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1890711Z def test_silu_mul_quant( 2025-05-07T20:33:37.1890784Z self, 2025-05-07T20:33:37.1890860Z T: int, 2025-05-07T20:33:37.1890931Z D: int, 2025-05-07T20:33:37.1891027Z scale_ub: Optional[float], 2025-05-07T20:33:37.1891158Z contiguous: bool, 2025-05-07T20:33:37.1891239Z compiled: bool, 2025-05-07T20:33:37.1891317Z ) -> None: 2025-05-07T20:33:37.1891407Z torch.manual_seed(2025) 2025-05-07T20:33:37.1891474Z 2025-05-07T20:33:37.1891641Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1891710Z 2025-05-07T20:33:37.1891798Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1891923Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1892008Z x = x_sign * x_clamp 2025-05-07T20:33:37.1892086Z x0 = x[:, :D] 2025-05-07T20:33:37.1892165Z x1 = x[:, D:] 2025-05-07T20:33:37.1892276Z 2025-05-07T20:33:37.1892355Z if contiguous: 2025-05-07T20:33:37.1892444Z x0 = x0.contiguous() 2025-05-07T20:33:37.1892527Z x1 = x1.contiguous() 2025-05-07T20:33:37.1892602Z 2025-05-07T20:33:37.1892685Z if scale_ub is not None: 2025-05-07T20:33:37.1892789Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1892918Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1892992Z ) 2025-05-07T20:33:37.1893064Z else: 2025-05-07T20:33:37.1893158Z scale_ub_tensor = None 2025-05-07T20:33:37.1893227Z 2025-05-07T20:33:37.1893350Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1893478Z op = silu_mul_quant 2025-05-07T20:33:37.1893556Z if compiled: 2025-05-07T20:33:37.1893649Z op = torch.compile(op) 2025-05-07T20:33:37.1893754Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1893820Z 2025-05-07T20:33:37.1893910Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1893914Z 2025-05-07T20:33:37.1894010Z moe/activation_test.py:117: 2025-05-07T20:33:37.1894132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1894229Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1894324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1894686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1894779Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1895263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1895359Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1895714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1895931Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1896269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1896358Z kernel = self.compile( 2025-05-07T20:33:37.1896753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1896927Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1897046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1897051Z 2025-05-07T20:33:37.1897245Z self = 2025-05-07T20:33:37.1898048Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1898550Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cae85e40>} 2025-05-07T20:33:37.1899283Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1899507Z context = 2025-05-07T20:33:37.1899512Z 2025-05-07T20:33:37.1899672Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1899926Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1900037Z module_map=module_map) 2025-05-07T20:33:37.1900193Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1900300Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1900428Z E ^ 2025-05-07T20:33:37.1900801Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1900806Z 2025-05-07T20:33:37.1901220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1901231Z 2025-05-07T20:33:37.1901326Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1901541Z self=, 2025-05-07T20:33:37.1901617Z T=16384, 2025-05-07T20:33:37.1901687Z D=5120, 2025-05-07T20:33:37.1901765Z scale_ub=1200.0, 2025-05-07T20:33:37.1901850Z contiguous=True, 2025-05-07T20:33:37.1901970Z compiled=False, 2025-05-07T20:33:37.1902036Z ) 2025-05-07T20:33:37.1902249Z self = 2025-05-07T20:33:37.1902422Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:37.1902429Z 2025-05-07T20:33:37.1902497Z @given( 2025-05-07T20:33:37.1902614Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1902705Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1902820Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1902932Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1903041Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1903109Z ) 2025-05-07T20:33:37.1903347Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1903434Z def test_silu_mul_quant( 2025-05-07T20:33:37.1903509Z self, 2025-05-07T20:33:37.1903583Z T: int, 2025-05-07T20:33:37.1903656Z D: int, 2025-05-07T20:33:37.1903754Z scale_ub: Optional[float], 2025-05-07T20:33:37.1903838Z contiguous: bool, 2025-05-07T20:33:37.1903923Z compiled: bool, 2025-05-07T20:33:37.1903995Z ) -> None: 2025-05-07T20:33:37.1904088Z torch.manual_seed(2025) 2025-05-07T20:33:37.1904157Z 2025-05-07T20:33:37.1904319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1904387Z 2025-05-07T20:33:37.1904475Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1904595Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1904678Z x = x_sign * x_clamp 2025-05-07T20:33:37.1904753Z x0 = x[:, :D] 2025-05-07T20:33:37.1904825Z x1 = x[:, D:] 2025-05-07T20:33:37.1904891Z 2025-05-07T20:33:37.1904973Z if contiguous: 2025-05-07T20:33:37.1905059Z x0 = x0.contiguous() 2025-05-07T20:33:37.1905146Z x1 = x1.contiguous() 2025-05-07T20:33:37.1905222Z 2025-05-07T20:33:37.1905306Z if scale_ub is not None: 2025-05-07T20:33:37.1905411Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1905587Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1905663Z ) 2025-05-07T20:33:37.1905738Z else: 2025-05-07T20:33:37.1905825Z scale_ub_tensor = None 2025-05-07T20:33:37.1905890Z 2025-05-07T20:33:37.1906017Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1906105Z op = silu_mul_quant 2025-05-07T20:33:37.1906224Z if compiled: 2025-05-07T20:33:37.1906321Z op = torch.compile(op) 2025-05-07T20:33:37.1906421Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1906486Z 2025-05-07T20:33:37.1906572Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1906577Z 2025-05-07T20:33:37.1906667Z moe/activation_test.py:117: 2025-05-07T20:33:37.1906792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1906890Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1906982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1907571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1907666Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1908020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1908244Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1908580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1908674Z kernel = self.compile( 2025-05-07T20:33:37.1909053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1909275Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1909407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1909414Z 2025-05-07T20:33:37.1909613Z self = 2025-05-07T20:33:37.1910386Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1910931Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89cae86ca0>} 2025-05-07T20:33:37.1911662Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1911850Z context = 2025-05-07T20:33:37.1911854Z 2025-05-07T20:33:37.1912016Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1912273Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1912375Z module_map=module_map) 2025-05-07T20:33:37.1912530Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1912630Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1912703Z E ^ 2025-05-07T20:33:37.1913058Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1913062Z 2025-05-07T20:33:37.1913475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1913480Z 2025-05-07T20:33:37.1913584Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1913804Z self=, 2025-05-07T20:33:37.1913877Z T=1, 2025-05-07T20:33:37.1914016Z D=7168, 2025-05-07T20:33:37.1914100Z scale_ub=1200.0, 2025-05-07T20:33:37.1914184Z contiguous=False, 2025-05-07T20:33:37.1914267Z compiled=False, 2025-05-07T20:33:37.1914336Z ) 2025-05-07T20:33:37.1914550Z self = 2025-05-07T20:33:37.1914719Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:37.1914763Z 2025-05-07T20:33:37.1914840Z @given( 2025-05-07T20:33:37.1914954Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1915053Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1915162Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1915274Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1915394Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1915466Z ) 2025-05-07T20:33:37.1915708Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1915796Z def test_silu_mul_quant( 2025-05-07T20:33:37.1915909Z self, 2025-05-07T20:33:37.1915987Z T: int, 2025-05-07T20:33:37.1916060Z D: int, 2025-05-07T20:33:37.1916151Z scale_ub: Optional[float], 2025-05-07T20:33:37.1916240Z contiguous: bool, 2025-05-07T20:33:37.1916321Z compiled: bool, 2025-05-07T20:33:37.1916399Z ) -> None: 2025-05-07T20:33:37.1916493Z torch.manual_seed(2025) 2025-05-07T20:33:37.1916559Z 2025-05-07T20:33:37.1916720Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1916795Z 2025-05-07T20:33:37.1916881Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1917004Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1917087Z x = x_sign * x_clamp 2025-05-07T20:33:37.1917202Z x0 = x[:, :D] 2025-05-07T20:33:37.1917280Z x1 = x[:, D:] 2025-05-07T20:33:37.1917348Z 2025-05-07T20:33:37.1917430Z if contiguous: 2025-05-07T20:33:37.1917521Z x0 = x0.contiguous() 2025-05-07T20:33:37.1917606Z x1 = x1.contiguous() 2025-05-07T20:33:37.1917672Z 2025-05-07T20:33:37.1917762Z if scale_ub is not None: 2025-05-07T20:33:37.1917863Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1917992Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1918073Z ) 2025-05-07T20:33:37.1918142Z else: 2025-05-07T20:33:37.1918235Z scale_ub_tensor = None 2025-05-07T20:33:37.1918301Z 2025-05-07T20:33:37.1918424Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1918513Z op = silu_mul_quant 2025-05-07T20:33:37.1918594Z if compiled: 2025-05-07T20:33:37.1918688Z op = torch.compile(op) 2025-05-07T20:33:37.1918798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1918865Z 2025-05-07T20:33:37.1918952Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1918956Z 2025-05-07T20:33:37.1919053Z moe/activation_test.py:117: 2025-05-07T20:33:37.1919180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1919272Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1919368Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1919859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1919959Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1920314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1920530Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1920871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1920958Z kernel = self.compile( 2025-05-07T20:33:37.1921388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1921556Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1921677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1921720Z 2025-05-07T20:33:37.1921922Z self = 2025-05-07T20:33:37.1922685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1923178Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca3940e0>} 2025-05-07T20:33:37.1923952Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1924135Z context = 2025-05-07T20:33:37.1924139Z 2025-05-07T20:33:37.1924299Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1924555Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1924660Z module_map=module_map) 2025-05-07T20:33:37.1924816Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1924909Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1924987Z E ^ 2025-05-07T20:33:37.1925336Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1925381Z 2025-05-07T20:33:37.1925797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1925802Z 2025-05-07T20:33:37.1925900Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1926116Z self=, 2025-05-07T20:33:37.1926190Z T=4096, 2025-05-07T20:33:37.1926264Z D=7168, 2025-05-07T20:33:37.1926342Z scale_ub=1200.0, 2025-05-07T20:33:37.1926429Z contiguous=False, 2025-05-07T20:33:37.1926507Z compiled=True, 2025-05-07T20:33:37.1926574Z ) 2025-05-07T20:33:37.1926787Z self = 2025-05-07T20:33:37.1926952Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:37.1926957Z 2025-05-07T20:33:37.1927034Z @given( 2025-05-07T20:33:37.1927146Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1927240Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1927356Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1927468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1927578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1927653Z ) 2025-05-07T20:33:37.1927890Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1927978Z def test_silu_mul_quant( 2025-05-07T20:33:37.1928051Z self, 2025-05-07T20:33:37.1928124Z T: int, 2025-05-07T20:33:37.1928197Z D: int, 2025-05-07T20:33:37.1928288Z scale_ub: Optional[float], 2025-05-07T20:33:37.1928371Z contiguous: bool, 2025-05-07T20:33:37.1928456Z compiled: bool, 2025-05-07T20:33:37.1928527Z ) -> None: 2025-05-07T20:33:37.1928615Z torch.manual_seed(2025) 2025-05-07T20:33:37.1928684Z 2025-05-07T20:33:37.1928844Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1928911Z 2025-05-07T20:33:37.1929045Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1929166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1929249Z x = x_sign * x_clamp 2025-05-07T20:33:37.1929326Z x0 = x[:, :D] 2025-05-07T20:33:37.1929399Z x1 = x[:, D:] 2025-05-07T20:33:37.1929468Z 2025-05-07T20:33:37.1929553Z if contiguous: 2025-05-07T20:33:37.1929676Z x0 = x0.contiguous() 2025-05-07T20:33:37.1929763Z x1 = x1.contiguous() 2025-05-07T20:33:37.1929828Z 2025-05-07T20:33:37.1929911Z if scale_ub is not None: 2025-05-07T20:33:37.1930014Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1930141Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1930210Z ) 2025-05-07T20:33:37.1930282Z else: 2025-05-07T20:33:37.1930371Z scale_ub_tensor = None 2025-05-07T20:33:37.1930435Z 2025-05-07T20:33:37.1930559Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1930646Z op = silu_mul_quant 2025-05-07T20:33:37.1930765Z if compiled: 2025-05-07T20:33:37.1930863Z op = torch.compile(op) 2025-05-07T20:33:37.1930962Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1931029Z 2025-05-07T20:33:37.1931114Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1931121Z 2025-05-07T20:33:37.1931212Z moe/activation_test.py:117: 2025-05-07T20:33:37.1931339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1931432Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1931524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1931888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1932020Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1932510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1932602Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1932955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1933174Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1933511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1933599Z kernel = self.compile( 2025-05-07T20:33:37.1933982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1934149Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1934275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1934282Z 2025-05-07T20:33:37.1934478Z self = 2025-05-07T20:33:37.1935247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1935742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca395300>} 2025-05-07T20:33:37.1936471Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1936656Z context = 2025-05-07T20:33:37.1936663Z 2025-05-07T20:33:37.1936819Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1937117Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1937223Z module_map=module_map) 2025-05-07T20:33:37.1937377Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1937470Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1937542Z E ^ 2025-05-07T20:33:37.1937887Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1937930Z 2025-05-07T20:33:37.1938343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1938348Z 2025-05-07T20:33:37.1938444Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1938666Z self=, 2025-05-07T20:33:37.1938742Z T=128, 2025-05-07T20:33:37.1938813Z D=7168, 2025-05-07T20:33:37.1938892Z scale_ub=1200.0, 2025-05-07T20:33:37.1938978Z contiguous=False, 2025-05-07T20:33:37.1939056Z compiled=True, 2025-05-07T20:33:37.1939127Z ) 2025-05-07T20:33:37.1939375Z self = 2025-05-07T20:33:37.1939541Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:37.1939545Z 2025-05-07T20:33:37.1939619Z @given( 2025-05-07T20:33:37.1939737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1939837Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1939948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1943753Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1943892Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1943966Z ) 2025-05-07T20:33:37.1944212Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1944449Z def test_silu_mul_quant( 2025-05-07T20:33:37.1944526Z self, 2025-05-07T20:33:37.1944607Z T: int, 2025-05-07T20:33:37.1944684Z D: int, 2025-05-07T20:33:37.1944784Z scale_ub: Optional[float], 2025-05-07T20:33:37.1944868Z contiguous: bool, 2025-05-07T20:33:37.1944956Z compiled: bool, 2025-05-07T20:33:37.1945032Z ) -> None: 2025-05-07T20:33:37.1945130Z torch.manual_seed(2025) 2025-05-07T20:33:37.1945207Z 2025-05-07T20:33:37.1945374Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1945447Z 2025-05-07T20:33:37.1945537Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1945659Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1945747Z x = x_sign * x_clamp 2025-05-07T20:33:37.1945824Z x0 = x[:, :D] 2025-05-07T20:33:37.1945900Z x1 = x[:, D:] 2025-05-07T20:33:37.1945973Z 2025-05-07T20:33:37.1946051Z if contiguous: 2025-05-07T20:33:37.1946138Z x0 = x0.contiguous() 2025-05-07T20:33:37.1946232Z x1 = x1.contiguous() 2025-05-07T20:33:37.1946299Z 2025-05-07T20:33:37.1946392Z if scale_ub is not None: 2025-05-07T20:33:37.1946494Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1946625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1946701Z ) 2025-05-07T20:33:37.1946775Z else: 2025-05-07T20:33:37.1946868Z scale_ub_tensor = None 2025-05-07T20:33:37.1946940Z 2025-05-07T20:33:37.1947066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1947151Z op = silu_mul_quant 2025-05-07T20:33:37.1947238Z if compiled: 2025-05-07T20:33:37.1947335Z op = torch.compile(op) 2025-05-07T20:33:37.1947501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1947577Z 2025-05-07T20:33:37.1947665Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1947670Z 2025-05-07T20:33:37.1947770Z moe/activation_test.py:117: 2025-05-07T20:33:37.1947970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1948072Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1948169Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1948538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1948687Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1949179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1949271Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1949630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1949847Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1950192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1950284Z kernel = self.compile( 2025-05-07T20:33:37.1950739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1950911Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1951036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1951044Z 2025-05-07T20:33:37.1951245Z self = 2025-05-07T20:33:37.1952023Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1952552Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca396160>} 2025-05-07T20:33:37.1953293Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1953476Z context = 2025-05-07T20:33:37.1953484Z 2025-05-07T20:33:37.1953641Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1953903Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1954007Z module_map=module_map) 2025-05-07T20:33:37.1954169Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1954263Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1954339Z E ^ 2025-05-07T20:33:37.1954691Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1954699Z 2025-05-07T20:33:37.1955110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1955114Z 2025-05-07T20:33:37.1955214Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1955433Z self=, 2025-05-07T20:33:37.1955507Z T=2048, 2025-05-07T20:33:37.1955583Z D=7168, 2025-05-07T20:33:37.1955666Z scale_ub=None, 2025-05-07T20:33:37.1955745Z contiguous=True, 2025-05-07T20:33:37.1955828Z compiled=True, 2025-05-07T20:33:37.1955897Z ) 2025-05-07T20:33:37.1956106Z self = 2025-05-07T20:33:37.1956273Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:37.1956281Z 2025-05-07T20:33:37.1956354Z @given( 2025-05-07T20:33:37.1956466Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1956612Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1956725Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1956838Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1956945Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1957016Z ) 2025-05-07T20:33:37.1957254Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1957379Z def test_silu_mul_quant( 2025-05-07T20:33:37.1957453Z self, 2025-05-07T20:33:37.1957530Z T: int, 2025-05-07T20:33:37.1957598Z D: int, 2025-05-07T20:33:37.1957692Z scale_ub: Optional[float], 2025-05-07T20:33:37.1957777Z contiguous: bool, 2025-05-07T20:33:37.1957857Z compiled: bool, 2025-05-07T20:33:37.1957931Z ) -> None: 2025-05-07T20:33:37.1958028Z torch.manual_seed(2025) 2025-05-07T20:33:37.1958096Z 2025-05-07T20:33:37.1958265Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1958331Z 2025-05-07T20:33:37.1958458Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1958583Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1958666Z x = x_sign * x_clamp 2025-05-07T20:33:37.1958739Z x0 = x[:, :D] 2025-05-07T20:33:37.1958816Z x1 = x[:, D:] 2025-05-07T20:33:37.1958887Z 2025-05-07T20:33:37.1958963Z if contiguous: 2025-05-07T20:33:37.1959051Z x0 = x0.contiguous() 2025-05-07T20:33:37.1959136Z x1 = x1.contiguous() 2025-05-07T20:33:37.1959205Z 2025-05-07T20:33:37.1959294Z if scale_ub is not None: 2025-05-07T20:33:37.1959394Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1959523Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1959640Z ) 2025-05-07T20:33:37.1959714Z else: 2025-05-07T20:33:37.1959807Z scale_ub_tensor = None 2025-05-07T20:33:37.1959876Z 2025-05-07T20:33:37.1960004Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1960095Z op = silu_mul_quant 2025-05-07T20:33:37.1960176Z if compiled: 2025-05-07T20:33:37.1960270Z op = torch.compile(op) 2025-05-07T20:33:37.1960373Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1960441Z 2025-05-07T20:33:37.1960530Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1960535Z 2025-05-07T20:33:37.1960630Z moe/activation_test.py:117: 2025-05-07T20:33:37.1960753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1960851Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1960942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1961304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.1961398Z return fn(*args, **kwargs) 2025-05-07T20:33:37.1961888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.1961979Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.1962333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.1962547Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.1962885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.1962972Z kernel = self.compile( 2025-05-07T20:33:37.1963350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.1963522Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.1963643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1963648Z 2025-05-07T20:33:37.1963890Z self = 2025-05-07T20:33:37.1964655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.1965144Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89ca397420>} 2025-05-07T20:33:37.1965918Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.1966100Z context = 2025-05-07T20:33:37.1966107Z 2025-05-07T20:33:37.1966264Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.1966559Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.1966662Z module_map=module_map) 2025-05-07T20:33:37.1966824Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.1966916Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.1966994Z E ^ 2025-05-07T20:33:37.1967341Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.1967346Z 2025-05-07T20:33:37.1967773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.1967777Z 2025-05-07T20:33:37.1967875Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1968134Z self=, 2025-05-07T20:33:37.1968211Z T=16384, 2025-05-07T20:33:37.1968287Z D=5120, 2025-05-07T20:33:37.1968367Z scale_ub=None, 2025-05-07T20:33:37.1968452Z contiguous=False, 2025-05-07T20:33:37.1968533Z compiled=False, 2025-05-07T20:33:37.1968602Z ) 2025-05-07T20:33:37.1968816Z self = 2025-05-07T20:33:37.1968988Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:37.1968996Z 2025-05-07T20:33:37.1969064Z @given( 2025-05-07T20:33:37.1969179Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1969269Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1969377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1969490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1969595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1969667Z ) 2025-05-07T20:33:37.1969904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1969993Z def test_silu_mul_quant( 2025-05-07T20:33:37.1970067Z self, 2025-05-07T20:33:37.1970143Z T: int, 2025-05-07T20:33:37.1970213Z D: int, 2025-05-07T20:33:37.1970307Z scale_ub: Optional[float], 2025-05-07T20:33:37.1970412Z contiguous: bool, 2025-05-07T20:33:37.1970495Z compiled: bool, 2025-05-07T20:33:37.1970587Z ) -> None: 2025-05-07T20:33:37.1970682Z torch.manual_seed(2025) 2025-05-07T20:33:37.1970749Z 2025-05-07T20:33:37.1970912Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1970978Z 2025-05-07T20:33:37.1971066Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1971185Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1973006Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.1973020Z 2025-05-07T20:33:37.1973171Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:37.1973176Z 2025-05-07T20:33:37.1973269Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1973489Z self=, 2025-05-07T20:33:37.1973559Z T=4096, 2025-05-07T20:33:37.1973625Z D=7168, 2025-05-07T20:33:37.1973708Z scale_ub=1200.0, 2025-05-07T20:33:37.1973784Z contiguous=True, 2025-05-07T20:33:37.1973862Z compiled=True, 2025-05-07T20:33:37.1973932Z ) 2025-05-07T20:33:37.1974141Z self = 2025-05-07T20:33:37.1974316Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:37.1974383Z 2025-05-07T20:33:37.1974458Z @given( 2025-05-07T20:33:37.1974568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1974660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1974765Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1974876Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1974984Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1975055Z ) 2025-05-07T20:33:37.1975297Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1975383Z def test_silu_mul_quant( 2025-05-07T20:33:37.1975454Z self, 2025-05-07T20:33:37.1975532Z T: int, 2025-05-07T20:33:37.1975647Z D: int, 2025-05-07T20:33:37.1975738Z scale_ub: Optional[float], 2025-05-07T20:33:37.1975825Z contiguous: bool, 2025-05-07T20:33:37.1975911Z compiled: bool, 2025-05-07T20:33:37.1975983Z ) -> None: 2025-05-07T20:33:37.1976075Z torch.manual_seed(2025) 2025-05-07T20:33:37.1976143Z 2025-05-07T20:33:37.1976302Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1976373Z 2025-05-07T20:33:37.1976457Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1976579Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1978335Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.1978344Z 2025-05-07T20:33:37.1978459Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:37.1978463Z 2025-05-07T20:33:37.1978557Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1978769Z self=, 2025-05-07T20:33:37.1978846Z T=16384, 2025-05-07T20:33:37.1978919Z D=7168, 2025-05-07T20:33:37.1978994Z scale_ub=None, 2025-05-07T20:33:37.1979074Z contiguous=False, 2025-05-07T20:33:37.1979152Z compiled=False, 2025-05-07T20:33:37.1979216Z ) 2025-05-07T20:33:37.1979425Z self = 2025-05-07T20:33:37.1979591Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:37.1979598Z 2025-05-07T20:33:37.1979673Z @given( 2025-05-07T20:33:37.1979782Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1979919Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1980030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1980142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1980246Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1980318Z ) 2025-05-07T20:33:37.1980553Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1980678Z def test_silu_mul_quant( 2025-05-07T20:33:37.1980753Z self, 2025-05-07T20:33:37.1980825Z T: int, 2025-05-07T20:33:37.1980896Z D: int, 2025-05-07T20:33:37.1980989Z scale_ub: Optional[float], 2025-05-07T20:33:37.1981071Z contiguous: bool, 2025-05-07T20:33:37.1981152Z compiled: bool, 2025-05-07T20:33:37.1981223Z ) -> None: 2025-05-07T20:33:37.1981311Z torch.manual_seed(2025) 2025-05-07T20:33:37.1981377Z 2025-05-07T20:33:37.1981536Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1983330Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.1983342Z 2025-05-07T20:33:37.1983453Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.1983457Z 2025-05-07T20:33:37.1983551Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1983770Z self=, 2025-05-07T20:33:37.1983878Z T=2048, 2025-05-07T20:33:37.1983946Z D=7168, 2025-05-07T20:33:37.1984025Z scale_ub=1200.0, 2025-05-07T20:33:37.1984105Z contiguous=True, 2025-05-07T20:33:37.1984185Z compiled=True, 2025-05-07T20:33:37.1984252Z ) 2025-05-07T20:33:37.1984460Z self = 2025-05-07T20:33:37.1984625Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:37.1984629Z 2025-05-07T20:33:37.1984701Z @given( 2025-05-07T20:33:37.1984809Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1984902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1985007Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1985114Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1985221Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1985290Z ) 2025-05-07T20:33:37.1985527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1985612Z def test_silu_mul_quant( 2025-05-07T20:33:37.1985684Z self, 2025-05-07T20:33:37.1985754Z T: int, 2025-05-07T20:33:37.1985826Z D: int, 2025-05-07T20:33:37.1985916Z scale_ub: Optional[float], 2025-05-07T20:33:37.1986001Z contiguous: bool, 2025-05-07T20:33:37.1986083Z compiled: bool, 2025-05-07T20:33:37.1986156Z ) -> None: 2025-05-07T20:33:37.1986245Z torch.manual_seed(2025) 2025-05-07T20:33:37.1986313Z 2025-05-07T20:33:37.1986471Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1986541Z 2025-05-07T20:33:37.1986624Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1986743Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1988580Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.1988591Z 2025-05-07T20:33:37.1988706Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:37.1988749Z 2025-05-07T20:33:37.1988843Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1989054Z self=, 2025-05-07T20:33:37.1989124Z T=2048, 2025-05-07T20:33:37.1989192Z D=7168, 2025-05-07T20:33:37.1989266Z scale_ub=None, 2025-05-07T20:33:37.1989345Z contiguous=True, 2025-05-07T20:33:37.1989421Z compiled=False, 2025-05-07T20:33:37.1989487Z ) 2025-05-07T20:33:37.1989696Z self = 2025-05-07T20:33:37.1989862Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:37.1989867Z 2025-05-07T20:33:37.1989977Z @given( 2025-05-07T20:33:37.1990088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1990179Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1990287Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1990399Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1990503Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1990574Z ) 2025-05-07T20:33:37.1990806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1990896Z def test_silu_mul_quant( 2025-05-07T20:33:37.1990966Z self, 2025-05-07T20:33:37.1991035Z T: int, 2025-05-07T20:33:37.1991106Z D: int, 2025-05-07T20:33:37.1991238Z scale_ub: Optional[float], 2025-05-07T20:33:37.1991320Z contiguous: bool, 2025-05-07T20:33:37.1991401Z compiled: bool, 2025-05-07T20:33:37.1991473Z ) -> None: 2025-05-07T20:33:37.1991562Z torch.manual_seed(2025) 2025-05-07T20:33:37.1991631Z 2025-05-07T20:33:37.1991788Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1991852Z 2025-05-07T20:33:37.1991940Z > x_sign = torch.sign(x) 2025-05-07T20:33:37.1993672Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.1993683Z 2025-05-07T20:33:37.1993797Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:37.1993802Z 2025-05-07T20:33:37.1993897Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.1994111Z self=, 2025-05-07T20:33:37.1994181Z T=1, 2025-05-07T20:33:37.1994249Z D=7168, 2025-05-07T20:33:37.1994326Z scale_ub=1200.0, 2025-05-07T20:33:37.1994404Z contiguous=True, 2025-05-07T20:33:37.1994477Z compiled=False, 2025-05-07T20:33:37.1994543Z ) 2025-05-07T20:33:37.1994751Z self = 2025-05-07T20:33:37.1994907Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:37.1994912Z 2025-05-07T20:33:37.1994983Z @given( 2025-05-07T20:33:37.1995090Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.1995187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.1995292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.1995444Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.1995555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.1995624Z ) 2025-05-07T20:33:37.1995857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.1995946Z def test_silu_mul_quant( 2025-05-07T20:33:37.1996018Z self, 2025-05-07T20:33:37.1996124Z T: int, 2025-05-07T20:33:37.1996196Z D: int, 2025-05-07T20:33:37.1996284Z scale_ub: Optional[float], 2025-05-07T20:33:37.1996365Z contiguous: bool, 2025-05-07T20:33:37.1996445Z compiled: bool, 2025-05-07T20:33:37.1996515Z ) -> None: 2025-05-07T20:33:37.1996605Z torch.manual_seed(2025) 2025-05-07T20:33:37.1996671Z 2025-05-07T20:33:37.1996827Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.1996898Z 2025-05-07T20:33:37.1996982Z x_sign = torch.sign(x) 2025-05-07T20:33:37.1997101Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.1997185Z x = x_sign * x_clamp 2025-05-07T20:33:37.1997296Z x0 = x[:, :D] 2025-05-07T20:33:37.1997369Z x1 = x[:, D:] 2025-05-07T20:33:37.1997439Z 2025-05-07T20:33:37.1997515Z if contiguous: 2025-05-07T20:33:37.1997599Z x0 = x0.contiguous() 2025-05-07T20:33:37.1997684Z x1 = x1.contiguous() 2025-05-07T20:33:37.1997753Z 2025-05-07T20:33:37.1997835Z if scale_ub is not None: 2025-05-07T20:33:37.1997936Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.1998062Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.1998131Z ) 2025-05-07T20:33:37.1998202Z else: 2025-05-07T20:33:37.1998289Z scale_ub_tensor = None 2025-05-07T20:33:37.1998396Z 2025-05-07T20:33:37.1998520Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.1998602Z op = silu_mul_quant 2025-05-07T20:33:37.1998684Z if compiled: 2025-05-07T20:33:37.1998776Z op = torch.compile(op) 2025-05-07T20:33:37.1998876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1998946Z 2025-05-07T20:33:37.1999029Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.1999034Z 2025-05-07T20:33:37.1999125Z moe/activation_test.py:117: 2025-05-07T20:33:37.1999248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.1999341Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.1999438Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.1999933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.2000025Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.2000411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.2000656Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.2000995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.2001082Z kernel = self.compile( 2025-05-07T20:33:37.2001476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.2001654Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.2001773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2001778Z 2025-05-07T20:33:37.2001970Z self = 2025-05-07T20:33:37.2002734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.2003271Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f88199aa2a0>} 2025-05-07T20:33:37.2004007Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.2004251Z context = 2025-05-07T20:33:37.2004256Z 2025-05-07T20:33:37.2004414Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.2004669Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.2004775Z module_map=module_map) 2025-05-07T20:33:37.2004936Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.2005030Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.2005102Z E ^ 2025-05-07T20:33:37.2005493Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.2005498Z 2025-05-07T20:33:37.2005912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.2005919Z 2025-05-07T20:33:37.2006019Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2006234Z self=, 2025-05-07T20:33:37.2006306Z T=128, 2025-05-07T20:33:37.2006385Z D=5120, 2025-05-07T20:33:37.2006462Z scale_ub=None, 2025-05-07T20:33:37.2006541Z contiguous=True, 2025-05-07T20:33:37.2006621Z compiled=False, 2025-05-07T20:33:37.2006688Z ) 2025-05-07T20:33:37.2006941Z self = 2025-05-07T20:33:37.2007106Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:37.2007114Z 2025-05-07T20:33:37.2007182Z @given( 2025-05-07T20:33:37.2007300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2007395Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2007505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2007618Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2007727Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2007796Z ) 2025-05-07T20:33:37.2008031Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2008121Z def test_silu_mul_quant( 2025-05-07T20:33:37.2008192Z self, 2025-05-07T20:33:37.2008263Z T: int, 2025-05-07T20:33:37.2008334Z D: int, 2025-05-07T20:33:37.2008434Z scale_ub: Optional[float], 2025-05-07T20:33:37.2008518Z contiguous: bool, 2025-05-07T20:33:37.2008600Z compiled: bool, 2025-05-07T20:33:37.2008678Z ) -> None: 2025-05-07T20:33:37.2008771Z torch.manual_seed(2025) 2025-05-07T20:33:37.2008841Z 2025-05-07T20:33:37.2009009Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2009080Z 2025-05-07T20:33:37.2009166Z x_sign = torch.sign(x) 2025-05-07T20:33:37.2009288Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.2009373Z x = x_sign * x_clamp 2025-05-07T20:33:37.2009451Z x0 = x[:, :D] 2025-05-07T20:33:37.2009527Z x1 = x[:, D:] 2025-05-07T20:33:37.2009595Z 2025-05-07T20:33:37.2009675Z if contiguous: 2025-05-07T20:33:37.2009759Z x0 = x0.contiguous() 2025-05-07T20:33:37.2009841Z x1 = x1.contiguous() 2025-05-07T20:33:37.2009911Z 2025-05-07T20:33:37.2009993Z if scale_ub is not None: 2025-05-07T20:33:37.2010095Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.2010225Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.2010343Z ) 2025-05-07T20:33:37.2010416Z else: 2025-05-07T20:33:37.2010514Z scale_ub_tensor = None 2025-05-07T20:33:37.2010579Z 2025-05-07T20:33:37.2010707Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.2010791Z op = silu_mul_quant 2025-05-07T20:33:37.2010873Z if compiled: 2025-05-07T20:33:37.2011015Z op = torch.compile(op) 2025-05-07T20:33:37.2011118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2011187Z 2025-05-07T20:33:37.2011276Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.2011281Z 2025-05-07T20:33:37.2011372Z moe/activation_test.py:117: 2025-05-07T20:33:37.2011495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2011593Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.2011688Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2012186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.2012316Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.2012669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.2012888Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.2013228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.2013316Z kernel = self.compile( 2025-05-07T20:33:37.2013713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.2013881Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.2014044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2014049Z 2025-05-07T20:33:37.2014247Z self = 2025-05-07T20:33:37.2015009Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.2015501Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f88199ab1a0>} 2025-05-07T20:33:37.2016237Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.2016421Z context = 2025-05-07T20:33:37.2016432Z 2025-05-07T20:33:37.2016589Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.2016851Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.2016951Z module_map=module_map) 2025-05-07T20:33:37.2017107Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.2017201Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.2017270Z E ^ 2025-05-07T20:33:37.2017619Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.2017623Z 2025-05-07T20:33:37.2018042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.2018047Z 2025-05-07T20:33:37.2018144Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2018369Z self=, 2025-05-07T20:33:37.2018446Z T=128, 2025-05-07T20:33:37.2018517Z D=7168, 2025-05-07T20:33:37.2018642Z scale_ub=None, 2025-05-07T20:33:37.2018725Z contiguous=True, 2025-05-07T20:33:37.2018809Z compiled=False, 2025-05-07T20:33:37.2018879Z ) 2025-05-07T20:33:37.2019090Z self = 2025-05-07T20:33:37.2019254Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:37.2019325Z 2025-05-07T20:33:37.2019399Z @given( 2025-05-07T20:33:37.2019511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2019607Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2019716Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2019827Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2019942Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2020015Z ) 2025-05-07T20:33:37.2020251Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2020345Z def test_silu_mul_quant( 2025-05-07T20:33:37.2020419Z self, 2025-05-07T20:33:37.2020491Z T: int, 2025-05-07T20:33:37.2020632Z D: int, 2025-05-07T20:33:37.2020733Z scale_ub: Optional[float], 2025-05-07T20:33:37.2020838Z contiguous: bool, 2025-05-07T20:33:37.2020916Z compiled: bool, 2025-05-07T20:33:37.2020988Z ) -> None: 2025-05-07T20:33:37.2021086Z torch.manual_seed(2025) 2025-05-07T20:33:37.2021156Z 2025-05-07T20:33:37.2021316Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2021391Z 2025-05-07T20:33:37.2021476Z x_sign = torch.sign(x) 2025-05-07T20:33:37.2021593Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.2021677Z x = x_sign * x_clamp 2025-05-07T20:33:37.2021749Z x0 = x[:, :D] 2025-05-07T20:33:37.2021862Z x1 = x[:, D:] 2025-05-07T20:33:37.2021931Z 2025-05-07T20:33:37.2022006Z if contiguous: 2025-05-07T20:33:37.2022088Z x0 = x0.contiguous() 2025-05-07T20:33:37.2022175Z x1 = x1.contiguous() 2025-05-07T20:33:37.2022244Z 2025-05-07T20:33:37.2022332Z if scale_ub is not None: 2025-05-07T20:33:37.2022431Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.2022557Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.2022626Z ) 2025-05-07T20:33:37.2022702Z else: 2025-05-07T20:33:37.2022789Z scale_ub_tensor = None 2025-05-07T20:33:37.2022864Z 2025-05-07T20:33:37.2022987Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.2023071Z op = silu_mul_quant 2025-05-07T20:33:37.2023157Z if compiled: 2025-05-07T20:33:37.2023251Z op = torch.compile(op) 2025-05-07T20:33:37.2023349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2023426Z 2025-05-07T20:33:37.2023512Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.2023516Z 2025-05-07T20:33:37.2023614Z moe/activation_test.py:117: 2025-05-07T20:33:37.2023738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2023829Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.2023922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2024408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.2024500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.2024855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.2025068Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.2025402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.2025492Z kernel = self.compile( 2025-05-07T20:33:37.2025930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.2026101Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.2026219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2026224Z 2025-05-07T20:33:37.2026421Z self = 2025-05-07T20:33:37.2027219Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.2027752Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819b78040>} 2025-05-07T20:33:37.2028491Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.2028711Z context = 2025-05-07T20:33:37.2028716Z 2025-05-07T20:33:37.2028876Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.2029128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.2029232Z module_map=module_map) 2025-05-07T20:33:37.2029389Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.2029478Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.2029552Z E ^ 2025-05-07T20:33:37.2029897Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.2029942Z 2025-05-07T20:33:37.2030355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.2030360Z 2025-05-07T20:33:37.2030471Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2030724Z self=, 2025-05-07T20:33:37.2030798Z T=2048, 2025-05-07T20:33:37.2030873Z D=7168, 2025-05-07T20:33:37.2030950Z scale_ub=1200.0, 2025-05-07T20:33:37.2031035Z contiguous=True, 2025-05-07T20:33:37.2031115Z compiled=False, 2025-05-07T20:33:37.2031181Z ) 2025-05-07T20:33:37.2031395Z self = 2025-05-07T20:33:37.2031560Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:37.2031564Z 2025-05-07T20:33:37.2031636Z @given( 2025-05-07T20:33:37.2031751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2031847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2031958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2032071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2032181Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2032250Z ) 2025-05-07T20:33:37.2032487Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2032573Z def test_silu_mul_quant( 2025-05-07T20:33:37.2032647Z self, 2025-05-07T20:33:37.2032721Z T: int, 2025-05-07T20:33:37.2032791Z D: int, 2025-05-07T20:33:37.2032884Z scale_ub: Optional[float], 2025-05-07T20:33:37.2032968Z contiguous: bool, 2025-05-07T20:33:37.2033046Z compiled: bool, 2025-05-07T20:33:37.2033120Z ) -> None: 2025-05-07T20:33:37.2033209Z torch.manual_seed(2025) 2025-05-07T20:33:37.2033277Z 2025-05-07T20:33:37.2033440Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2035264Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2035312Z 2025-05-07T20:33:37.2035423Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.2035428Z 2025-05-07T20:33:37.2035522Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2035740Z self=, 2025-05-07T20:33:37.2035810Z T=1, 2025-05-07T20:33:37.2035883Z D=5120, 2025-05-07T20:33:37.2035963Z scale_ub=1200.0, 2025-05-07T20:33:37.2036040Z contiguous=True, 2025-05-07T20:33:37.2036117Z compiled=False, 2025-05-07T20:33:37.2036192Z ) 2025-05-07T20:33:37.2036439Z self = 2025-05-07T20:33:37.2036602Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:37.2036610Z 2025-05-07T20:33:37.2036681Z @given( 2025-05-07T20:33:37.2036790Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2036888Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2036996Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2037107Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2037218Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2037286Z ) 2025-05-07T20:33:37.2037523Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2037654Z def test_silu_mul_quant( 2025-05-07T20:33:37.2037726Z self, 2025-05-07T20:33:37.2037793Z T: int, 2025-05-07T20:33:37.2037869Z D: int, 2025-05-07T20:33:37.2037963Z scale_ub: Optional[float], 2025-05-07T20:33:37.2038051Z contiguous: bool, 2025-05-07T20:33:37.2038129Z compiled: bool, 2025-05-07T20:33:37.2038200Z ) -> None: 2025-05-07T20:33:37.2038292Z torch.manual_seed(2025) 2025-05-07T20:33:37.2038359Z 2025-05-07T20:33:37.2038517Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2038590Z 2025-05-07T20:33:37.2038677Z x_sign = torch.sign(x) 2025-05-07T20:33:37.2038795Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.2038880Z x = x_sign * x_clamp 2025-05-07T20:33:37.2038955Z x0 = x[:, :D] 2025-05-07T20:33:37.2039029Z x1 = x[:, D:] 2025-05-07T20:33:37.2039098Z 2025-05-07T20:33:37.2039175Z if contiguous: 2025-05-07T20:33:37.2039265Z x0 = x0.contiguous() 2025-05-07T20:33:37.2039347Z x1 = x1.contiguous() 2025-05-07T20:33:37.2039412Z 2025-05-07T20:33:37.2039499Z if scale_ub is not None: 2025-05-07T20:33:37.2039601Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.2039728Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.2039802Z ) 2025-05-07T20:33:37.2039871Z else: 2025-05-07T20:33:37.2039961Z scale_ub_tensor = None 2025-05-07T20:33:37.2040029Z 2025-05-07T20:33:37.2040536Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.2040668Z op = silu_mul_quant 2025-05-07T20:33:37.2040751Z if compiled: 2025-05-07T20:33:37.2040844Z op = torch.compile(op) 2025-05-07T20:33:37.2040945Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2041012Z 2025-05-07T20:33:37.2041098Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.2041107Z 2025-05-07T20:33:37.2041199Z moe/activation_test.py:117: 2025-05-07T20:33:37.2041320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2041502Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.2041599Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2042085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.2042177Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.2042590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.2042806Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.2043148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.2043237Z kernel = self.compile( 2025-05-07T20:33:37.2043633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.2043816Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.2043992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2043997Z 2025-05-07T20:33:37.2044199Z self = 2025-05-07T20:33:37.2044959Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.2045451Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819b79580>} 2025-05-07T20:33:37.2046185Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.2046432Z context = 2025-05-07T20:33:37.2046436Z 2025-05-07T20:33:37.2046600Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.2046855Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.2046954Z module_map=module_map) 2025-05-07T20:33:37.2047118Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.2047212Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.2047289Z E ^ 2025-05-07T20:33:37.2047639Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.2047643Z 2025-05-07T20:33:37.2048052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.2048060Z 2025-05-07T20:33:37.2048159Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2048375Z self=, 2025-05-07T20:33:37.2048455Z T=2048, 2025-05-07T20:33:37.2048525Z D=5120, 2025-05-07T20:33:37.2048600Z scale_ub=None, 2025-05-07T20:33:37.2048684Z contiguous=True, 2025-05-07T20:33:37.2048763Z compiled=False, 2025-05-07T20:33:37.2048830Z ) 2025-05-07T20:33:37.2049046Z self = 2025-05-07T20:33:37.2049211Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:37.2049215Z 2025-05-07T20:33:37.2049286Z @given( 2025-05-07T20:33:37.2049402Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2049494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2049602Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2049718Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2049823Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2049941Z ) 2025-05-07T20:33:37.2050179Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2050264Z def test_silu_mul_quant( 2025-05-07T20:33:37.2050340Z self, 2025-05-07T20:33:37.2050412Z T: int, 2025-05-07T20:33:37.2050480Z D: int, 2025-05-07T20:33:37.2050573Z scale_ub: Optional[float], 2025-05-07T20:33:37.2050696Z contiguous: bool, 2025-05-07T20:33:37.2050773Z compiled: bool, 2025-05-07T20:33:37.2050848Z ) -> None: 2025-05-07T20:33:37.2050935Z torch.manual_seed(2025) 2025-05-07T20:33:37.2051001Z 2025-05-07T20:33:37.2051163Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2051231Z 2025-05-07T20:33:37.2051321Z > x_sign = torch.sign(x) 2025-05-07T20:33:37.2053113Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2053124Z 2025-05-07T20:33:37.2053239Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:37.2053244Z 2025-05-07T20:33:37.2053342Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2053557Z self=, 2025-05-07T20:33:37.2053634Z T=16384, 2025-05-07T20:33:37.2053705Z D=5120, 2025-05-07T20:33:37.2053818Z scale_ub=None, 2025-05-07T20:33:37.2053901Z contiguous=True, 2025-05-07T20:33:37.2053981Z compiled=False, 2025-05-07T20:33:37.2054051Z ) 2025-05-07T20:33:37.2054269Z self = 2025-05-07T20:33:37.2054440Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:37.2054444Z 2025-05-07T20:33:37.2054522Z @given( 2025-05-07T20:33:37.2054635Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2054728Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2054849Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2054961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2055070Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2055147Z ) 2025-05-07T20:33:37.2055386Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2055475Z def test_silu_mul_quant( 2025-05-07T20:33:37.2055549Z self, 2025-05-07T20:33:37.2055620Z T: int, 2025-05-07T20:33:37.2055691Z D: int, 2025-05-07T20:33:37.2055784Z scale_ub: Optional[float], 2025-05-07T20:33:37.2055868Z contiguous: bool, 2025-05-07T20:33:37.2055951Z compiled: bool, 2025-05-07T20:33:37.2056023Z ) -> None: 2025-05-07T20:33:37.2056109Z torch.manual_seed(2025) 2025-05-07T20:33:37.2056178Z 2025-05-07T20:33:37.2056335Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2058093Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2058101Z 2025-05-07T20:33:37.2058257Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.2058262Z 2025-05-07T20:33:37.2058363Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2058581Z self=, 2025-05-07T20:33:37.2058659Z T=4096, 2025-05-07T20:33:37.2058737Z D=5120, 2025-05-07T20:33:37.2058852Z scale_ub=None, 2025-05-07T20:33:37.2058930Z contiguous=True, 2025-05-07T20:33:37.2059011Z compiled=False, 2025-05-07T20:33:37.2059083Z ) 2025-05-07T20:33:37.2059291Z self = 2025-05-07T20:33:37.2059459Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:37.2059463Z 2025-05-07T20:33:37.2059535Z @given( 2025-05-07T20:33:37.2059645Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2059741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2059851Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2063305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2063490Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2063561Z ) 2025-05-07T20:33:37.2063806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2063893Z def test_silu_mul_quant( 2025-05-07T20:33:37.2063968Z self, 2025-05-07T20:33:37.2064039Z T: int, 2025-05-07T20:33:37.2064110Z D: int, 2025-05-07T20:33:37.2064202Z scale_ub: Optional[float], 2025-05-07T20:33:37.2064289Z contiguous: bool, 2025-05-07T20:33:37.2064367Z compiled: bool, 2025-05-07T20:33:37.2064441Z ) -> None: 2025-05-07T20:33:37.2064530Z torch.manual_seed(2025) 2025-05-07T20:33:37.2064596Z 2025-05-07T20:33:37.2064831Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2066586Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2066595Z 2025-05-07T20:33:37.2066709Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.2066714Z 2025-05-07T20:33:37.2066811Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2067027Z self=, 2025-05-07T20:33:37.2067107Z T=2048, 2025-05-07T20:33:37.2067179Z D=5120, 2025-05-07T20:33:37.2067254Z scale_ub=None, 2025-05-07T20:33:37.2067340Z contiguous=False, 2025-05-07T20:33:37.2067491Z compiled=False, 2025-05-07T20:33:37.2067561Z ) 2025-05-07T20:33:37.2067777Z self = 2025-05-07T20:33:37.2067943Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:37.2067948Z 2025-05-07T20:33:37.2068023Z @given( 2025-05-07T20:33:37.2068135Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2068227Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2068339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2068447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2068552Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2068628Z ) 2025-05-07T20:33:37.2068864Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2068962Z def test_silu_mul_quant( 2025-05-07T20:33:37.2069034Z self, 2025-05-07T20:33:37.2069103Z T: int, 2025-05-07T20:33:37.2069221Z D: int, 2025-05-07T20:33:37.2069322Z scale_ub: Optional[float], 2025-05-07T20:33:37.2069411Z contiguous: bool, 2025-05-07T20:33:37.2069498Z compiled: bool, 2025-05-07T20:33:37.2069575Z ) -> None: 2025-05-07T20:33:37.2069670Z torch.manual_seed(2025) 2025-05-07T20:33:37.2069746Z 2025-05-07T20:33:37.2069948Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2071697Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2071705Z 2025-05-07T20:33:37.2071854Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.2071859Z 2025-05-07T20:33:37.2071956Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2072176Z self=, 2025-05-07T20:33:37.2072250Z T=4096, 2025-05-07T20:33:37.2072333Z D=7168, 2025-05-07T20:33:37.2072413Z scale_ub=None, 2025-05-07T20:33:37.2072492Z contiguous=True, 2025-05-07T20:33:37.2072574Z compiled=True, 2025-05-07T20:33:37.2072644Z ) 2025-05-07T20:33:37.2072852Z self = 2025-05-07T20:33:37.2073016Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:37.2073065Z 2025-05-07T20:33:37.2073139Z @given( 2025-05-07T20:33:37.2073249Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2073346Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2073453Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2073567Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2073672Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2073741Z ) 2025-05-07T20:33:37.2073979Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2074069Z def test_silu_mul_quant( 2025-05-07T20:33:37.2074142Z self, 2025-05-07T20:33:37.2074221Z T: int, 2025-05-07T20:33:37.2074293Z D: int, 2025-05-07T20:33:37.2074382Z scale_ub: Optional[float], 2025-05-07T20:33:37.2074471Z contiguous: bool, 2025-05-07T20:33:37.2074551Z compiled: bool, 2025-05-07T20:33:37.2074624Z ) -> None: 2025-05-07T20:33:37.2074716Z torch.manual_seed(2025) 2025-05-07T20:33:37.2074789Z 2025-05-07T20:33:37.2074953Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2076696Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2076704Z 2025-05-07T20:33:37.2076816Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.2076820Z 2025-05-07T20:33:37.2076914Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2077127Z self=, 2025-05-07T20:33:37.2077204Z T=2048, 2025-05-07T20:33:37.2077273Z D=5120, 2025-05-07T20:33:37.2077347Z scale_ub=1200.0, 2025-05-07T20:33:37.2077474Z contiguous=False, 2025-05-07T20:33:37.2077554Z compiled=False, 2025-05-07T20:33:37.2077624Z ) 2025-05-07T20:33:37.2077832Z self = 2025-05-07T20:33:37.2077998Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:37.2078003Z 2025-05-07T20:33:37.2078115Z @given( 2025-05-07T20:33:37.2078225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2078317Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2078426Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2078535Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2078640Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2078710Z ) 2025-05-07T20:33:37.2078947Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2079037Z def test_silu_mul_quant( 2025-05-07T20:33:37.2079110Z self, 2025-05-07T20:33:37.2079183Z T: int, 2025-05-07T20:33:37.2079294Z D: int, 2025-05-07T20:33:37.2079389Z scale_ub: Optional[float], 2025-05-07T20:33:37.2079477Z contiguous: bool, 2025-05-07T20:33:37.2079561Z compiled: bool, 2025-05-07T20:33:37.2079634Z ) -> None: 2025-05-07T20:33:37.2079722Z torch.manual_seed(2025) 2025-05-07T20:33:37.2079797Z 2025-05-07T20:33:37.2079956Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2081698Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2081743Z 2025-05-07T20:33:37.2081854Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.2081858Z 2025-05-07T20:33:37.2081953Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2082170Z self=, 2025-05-07T20:33:37.2082244Z T=4096, 2025-05-07T20:33:37.2082318Z D=7168, 2025-05-07T20:33:37.2082393Z scale_ub=1200.0, 2025-05-07T20:33:37.2082469Z contiguous=True, 2025-05-07T20:33:37.2082549Z compiled=False, 2025-05-07T20:33:37.2082618Z ) 2025-05-07T20:33:37.2082825Z self = 2025-05-07T20:33:37.2082993Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:37.2083000Z 2025-05-07T20:33:37.2083073Z @given( 2025-05-07T20:33:37.2083188Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2083283Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2083393Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2083506Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2083613Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2083683Z ) 2025-05-07T20:33:37.2083923Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2084011Z def test_silu_mul_quant( 2025-05-07T20:33:37.2084081Z self, 2025-05-07T20:33:37.2084157Z T: int, 2025-05-07T20:33:37.2084229Z D: int, 2025-05-07T20:33:37.2084320Z scale_ub: Optional[float], 2025-05-07T20:33:37.2084406Z contiguous: bool, 2025-05-07T20:33:37.2084486Z compiled: bool, 2025-05-07T20:33:37.2084558Z ) -> None: 2025-05-07T20:33:37.2084648Z torch.manual_seed(2025) 2025-05-07T20:33:37.2084713Z 2025-05-07T20:33:37.2084919Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2086665Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2086707Z 2025-05-07T20:33:37.2086818Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.2086822Z 2025-05-07T20:33:37.2086917Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2087132Z self=, 2025-05-07T20:33:37.2087206Z T=16384, 2025-05-07T20:33:37.2087277Z D=7168, 2025-05-07T20:33:37.2087352Z scale_ub=None, 2025-05-07T20:33:37.2087469Z contiguous=False, 2025-05-07T20:33:37.2087547Z compiled=True, 2025-05-07T20:33:37.2087612Z ) 2025-05-07T20:33:37.2087824Z self = 2025-05-07T20:33:37.2087992Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:37.2088000Z 2025-05-07T20:33:37.2088071Z @given( 2025-05-07T20:33:37.2088179Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2088272Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2088383Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2088491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2088638Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2088711Z ) 2025-05-07T20:33:37.2088947Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2089035Z def test_silu_mul_quant( 2025-05-07T20:33:37.2089108Z self, 2025-05-07T20:33:37.2089177Z T: int, 2025-05-07T20:33:37.2089250Z D: int, 2025-05-07T20:33:37.2089338Z scale_ub: Optional[float], 2025-05-07T20:33:37.2089418Z contiguous: bool, 2025-05-07T20:33:37.2089498Z compiled: bool, 2025-05-07T20:33:37.2089574Z ) -> None: 2025-05-07T20:33:37.2089659Z torch.manual_seed(2025) 2025-05-07T20:33:37.2089728Z 2025-05-07T20:33:37.2089885Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2091681Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2091691Z 2025-05-07T20:33:37.2091799Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.2091803Z 2025-05-07T20:33:37.2091897Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2092113Z self=, 2025-05-07T20:33:37.2092182Z T=4096, 2025-05-07T20:33:37.2092252Z D=7168, 2025-05-07T20:33:37.2092328Z scale_ub=None, 2025-05-07T20:33:37.2092404Z contiguous=True, 2025-05-07T20:33:37.2092483Z compiled=False, 2025-05-07T20:33:37.2092548Z ) 2025-05-07T20:33:37.2092754Z self = 2025-05-07T20:33:37.2092919Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:37.2092924Z 2025-05-07T20:33:37.2093037Z @given( 2025-05-07T20:33:37.2093149Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2093244Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2093348Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2093459Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2093566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2093671Z ) 2025-05-07T20:33:37.2093909Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2093994Z def test_silu_mul_quant( 2025-05-07T20:33:37.2094061Z self, 2025-05-07T20:33:37.2094135Z T: int, 2025-05-07T20:33:37.2094202Z D: int, 2025-05-07T20:33:37.2094292Z scale_ub: Optional[float], 2025-05-07T20:33:37.2094380Z contiguous: bool, 2025-05-07T20:33:37.2094459Z compiled: bool, 2025-05-07T20:33:37.2094528Z ) -> None: 2025-05-07T20:33:37.2094616Z torch.manual_seed(2025) 2025-05-07T20:33:37.2094686Z 2025-05-07T20:33:37.2094912Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2096646Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2096655Z 2025-05-07T20:33:37.2096766Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.2096810Z 2025-05-07T20:33:37.2096905Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2097121Z self=, 2025-05-07T20:33:37.2097194Z T=16384, 2025-05-07T20:33:37.2097270Z D=7168, 2025-05-07T20:33:37.2097344Z scale_ub=None, 2025-05-07T20:33:37.2097424Z contiguous=True, 2025-05-07T20:33:37.2097500Z compiled=False, 2025-05-07T20:33:37.2097565Z ) 2025-05-07T20:33:37.2097774Z self = 2025-05-07T20:33:37.2097942Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:37.2097947Z 2025-05-07T20:33:37.2098016Z @given( 2025-05-07T20:33:37.2098126Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2098218Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2098327Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2098434Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2098542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2098612Z ) 2025-05-07T20:33:37.2098849Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2098940Z def test_silu_mul_quant( 2025-05-07T20:33:37.2099012Z self, 2025-05-07T20:33:37.2099082Z T: int, 2025-05-07T20:33:37.2099154Z D: int, 2025-05-07T20:33:37.2099242Z scale_ub: Optional[float], 2025-05-07T20:33:37.2099323Z contiguous: bool, 2025-05-07T20:33:37.2099407Z compiled: bool, 2025-05-07T20:33:37.2099479Z ) -> None: 2025-05-07T20:33:37.2099567Z torch.manual_seed(2025) 2025-05-07T20:33:37.2099636Z 2025-05-07T20:33:37.2099794Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2101624Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2101633Z 2025-05-07T20:33:37.2101745Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.2101787Z 2025-05-07T20:33:37.2101881Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2102097Z self=, 2025-05-07T20:33:37.2102166Z T=16384, 2025-05-07T20:33:37.2102240Z D=7168, 2025-05-07T20:33:37.2102317Z scale_ub=1200.0, 2025-05-07T20:33:37.2102394Z contiguous=True, 2025-05-07T20:33:37.2102474Z compiled=False, 2025-05-07T20:33:37.2102544Z ) 2025-05-07T20:33:37.2102751Z self = 2025-05-07T20:33:37.2102922Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:37.2102927Z 2025-05-07T20:33:37.2103033Z @given( 2025-05-07T20:33:37.2103144Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2103238Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2103345Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2103455Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2103565Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2103632Z ) 2025-05-07T20:33:37.2103868Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2103954Z def test_silu_mul_quant( 2025-05-07T20:33:37.2104023Z self, 2025-05-07T20:33:37.2104095Z T: int, 2025-05-07T20:33:37.2104163Z D: int, 2025-05-07T20:33:37.2104293Z scale_ub: Optional[float], 2025-05-07T20:33:37.2104378Z contiguous: bool, 2025-05-07T20:33:37.2104457Z compiled: bool, 2025-05-07T20:33:37.2104528Z ) -> None: 2025-05-07T20:33:37.2104618Z torch.manual_seed(2025) 2025-05-07T20:33:37.2104686Z 2025-05-07T20:33:37.2104845Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2106580Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2106590Z 2025-05-07T20:33:37.2106702Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.2106706Z 2025-05-07T20:33:37.2106799Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2107016Z self=, 2025-05-07T20:33:37.2107089Z T=128, 2025-05-07T20:33:37.2107158Z D=5120, 2025-05-07T20:33:37.2107232Z scale_ub=1200.0, 2025-05-07T20:33:37.2107311Z contiguous=False, 2025-05-07T20:33:37.2107388Z compiled=False, 2025-05-07T20:33:37.2107499Z ) 2025-05-07T20:33:37.2107709Z self = 2025-05-07T20:33:37.2107872Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:37.2107876Z 2025-05-07T20:33:37.2107947Z @given( 2025-05-07T20:33:37.2108056Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2108148Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2108260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2108369Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2108517Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2108590Z ) 2025-05-07T20:33:37.2108826Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2108916Z def test_silu_mul_quant( 2025-05-07T20:33:37.2108987Z self, 2025-05-07T20:33:37.2109058Z T: int, 2025-05-07T20:33:37.2109131Z D: int, 2025-05-07T20:33:37.2109261Z scale_ub: Optional[float], 2025-05-07T20:33:37.2109344Z contiguous: bool, 2025-05-07T20:33:37.2109423Z compiled: bool, 2025-05-07T20:33:37.2109493Z ) -> None: 2025-05-07T20:33:37.2109578Z torch.manual_seed(2025) 2025-05-07T20:33:37.2109649Z 2025-05-07T20:33:37.2109808Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2109872Z 2025-05-07T20:33:37.2109960Z x_sign = torch.sign(x) 2025-05-07T20:33:37.2110077Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.2110159Z x = x_sign * x_clamp 2025-05-07T20:33:37.2110238Z x0 = x[:, :D] 2025-05-07T20:33:37.2110311Z x1 = x[:, D:] 2025-05-07T20:33:37.2110418Z 2025-05-07T20:33:37.2110495Z if contiguous: 2025-05-07T20:33:37.2110578Z x0 = x0.contiguous() 2025-05-07T20:33:37.2110661Z x1 = x1.contiguous() 2025-05-07T20:33:37.2110726Z 2025-05-07T20:33:37.2110807Z if scale_ub is not None: 2025-05-07T20:33:37.2110910Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.2111037Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.2111104Z ) 2025-05-07T20:33:37.2111176Z else: 2025-05-07T20:33:37.2111261Z scale_ub_tensor = None 2025-05-07T20:33:37.2111324Z 2025-05-07T20:33:37.2111448Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.2111574Z op = silu_mul_quant 2025-05-07T20:33:37.2111656Z if compiled: 2025-05-07T20:33:37.2111748Z op = torch.compile(op) 2025-05-07T20:33:37.2111849Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2111915Z 2025-05-07T20:33:37.2112003Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.2112008Z 2025-05-07T20:33:37.2112096Z moe/activation_test.py:117: 2025-05-07T20:33:37.2112220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2112316Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.2112406Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2112902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.2112990Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.2113347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.2113565Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.2113902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.2113995Z kernel = self.compile( 2025-05-07T20:33:37.2114390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.2114560Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.2114682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2114687Z 2025-05-07T20:33:37.2114880Z self = 2025-05-07T20:33:37.2115645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.2116179Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8819876e80>} 2025-05-07T20:33:37.2116919Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.2117101Z context = 2025-05-07T20:33:37.2117143Z 2025-05-07T20:33:37.2117299Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.2117566Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.2117668Z module_map=module_map) 2025-05-07T20:33:37.2117825Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.2117920Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.2117993Z E ^ 2025-05-07T20:33:37.2118344Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.2118386Z 2025-05-07T20:33:37.2118797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.2118801Z 2025-05-07T20:33:37.2118899Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2119115Z self=, 2025-05-07T20:33:37.2119186Z T=2048, 2025-05-07T20:33:37.2119260Z D=7168, 2025-05-07T20:33:37.2119336Z scale_ub=None, 2025-05-07T20:33:37.2119417Z contiguous=False, 2025-05-07T20:33:37.2119495Z compiled=False, 2025-05-07T20:33:37.2119561Z ) 2025-05-07T20:33:37.2119770Z self = 2025-05-07T20:33:37.2119979Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:37.2119984Z 2025-05-07T20:33:37.2120053Z @given( 2025-05-07T20:33:37.2120169Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2120263Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2120371Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2120482Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2120587Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2120658Z ) 2025-05-07T20:33:37.2120896Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2120981Z def test_silu_mul_quant( 2025-05-07T20:33:37.2121052Z self, 2025-05-07T20:33:37.2121126Z T: int, 2025-05-07T20:33:37.2121196Z D: int, 2025-05-07T20:33:37.2121285Z scale_ub: Optional[float], 2025-05-07T20:33:37.2121370Z contiguous: bool, 2025-05-07T20:33:37.2121450Z compiled: bool, 2025-05-07T20:33:37.2121523Z ) -> None: 2025-05-07T20:33:37.2121610Z torch.manual_seed(2025) 2025-05-07T20:33:37.2121682Z 2025-05-07T20:33:37.2121846Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2123589Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2123597Z 2025-05-07T20:33:37.2123709Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.2123715Z 2025-05-07T20:33:37.2123808Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2124021Z self=, 2025-05-07T20:33:37.2124137Z T=128, 2025-05-07T20:33:37.2124210Z D=7168, 2025-05-07T20:33:37.2124286Z scale_ub=1200.0, 2025-05-07T20:33:37.2124366Z contiguous=True, 2025-05-07T20:33:37.2124444Z compiled=True, 2025-05-07T20:33:37.2124513Z ) 2025-05-07T20:33:37.2124726Z self = 2025-05-07T20:33:37.2124950Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:37.2124955Z 2025-05-07T20:33:37.2125034Z @given( 2025-05-07T20:33:37.2125145Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2125237Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2125346Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2125455Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2125565Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2125640Z ) 2025-05-07T20:33:37.2125878Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2126005Z def test_silu_mul_quant( 2025-05-07T20:33:37.2126078Z self, 2025-05-07T20:33:37.2126150Z T: int, 2025-05-07T20:33:37.2126226Z D: int, 2025-05-07T20:33:37.2126316Z scale_ub: Optional[float], 2025-05-07T20:33:37.2126396Z contiguous: bool, 2025-05-07T20:33:37.2126477Z compiled: bool, 2025-05-07T20:33:37.2126547Z ) -> None: 2025-05-07T20:33:37.2126634Z torch.manual_seed(2025) 2025-05-07T20:33:37.2126708Z 2025-05-07T20:33:37.2126867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2126938Z 2025-05-07T20:33:37.2127026Z x_sign = torch.sign(x) 2025-05-07T20:33:37.2127144Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.2127274Z x = x_sign * x_clamp 2025-05-07T20:33:37.2127347Z x0 = x[:, :D] 2025-05-07T20:33:37.2127420Z x1 = x[:, D:] 2025-05-07T20:33:37.2127491Z 2025-05-07T20:33:37.2127571Z if contiguous: 2025-05-07T20:33:37.2127658Z x0 = x0.contiguous() 2025-05-07T20:33:37.2127742Z x1 = x1.contiguous() 2025-05-07T20:33:37.2127807Z 2025-05-07T20:33:37.2127890Z if scale_ub is not None: 2025-05-07T20:33:37.2127996Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.2128125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.2128198Z ) 2025-05-07T20:33:37.2128273Z else: 2025-05-07T20:33:37.2128361Z scale_ub_tensor = None 2025-05-07T20:33:37.2128429Z 2025-05-07T20:33:37.2128555Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.2128637Z op = silu_mul_quant 2025-05-07T20:33:37.2128720Z if compiled: 2025-05-07T20:33:37.2128818Z op = torch.compile(op) 2025-05-07T20:33:37.2128918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2128986Z 2025-05-07T20:33:37.2129074Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.2129078Z 2025-05-07T20:33:37.2129171Z moe/activation_test.py:117: 2025-05-07T20:33:37.2129297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2129391Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.2129484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2129856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.2129944Z return fn(*args, **kwargs) 2025-05-07T20:33:37.2130446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.2130548Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.2130922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.2131190Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.2131529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.2131622Z kernel = self.compile( 2025-05-07T20:33:37.2132017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.2132224Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.2132348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2132353Z 2025-05-07T20:33:37.2132546Z self = 2025-05-07T20:33:37.2133311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.2133843Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f88197c7b00>} 2025-05-07T20:33:37.2134576Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.2134764Z context = 2025-05-07T20:33:37.2134769Z 2025-05-07T20:33:37.2134923Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.2135183Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.2135283Z module_map=module_map) 2025-05-07T20:33:37.2135478Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.2135575Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.2135646Z E ^ 2025-05-07T20:33:37.2135994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.2136001Z 2025-05-07T20:33:37.2136412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.2136417Z 2025-05-07T20:33:37.2136515Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2136734Z self=, 2025-05-07T20:33:37.2136806Z T=128, 2025-05-07T20:33:37.2136874Z D=7168, 2025-05-07T20:33:37.2136953Z scale_ub=1200.0, 2025-05-07T20:33:37.2137032Z contiguous=True, 2025-05-07T20:33:37.2137107Z compiled=False, 2025-05-07T20:33:37.2137175Z ) 2025-05-07T20:33:37.2137384Z self = 2025-05-07T20:33:37.2137552Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:37.2137560Z 2025-05-07T20:33:37.2137633Z @given( 2025-05-07T20:33:37.2137746Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2137842Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2137949Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2138059Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2138171Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2138240Z ) 2025-05-07T20:33:37.2138474Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2138563Z def test_silu_mul_quant( 2025-05-07T20:33:37.2138633Z self, 2025-05-07T20:33:37.2138710Z T: int, 2025-05-07T20:33:37.2138782Z D: int, 2025-05-07T20:33:37.2138873Z scale_ub: Optional[float], 2025-05-07T20:33:37.2138961Z contiguous: bool, 2025-05-07T20:33:37.2139041Z compiled: bool, 2025-05-07T20:33:37.2139115Z ) -> None: 2025-05-07T20:33:37.2139254Z torch.manual_seed(2025) 2025-05-07T20:33:37.2139324Z 2025-05-07T20:33:37.2139487Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2139557Z 2025-05-07T20:33:37.2139644Z x_sign = torch.sign(x) 2025-05-07T20:33:37.2139760Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.2141883Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2141894Z 2025-05-07T20:33:37.2142010Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:37.2142018Z 2025-05-07T20:33:37.2142206Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2142427Z self=, 2025-05-07T20:33:37.2142504Z T=128, 2025-05-07T20:33:37.2142577Z D=5120, 2025-05-07T20:33:37.2142653Z scale_ub=1200.0, 2025-05-07T20:33:37.2142736Z contiguous=True, 2025-05-07T20:33:37.2142810Z compiled=True, 2025-05-07T20:33:37.2142883Z ) 2025-05-07T20:33:37.2143096Z self = 2025-05-07T20:33:37.2143254Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:37.2143258Z 2025-05-07T20:33:37.2143330Z @given( 2025-05-07T20:33:37.2143443Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2143603Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2143712Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2143823Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2143932Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2144003Z ) 2025-05-07T20:33:37.2144239Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2144329Z def test_silu_mul_quant( 2025-05-07T20:33:37.2144403Z self, 2025-05-07T20:33:37.2144476Z T: int, 2025-05-07T20:33:37.2144547Z D: int, 2025-05-07T20:33:37.2144640Z scale_ub: Optional[float], 2025-05-07T20:33:37.2144722Z contiguous: bool, 2025-05-07T20:33:37.2144804Z compiled: bool, 2025-05-07T20:33:37.2144878Z ) -> None: 2025-05-07T20:33:37.2144965Z torch.manual_seed(2025) 2025-05-07T20:33:37.2145035Z 2025-05-07T20:33:37.2145192Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2145264Z 2025-05-07T20:33:37.2145353Z x_sign = torch.sign(x) 2025-05-07T20:33:37.2145471Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.2147216Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2147229Z 2025-05-07T20:33:37.2147339Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:37.2147344Z 2025-05-07T20:33:37.2147489Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2147711Z self=, 2025-05-07T20:33:37.2147783Z T=128, 2025-05-07T20:33:37.2147924Z D=7168, 2025-05-07T20:33:37.2148002Z scale_ub=None, 2025-05-07T20:33:37.2148082Z contiguous=True, 2025-05-07T20:33:37.2148160Z compiled=True, 2025-05-07T20:33:37.2148228Z ) 2025-05-07T20:33:37.2148436Z self = 2025-05-07T20:33:37.2148600Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:37.2148672Z 2025-05-07T20:33:37.2148744Z @given( 2025-05-07T20:33:37.2148853Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2148948Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2149056Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2149168Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2149273Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2149345Z ) 2025-05-07T20:33:37.2149581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2149671Z def test_silu_mul_quant( 2025-05-07T20:33:37.2149744Z self, 2025-05-07T20:33:37.2149857Z T: int, 2025-05-07T20:33:37.2149925Z D: int, 2025-05-07T20:33:37.2150014Z scale_ub: Optional[float], 2025-05-07T20:33:37.2150100Z contiguous: bool, 2025-05-07T20:33:37.2150179Z compiled: bool, 2025-05-07T20:33:37.2150253Z ) -> None: 2025-05-07T20:33:37.2150351Z torch.manual_seed(2025) 2025-05-07T20:33:37.2150417Z 2025-05-07T20:33:37.2150575Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2152325Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:37.2152372Z 2025-05-07T20:33:37.2152485Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:37.2152614Z =============================== warnings summary =============================== 2025-05-07T20:33:37.2152919Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:37.2153215Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:37.2153505Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:37.2154377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:37.2154602Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:37.2154606Z 2025-05-07T20:33:37.2154808Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:37.2154971Z ================= 1 failed, 1 deselected, 3 warnings in 12.06s ================= 2025-05-07T20:33:38.9011878Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:38.9643212Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:33:38.9643843Z 2025-05-07T20:33:38.9644306Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:33:38.9644926Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:33:38.9645319Z 2025-05-07T20:33:38.9645637Z 2025-05-07T20:33:38.9645642Z 2025-05-07T20:33:38.9662253Z ##[error]Process completed with exit code 1. 2025-05-07T20:33:38.9751657Z Post job cleanup. 2025-05-07T20:33:39.0718936Z [command]/usr/bin/git version 2025-05-07T20:33:39.0762264Z git version 2.47.1 2025-05-07T20:33:39.0799567Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/cfa129f4-ffae-4973-bfbb-710246d077a2/.gitconfig' 2025-05-07T20:33:39.0810408Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/cfa129f4-ffae-4973-bfbb-710246d077a2' before making global git config changes 2025-05-07T20:33:39.0811274Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:33:39.0816117Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:33:39.0859692Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:33:39.0894256Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:33:39.1226814Z Entering 'external/asmjit' 2025-05-07T20:33:39.1294155Z Entering 'external/composable_kernel' 2025-05-07T20:33:39.1368099Z Entering 'external/cpuinfo' 2025-05-07T20:33:39.1435021Z Entering 'external/cutlass' 2025-05-07T20:33:39.1512755Z Entering 'external/googletest' 2025-05-07T20:33:39.1579823Z Entering 'external/hipify_torch' 2025-05-07T20:33:39.1646519Z Entering 'external/json' 2025-05-07T20:33:39.1732216Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:33:39.1757453Z http.https://github.com/.extraheader 2025-05-07T20:33:39.1769420Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:33:39.1800390Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:33:39.2131695Z Entering 'external/asmjit' 2025-05-07T20:33:39.2173615Z http.https://github.com/.extraheader 2025-05-07T20:33:39.2216623Z Entering 'external/composable_kernel' 2025-05-07T20:33:39.2259759Z http.https://github.com/.extraheader 2025-05-07T20:33:39.2308808Z Entering 'external/cpuinfo' 2025-05-07T20:33:39.2352296Z http.https://github.com/.extraheader 2025-05-07T20:33:39.2395371Z Entering 'external/cutlass' 2025-05-07T20:33:39.2437781Z http.https://github.com/.extraheader 2025-05-07T20:33:39.2489360Z Entering 'external/googletest' 2025-05-07T20:33:39.2532282Z http.https://github.com/.extraheader 2025-05-07T20:33:39.2575349Z Entering 'external/hipify_torch' 2025-05-07T20:33:39.2616929Z http.https://github.com/.extraheader 2025-05-07T20:33:39.2659389Z Entering 'external/json' 2025-05-07T20:33:39.2701386Z http.https://github.com/.extraheader 2025-05-07T20:33:39.2855015Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:33:39.2886832Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:33:39.2897126Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:33:39.2897486Z ##[endgroup] 2025-05-07T20:33:39.2997648Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:33:50.0948651Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:34:06.6556237Z Cleaning up orphan processes