2025-05-07T20:22:34.9521163Z Current runner version: '2.323.0' 2025-05-07T20:22:34.9528211Z Runner name: 'i-03e120d7c73b3b069' 2025-05-07T20:22:34.9529130Z Machine name: 'ip-10-0-57-2' 2025-05-07T20:22:34.9531951Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:22:34.9534314Z Contents: read 2025-05-07T20:22:34.9534820Z Metadata: read 2025-05-07T20:22:34.9535300Z Packages: read 2025-05-07T20:22:34.9535787Z ##[endgroup] 2025-05-07T20:22:34.9538071Z Secret source: None 2025-05-07T20:22:34.9538912Z Prepare workflow directory 2025-05-07T20:22:35.0047220Z Prepare all required actions 2025-05-07T20:22:35.0083108Z Getting action download info 2025-05-07T20:22:35.2490070Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:22:35.5300029Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:22:35.9025957Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:22:37.4602453Z Getting action download info 2025-05-07T20:22:37.5928112Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:22:37.9008663Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.11, 12.6.3, 12.6.3, gcc) 2025-05-07T20:22:37.9540190Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:22:37.9655441Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:22:37.9667684Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:22:37.9668354Z ##[endgroup] 2025-05-07T20:22:39.2045839Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:22:39.2046265Z Instance Type: g5.4xlarge 2025-05-07T20:22:39.2046501Z AMI Name: unknown 2025-05-07T20:22:39.2087177Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:22:44.6626432Z ##[group]Run actions/checkout@v4 2025-05-07T20:22:44.6626737Z with: 2025-05-07T20:22:44.6626958Z submodules: true 2025-05-07T20:22:44.6627202Z repository: pytorch/FBGEMM 2025-05-07T20:22:44.6627575Z token: *** 2025-05-07T20:22:44.6627775Z ssh-strict: true 2025-05-07T20:22:44.6627972Z ssh-user: git 2025-05-07T20:22:44.6628195Z persist-credentials: true 2025-05-07T20:22:44.6628436Z clean: true 2025-05-07T20:22:44.6628660Z sparse-checkout-cone-mode: true 2025-05-07T20:22:44.6628924Z fetch-depth: 1 2025-05-07T20:22:44.6629137Z fetch-tags: false 2025-05-07T20:22:44.6629351Z show-progress: true 2025-05-07T20:22:44.6629563Z lfs: false 2025-05-07T20:22:44.6629769Z set-safe-directory: true 2025-05-07T20:22:44.6630012Z env: 2025-05-07T20:22:44.6630222Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:44.6630519Z BUILD_ENV: build_binary 2025-05-07T20:22:44.6630779Z BUILD_TARGET: genai 2025-05-07T20:22:44.6630994Z BUILD_VARIANT: cuda 2025-05-07T20:22:44.6631261Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:44.6631508Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:44.6631737Z ##[endgroup] 2025-05-07T20:22:44.7782279Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:22:44.7783431Z ##[group]Getting Git version info 2025-05-07T20:22:44.7783843Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.7784434Z [command]/usr/bin/git version 2025-05-07T20:22:44.7784691Z git version 2.47.1 2025-05-07T20:22:44.7798896Z ##[endgroup] 2025-05-07T20:22:44.7813420Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/a7fe0bb1-bbbf-46da-842d-f986fdd76615' before making global git config changes 2025-05-07T20:22:44.7814506Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:22:44.7829386Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.7868573Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.7872096Z ##[group]Initializing the repository 2025-05-07T20:22:44.7876278Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.7917423Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-05-07T20:22:44.7918370Z hint: is subject to change. To configure the initial branch name to use in all 2025-05-07T20:22:44.7919163Z hint: of your new repositories, which will suppress this warning, call: 2025-05-07T20:22:44.7919781Z hint: 2025-05-07T20:22:44.7920193Z hint: git config --global init.defaultBranch 2025-05-07T20:22:44.7920738Z hint: 2025-05-07T20:22:44.7921228Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-05-07T20:22:44.7922161Z hint: 'development'. The just-created branch can be renamed via this command: 2025-05-07T20:22:44.7922836Z hint: 2025-05-07T20:22:44.7923209Z hint: git branch -m 2025-05-07T20:22:44.7924085Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/ 2025-05-07T20:22:44.7932651Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM 2025-05-07T20:22:44.7968530Z ##[endgroup] 2025-05-07T20:22:44.7969220Z ##[group]Disabling automatic garbage collection 2025-05-07T20:22:44.7973167Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:22:44.8004828Z ##[endgroup] 2025-05-07T20:22:44.8005451Z ##[group]Setting up auth 2025-05-07T20:22:44.8012274Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:22:44.8045924Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:22:44.8411236Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:22:44.8445699Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:22:44.8796306Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:44.8843910Z ##[endgroup] 2025-05-07T20:22:44.8844466Z ##[group]Fetching the repository 2025-05-07T20:22:44.8853443Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:22:45.3842339Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:22:45.3842873Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:22:45.3867122Z ##[endgroup] 2025-05-07T20:22:45.3867532Z ##[group]Determining the checkout info 2025-05-07T20:22:45.3870642Z ##[endgroup] 2025-05-07T20:22:45.3886919Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:22:45.3926604Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:22:45.3961358Z ##[group]Checking out the ref 2025-05-07T20:22:45.3966054Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:22:45.5072502Z Note: switching to 'refs/remotes/pull/4066/merge'. 2025-05-07T20:22:45.5072969Z 2025-05-07T20:22:45.5073196Z You are in 'detached HEAD' state. You can look around, make experimental 2025-05-07T20:22:45.5073705Z changes and commit them, and you can discard any commits you make in this 2025-05-07T20:22:45.5074198Z state without impacting any branches by switching back to a branch. 2025-05-07T20:22:45.5074587Z 2025-05-07T20:22:45.5074812Z If you want to create a new branch to retain commits you create, you may 2025-05-07T20:22:45.5075308Z do so (now or later) by using -c with the switch command. Example: 2025-05-07T20:22:45.5075584Z 2025-05-07T20:22:45.5075700Z git switch -c 2025-05-07T20:22:45.5075887Z 2025-05-07T20:22:45.5076051Z Or undo this operation with: 2025-05-07T20:22:45.5076223Z 2025-05-07T20:22:45.5076314Z git switch - 2025-05-07T20:22:45.5076818Z 2025-05-07T20:22:45.5077042Z Turn off this advice by setting config variable advice.detachedHead to false 2025-05-07T20:22:45.5077359Z 2025-05-07T20:22:45.5077743Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:22:45.5085086Z ##[endgroup] 2025-05-07T20:22:45.5085503Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:22:45.5091073Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:45.5136751Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:22:45.5167895Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:22:45.5202282Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:22:45.5229424Z ##[endgroup] 2025-05-07T20:22:45.5229805Z ##[group]Fetching submodules 2025-05-07T20:22:45.5232608Z [command]/usr/bin/git submodule sync 2025-05-07T20:22:45.5573813Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:22:45.5905526Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit' 2025-05-07T20:22:45.5907195Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel' 2025-05-07T20:22:45.5910655Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo' 2025-05-07T20:22:45.5914302Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass' 2025-05-07T20:22:45.5917967Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest' 2025-05-07T20:22:45.5922259Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch' 2025-05-07T20:22:45.5925728Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json' 2025-05-07T20:22:45.5958264Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'... 2025-05-07T20:22:45.9178548Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'... 2025-05-07T20:22:46.3948348Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'... 2025-05-07T20:22:46.8152721Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'... 2025-05-07T20:22:47.8677926Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'... 2025-05-07T20:22:48.1275171Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'... 2025-05-07T20:22:48.4365719Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'... 2025-05-07T20:22:49.7584821Z From https://github.com/asmjit/asmjit 2025-05-07T20:22:49.7585323Z * branch e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD 2025-05-07T20:22:49.8083017Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:22:50.4961178Z From https://github.com/jwfromm/composable_kernel 2025-05-07T20:22:50.4961647Z * branch 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD 2025-05-07T20:22:50.7810605Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:22:51.6742306Z From https://github.com/pytorch/cpuinfo 2025-05-07T20:22:51.6742917Z * branch 6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD 2025-05-07T20:22:51.7822925Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:22:52.9785300Z From https://github.com/jwfromm/cutlass 2025-05-07T20:22:52.9785817Z * branch 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD 2025-05-07T20:22:53.6780680Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:22:54.3658096Z From https://github.com/google/googletest 2025-05-07T20:22:54.3658705Z * branch f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD 2025-05-07T20:22:54.4067920Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:22:54.9715367Z From https://github.com/ROCmSoftwarePlatform/hipify_torch 2025-05-07T20:22:54.9715963Z * branch 420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD 2025-05-07T20:22:54.9801100Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:22:55.6483664Z From https://github.com/nlohmann/json 2025-05-07T20:22:55.6484265Z * branch 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD 2025-05-07T20:22:55.7616766Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:22:55.7636477Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:22:55.7977373Z Entering 'external/asmjit' 2025-05-07T20:22:55.8009687Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.8041887Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.8074033Z Entering 'external/cutlass' 2025-05-07T20:22:55.8105906Z Entering 'external/googletest' 2025-05-07T20:22:55.8137662Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.8169898Z Entering 'external/json' 2025-05-07T20:22:55.8215782Z ##[endgroup] 2025-05-07T20:22:55.8216218Z ##[group]Persisting credentials for submodules 2025-05-07T20:22:55.8223518Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:22:55.8551715Z Entering 'external/asmjit' 2025-05-07T20:22:55.8618113Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.8692011Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.8757925Z Entering 'external/cutlass' 2025-05-07T20:22:55.8831260Z Entering 'external/googletest' 2025-05-07T20:22:55.8897434Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.8966435Z Entering 'external/json' 2025-05-07T20:22:55.9051954Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:22:55.9380812Z Entering 'external/asmjit' 2025-05-07T20:22:55.9443655Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:22:55.9445600Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.9506759Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:22:55.9510194Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.9574659Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:22:55.9577695Z Entering 'external/cutlass' 2025-05-07T20:22:55.9639823Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:22:55.9643051Z Entering 'external/googletest' 2025-05-07T20:22:55.9703528Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:22:55.9706450Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.9767491Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:22:55.9770437Z Entering 'external/json' 2025-05-07T20:22:55.9830264Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:22:55.9914589Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:22:56.0245964Z Entering 'external/asmjit' 2025-05-07T20:22:56.0279181Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.0310534Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.0343075Z Entering 'external/cutlass' 2025-05-07T20:22:56.0374883Z Entering 'external/googletest' 2025-05-07T20:22:56.0407679Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.0438558Z Entering 'external/json' 2025-05-07T20:22:56.0491867Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:22:56.0812992Z Entering 'external/asmjit' 2025-05-07T20:22:56.0846829Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.0880673Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.0912662Z Entering 'external/cutlass' 2025-05-07T20:22:56.0944557Z Entering 'external/googletest' 2025-05-07T20:22:56.0975960Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.1010699Z Entering 'external/json' 2025-05-07T20:22:56.1070629Z ##[endgroup] 2025-05-07T20:22:56.1091106Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:22:56.1118215Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:22:56.1293390Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:22:56.1293705Z with: 2025-05-07T20:22:56.1293936Z name: fbgemm_genai_x86_gcc_py3.11_cu12.6.3.whl 2025-05-07T20:22:56.1294243Z merge-multiple: false 2025-05-07T20:22:56.1294498Z repository: pytorch/FBGEMM 2025-05-07T20:22:56.1294754Z run-id: 14891846252 2025-05-07T20:22:56.1294960Z env: 2025-05-07T20:22:56.1295175Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.1295470Z BUILD_ENV: build_binary 2025-05-07T20:22:56.1295705Z BUILD_TARGET: genai 2025-05-07T20:22:56.1295919Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.1296156Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:56.1296397Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.1296630Z ##[endgroup] 2025-05-07T20:22:56.3582202Z Downloading single artifact 2025-05-07T20:22:56.4571986Z Preparing to download the following artifacts: 2025-05-07T20:22:56.4572855Z - fbgemm_genai_x86_gcc_py3.11_cu12.6.3.whl (ID: 3081362046, Size: 12503661, Expected Digest: sha256:62b71de05844c49a64b362ad2b6d2df4fb5f1ee6fe564783afec567436ca2ca9) 2025-05-07T20:22:56.5138235Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-d00c8883-fd0c-5901-9007-a9cd1395759f/artifacts/83ef9f0a55c3787ac5ec90dd5a05156a974c6e4380cbb349c58c6a5843cb1014.zip 2025-05-07T20:22:56.5139913Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:56.5948148Z (node:56950) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:22:56.5949077Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:22:56.7748019Z SHA256 digest of downloaded artifact is 62b71de05844c49a64b362ad2b6d2df4fb5f1ee6fe564783afec567436ca2ca9 2025-05-07T20:22:56.7748646Z Artifact download completed successfully. 2025-05-07T20:22:56.7748983Z Total of 1 artifact(s) downloaded 2025-05-07T20:22:56.7753995Z Download artifact has finished successfully 2025-05-07T20:22:56.8021170Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:22:56.8021560Z with: 2025-05-07T20:22:56.8021771Z driver-version: 570.133.07 2025-05-07T20:22:56.8022018Z env: 2025-05-07T20:22:56.8022232Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.8022533Z BUILD_ENV: build_binary 2025-05-07T20:22:56.8022775Z BUILD_TARGET: genai 2025-05-07T20:22:56.8022996Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.8023234Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:56.8023488Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.8023721Z ##[endgroup] 2025-05-07T20:22:56.8115166Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:22:56.8115552Z with: 2025-05-07T20:22:56.8115940Z timeout_minutes: 10 2025-05-07T20:22:56.8116170Z max_attempts: 3 2025-05-07T20:22:56.8139498Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:22:56.8162473Z retry_wait_seconds: 10 2025-05-07T20:22:56.8162727Z polling_interval_seconds: 1 2025-05-07T20:22:56.8162979Z warning_on_retry: true 2025-05-07T20:22:56.8163220Z continue_on_error: false 2025-05-07T20:22:56.8163540Z env: 2025-05-07T20:22:56.8163758Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.8164056Z BUILD_ENV: build_binary 2025-05-07T20:22:56.8164293Z BUILD_TARGET: genai 2025-05-07T20:22:56.8164510Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.8164754Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:56.8165005Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.8165236Z DRIVER_VERSION: 570.133.07 2025-05-07T20:22:56.8165478Z ##[endgroup] 2025-05-07T20:22:56.8970790Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:22:56.8971607Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:22:56.8975518Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:22:57.5353785Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:22:57.5354492Z No packages marked for removal. 2025-05-07T20:22:57.5417767Z Dependencies resolved. 2025-05-07T20:22:57.5427556Z Nothing to do. 2025-05-07T20:22:57.5427995Z Complete! 2025-05-07T20:22:57.5748833Z + install_nvidia_driver_common 2025-05-07T20:22:57.5753010Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:22:57.5753420Z + lspci 2025-05-07T20:22:57.5755115Z Before installing NVIDIA driver 2025-05-07T20:22:57.5940995Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:57.5943084Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:57.5944472Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:57.5945463Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:57.5946310Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:57.5947249Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:57.5947950Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:57.5948416Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:57.5948815Z + lsmod 2025-05-07T20:22:57.5985687Z Module Size Used by 2025-05-07T20:22:57.5986096Z xt_conntrack 16384 1 2025-05-07T20:22:57.5986468Z nft_chain_nat 16384 3 2025-05-07T20:22:57.5986874Z xt_MASQUERADE 20480 1 2025-05-07T20:22:57.5987433Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:57.5988320Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:57.5989376Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:57.5990230Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:57.5990835Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:57.5991394Z xfrm_user 57344 1 2025-05-07T20:22:57.5991907Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:57.5992463Z xt_addrtype 16384 2 2025-05-07T20:22:57.5992950Z nft_compat 20480 4 2025-05-07T20:22:57.5993542Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:57.5994345Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:57.5995075Z br_netfilter 36864 0 2025-05-07T20:22:57.5995609Z bridge 323584 1 br_netfilter 2025-05-07T20:22:57.5996203Z stp 16384 1 bridge 2025-05-07T20:22:57.5996753Z llc 16384 2 bridge,stp 2025-05-07T20:22:57.5997295Z overlay 167936 0 2025-05-07T20:22:57.5997611Z tls 135168 0 2025-05-07T20:22:57.5997893Z nls_ascii 16384 1 2025-05-07T20:22:57.5998135Z nls_cp437 20480 1 2025-05-07T20:22:57.5998385Z vfat 24576 1 2025-05-07T20:22:57.5998633Z fat 86016 1 vfat 2025-05-07T20:22:57.5998890Z sunrpc 696320 1 2025-05-07T20:22:57.5999140Z ena 180224 0 2025-05-07T20:22:57.5999379Z i8042 45056 0 2025-05-07T20:22:57.5999632Z serio 28672 3 i8042 2025-05-07T20:22:57.5999892Z button 24576 0 2025-05-07T20:22:57.6000152Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:57.6000430Z dm_mod 188416 0 2025-05-07T20:22:57.6000675Z sch_fq_codel 20480 17 2025-05-07T20:22:57.6000936Z fuse 163840 1 2025-05-07T20:22:57.6001188Z loop 36864 0 2025-05-07T20:22:57.6001431Z configfs 57344 1 2025-05-07T20:22:57.6001685Z dax 45056 1 dm_mod 2025-05-07T20:22:57.6001959Z dmi_sysfs 20480 0 2025-05-07T20:22:57.6002202Z crc32_pclmul 16384 0 2025-05-07T20:22:57.6002455Z crc32c_intel 24576 0 2025-05-07T20:22:57.6002708Z efivarfs 24576 1 2025-05-07T20:22:57.6002952Z + modinfo nvidia 2025-05-07T20:22:57.6003671Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:57.6004338Z import_ns: DMA_BUF 2025-05-07T20:22:57.6004698Z alias: char-major-195-* 2025-05-07T20:22:57.6005052Z version: 570.133.07 2025-05-07T20:22:57.6005392Z supported: external 2025-05-07T20:22:57.6005792Z license: Dual MIT/GPL 2025-05-07T20:22:57.6006228Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:57.6006680Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:57.6007219Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:57.6007542Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:57.6007904Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:57.6008226Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:57.6008534Z depends: i2c-core,drm 2025-05-07T20:22:57.6008785Z retpoline: Y 2025-05-07T20:22:57.6009019Z name: nvidia 2025-05-07T20:22:57.6009516Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:57.6010146Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:57.6010622Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:57.6011143Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:57.6011451Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:57.6011766Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:57.6012128Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:57.6012550Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:57.6012958Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:57.6013404Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:57.6013787Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:57.6014119Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:57.6014408Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:57.6014711Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:57.6015095Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:57.6015635Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:57.6016131Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:57.6016573Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:57.6016986Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:57.6017406Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:57.6017815Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:57.6018146Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:57.6018505Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:57.6018875Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:57.6019214Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:57.6019536Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:57.6019860Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:57.6020179Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:57.6020488Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:57.6020826Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:57.6021199Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:57.6021526Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:57.6021851Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:57.6022199Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:57.6022532Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:57.6022863Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:57.6023196Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:57.6023484Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:57.6023804Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:57.6024117Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:57.6024430Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:57.6024758Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:57.6025104Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:57.6025450Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:57.6025778Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:57.6026112Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:57.6026449Z parm: rm_firmware_active:charp 2025-05-07T20:22:57.6026852Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:22:57.6027099Z ++ command -v nvidia-smi 2025-05-07T20:22:57.6027351Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:22:57.6027612Z + set +e 2025-05-07T20:22:57.6027919Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:22:59.4322929Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:22:59.4324036Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:22:59.4324684Z + '[' 0 -ne 0 ']' 2025-05-07T20:22:59.4325295Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:22:59.4326014Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:22:59.4327232Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:22:59.4328269Z + set -e 2025-05-07T20:22:59.4328980Z + '[' 1 -eq 0 ']' 2025-05-07T20:22:59.4329483Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:22:59.4329961Z + post_install_nvidia_driver_common 2025-05-07T20:22:59.4332184Z + sudo modprobe nvidia 2025-05-07T20:22:59.5631407Z + echo 'After installing NVIDIA driver' 2025-05-07T20:22:59.5631853Z + lspci 2025-05-07T20:22:59.5632120Z After installing NVIDIA driver 2025-05-07T20:22:59.5750827Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:59.5751446Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:59.5752004Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:59.5752524Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:59.5753012Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:59.5753547Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:59.5754053Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:59.5754538Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:59.5754953Z + lsmod 2025-05-07T20:22:59.5782248Z Module Size Used by 2025-05-07T20:22:59.5782551Z nvidia_uvm 1884160 0 2025-05-07T20:22:59.5782951Z nvidia 11583488 1 nvidia_uvm 2025-05-07T20:22:59.5783356Z drm 602112 1 nvidia 2025-05-07T20:22:59.5783769Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:22:59.5784120Z backlight 24576 1 drm 2025-05-07T20:22:59.5784448Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:22:59.5784859Z xt_conntrack 16384 1 2025-05-07T20:22:59.5785217Z nft_chain_nat 16384 3 2025-05-07T20:22:59.5785577Z xt_MASQUERADE 20480 1 2025-05-07T20:22:59.5785901Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:59.5786279Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:59.5786683Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:59.5787130Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:59.5787450Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:59.5787755Z xfrm_user 57344 1 2025-05-07T20:22:59.5788032Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:59.5788321Z xt_addrtype 16384 2 2025-05-07T20:22:59.5788587Z nft_compat 20480 4 2025-05-07T20:22:59.5788904Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:59.5789316Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:59.5789703Z br_netfilter 36864 0 2025-05-07T20:22:59.5789989Z bridge 323584 1 br_netfilter 2025-05-07T20:22:59.5790287Z stp 16384 1 bridge 2025-05-07T20:22:59.5790567Z llc 16384 2 bridge,stp 2025-05-07T20:22:59.5790861Z overlay 167936 0 2025-05-07T20:22:59.5791120Z tls 135168 0 2025-05-07T20:22:59.5791368Z nls_ascii 16384 1 2025-05-07T20:22:59.5791956Z nls_cp437 20480 1 2025-05-07T20:22:59.5792217Z vfat 24576 1 2025-05-07T20:22:59.5792465Z fat 86016 1 vfat 2025-05-07T20:22:59.5792737Z sunrpc 696320 1 2025-05-07T20:22:59.5792992Z ena 180224 0 2025-05-07T20:22:59.5793228Z i8042 45056 0 2025-05-07T20:22:59.5793485Z serio 28672 3 i8042 2025-05-07T20:22:59.5793762Z button 24576 0 2025-05-07T20:22:59.5794015Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:59.5794274Z dm_mod 188416 0 2025-05-07T20:22:59.5794532Z sch_fq_codel 20480 17 2025-05-07T20:22:59.5794795Z fuse 163840 1 2025-05-07T20:22:59.5795039Z loop 36864 0 2025-05-07T20:22:59.5795450Z configfs 57344 1 2025-05-07T20:22:59.5795707Z dax 45056 1 dm_mod 2025-05-07T20:22:59.5795976Z dmi_sysfs 20480 0 2025-05-07T20:22:59.5796231Z crc32_pclmul 16384 0 2025-05-07T20:22:59.5796499Z crc32c_intel 24576 0 2025-05-07T20:22:59.5796751Z efivarfs 24576 1 2025-05-07T20:22:59.5797003Z + modinfo nvidia 2025-05-07T20:22:59.5799146Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:59.5799783Z import_ns: DMA_BUF 2025-05-07T20:22:59.5800108Z alias: char-major-195-* 2025-05-07T20:22:59.5800395Z version: 570.133.07 2025-05-07T20:22:59.5800643Z supported: external 2025-05-07T20:22:59.5800885Z license: Dual MIT/GPL 2025-05-07T20:22:59.5801170Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:59.5801512Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:59.5801824Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:59.5802149Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:59.5802497Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:59.5802829Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:59.5803138Z depends: i2c-core,drm 2025-05-07T20:22:59.5803393Z retpoline: Y 2025-05-07T20:22:59.5803765Z name: nvidia 2025-05-07T20:22:59.5804206Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:59.5804844Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:59.5805445Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:59.5805860Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:59.5806169Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:59.5806470Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:59.5806776Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:59.5807078Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:59.5807388Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:59.5807748Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:59.5808131Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:59.5808465Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:59.5808764Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:59.5809059Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:59.5809418Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:59.5809809Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:59.5810176Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:59.5810588Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:59.5810992Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:59.5811408Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:59.5811808Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:59.5812148Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:59.5812518Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:59.5813012Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:59.5813356Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:59.5813674Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:59.5813996Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:59.5814316Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:59.5814624Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:59.5814967Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:59.5815322Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:59.5815650Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:59.5815984Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:59.5816319Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:59.5816747Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:59.5817085Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:59.5817408Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:59.5817702Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:59.5818027Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:59.5818344Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:59.5818656Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:59.5818982Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:59.5819338Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:59.5819738Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:59.5820056Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:59.5820404Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:59.5820742Z parm: rm_firmware_active:charp 2025-05-07T20:22:59.5821028Z + set +e 2025-05-07T20:22:59.5821214Z + nvidia-smi 2025-05-07T20:23:00.9905326Z Wed May 7 20:23:00 2025 2025-05-07T20:23:00.9906026Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:00.9906972Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:00.9907852Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:00.9908717Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:00.9909260Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:00.9909692Z | | | MIG M. | 2025-05-07T20:23:00.9910029Z |=========================================+========================+======================| 2025-05-07T20:23:00.9972264Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:00.9972724Z | 0% 30C P0 59W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:00.9973108Z | | | N/A | 2025-05-07T20:23:00.9973503Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:00.9973898Z 2025-05-07T20:23:00.9974478Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:00.9974938Z | Processes: | 2025-05-07T20:23:00.9975383Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:00.9975796Z | ID ID Usage | 2025-05-07T20:23:00.9976138Z |=========================================================================================| 2025-05-07T20:23:00.9977086Z | No running processes found | 2025-05-07T20:23:00.9977990Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:01.4111395Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:02.8220183Z NVIDIA A10G 2025-05-07T20:23:03.0894089Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:03.0894760Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:03.0895070Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:03.0895370Z + set -e 2025-05-07T20:23:03.0895578Z INFO: Ignoring allowed status 0 2025-05-07T20:23:03.0903398Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:03.0906872Z + sudo yum install -y yum-utils 2025-05-07T20:23:03.5437567Z Last metadata expiration check: 0:05:44 ago on Wed May 7 20:17:19 2025. 2025-05-07T20:23:03.5690792Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:03.6084637Z Dependencies resolved. 2025-05-07T20:23:03.6266884Z Nothing to do. 2025-05-07T20:23:03.6267204Z Complete! 2025-05-07T20:23:03.6657709Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:03.6658552Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:03.6659734Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:03.9860581Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:04.0428828Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:04.5174761Z nvidia-container-toolkit 14 kB/s | 833 B 00:00 2025-05-07T20:23:04.5426696Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:04.5827270Z Dependencies resolved. 2025-05-07T20:23:04.6005771Z ================================================================================ 2025-05-07T20:23:04.6006193Z Package Arch Version Repository Size 2025-05-07T20:23:04.6006600Z ================================================================================ 2025-05-07T20:23:04.6006897Z Downgrading: 2025-05-07T20:23:04.6007262Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:04.6007848Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-05-07T20:23:04.6008197Z 2025-05-07T20:23:04.6008290Z Transaction Summary 2025-05-07T20:23:04.6008535Z ================================================================================ 2025-05-07T20:23:04.6008849Z Downgrade 2 Packages 2025-05-07T20:23:04.6008997Z 2025-05-07T20:23:04.6009107Z Total download size: 6.8 M 2025-05-07T20:23:04.6010013Z Downloading Packages: 2025-05-07T20:23:04.6651691Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64 20 MB/s | 1.2 MB 00:00 2025-05-07T20:23:04.6836632Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x 69 MB/s | 5.6 MB 00:00 2025-05-07T20:23:04.6845888Z -------------------------------------------------------------------------------- 2025-05-07T20:23:04.6848762Z Total 82 MB/s | 6.8 MB 00:00 2025-05-07T20:23:04.6851492Z Running transaction check 2025-05-07T20:23:04.6954484Z Transaction check succeeded. 2025-05-07T20:23:04.6955139Z Running transaction test 2025-05-07T20:23:04.7251537Z Transaction test succeeded. 2025-05-07T20:23:04.7254109Z Running transaction 2025-05-07T20:23:05.2719285Z Preparing : 1/1 2025-05-07T20:23:05.3775721Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/4 2025-05-07T20:23:05.3812409Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:05.4038705Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:05.4039299Z Cleanup : nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:05.4142013Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:05.4167685Z Cleanup : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4 2025-05-07T20:23:06.8024860Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/4 2025-05-07T20:23:06.8025469Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 1/4 2025-05-07T20:23:06.8026003Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:06.8026533Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 3/4 2025-05-07T20:23:06.9471506Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4================================================================================ 2025-05-07T20:23:06.9472382Z WARNING: 2025-05-07T20:23:06.9472633Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:06.9472866Z 2025-05-07T20:23:06.9472964Z Available Versions: 2025-05-07T20:23:06.9473113Z 2025-05-07T20:23:06.9473216Z Version 2023.7.20250331: 2025-05-07T20:23:06.9473528Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:06.9473786Z 2025-05-07T20:23:06.9473906Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:06.9474114Z 2025-05-07T20:23:06.9474206Z Release notes: 2025-05-07T20:23:06.9474607Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:06.9474981Z 2025-05-07T20:23:06.9475071Z Version 2023.7.20250414: 2025-05-07T20:23:06.9475377Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:06.9475622Z 2025-05-07T20:23:06.9475743Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:06.9475949Z 2025-05-07T20:23:06.9476038Z Release notes: 2025-05-07T20:23:06.9476441Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:06.9476803Z 2025-05-07T20:23:06.9476899Z Version 2023.7.20250428: 2025-05-07T20:23:06.9477302Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:06.9477577Z 2025-05-07T20:23:06.9477950Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:06.9478218Z 2025-05-07T20:23:06.9478371Z Release notes: 2025-05-07T20:23:06.9478819Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:06.9479289Z 2025-05-07T20:23:06.9479472Z ================================================================================ 2025-05-07T20:23:06.9828147Z 2025-05-07T20:23:06.9828305Z 2025-05-07T20:23:06.9842142Z Downgraded: 2025-05-07T20:23:06.9842635Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-05-07T20:23:06.9843221Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-05-07T20:23:06.9843727Z 2025-05-07T20:23:06.9843825Z Complete! 2025-05-07T20:23:07.0307921Z + sudo systemctl restart docker 2025-05-07T20:23:10.9842219Z Wed May 7 20:23:10 2025 2025-05-07T20:23:10.9843002Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:10.9844196Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:10.9845155Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:10.9846133Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:10.9847168Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:10.9848017Z | | | MIG M. | 2025-05-07T20:23:10.9848674Z |=========================================+========================+======================| 2025-05-07T20:23:10.9923719Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:10.9924662Z | 0% 30C P0 59W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:10.9925076Z | | | N/A | 2025-05-07T20:23:10.9925473Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:10.9925864Z 2025-05-07T20:23:10.9926245Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:10.9926671Z | Processes: | 2025-05-07T20:23:10.9927114Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:10.9927690Z | ID ID Usage | 2025-05-07T20:23:10.9928039Z |=========================================================================================| 2025-05-07T20:23:10.9928491Z | No running processes found | 2025-05-07T20:23:10.9928960Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:11.8749733Z Command completed after 1 attempt(s). 2025-05-07T20:23:11.8835922Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:11.8836410Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:11.8852435Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:11.8852787Z env: 2025-05-07T20:23:11.8853017Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:11.8853316Z BUILD_ENV: build_binary 2025-05-07T20:23:11.8853566Z BUILD_TARGET: genai 2025-05-07T20:23:11.8853808Z BUILD_VARIANT: cuda 2025-05-07T20:23:11.8854040Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:11.8854297Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:11.8854600Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:11.8854922Z ##[endgroup] 2025-05-07T20:23:12.2233432Z ################################################################################ 2025-05-07T20:23:12.2233795Z # Print System Info 2025-05-07T20:23:12.2234013Z # 2025-05-07T20:23:12.2249126Z # [2025-05-07T20:23:12.224Z] + print_system_info 2025-05-07T20:23:12.2249491Z ################################################################################ 2025-05-07T20:23:12.2249712Z 2025-05-07T20:23:12.2249827Z ################################################################################ 2025-05-07T20:23:12.2250163Z [INFO] Printing environment variables ... 2025-05-07T20:23:12.2250464Z + printenv 2025-05-07T20:23:12.2250581Z 2025-05-07T20:23:12.2275002Z SHELL=/bin/bash 2025-05-07T20:23:12.2275403Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:12.2275971Z BUILD_VARIANT=cuda 2025-05-07T20:23:12.2276690Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_3a5eed80-7251-498b-a987-a21c05c070ae 2025-05-07T20:23:12.2277471Z GITHUB_ACTION=__run 2025-05-07T20:23:12.2277874Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.2278338Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:12.2278662Z RUNNER_NAME=i-03e120d7c73b3b069 2025-05-07T20:23:12.2278956Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:12.2279263Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:12.2279521Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:12.2279892Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:12.2280322Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:12.2280601Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:12.2280920Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:12.2281426Z *** 2025-05-07T20:23:12.2281632Z LOGNAME=ec2-user 2025-05-07T20:23:12.2281864Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:12.2282127Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:12.2282371Z GITHUB_ACTIONS=true 2025-05-07T20:23:12.2282602Z SYSTEMD_EXEC_PID=55511 2025-05-07T20:23:12.2282884Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:12.2283538Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:12.2284054Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:12.2284341Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:12.2284600Z RUNNER_OS=Linux 2025-05-07T20:23:12.2284827Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:12.2285072Z HOME=/home/ec2-user 2025-05-07T20:23:12.2285328Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:12.2285626Z LANG=C.UTF-8 2025-05-07T20:23:12.2285915Z RUNNER_TRACKING_ID=github_04a57729-97cf-41ac-88c5-5ac90b307b9a 2025-05-07T20:23:12.2286269Z RUNNER_ARCH=X64 2025-05-07T20:23:12.2286554Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:12.2287253Z BUILD_TARGET=genai 2025-05-07T20:23:12.2287781Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_3a5eed80-7251-498b-a987-a21c05c070ae 2025-05-07T20:23:12.2288642Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_3a5eed80-7251-498b-a987-a21c05c070ae 2025-05-07T20:23:12.2289374Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:12.2290034Z INVOCATION_ID=4dac1ab9286f4f74ada387b6af3aba5a 2025-05-07T20:23:12.2290367Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:12.2290635Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:12.2291203Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_3a5eed80-7251-498b-a987-a21c05c070ae 2025-05-07T20:23:12.2291814Z BUILD_ENV=build_binary 2025-05-07T20:23:12.2292045Z GITHUB_ACTOR=q10 2025-05-07T20:23:12.2292281Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:12.2292536Z KERN_NAME_LC=linux 2025-05-07T20:23:12.2292768Z BUILD_CUDA_VERSION=12.6.3 2025-05-07T20:23:12.2293072Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:12.2293401Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:12.2293671Z USER=ec2-user 2025-05-07T20:23:12.2293992Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:12.2294371Z SHLVL=1 2025-05-07T20:23:12.2294640Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:12.2295060Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:12.2295542Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:12.2295902Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:12.2296144Z KERN_NAME=Linux 2025-05-07T20:23:12.2296371Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:12.2296829Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:12.2297403Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:12.2297679Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:12.2297921Z JOURNAL_STREAM=8:93485 2025-05-07T20:23:12.2298244Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:12.2298605Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:12.2298912Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:12.2299248Z GITHUB_BASE_REF=main 2025-05-07T20:23:12.2299470Z CI=true 2025-05-07T20:23:12.2299673Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:12.2299959Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:12.2300239Z GITHUB_ACTION_REF= 2025-05-07T20:23:12.2300481Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:12.2301089Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_3a5eed80-7251-498b-a987-a21c05c070ae 2025-05-07T20:23:12.2301678Z MACHINE_NAME=x86_64 2025-05-07T20:23:12.2301892Z _=/usr/bin/printenv 2025-05-07T20:23:12.2302034Z 2025-05-07T20:23:12.2302153Z ################################################################################ 2025-05-07T20:23:12.2302477Z [INFO] Print ldd version ... 2025-05-07T20:23:12.2302742Z + ldd --version 2025-05-07T20:23:12.2302876Z 2025-05-07T20:23:12.2302972Z ldd (GNU libc) 2.34 2025-05-07T20:23:12.2303246Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:12.2303693Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:12.2304221Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:12.2304678Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:12.2304903Z 2025-05-07T20:23:12.2305018Z ################################################################################ 2025-05-07T20:23:12.2305333Z [INFO] Print CPU info ... 2025-05-07T20:23:12.2305571Z + nproc 2025-05-07T20:23:12.2305688Z 2025-05-07T20:23:12.2318476Z 16 2025-05-07T20:23:12.2320095Z 2025-05-07T20:23:12.2320410Z + lscpu 2025-05-07T20:23:12.2320542Z 2025-05-07T20:23:12.2432671Z Architecture: x86_64 2025-05-07T20:23:12.2433186Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:12.2433929Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2434323Z Byte Order: Little Endian 2025-05-07T20:23:12.2434638Z CPU(s): 16 2025-05-07T20:23:12.2434923Z On-line CPU(s) list: 0-15 2025-05-07T20:23:12.2435238Z Vendor ID: AuthenticAMD 2025-05-07T20:23:12.2435578Z Model name: AMD EPYC 7R32 2025-05-07T20:23:12.2435887Z CPU family: 23 2025-05-07T20:23:12.2436322Z Model: 49 2025-05-07T20:23:12.2436613Z Thread(s) per core: 2 2025-05-07T20:23:12.2436894Z Core(s) per socket: 8 2025-05-07T20:23:12.2437177Z Socket(s): 1 2025-05-07T20:23:12.2437452Z Stepping: 0 2025-05-07T20:23:12.2437749Z BogoMIPS: 5599.99 2025-05-07T20:23:12.2440094Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2442198Z Hypervisor vendor: KVM 2025-05-07T20:23:12.2442506Z Virtualization type: full 2025-05-07T20:23:12.2442845Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:12.2443215Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:12.2443693Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:12.2444044Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:12.2444370Z NUMA node(s): 1 2025-05-07T20:23:12.2444660Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:12.2444994Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:12.2445366Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:12.2445722Z Vulnerability L1tf: Not affected 2025-05-07T20:23:12.2446065Z Vulnerability Mds: Not affected 2025-05-07T20:23:12.2446428Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:12.2446787Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:12.2447149Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:12.2447691Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:12.2448447Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:12.2449210Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:12.2450168Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:12.2451191Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:12.2451869Z Vulnerability Srbds: Not affected 2025-05-07T20:23:12.2452233Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:12.2452550Z 2025-05-07T20:23:12.2452640Z + cat /proc/cpuinfo 2025-05-07T20:23:12.2452781Z 2025-05-07T20:23:12.2452865Z processor : 0 2025-05-07T20:23:12.2453082Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2453316Z cpu family : 23 2025-05-07T20:23:12.2453530Z model : 49 2025-05-07T20:23:12.2453738Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2453974Z stepping : 0 2025-05-07T20:23:12.2454183Z microcode : 0x830107f 2025-05-07T20:23:12.2454574Z cpu MHz : 2359.481 2025-05-07T20:23:12.2454784Z cache size : 512 KB 2025-05-07T20:23:12.2454999Z physical id : 0 2025-05-07T20:23:12.2455213Z siblings : 16 2025-05-07T20:23:12.2455408Z core id : 0 2025-05-07T20:23:12.2455609Z cpu cores : 8 2025-05-07T20:23:12.2455808Z apicid : 0 2025-05-07T20:23:12.2456003Z initial apicid : 0 2025-05-07T20:23:12.2456218Z fpu : yes 2025-05-07T20:23:12.2456420Z fpu_exception : yes 2025-05-07T20:23:12.2456630Z cpuid level : 13 2025-05-07T20:23:12.2456840Z wp : yes 2025-05-07T20:23:12.2458939Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2461188Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2461675Z bogomips : 5599.99 2025-05-07T20:23:12.2461893Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2462133Z clflush size : 64 2025-05-07T20:23:12.2462358Z cache_alignment : 64 2025-05-07T20:23:12.2462622Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2462943Z power management: 2025-05-07T20:23:12.2463075Z 2025-05-07T20:23:12.2463166Z processor : 1 2025-05-07T20:23:12.2463375Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2463619Z cpu family : 23 2025-05-07T20:23:12.2463829Z model : 49 2025-05-07T20:23:12.2464030Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2464278Z stepping : 0 2025-05-07T20:23:12.2464485Z microcode : 0x830107f 2025-05-07T20:23:12.2464710Z cpu MHz : 3286.602 2025-05-07T20:23:12.2464921Z cache size : 512 KB 2025-05-07T20:23:12.2465144Z physical id : 0 2025-05-07T20:23:12.2465355Z siblings : 16 2025-05-07T20:23:12.2465553Z core id : 1 2025-05-07T20:23:12.2465757Z cpu cores : 8 2025-05-07T20:23:12.2465960Z apicid : 2 2025-05-07T20:23:12.2466157Z initial apicid : 2 2025-05-07T20:23:12.2466370Z fpu : yes 2025-05-07T20:23:12.2466573Z fpu_exception : yes 2025-05-07T20:23:12.2466786Z cpuid level : 13 2025-05-07T20:23:12.2466996Z wp : yes 2025-05-07T20:23:12.2468951Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2471173Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2471659Z bogomips : 5599.99 2025-05-07T20:23:12.2471880Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2472118Z clflush size : 64 2025-05-07T20:23:12.2472331Z cache_alignment : 64 2025-05-07T20:23:12.2472601Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2472919Z power management: 2025-05-07T20:23:12.2473051Z 2025-05-07T20:23:12.2473144Z processor : 2 2025-05-07T20:23:12.2473353Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2473591Z cpu family : 23 2025-05-07T20:23:12.2473796Z model : 49 2025-05-07T20:23:12.2474000Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2474243Z stepping : 0 2025-05-07T20:23:12.2474455Z microcode : 0x830107f 2025-05-07T20:23:12.2474673Z cpu MHz : 2775.142 2025-05-07T20:23:12.2474894Z cache size : 512 KB 2025-05-07T20:23:12.2475110Z physical id : 0 2025-05-07T20:23:12.2475317Z siblings : 16 2025-05-07T20:23:12.2475645Z core id : 2 2025-05-07T20:23:12.2475847Z cpu cores : 8 2025-05-07T20:23:12.2476048Z apicid : 4 2025-05-07T20:23:12.2476242Z initial apicid : 4 2025-05-07T20:23:12.2476459Z fpu : yes 2025-05-07T20:23:12.2476662Z fpu_exception : yes 2025-05-07T20:23:12.2476874Z cpuid level : 13 2025-05-07T20:23:12.2477084Z wp : yes 2025-05-07T20:23:12.2479116Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2481338Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2481832Z bogomips : 5599.99 2025-05-07T20:23:12.2482050Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2482288Z clflush size : 64 2025-05-07T20:23:12.2482514Z cache_alignment : 64 2025-05-07T20:23:12.2482779Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2483093Z power management: 2025-05-07T20:23:12.2483224Z 2025-05-07T20:23:12.2483314Z processor : 3 2025-05-07T20:23:12.2483679Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2483928Z cpu family : 23 2025-05-07T20:23:12.2484134Z model : 49 2025-05-07T20:23:12.2484335Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2484573Z stepping : 0 2025-05-07T20:23:12.2484779Z microcode : 0x830107f 2025-05-07T20:23:12.2485019Z cpu MHz : 3298.846 2025-05-07T20:23:12.2485225Z cache size : 512 KB 2025-05-07T20:23:12.2485441Z physical id : 0 2025-05-07T20:23:12.2485649Z siblings : 16 2025-05-07T20:23:12.2485844Z core id : 3 2025-05-07T20:23:12.2486045Z cpu cores : 8 2025-05-07T20:23:12.2486257Z apicid : 6 2025-05-07T20:23:12.2486450Z initial apicid : 6 2025-05-07T20:23:12.2486663Z fpu : yes 2025-05-07T20:23:12.2486867Z fpu_exception : yes 2025-05-07T20:23:12.2487078Z cpuid level : 13 2025-05-07T20:23:12.2487290Z wp : yes 2025-05-07T20:23:12.2489249Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2491471Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2491960Z bogomips : 5599.99 2025-05-07T20:23:12.2492198Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2492461Z clflush size : 64 2025-05-07T20:23:12.2492680Z cache_alignment : 64 2025-05-07T20:23:12.2492944Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2493256Z power management: 2025-05-07T20:23:12.2493384Z 2025-05-07T20:23:12.2493517Z processor : 4 2025-05-07T20:23:12.2507975Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2508338Z cpu family : 23 2025-05-07T20:23:12.2508642Z model : 49 2025-05-07T20:23:12.2508876Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2509121Z stepping : 0 2025-05-07T20:23:12.2509336Z microcode : 0x830107f 2025-05-07T20:23:12.2509573Z cpu MHz : 3302.521 2025-05-07T20:23:12.2509787Z cache size : 512 KB 2025-05-07T20:23:12.2510006Z physical id : 0 2025-05-07T20:23:12.2510217Z siblings : 16 2025-05-07T20:23:12.2510412Z core id : 4 2025-05-07T20:23:12.2510615Z cpu cores : 8 2025-05-07T20:23:12.2510823Z apicid : 8 2025-05-07T20:23:12.2511147Z initial apicid : 8 2025-05-07T20:23:12.2511366Z fpu : yes 2025-05-07T20:23:12.2511630Z fpu_exception : yes 2025-05-07T20:23:12.2511846Z cpuid level : 13 2025-05-07T20:23:12.2512060Z wp : yes 2025-05-07T20:23:12.2514111Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2516346Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2516837Z bogomips : 5599.99 2025-05-07T20:23:12.2517055Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2517299Z clflush size : 64 2025-05-07T20:23:12.2517523Z cache_alignment : 64 2025-05-07T20:23:12.2517791Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2518113Z power management: 2025-05-07T20:23:12.2518247Z 2025-05-07T20:23:12.2518344Z processor : 5 2025-05-07T20:23:12.2518558Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2518798Z cpu family : 23 2025-05-07T20:23:12.2519014Z model : 49 2025-05-07T20:23:12.2519212Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2519469Z stepping : 0 2025-05-07T20:23:12.2519685Z microcode : 0x830107f 2025-05-07T20:23:12.2519911Z cpu MHz : 3295.810 2025-05-07T20:23:12.2520125Z cache size : 512 KB 2025-05-07T20:23:12.2520346Z physical id : 0 2025-05-07T20:23:12.2520552Z siblings : 16 2025-05-07T20:23:12.2520757Z core id : 5 2025-05-07T20:23:12.2520959Z cpu cores : 8 2025-05-07T20:23:12.2521157Z apicid : 10 2025-05-07T20:23:12.2521368Z initial apicid : 10 2025-05-07T20:23:12.2521585Z fpu : yes 2025-05-07T20:23:12.2521786Z fpu_exception : yes 2025-05-07T20:23:12.2522005Z cpuid level : 13 2025-05-07T20:23:12.2522215Z wp : yes 2025-05-07T20:23:12.2524317Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2526549Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2527034Z bogomips : 5599.99 2025-05-07T20:23:12.2527261Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2527508Z clflush size : 64 2025-05-07T20:23:12.2527730Z cache_alignment : 64 2025-05-07T20:23:12.2528006Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2528326Z power management: 2025-05-07T20:23:12.2528461Z 2025-05-07T20:23:12.2528547Z processor : 6 2025-05-07T20:23:12.2528774Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2529022Z cpu family : 23 2025-05-07T20:23:12.2529232Z model : 49 2025-05-07T20:23:12.2529449Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2529695Z stepping : 0 2025-05-07T20:23:12.2529899Z microcode : 0x830107f 2025-05-07T20:23:12.2530135Z cpu MHz : 3308.553 2025-05-07T20:23:12.2530354Z cache size : 512 KB 2025-05-07T20:23:12.2530572Z physical id : 0 2025-05-07T20:23:12.2530778Z siblings : 16 2025-05-07T20:23:12.2530980Z core id : 6 2025-05-07T20:23:12.2531185Z cpu cores : 8 2025-05-07T20:23:12.2531390Z apicid : 12 2025-05-07T20:23:12.2531603Z initial apicid : 12 2025-05-07T20:23:12.2531817Z fpu : yes 2025-05-07T20:23:12.2532011Z fpu_exception : yes 2025-05-07T20:23:12.2532233Z cpuid level : 13 2025-05-07T20:23:12.2532566Z wp : yes 2025-05-07T20:23:12.2534597Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2536859Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2537356Z bogomips : 5599.99 2025-05-07T20:23:12.2537581Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2537812Z clflush size : 64 2025-05-07T20:23:12.2538034Z cache_alignment : 64 2025-05-07T20:23:12.2538306Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2538842Z power management: 2025-05-07T20:23:12.2538981Z 2025-05-07T20:23:12.2539072Z processor : 7 2025-05-07T20:23:12.2539293Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2539528Z cpu family : 23 2025-05-07T20:23:12.2539730Z model : 49 2025-05-07T20:23:12.2539937Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2540182Z stepping : 0 2025-05-07T20:23:12.2540385Z microcode : 0x830107f 2025-05-07T20:23:12.2540613Z cpu MHz : 3299.922 2025-05-07T20:23:12.2540836Z cache size : 512 KB 2025-05-07T20:23:12.2541046Z physical id : 0 2025-05-07T20:23:12.2541256Z siblings : 16 2025-05-07T20:23:12.2541458Z core id : 7 2025-05-07T20:23:12.2541655Z cpu cores : 8 2025-05-07T20:23:12.2541864Z apicid : 14 2025-05-07T20:23:12.2542074Z initial apicid : 14 2025-05-07T20:23:12.2542282Z fpu : yes 2025-05-07T20:23:12.2542486Z fpu_exception : yes 2025-05-07T20:23:12.2542702Z cpuid level : 13 2025-05-07T20:23:12.2542904Z wp : yes 2025-05-07T20:23:12.2544856Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2547081Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2547569Z bogomips : 5599.99 2025-05-07T20:23:12.2547793Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2548024Z clflush size : 64 2025-05-07T20:23:12.2548242Z cache_alignment : 64 2025-05-07T20:23:12.2548504Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2548819Z power management: 2025-05-07T20:23:12.2548946Z 2025-05-07T20:23:12.2549025Z processor : 8 2025-05-07T20:23:12.2549228Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2549454Z cpu family : 23 2025-05-07T20:23:12.2549648Z model : 49 2025-05-07T20:23:12.2549841Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2550070Z stepping : 0 2025-05-07T20:23:12.2550266Z microcode : 0x830107f 2025-05-07T20:23:12.2550479Z cpu MHz : 1994.443 2025-05-07T20:23:12.2550684Z cache size : 512 KB 2025-05-07T20:23:12.2550887Z physical id : 0 2025-05-07T20:23:12.2551088Z siblings : 16 2025-05-07T20:23:12.2551290Z core id : 0 2025-05-07T20:23:12.2551485Z cpu cores : 8 2025-05-07T20:23:12.2551673Z apicid : 1 2025-05-07T20:23:12.2551857Z initial apicid : 1 2025-05-07T20:23:12.2552068Z fpu : yes 2025-05-07T20:23:12.2552259Z fpu_exception : yes 2025-05-07T20:23:12.2552465Z cpuid level : 13 2025-05-07T20:23:12.2552660Z wp : yes 2025-05-07T20:23:12.2554593Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2557080Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2557572Z bogomips : 5599.99 2025-05-07T20:23:12.2557795Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2558025Z clflush size : 64 2025-05-07T20:23:12.2558241Z cache_alignment : 64 2025-05-07T20:23:12.2558509Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2558828Z power management: 2025-05-07T20:23:12.2558959Z 2025-05-07T20:23:12.2559043Z processor : 9 2025-05-07T20:23:12.2559263Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2559504Z cpu family : 23 2025-05-07T20:23:12.2559707Z model : 49 2025-05-07T20:23:12.2559914Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2560155Z stepping : 0 2025-05-07T20:23:12.2560355Z microcode : 0x830107f 2025-05-07T20:23:12.2560583Z cpu MHz : 3280.714 2025-05-07T20:23:12.2560801Z cache size : 512 KB 2025-05-07T20:23:12.2561010Z physical id : 0 2025-05-07T20:23:12.2561222Z siblings : 16 2025-05-07T20:23:12.2561416Z core id : 1 2025-05-07T20:23:12.2561617Z cpu cores : 8 2025-05-07T20:23:12.2561818Z apicid : 3 2025-05-07T20:23:12.2562016Z initial apicid : 3 2025-05-07T20:23:12.2562221Z fpu : yes 2025-05-07T20:23:12.2562422Z fpu_exception : yes 2025-05-07T20:23:12.2562645Z cpuid level : 13 2025-05-07T20:23:12.2562846Z wp : yes 2025-05-07T20:23:12.2564902Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2567120Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2567605Z bogomips : 5599.99 2025-05-07T20:23:12.2567830Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2568059Z clflush size : 64 2025-05-07T20:23:12.2568276Z cache_alignment : 64 2025-05-07T20:23:12.2568546Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2568854Z power management: 2025-05-07T20:23:12.2568991Z 2025-05-07T20:23:12.2569071Z processor : 10 2025-05-07T20:23:12.2569284Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2569518Z cpu family : 23 2025-05-07T20:23:12.2569722Z model : 49 2025-05-07T20:23:12.2569928Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2570161Z stepping : 0 2025-05-07T20:23:12.2570366Z microcode : 0x830107f 2025-05-07T20:23:12.2570590Z cpu MHz : 3260.265 2025-05-07T20:23:12.2570796Z cache size : 512 KB 2025-05-07T20:23:12.2571009Z physical id : 0 2025-05-07T20:23:12.2571210Z siblings : 16 2025-05-07T20:23:12.2571408Z core id : 2 2025-05-07T20:23:12.2571606Z cpu cores : 8 2025-05-07T20:23:12.2571803Z apicid : 5 2025-05-07T20:23:12.2571997Z initial apicid : 5 2025-05-07T20:23:12.2572221Z fpu : yes 2025-05-07T20:23:12.2572456Z fpu_exception : yes 2025-05-07T20:23:12.2572681Z cpuid level : 13 2025-05-07T20:23:12.2572881Z wp : yes 2025-05-07T20:23:12.2574983Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2577297Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2577784Z bogomips : 5599.99 2025-05-07T20:23:12.2578082Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2578321Z clflush size : 64 2025-05-07T20:23:12.2578537Z cache_alignment : 64 2025-05-07T20:23:12.2578797Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2579118Z power management: 2025-05-07T20:23:12.2579247Z 2025-05-07T20:23:12.2579337Z processor : 11 2025-05-07T20:23:12.2579548Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2579779Z cpu family : 23 2025-05-07T20:23:12.2579982Z model : 49 2025-05-07T20:23:12.2580189Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2580422Z stepping : 0 2025-05-07T20:23:12.2580627Z microcode : 0x830107f 2025-05-07T20:23:12.2580853Z cpu MHz : 3292.356 2025-05-07T20:23:12.2581057Z cache size : 512 KB 2025-05-07T20:23:12.2581268Z physical id : 0 2025-05-07T20:23:12.2581473Z siblings : 16 2025-05-07T20:23:12.2581665Z core id : 3 2025-05-07T20:23:12.2581863Z cpu cores : 8 2025-05-07T20:23:12.2582058Z apicid : 7 2025-05-07T20:23:12.2582248Z initial apicid : 7 2025-05-07T20:23:12.2582459Z fpu : yes 2025-05-07T20:23:12.2582654Z fpu_exception : yes 2025-05-07T20:23:12.2582863Z cpuid level : 13 2025-05-07T20:23:12.2583069Z wp : yes 2025-05-07T20:23:12.2585003Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2587213Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2587687Z bogomips : 5599.99 2025-05-07T20:23:12.2587903Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2588136Z clflush size : 64 2025-05-07T20:23:12.2588346Z cache_alignment : 64 2025-05-07T20:23:12.2588606Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2588918Z power management: 2025-05-07T20:23:12.2589048Z 2025-05-07T20:23:12.2589137Z processor : 12 2025-05-07T20:23:12.2589341Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2589572Z cpu family : 23 2025-05-07T20:23:12.2589771Z model : 49 2025-05-07T20:23:12.2589966Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2590209Z stepping : 0 2025-05-07T20:23:12.2590410Z microcode : 0x830107f 2025-05-07T20:23:12.2590628Z cpu MHz : 3302.033 2025-05-07T20:23:12.2590838Z cache size : 512 KB 2025-05-07T20:23:12.2591047Z physical id : 0 2025-05-07T20:23:12.2591246Z siblings : 16 2025-05-07T20:23:12.2591443Z core id : 4 2025-05-07T20:23:12.2591637Z cpu cores : 8 2025-05-07T20:23:12.2591828Z apicid : 9 2025-05-07T20:23:12.2592025Z initial apicid : 9 2025-05-07T20:23:12.2592232Z fpu : yes 2025-05-07T20:23:12.2592420Z fpu_exception : yes 2025-05-07T20:23:12.2592637Z cpuid level : 13 2025-05-07T20:23:12.2592840Z wp : yes 2025-05-07T20:23:12.2594775Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2597070Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2597547Z bogomips : 5599.99 2025-05-07T20:23:12.2597764Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2597998Z clflush size : 64 2025-05-07T20:23:12.2598208Z cache_alignment : 64 2025-05-07T20:23:12.2598558Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2598875Z power management: 2025-05-07T20:23:12.2599004Z 2025-05-07T20:23:12.2599090Z processor : 13 2025-05-07T20:23:12.2599305Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2599544Z cpu family : 23 2025-05-07T20:23:12.2599742Z model : 49 2025-05-07T20:23:12.2599944Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2600180Z stepping : 0 2025-05-07T20:23:12.2600382Z microcode : 0x830107f 2025-05-07T20:23:12.2600607Z cpu MHz : 3297.445 2025-05-07T20:23:12.2600822Z cache size : 512 KB 2025-05-07T20:23:12.2601032Z physical id : 0 2025-05-07T20:23:12.2601240Z siblings : 16 2025-05-07T20:23:12.2601445Z core id : 5 2025-05-07T20:23:12.2601634Z cpu cores : 8 2025-05-07T20:23:12.2601833Z apicid : 11 2025-05-07T20:23:12.2602037Z initial apicid : 11 2025-05-07T20:23:12.2602269Z fpu : yes 2025-05-07T20:23:12.2602481Z fpu_exception : yes 2025-05-07T20:23:12.2602692Z cpuid level : 13 2025-05-07T20:23:12.2602894Z wp : yes 2025-05-07T20:23:12.2604946Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2607154Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2607630Z bogomips : 5599.99 2025-05-07T20:23:12.2607845Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2608069Z clflush size : 64 2025-05-07T20:23:12.2608287Z cache_alignment : 64 2025-05-07T20:23:12.2608552Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2608861Z power management: 2025-05-07T20:23:12.2608997Z 2025-05-07T20:23:12.2609078Z processor : 14 2025-05-07T20:23:12.2609288Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2609522Z cpu family : 23 2025-05-07T20:23:12.2609717Z model : 49 2025-05-07T20:23:12.2609916Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2610153Z stepping : 0 2025-05-07T20:23:12.2610347Z microcode : 0x830107f 2025-05-07T20:23:12.2610568Z cpu MHz : 3300.912 2025-05-07T20:23:12.2610777Z cache size : 512 KB 2025-05-07T20:23:12.2610985Z physical id : 0 2025-05-07T20:23:12.2611191Z siblings : 16 2025-05-07T20:23:12.2611388Z core id : 6 2025-05-07T20:23:12.2611577Z cpu cores : 8 2025-05-07T20:23:12.2611771Z apicid : 13 2025-05-07T20:23:12.2611971Z initial apicid : 13 2025-05-07T20:23:12.2612175Z fpu : yes 2025-05-07T20:23:12.2612370Z fpu_exception : yes 2025-05-07T20:23:12.2612583Z cpuid level : 13 2025-05-07T20:23:12.2612779Z wp : yes 2025-05-07T20:23:12.2614715Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2618952Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2619435Z bogomips : 5599.99 2025-05-07T20:23:12.2619645Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2619880Z clflush size : 64 2025-05-07T20:23:12.2620094Z cache_alignment : 64 2025-05-07T20:23:12.2620364Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2620672Z power management: 2025-05-07T20:23:12.2620806Z 2025-05-07T20:23:12.2620977Z processor : 15 2025-05-07T20:23:12.2621198Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2621429Z cpu family : 23 2025-05-07T20:23:12.2621637Z model : 49 2025-05-07T20:23:12.2621842Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2622077Z stepping : 0 2025-05-07T20:23:12.2622285Z microcode : 0x830107f 2025-05-07T20:23:12.2622512Z cpu MHz : 3294.458 2025-05-07T20:23:12.2622718Z cache size : 512 KB 2025-05-07T20:23:12.2622930Z physical id : 0 2025-05-07T20:23:12.2623142Z siblings : 16 2025-05-07T20:23:12.2623338Z core id : 7 2025-05-07T20:23:12.2623535Z cpu cores : 8 2025-05-07T20:23:12.2623731Z apicid : 15 2025-05-07T20:23:12.2623927Z initial apicid : 15 2025-05-07T20:23:12.2624141Z fpu : yes 2025-05-07T20:23:12.2624329Z fpu_exception : yes 2025-05-07T20:23:12.2624534Z cpuid level : 13 2025-05-07T20:23:12.2624728Z wp : yes 2025-05-07T20:23:12.2626678Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2628897Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2629387Z bogomips : 5599.99 2025-05-07T20:23:12.2629600Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2629833Z clflush size : 64 2025-05-07T20:23:12.2630044Z cache_alignment : 64 2025-05-07T20:23:12.2630308Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2630623Z power management: 2025-05-07T20:23:12.2630751Z 2025-05-07T20:23:12.2630756Z 2025-05-07T20:23:12.2630883Z ################################################################################ 2025-05-07T20:23:12.2631191Z [INFO] Print PCI info ... 2025-05-07T20:23:12.2631428Z + lspci -v 2025-05-07T20:23:12.2631547Z 2025-05-07T20:23:12.2631759Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:12.2632146Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:12.2632468Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:12.2632674Z 2025-05-07T20:23:12.2632872Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:12.2633251Z Physical Slot: 1 2025-05-07T20:23:12.2633496Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.2633698Z 2025-05-07T20:23:12.2633951Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:12.2634380Z Physical Slot: 1 2025-05-07T20:23:12.2634638Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:12.2634859Z 2025-05-07T20:23:12.2635129Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:12.2635565Z Physical Slot: 3 2025-05-07T20:23:12.2635803Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.2636141Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:12.2636489Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:12.2636717Z 2025-05-07T20:23:12.2637017Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:12.2637611Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:12.2637897Z Physical Slot: 4 2025-05-07T20:23:12.2638148Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:12.2638782Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.2639156Z Capabilities: 2025-05-07T20:23:12.2639418Z Kernel driver in use: nvme 2025-05-07T20:23:12.2639585Z 2025-05-07T20:23:12.2639945Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:12.2640429Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:12.2640778Z Physical Slot: 5 2025-05-07T20:23:12.2641014Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.2641365Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.2641747Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:12.2642068Z Capabilities: 2025-05-07T20:23:12.2642344Z Kernel driver in use: ena 2025-05-07T20:23:12.2642626Z Kernel modules: ena 2025-05-07T20:23:12.2642764Z 2025-05-07T20:23:12.2642931Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:12.2643312Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:12.2643699Z Physical Slot: 30 2025-05-07T20:23:12.2643952Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:12.2644323Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:12.2644717Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:12.2645086Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:12.2645413Z Capabilities: 2025-05-07T20:23:12.2645683Z Kernel driver in use: nvidia 2025-05-07T20:23:12.2645938Z Kernel modules: nvidia 2025-05-07T20:23:12.2646082Z 2025-05-07T20:23:12.2646382Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:12.2646894Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:12.2647180Z Physical Slot: 31 2025-05-07T20:23:12.2647422Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.2647773Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.2648152Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:12.2648480Z Capabilities: 2025-05-07T20:23:12.2648740Z Kernel driver in use: nvme 2025-05-07T20:23:12.2648903Z 2025-05-07T20:23:12.2648907Z 2025-05-07T20:23:12.2649021Z ################################################################################ 2025-05-07T20:23:12.2649344Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:12.2649624Z + uname -a 2025-05-07T20:23:12.2649746Z 2025-05-07T20:23:12.2657397Z Linux ip-10-0-57-2.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:12.2657965Z 2025-05-07T20:23:12.2658057Z + uname -m 2025-05-07T20:23:12.2658191Z 2025-05-07T20:23:12.2658266Z x86_64 2025-05-07T20:23:12.2658382Z 2025-05-07T20:23:12.2658467Z + cat /proc/version 2025-05-07T20:23:12.2658597Z 2025-05-07T20:23:12.2659141Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:12.2659760Z 2025-05-07T20:23:12.2659847Z + cat /etc/os-release 2025-05-07T20:23:12.2659996Z 2025-05-07T20:23:12.2660081Z NAME="Amazon Linux" 2025-05-07T20:23:12.2660291Z VERSION="2023" 2025-05-07T20:23:12.2660494Z ID="amzn" 2025-05-07T20:23:12.2660680Z ID_LIKE="fedora" 2025-05-07T20:23:12.2660887Z VERSION_ID="2023" 2025-05-07T20:23:12.2661113Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:12.2661384Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:12.2661668Z ANSI_COLOR="0;33" 2025-05-07T20:23:12.2661917Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:12.2662480Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:12.2662913Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:12.2663325Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:12.2663758Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:12.2664127Z VENDOR_NAME="AWS" 2025-05-07T20:23:12.2664366Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:12.2664655Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:12.2664805Z 2025-05-07T20:23:12.2665005Z ################################################################################ 2025-05-07T20:23:12.2665308Z # Print EC2 Instance Info 2025-05-07T20:23:12.2665541Z # 2025-05-07T20:23:12.2665751Z # [2025-05-07T20:23:12.264Z] + print_ec2_info 2025-05-07T20:23:12.2666060Z ################################################################################ 2025-05-07T20:23:12.2666272Z 2025-05-07T20:23:12.2769654Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:12.2892038Z instance-id: i-03e120d7c73b3b069 2025-05-07T20:23:12.3012306Z instance-type: g5.4xlarge 2025-05-07T20:23:12.3054932Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:12.3055292Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:12.3064224Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:12.3064590Z env: 2025-05-07T20:23:12.3064814Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:12.3065125Z BUILD_ENV: build_binary 2025-05-07T20:23:12.3065379Z BUILD_TARGET: genai 2025-05-07T20:23:12.3065615Z BUILD_VARIANT: cuda 2025-05-07T20:23:12.3065871Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:12.3066135Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:12.3066450Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.3066780Z ##[endgroup] 2025-05-07T20:23:12.6431670Z ################################################################################ 2025-05-07T20:23:12.6432128Z [INFO] Printing general display info ... 2025-05-07T20:23:12.6460830Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:12.7607003Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:12.7617571Z /usr/bin/sudo 2025-05-07T20:23:12.7628258Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:12.7638051Z /usr/bin/yum 2025-05-07T20:23:12.7640251Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:12.7660732Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:13.2122012Z Last metadata expiration check: 0:00:09 ago on Wed May 7 20:23:04 2025. 2025-05-07T20:23:13.2875822Z ================================================================================ 2025-05-07T20:23:13.2876302Z WARNING: 2025-05-07T20:23:13.2876694Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:13.2877024Z 2025-05-07T20:23:13.2877149Z Available Versions: 2025-05-07T20:23:13.2877360Z 2025-05-07T20:23:13.2877484Z Version 2023.7.20250331: 2025-05-07T20:23:13.2877849Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:13.2878116Z 2025-05-07T20:23:13.2878251Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:13.2878467Z 2025-05-07T20:23:13.2878564Z Release notes: 2025-05-07T20:23:13.2878969Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:13.2879340Z 2025-05-07T20:23:13.2879438Z Version 2023.7.20250414: 2025-05-07T20:23:13.2879742Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:13.2879995Z 2025-05-07T20:23:13.2880110Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:13.2880318Z 2025-05-07T20:23:13.2880408Z Release notes: 2025-05-07T20:23:13.2880796Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:13.2881167Z 2025-05-07T20:23:13.2881254Z Version 2023.7.20250428: 2025-05-07T20:23:13.2881558Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:13.2881805Z 2025-05-07T20:23:13.2882149Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:13.2882365Z 2025-05-07T20:23:13.2882451Z Release notes: 2025-05-07T20:23:13.2882842Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:13.2883204Z 2025-05-07T20:23:13.2883322Z ================================================================================ 2025-05-07T20:23:13.4036255Z Dependencies resolved. 2025-05-07T20:23:13.4324823Z ================================================================================ 2025-05-07T20:23:13.4325395Z Package Arch Version Repository Size 2025-05-07T20:23:13.4325921Z ================================================================================ 2025-05-07T20:23:13.4326345Z Upgrading: 2025-05-07T20:23:13.4326698Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:13.4327286Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:13.4327665Z 2025-05-07T20:23:13.4328072Z Transaction Summary 2025-05-07T20:23:13.4328377Z ================================================================================ 2025-05-07T20:23:13.4328807Z Upgrade 2 Packages 2025-05-07T20:23:13.4329001Z 2025-05-07T20:23:13.4329141Z Total download size: 6.9 M 2025-05-07T20:23:13.4329641Z Downloading Packages: 2025-05-07T20:23:13.4710433Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 34 MB/s | 1.2 MB 00:00 2025-05-07T20:23:13.5193782Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 67 MB/s | 5.7 MB 00:00 2025-05-07T20:23:13.5207654Z -------------------------------------------------------------------------------- 2025-05-07T20:23:13.5208579Z Total 79 MB/s | 6.9 MB 00:00 2025-05-07T20:23:13.5211031Z Running transaction check 2025-05-07T20:23:13.5305623Z Transaction check succeeded. 2025-05-07T20:23:13.5306149Z Running transaction test 2025-05-07T20:23:13.5601544Z Transaction test succeeded. 2025-05-07T20:23:13.5605269Z Running transaction 2025-05-07T20:23:14.1120392Z Preparing : 1/1 2025-05-07T20:23:14.2190603Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:14.2220910Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:14.2438314Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:14.2439453Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:14.2553897Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:14.2577340Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:14.4139515Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:14.4140148Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:14.4140724Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:14.4141255Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:14.5596054Z ================================================================================ 2025-05-07T20:23:14.5596423Z WARNING: 2025-05-07T20:23:14.5596672Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:14.5596898Z 2025-05-07T20:23:14.5596987Z Available Versions: 2025-05-07T20:23:14.5597140Z 2025-05-07T20:23:14.5597229Z Version 2023.7.20250331: 2025-05-07T20:23:14.5597537Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:14.5597784Z 2025-05-07T20:23:14.5597912Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:14.5598121Z 2025-05-07T20:23:14.5598205Z Release notes: 2025-05-07T20:23:14.5598621Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:14.5599267Z 2025-05-07T20:23:14.5599374Z Version 2023.7.20250414: 2025-05-07T20:23:14.5599679Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:14.5599926Z 2025-05-07T20:23:14.5600039Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:14.5600249Z 2025-05-07T20:23:14.5600333Z Release notes: 2025-05-07T20:23:14.5600724Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:14.5601085Z 2025-05-07T20:23:14.5601175Z Version 2023.7.20250428: 2025-05-07T20:23:14.5601481Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:14.5601731Z 2025-05-07T20:23:14.5601843Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:14.5602046Z 2025-05-07T20:23:14.5602137Z Release notes: 2025-05-07T20:23:14.5602518Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:14.5602892Z 2025-05-07T20:23:14.5603228Z ================================================================================ 2025-05-07T20:23:14.6174219Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:14.6174554Z 2025-05-07T20:23:14.6174647Z Upgraded: 2025-05-07T20:23:14.6174985Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:14.6175549Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:14.6175883Z 2025-05-07T20:23:14.6175975Z Complete! 2025-05-07T20:23:14.6625305Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:14.6651509Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:15.1322888Z Last metadata expiration check: 0:00:11 ago on Wed May 7 20:23:04 2025. 2025-05-07T20:23:15.1568430Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:15.1965821Z Dependencies resolved. 2025-05-07T20:23:15.2142452Z ================================================================================ 2025-05-07T20:23:15.2143008Z Package Architecture Version Repository Size 2025-05-07T20:23:15.2143431Z ================================================================================ 2025-05-07T20:23:15.2143734Z Installing: 2025-05-07T20:23:15.2144021Z lshw x86_64 B.02.19.2-7.amzn2023.0.3 amazonlinux 319 k 2025-05-07T20:23:15.2144298Z 2025-05-07T20:23:15.2144389Z Transaction Summary 2025-05-07T20:23:15.2144675Z ================================================================================ 2025-05-07T20:23:15.2145101Z Install 1 Package 2025-05-07T20:23:15.2145304Z 2025-05-07T20:23:15.2145428Z Total download size: 319 k 2025-05-07T20:23:15.2145686Z Installed size: 837 k 2025-05-07T20:23:15.2146735Z Downloading Packages: 2025-05-07T20:23:15.2950459Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm 6.4 MB/s | 319 kB 00:00 2025-05-07T20:23:15.2956319Z -------------------------------------------------------------------------------- 2025-05-07T20:23:15.2959119Z Total 3.9 MB/s | 319 kB 00:00 2025-05-07T20:23:15.3119611Z Running transaction check 2025-05-07T20:23:15.3174254Z Transaction check succeeded. 2025-05-07T20:23:15.3174894Z Running transaction test 2025-05-07T20:23:15.3635303Z Transaction test succeeded. 2025-05-07T20:23:15.3639327Z Running transaction 2025-05-07T20:23:15.4641196Z Preparing : 1/1 2025-05-07T20:23:15.5121157Z Installing : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.6733490Z Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.7969785Z ================================================================================ 2025-05-07T20:23:15.7970142Z WARNING: 2025-05-07T20:23:15.7970384Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:15.7970896Z 2025-05-07T20:23:15.7970996Z Available Versions: 2025-05-07T20:23:15.7971158Z 2025-05-07T20:23:15.7971246Z Version 2023.7.20250331: 2025-05-07T20:23:15.7971554Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:15.7971807Z 2025-05-07T20:23:15.7971927Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:15.7972133Z 2025-05-07T20:23:15.7972223Z Release notes: 2025-05-07T20:23:15.7972622Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:15.7972993Z 2025-05-07T20:23:15.7973080Z Version 2023.7.20250414: 2025-05-07T20:23:15.7973382Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:15.7973624Z 2025-05-07T20:23:15.7973744Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:15.7973947Z 2025-05-07T20:23:15.7974032Z Release notes: 2025-05-07T20:23:15.7974420Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:15.7974788Z 2025-05-07T20:23:15.7975058Z Version 2023.7.20250428: 2025-05-07T20:23:15.7975362Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:15.7975612Z 2025-05-07T20:23:15.7975723Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:15.7975934Z 2025-05-07T20:23:15.7976017Z Release notes: 2025-05-07T20:23:15.7976407Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:15.7976766Z 2025-05-07T20:23:15.7976883Z ================================================================================ 2025-05-07T20:23:15.8315810Z Verifying : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.8316301Z 2025-05-07T20:23:15.8316422Z Installed: 2025-05-07T20:23:15.8316863Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64 2025-05-07T20:23:15.8317286Z 2025-05-07T20:23:15.8317408Z Complete! 2025-05-07T20:23:15.8762769Z + hostname 2025-05-07T20:23:15.8762969Z 2025-05-07T20:23:15.8776606Z ip-10-0-57-2.ec2.internal 2025-05-07T20:23:15.8778180Z 2025-05-07T20:23:15.8778449Z + sudo lshw -C display 2025-05-07T20:23:15.8778607Z 2025-05-07T20:23:16.4467648Z *-display:0 UNCLAIMED 2025-05-07T20:23:16.4468106Z description: VGA compatible controller 2025-05-07T20:23:16.4468564Z product: Amazon.com, Inc. 2025-05-07T20:23:16.4468947Z vendor: Amazon.com, Inc. 2025-05-07T20:23:16.4469294Z physical id: 3 2025-05-07T20:23:16.4469590Z bus info: pci@0000:00:03.0 2025-05-07T20:23:16.4469852Z version: 00 2025-05-07T20:23:16.4470063Z width: 32 bits 2025-05-07T20:23:16.4470292Z clock: 33MHz 2025-05-07T20:23:16.4470547Z capabilities: vga_controller bus_master 2025-05-07T20:23:16.4470859Z configuration: latency=0 2025-05-07T20:23:16.4471188Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:16.4471528Z *-display:1 2025-05-07T20:23:16.4471757Z description: 3D controller 2025-05-07T20:23:16.4472062Z product: GA102GL [A10G] 2025-05-07T20:23:16.4472346Z vendor: NVIDIA Corporation 2025-05-07T20:23:16.4472613Z physical id: 1e 2025-05-07T20:23:16.4472845Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:16.4473107Z version: a1 2025-05-07T20:23:16.4473324Z width: 64 bits 2025-05-07T20:23:16.4473553Z clock: 33MHz 2025-05-07T20:23:16.4473885Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:16.4474265Z configuration: driver=nvidia latency=0 2025-05-07T20:23:16.4474882Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:16.4508293Z 2025-05-07T20:23:16.4508733Z ################################################################################ 2025-05-07T20:23:16.4509200Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:16.4637824Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:16.4808370Z Wed May 7 20:23:16 2025 2025-05-07T20:23:16.4808899Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.4809593Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:16.4810138Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:16.4810634Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:16.4811162Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:16.4811586Z | | | MIG M. | 2025-05-07T20:23:16.4811923Z |=========================================+========================+======================| 2025-05-07T20:23:16.4887336Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:16.4888228Z | 0% 31C P0 57W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:16.4888771Z | | | N/A | 2025-05-07T20:23:16.4889345Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:16.4889900Z 2025-05-07T20:23:16.4890321Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.4890741Z | Processes: | 2025-05-07T20:23:16.4891181Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:16.4891594Z | ID ID Usage | 2025-05-07T20:23:16.4891999Z |=========================================================================================| 2025-05-07T20:23:16.4892602Z | No running processes found | 2025-05-07T20:23:16.4893255Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.6347745Z ################################################################################ 2025-05-07T20:23:16.6348217Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:16.6488217Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:16.6488997Z [CHECK] rocminfo not found 2025-05-07T20:23:16.6498177Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:16.6499176Z [CHECK] rocm-smi not found 2025-05-07T20:23:16.6563583Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:16.6564022Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:16.6576450Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:16.6576810Z env: 2025-05-07T20:23:16.6577031Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:16.6577322Z BUILD_ENV: build_binary 2025-05-07T20:23:16.6577566Z BUILD_TARGET: genai 2025-05-07T20:23:16.6577790Z BUILD_VARIANT: cuda 2025-05-07T20:23:16.6578016Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:16.6578273Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:16.6578571Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:16.6578897Z ##[endgroup] 2025-05-07T20:23:16.9946935Z ################################################################################ 2025-05-07T20:23:16.9947298Z # Setup Miniconda 2025-05-07T20:23:16.9947527Z # 2025-05-07T20:23:16.9961386Z # [2025-05-07T20:23:16.995Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:16.9961799Z ################################################################################ 2025-05-07T20:23:16.9962013Z 2025-05-07T20:23:16.9975983Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:17.0896867Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:17.0897226Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:17.0897422Z 2025-05-07T20:23:17.0914462Z 2025-05-07T20:23:17.0914792Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:17.0938119Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:18.1029006Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:18.1029387Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:18.1029640Z 2025-05-07T20:23:18.1178403Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:18.5671651Z Unpacking payload ... 2025-05-07T20:23:19.0885062Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:19.9248269Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:22.0254993Z 2025-05-07T20:23:22.0255652Z Installing base environment... 2025-05-07T20:23:22.0255917Z 2025-05-07T20:23:23.1070675Z Preparing transaction: ...working... done 2025-05-07T20:23:26.1181783Z Executing transaction: ...working... done 2025-05-07T20:23:26.7808005Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:26.8695862Z installation finished. 2025-05-07T20:23:26.8702575Z 2025-05-07T20:23:26.8703033Z + rm -f miniconda.sh 2025-05-07T20:23:26.8703297Z 2025-05-07T20:23:26.9016083Z 2025-05-07T20:23:26.9016492Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:23:26.9016952Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:23:26.9017199Z 2025-05-07T20:23:27.2682949Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:23:27.2684239Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:23:27.2685203Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:23:27.2686189Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:23:27.2687039Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:23:27.2687426Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:23:27.2687861Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:23:27.2688302Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:23:27.2688754Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:23:27.2689589Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:23:27.2690121Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:23:27.2690495Z modified /home/ec2-user/.bashrc 2025-05-07T20:23:27.2690685Z 2025-05-07T20:23:27.2690881Z ==> For changes to take effect, close and re-open your current shell. <== 2025-05-07T20:23:27.2691181Z 2025-05-07T20:23:27.3364020Z 2025-05-07T20:23:27.3364525Z + . /home/ec2-user/.bashrc 2025-05-07T20:23:27.3364778Z 2025-05-07T20:23:28.1790581Z 2025-05-07T20:23:28.1791542Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:23:28.1816366Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:23:41.5555393Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:23:43.1649092Z Solving environment: / - \ | / - \ | / - \ | done 2025-05-07T20:23:43.2616121Z 2025-05-07T20:23:43.2616396Z ## Package Plan ## 2025-05-07T20:23:43.2616548Z 2025-05-07T20:23:43.2616722Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:43.2616957Z 2025-05-07T20:23:43.2617062Z added / updated specs: 2025-05-07T20:23:43.2617325Z - conda-libmamba-solver 2025-05-07T20:23:43.2617583Z - libarchive 2025-05-07T20:23:43.2617794Z - libmamba 2025-05-07T20:23:43.2617993Z - libmambapy 2025-05-07T20:23:43.2618123Z 2025-05-07T20:23:43.2618127Z 2025-05-07T20:23:43.2618257Z The following packages will be downloaded: 2025-05-07T20:23:43.2618473Z 2025-05-07T20:23:43.2618595Z package | build 2025-05-07T20:23:43.2618907Z ---------------------------|----------------- 2025-05-07T20:23:43.2619322Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:23:43.2619795Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:23:43.2620228Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:23:43.2620696Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:23:43.2621146Z ------------------------------------------------------------ 2025-05-07T20:23:43.2621489Z Total: 1.4 MB 2025-05-07T20:23:43.2621698Z 2025-05-07T20:23:43.2621811Z The following packages will be UPDATED: 2025-05-07T20:23:43.2622021Z 2025-05-07T20:23:43.2625673Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:23:43.2626456Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:23:43.2626839Z 2025-05-07T20:23:43.2627067Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:23:43.2627386Z 2025-05-07T20:23:43.2627714Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:23:43.2628509Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:23:43.2628996Z 2025-05-07T20:23:43.2629000Z 2025-05-07T20:23:43.2629004Z 2025-05-07T20:23:43.2629149Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:43.2629530Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:23:43.2629753Z 2025-05-07T20:23:43.2630250Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:23:43.2630496Z 2025-05-07T20:23:43.2630500Z 2025-05-07T20:23:43.2640159Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:23:43.2640432Z 2025-05-07T20:23:43.2640445Z 2025-05-07T20:23:43.2640449Z 2025-05-07T20:23:43.3102383Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:23:43.3103083Z 2025-05-07T20:23:43.3165619Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.3165871Z 2025-05-07T20:23:43.3165875Z 2025-05-07T20:23:43.3259705Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.3259996Z 2025-05-07T20:23:43.3274071Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.3274321Z 2025-05-07T20:23:43.3274325Z 2025-05-07T20:23:43.3316991Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.3317253Z 2025-05-07T20:23:43.3317257Z 2025-05-07T20:23:43.3317261Z 2025-05-07T20:23:43.3404992Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.3405277Z 2025-05-07T20:23:43.3405281Z 2025-05-07T20:23:43.3405288Z 2025-05-07T20:23:43.3639157Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.3859195Z conda-25.3.1 | 1.1 MB | #5 | 16% 2025-05-07T20:23:43.4941074Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.4941461Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.4948062Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.4948387Z 2025-05-07T20:23:43.4948591Z 2025-05-07T20:23:43.4948939Z  2025-05-07T20:23:43.4949139Z 2025-05-07T20:23:43.4949151Z 2025-05-07T20:23:43.4949322Z  2025-05-07T20:23:43.4949525Z 2025-05-07T20:23:43.4949529Z 2025-05-07T20:23:43.4949533Z 2025-05-07T20:23:43.4949744Z  done 2025-05-07T20:23:43.5952119Z Preparing transaction: - done 2025-05-07T20:23:43.6958115Z Verifying transaction: | done 2025-05-07T20:23:45.0978401Z Executing transaction: - \ | / - \ | / - \ | / - \ done 2025-05-07T20:23:46.9498418Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:23:46.9524129Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:23:47.8897304Z Channels: 2025-05-07T20:23:47.8897551Z - defaults 2025-05-07T20:23:47.8898035Z Platform: linux-64 2025-05-07T20:23:49.1364225Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:23:49.2538179Z Solving environment: - \ Channels: 2025-05-07T20:23:49.2538632Z - defaults 2025-05-07T20:23:49.2539049Z Platform: linux-64 2025-05-07T20:23:49.5510753Z Collecting package metadata (repodata.json): / - \ done 2025-05-07T20:23:49.7666579Z Solving environment: / - \ | done 2025-05-07T20:23:49.8503329Z done 2025-05-07T20:23:49.9161712Z 2025-05-07T20:23:49.9161946Z ## Package Plan ## 2025-05-07T20:23:49.9162129Z 2025-05-07T20:23:49.9162326Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:49.9162678Z 2025-05-07T20:23:49.9162810Z added / updated specs: 2025-05-07T20:23:49.9163137Z - conda 2025-05-07T20:23:49.9163300Z 2025-05-07T20:23:49.9163305Z 2025-05-07T20:23:49.9163606Z The following packages will be downloaded: 2025-05-07T20:23:49.9163894Z 2025-05-07T20:23:49.9164045Z package | build 2025-05-07T20:23:49.9164473Z ---------------------------|----------------- 2025-05-07T20:23:49.9164883Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:23:49.9165383Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:23:49.9165936Z ------------------------------------------------------------ 2025-05-07T20:23:49.9166755Z Total: 1.4 MB 2025-05-07T20:23:49.9166985Z 2025-05-07T20:23:49.9167120Z The following packages will be UPDATED: 2025-05-07T20:23:49.9167328Z 2025-05-07T20:23:49.9167636Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:23:49.9168162Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:23:49.9168421Z 2025-05-07T20:23:49.9168425Z 2025-05-07T20:23:49.9168429Z 2025-05-07T20:23:49.9168575Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:49.9168952Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:23:49.9169173Z 2025-05-07T20:23:49.9637971Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:23:49.9756061Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:49.9756380Z 2025-05-07T20:23:50.1644293Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.1644942Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.1859533Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.1859915Z 2025-05-07T20:23:50.1860364Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.1861058Z 2025-05-07T20:23:50.1865350Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.1865689Z 2025-05-07T20:23:50.1865897Z 2025-05-07T20:23:50.1866071Z  done 2025-05-07T20:23:50.2868904Z Preparing transaction: - done 2025-05-07T20:23:50.3874928Z Verifying transaction: | done 2025-05-07T20:23:52.5965187Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:23:53.2259495Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:23:53.2263352Z + conda clean --packages --tarball -y 2025-05-07T20:23:53.2263634Z 2025-05-07T20:23:54.2231088Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:23:54.2231447Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:23:54.2901755Z 2025-05-07T20:23:54.2910318Z + conda clean --all -y 2025-05-07T20:23:54.2910498Z 2025-05-07T20:23:54.8314164Z There are no unused tarball(s) to remove. 2025-05-07T20:23:54.8314513Z Will remove 1 index cache(s). 2025-05-07T20:23:54.8314829Z There are no unused package(s) to remove. 2025-05-07T20:23:54.8315178Z There are no tempfile(s) to remove. 2025-05-07T20:23:54.8315462Z There are no logfile(s) to remove. 2025-05-07T20:23:54.9004786Z 2025-05-07T20:23:54.9009865Z + conda info 2025-05-07T20:23:54.9010050Z 2025-05-07T20:23:55.6807975Z 2025-05-07T20:23:55.6808573Z active environment : base 2025-05-07T20:23:55.6808945Z active env location : /home/ec2-user/miniconda 2025-05-07T20:23:55.6809305Z shell level : 1 2025-05-07T20:23:55.6809593Z user config file : /home/ec2-user/.condarc 2025-05-07T20:23:55.6809987Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:23:55.6810415Z conda version : 25.3.1 2025-05-07T20:23:55.6810715Z conda-build version : not installed 2025-05-07T20:23:55.6811018Z python version : 3.13.2.final.0 2025-05-07T20:23:55.6811349Z solver : libmamba (default) 2025-05-07T20:23:55.6811670Z virtual packages : __archspec=1=zen2 2025-05-07T20:23:55.6811968Z __conda=25.3.1=0 2025-05-07T20:23:55.6812253Z __cuda=12.8=0 2025-05-07T20:23:55.6812530Z __glibc=2.34=0 2025-05-07T20:23:55.6812805Z __linux=6.1.130=0 2025-05-07T20:23:55.6813089Z __unix=0=0 2025-05-07T20:23:55.6813445Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:23:55.6813855Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:23:55.6814218Z conda av metadata url : None 2025-05-07T20:23:55.6815036Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:23:55.6815482Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:23:55.6815875Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:23:55.6816282Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:23:55.6816661Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:23:55.6817007Z /home/ec2-user/.conda/pkgs 2025-05-07T20:23:55.6817362Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:23:55.6817714Z /home/ec2-user/.conda/envs 2025-05-07T20:23:55.6818031Z platform : linux-64 2025-05-07T20:23:55.6818873Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:23:55.6819706Z UID:GID : 1000:1000 2025-05-07T20:23:55.6820005Z netrc file : None 2025-05-07T20:23:55.6820268Z offline mode : False 2025-05-07T20:23:55.6820455Z 2025-05-07T20:23:55.7619215Z 2025-05-07T20:23:55.7619636Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:23:55.7621051Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_91bf4d2a-ac78-4419-8bdb-b54df260a85a ... 2025-05-07T20:23:55.7622156Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:23:55.7707503Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.11 2025-05-07T20:23:55.7707995Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.11 2025-05-07T20:23:55.7725310Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:55.7725667Z env: 2025-05-07T20:23:55.7725895Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:55.7726199Z BUILD_ENV: build_binary 2025-05-07T20:23:55.7726477Z BUILD_TARGET: genai 2025-05-07T20:23:55.7726720Z BUILD_VARIANT: cuda 2025-05-07T20:23:55.7726954Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:55.7727208Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:55.7727511Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:55.7727854Z ##[endgroup] 2025-05-07T20:23:56.1103681Z ################################################################################ 2025-05-07T20:23:56.1104044Z # Create Conda Environment 2025-05-07T20:23:56.1104291Z # 2025-05-07T20:23:56.1121005Z # [2025-05-07T20:23:56.111Z] + create_conda_environment build_binary 3.11 2025-05-07T20:23:56.1121482Z ################################################################################ 2025-05-07T20:23:56.1121693Z 2025-05-07T20:23:56.1137782Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:56.2037029Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:56.2037396Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:23:56.2037703Z + conda info --envs 2025-05-07T20:23:56.2037845Z 2025-05-07T20:23:56.9792711Z 2025-05-07T20:23:56.9793256Z # conda environments: 2025-05-07T20:23:56.9793520Z # 2025-05-07T20:23:56.9793739Z base /home/ec2-user/miniconda 2025-05-07T20:23:56.9794002Z 2025-05-07T20:23:57.0514319Z 2025-05-07T20:23:57.0514874Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:23:58.7186595Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:23:58.7186881Z 2025-05-07T20:23:58.7203032Z 2025-05-07T20:23:58.7212478Z [SETUP] Creating new Conda environment (Python 3.11) ... 2025-05-07T20:23:58.7234619Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.11 2025-05-07T20:23:59.5024475Z Channels: 2025-05-07T20:23:59.5024786Z - defaults 2025-05-07T20:23:59.5025072Z Platform: linux-64 2025-05-07T20:24:01.0573880Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done 2025-05-07T20:24:01.1579444Z Solving environment: / done 2025-05-07T20:24:01.1869824Z 2025-05-07T20:24:01.1870099Z ## Package Plan ## 2025-05-07T20:24:01.1870335Z 2025-05-07T20:24:01.1870704Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:01.1871135Z 2025-05-07T20:24:01.1871303Z added / updated specs: 2025-05-07T20:24:01.1871644Z - python=3.11 2025-05-07T20:24:01.1871782Z 2025-05-07T20:24:01.1871787Z 2025-05-07T20:24:01.1871908Z The following packages will be downloaded: 2025-05-07T20:24:01.1872133Z 2025-05-07T20:24:01.1872259Z package | build 2025-05-07T20:24:01.1872587Z ---------------------------|----------------- 2025-05-07T20:24:01.1872947Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:01.1873342Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:01.1873903Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:01.1874479Z python-3.11.11 | he870216_0 32.9 MB 2025-05-07T20:24:01.1875003Z setuptools-78.1.1 | py311h06a4308_0 2.3 MB 2025-05-07T20:24:01.1875417Z wheel-0.45.1 | py311h06a4308_0 151 KB 2025-05-07T20:24:01.1875782Z ------------------------------------------------------------ 2025-05-07T20:24:01.1876450Z Total: 35.4 MB 2025-05-07T20:24:01.1876658Z 2025-05-07T20:24:01.1876786Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:01.1877013Z 2025-05-07T20:24:01.1877412Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:01.1877855Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:01.1878331Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T20:24:01.1878804Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:01.1879344Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:01.1879803Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:01.1880233Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:01.1880763Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:01.1881422Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:01.1882048Z libuuid pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 2025-05-07T20:24:01.1882477Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:01.1882888Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:01.1883297Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:01.1883838Z python pkgs/main/linux-64::python-3.11.11-he870216_0 2025-05-07T20:24:01.1884263Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:01.1884735Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py311h06a4308_0 2025-05-07T20:24:01.1885200Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:01.1885587Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:01.1885971Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:01.1886388Z wheel pkgs/main/linux-64::wheel-0.45.1-py311h06a4308_0 2025-05-07T20:24:01.1886783Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:01.1887159Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:01.1887402Z 2025-05-07T20:24:01.1887407Z 2025-05-07T20:24:01.1887411Z 2025-05-07T20:24:01.1887558Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:01.1887931Z python-3.11.11 | 32.9 MB | | 0% 2025-05-07T20:24:01.1888158Z 2025-05-07T20:24:01.1888529Z setuptools-78.1.1 | 2.3 MB | | 0%  2025-05-07T20:24:01.1888768Z 2025-05-07T20:24:01.1892774Z 2025-05-07T20:24:01.1903371Z wheel-0.45.1 | 151 KB | | 0%  2025-05-07T20:24:01.1903629Z 2025-05-07T20:24:01.1903633Z 2025-05-07T20:24:01.1903883Z 2025-05-07T20:24:01.1911631Z ca-certificates-2025 | 129 KB | | 0%  2025-05-07T20:24:01.1912029Z 2025-05-07T20:24:01.1912033Z 2025-05-07T20:24:01.1912037Z 2025-05-07T20:24:01.1914719Z 2025-05-07T20:24:01.1935477Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:01.1935848Z 2025-05-07T20:24:01.1935852Z 2025-05-07T20:24:01.1935870Z 2025-05-07T20:24:01.1935874Z 2025-05-07T20:24:01.1937292Z 2025-05-07T20:24:01.2233552Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:01.2233837Z 2025-05-07T20:24:01.2233841Z 2025-05-07T20:24:01.2236943Z 2025-05-07T20:24:01.2363603Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:01.2363900Z 2025-05-07T20:24:01.2363904Z 2025-05-07T20:24:01.2363908Z 2025-05-07T20:24:01.2390962Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:01.2391233Z 2025-05-07T20:24:01.2391237Z 2025-05-07T20:24:01.2391241Z 2025-05-07T20:24:01.2391245Z 2025-05-07T20:24:01.2391321Z 2025-05-07T20:24:01.2440856Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:01.2441139Z 2025-05-07T20:24:01.2441143Z 2025-05-07T20:24:01.2441147Z 2025-05-07T20:24:01.2441151Z 2025-05-07T20:24:01.2441155Z 2025-05-07T20:24:01.2535643Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:01.2535955Z 2025-05-07T20:24:01.2589946Z setuptools-78.1.1 | 2.3 MB | ########## | 100%  2025-05-07T20:24:01.2590205Z 2025-05-07T20:24:01.2590210Z 2025-05-07T20:24:01.2590214Z 2025-05-07T20:24:01.2590542Z 2025-05-07T20:24:01.2873026Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.3090472Z python-3.11.11 | 32.9 MB | 9 | 10% 2025-05-07T20:24:01.3090728Z 2025-05-07T20:24:01.3091096Z 2025-05-07T20:24:01.3204152Z wheel-0.45.1 | 151 KB | # | 11%  2025-05-07T20:24:01.3204596Z 2025-05-07T20:24:01.3209560Z 2025-05-07T20:24:01.3417332Z wheel-0.45.1 | 151 KB | ########## | 100%  2025-05-07T20:24:01.3417626Z 2025-05-07T20:24:01.3417631Z 2025-05-07T20:24:01.3417635Z 2025-05-07T20:24:01.3418175Z 2025-05-07T20:24:01.3424074Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.3424425Z 2025-05-07T20:24:01.3424429Z 2025-05-07T20:24:01.3424446Z 2025-05-07T20:24:01.3424594Z 2025-05-07T20:24:01.3873896Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.4236307Z python-3.11.11 | 32.9 MB | ##8 | 28% 2025-05-07T20:24:01.4237144Z 2025-05-07T20:24:01.4237149Z 2025-05-07T20:24:01.4241669Z wheel-0.45.1 | 151 KB | ########## | 100%  2025-05-07T20:24:01.4241984Z 2025-05-07T20:24:01.4241989Z 2025-05-07T20:24:01.4874549Z wheel-0.45.1 | 151 KB | ########## | 100%  2025-05-07T20:24:01.6546362Z python-3.11.11 | 32.9 MB | ######7 | 68% 2025-05-07T20:24:01.6579794Z python-3.11.11 | 32.9 MB | #########3 | 94% 2025-05-07T20:24:01.6580053Z 2025-05-07T20:24:01.6581051Z setuptools-78.1.1 | 2.3 MB | ########## | 100%  2025-05-07T20:24:01.6581395Z 2025-05-07T20:24:01.7141899Z setuptools-78.1.1 | 2.3 MB | ########## | 100%  2025-05-07T20:24:02.3723752Z python-3.11.11 | 32.9 MB | ########## | 100% 2025-05-07T20:24:02.3730199Z python-3.11.11 | 32.9 MB | ########## | 100% 2025-05-07T20:24:02.3730755Z 2025-05-07T20:24:02.3731072Z 2025-05-07T20:24:02.3731380Z  2025-05-07T20:24:02.3731676Z 2025-05-07T20:24:02.3731681Z 2025-05-07T20:24:02.3731933Z  2025-05-07T20:24:02.3732241Z 2025-05-07T20:24:02.3732247Z 2025-05-07T20:24:02.3732252Z 2025-05-07T20:24:02.3732503Z  2025-05-07T20:24:02.3732771Z 2025-05-07T20:24:02.3732775Z 2025-05-07T20:24:02.3732779Z 2025-05-07T20:24:02.3732782Z 2025-05-07T20:24:02.3732980Z  2025-05-07T20:24:02.3733263Z 2025-05-07T20:24:02.3733269Z 2025-05-07T20:24:02.3733288Z 2025-05-07T20:24:02.3733293Z 2025-05-07T20:24:02.3733299Z 2025-05-07T20:24:02.3733579Z  done 2025-05-07T20:24:02.5788301Z Preparing transaction: \ | done 2025-05-07T20:24:03.8266799Z Verifying transaction: - \ | / - \ | / - \ | / done 2025-05-07T20:24:06.1481742Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:24:06.1985649Z # 2025-05-07T20:24:06.1986133Z # To activate this environment, use 2025-05-07T20:24:06.1986697Z # 2025-05-07T20:24:06.1987088Z # $ conda activate build_binary 2025-05-07T20:24:06.1987606Z # 2025-05-07T20:24:06.1988025Z # To deactivate an active environment, use 2025-05-07T20:24:06.1988591Z # 2025-05-07T20:24:06.1988963Z # $ conda deactivate 2025-05-07T20:24:06.1989712Z 2025-05-07T20:24:06.3172848Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:06.3196105Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:09.2268904Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (25.1) 2025-05-07T20:24:09.2269503Z Collecting pip 2025-05-07T20:24:09.2269832Z Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:09.2270247Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:09.2271106Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 126.9 MB/s eta 0:00:00 2025-05-07T20:24:09.2271475Z Installing collected packages: pip 2025-05-07T20:24:09.2271766Z Attempting uninstall: pip 2025-05-07T20:24:09.2272061Z Found existing installation: pip 25.1 2025-05-07T20:24:09.2272375Z Uninstalling pip-25.1: 2025-05-07T20:24:09.2272654Z Successfully uninstalled pip-25.1 2025-05-07T20:24:09.2272982Z Successfully installed pip-25.1.1 2025-05-07T20:24:09.2273180Z 2025-05-07T20:24:09.2942674Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:09.2965437Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:10.1585252Z Channels: 2025-05-07T20:24:10.1585786Z - conda-forge 2025-05-07T20:24:10.1586234Z Platform: linux-64 2025-05-07T20:24:20.7473163Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:22.4359261Z Solving environment: - \ | / - \ | done 2025-05-07T20:24:22.5007516Z 2025-05-07T20:24:22.5007821Z ## Package Plan ## 2025-05-07T20:24:22.5008117Z 2025-05-07T20:24:22.5008324Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:22.5008648Z 2025-05-07T20:24:22.5008778Z added / updated specs: 2025-05-07T20:24:22.5009162Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:24:22.5009432Z 2025-05-07T20:24:22.5009470Z 2025-05-07T20:24:22.5009637Z The following packages will be downloaded: 2025-05-07T20:24:22.5009923Z 2025-05-07T20:24:22.5010079Z package | build 2025-05-07T20:24:22.5010463Z ---------------------------|----------------- 2025-05-07T20:24:22.5010838Z cffi-1.17.1 | py311hf29c0ef_0 295 KB conda-forge 2025-05-07T20:24:22.5011291Z cryptography-44.0.3 | py311hafd3f86_0 1.5 MB conda-forge 2025-05-07T20:24:22.5011727Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:24:22.5012146Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:24:22.5012563Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:24:22.5012976Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:24:22.5013390Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:24:22.5013835Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:24:22.5014267Z python_abi-3.11 | 2_cp311 5 KB conda-forge 2025-05-07T20:24:22.5014720Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:24:22.5015209Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:24:22.5015633Z ------------------------------------------------------------ 2025-05-07T20:24:22.5015977Z Total: 6.4 MB 2025-05-07T20:24:22.5016187Z 2025-05-07T20:24:22.5016315Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:22.5016540Z 2025-05-07T20:24:22.5016746Z cffi conda-forge/linux-64::cffi-1.17.1-py311hf29c0ef_0 2025-05-07T20:24:22.5017244Z cryptography conda-forge/linux-64::cryptography-44.0.3-py311hafd3f86_0 2025-05-07T20:24:22.5017743Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:24:22.5018508Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:24:22.5018984Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:24:22.5019595Z python_abi conda-forge/linux-64::python_abi-3.11-2_cp311 2025-05-07T20:24:22.5020314Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:24:22.5020900Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:24:22.5021235Z 2025-05-07T20:24:22.5021356Z The following packages will be UPDATED: 2025-05-07T20:24:22.5021560Z 2025-05-07T20:24:22.5021951Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:22.5022718Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:24:22.5023368Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:24:22.5024009Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:24:22.5024371Z 2025-05-07T20:24:22.5024375Z 2025-05-07T20:24:22.5024379Z 2025-05-07T20:24:22.5024532Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:22.5024909Z openssl-3.5.0 | 3.0 MB | | 0% 2025-05-07T20:24:22.5025142Z 2025-05-07T20:24:22.5025581Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:24:22.5025830Z 2025-05-07T20:24:22.5025841Z 2025-05-07T20:24:22.5034287Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:24:22.5034528Z 2025-05-07T20:24:22.5034532Z 2025-05-07T20:24:22.5034536Z 2025-05-07T20:24:22.5042492Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:24:22.5042778Z 2025-05-07T20:24:22.5042782Z 2025-05-07T20:24:22.5042786Z 2025-05-07T20:24:22.5042802Z 2025-05-07T20:24:22.5073557Z cffi-1.17.1 | 295 KB | | 0%  2025-05-07T20:24:22.5073826Z 2025-05-07T20:24:22.5073830Z 2025-05-07T20:24:22.5073834Z 2025-05-07T20:24:22.5073838Z 2025-05-07T20:24:22.5073842Z 2025-05-07T20:24:22.5080894Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:24:22.5081327Z 2025-05-07T20:24:22.5081333Z 2025-05-07T20:24:22.5081340Z 2025-05-07T20:24:22.5081346Z 2025-05-07T20:24:22.5081351Z 2025-05-07T20:24:22.5081357Z 2025-05-07T20:24:22.5085556Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:24:22.5085903Z 2025-05-07T20:24:22.5085908Z 2025-05-07T20:24:22.5085912Z 2025-05-07T20:24:22.5085915Z 2025-05-07T20:24:22.5085919Z 2025-05-07T20:24:22.5085922Z 2025-05-07T20:24:22.5085926Z 2025-05-07T20:24:22.5092027Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:24:22.5092498Z 2025-05-07T20:24:22.5092504Z 2025-05-07T20:24:22.5092520Z 2025-05-07T20:24:22.5092526Z 2025-05-07T20:24:22.5092530Z 2025-05-07T20:24:22.5092535Z 2025-05-07T20:24:22.5092540Z 2025-05-07T20:24:22.5092545Z 2025-05-07T20:24:22.5094275Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:24:22.5094580Z 2025-05-07T20:24:22.5094594Z 2025-05-07T20:24:22.5094598Z 2025-05-07T20:24:22.5094602Z 2025-05-07T20:24:22.5094605Z 2025-05-07T20:24:22.5094609Z 2025-05-07T20:24:22.5094613Z 2025-05-07T20:24:22.5094616Z 2025-05-07T20:24:22.5098556Z 2025-05-07T20:24:22.5100900Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:22.5101296Z 2025-05-07T20:24:22.5101300Z 2025-05-07T20:24:22.5101311Z 2025-05-07T20:24:22.5101315Z 2025-05-07T20:24:22.5101319Z 2025-05-07T20:24:22.5101323Z 2025-05-07T20:24:22.5101327Z 2025-05-07T20:24:22.5101330Z 2025-05-07T20:24:22.5101334Z 2025-05-07T20:24:22.5102896Z 2025-05-07T20:24:22.5690231Z python_abi-3.11 | 5 KB | | 0%  2025-05-07T20:24:22.5690822Z 2025-05-07T20:24:22.5690826Z 2025-05-07T20:24:22.5690829Z 2025-05-07T20:24:22.5692910Z 2025-05-07T20:24:22.6012034Z cffi-1.17.1 | 295 KB | ########## | 100%  2025-05-07T20:24:22.6039568Z openssl-3.5.0 | 3.0 MB | ###### | 61% 2025-05-07T20:24:22.6039920Z 2025-05-07T20:24:22.6039924Z 2025-05-07T20:24:22.6039941Z 2025-05-07T20:24:22.6076384Z libgomp-15.1.0 | 442 KB | 3 | 4%  2025-05-07T20:24:22.6076729Z 2025-05-07T20:24:22.6076733Z 2025-05-07T20:24:22.6087363Z libgcc-15.1.0 | 810 KB | ######### | 91%  2025-05-07T20:24:22.6087681Z 2025-05-07T20:24:22.6087685Z 2025-05-07T20:24:22.6087689Z 2025-05-07T20:24:22.6087693Z 2025-05-07T20:24:22.6089416Z 2025-05-07T20:24:22.6155841Z pyopenssl-25.0.0 | 120 KB | #####3 | 53%  2025-05-07T20:24:22.6156288Z 2025-05-07T20:24:22.6156294Z 2025-05-07T20:24:22.6156299Z 2025-05-07T20:24:22.6156319Z 2025-05-07T20:24:22.6156325Z 2025-05-07T20:24:22.6355196Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:22.6355477Z 2025-05-07T20:24:22.6355481Z 2025-05-07T20:24:22.6399582Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:22.6399849Z 2025-05-07T20:24:22.6399854Z 2025-05-07T20:24:22.6402269Z 2025-05-07T20:24:22.6408657Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:22.6411008Z 2025-05-07T20:24:22.6428966Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:22.6431950Z 2025-05-07T20:24:22.6708848Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:22.6709281Z 2025-05-07T20:24:22.6709287Z 2025-05-07T20:24:22.6709293Z 2025-05-07T20:24:22.6709298Z 2025-05-07T20:24:22.6709303Z 2025-05-07T20:24:22.6712272Z 2025-05-07T20:24:22.6773901Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:24:22.6774279Z 2025-05-07T20:24:22.6774306Z 2025-05-07T20:24:22.6774312Z 2025-05-07T20:24:22.6774317Z 2025-05-07T20:24:22.6774322Z 2025-05-07T20:24:22.6774327Z 2025-05-07T20:24:22.6774333Z 2025-05-07T20:24:22.6774338Z 2025-05-07T20:24:22.6775567Z 2025-05-07T20:24:22.6779024Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:22.6779435Z 2025-05-07T20:24:22.6779441Z 2025-05-07T20:24:22.6779446Z 2025-05-07T20:24:22.6779452Z 2025-05-07T20:24:22.6779458Z 2025-05-07T20:24:22.6785096Z 2025-05-07T20:24:22.6828274Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:22.6828663Z 2025-05-07T20:24:22.6828669Z 2025-05-07T20:24:22.6828674Z 2025-05-07T20:24:22.6828680Z 2025-05-07T20:24:22.6828695Z 2025-05-07T20:24:22.6828700Z 2025-05-07T20:24:22.6828705Z 2025-05-07T20:24:22.6828710Z 2025-05-07T20:24:22.6831081Z 2025-05-07T20:24:22.6947757Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:22.6948179Z 2025-05-07T20:24:22.6948185Z 2025-05-07T20:24:22.6948190Z 2025-05-07T20:24:22.6948195Z 2025-05-07T20:24:22.6948201Z 2025-05-07T20:24:22.6948206Z 2025-05-07T20:24:22.6951266Z 2025-05-07T20:24:22.6957168Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:24:22.6957535Z 2025-05-07T20:24:22.6957539Z 2025-05-07T20:24:22.6957543Z 2025-05-07T20:24:22.6957547Z 2025-05-07T20:24:22.6957551Z 2025-05-07T20:24:22.6957554Z 2025-05-07T20:24:22.6957558Z 2025-05-07T20:24:22.6963093Z 2025-05-07T20:24:22.6993135Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:24:22.7032743Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:22.7033017Z 2025-05-07T20:24:22.7033022Z 2025-05-07T20:24:22.7033028Z 2025-05-07T20:24:22.7033033Z 2025-05-07T20:24:22.7033038Z 2025-05-07T20:24:22.7033043Z 2025-05-07T20:24:22.7036728Z 2025-05-07T20:24:22.7058727Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:22.7059300Z 2025-05-07T20:24:22.7059304Z 2025-05-07T20:24:22.7059308Z 2025-05-07T20:24:22.7059312Z 2025-05-07T20:24:22.7059323Z 2025-05-07T20:24:22.7064628Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:22.7065007Z 2025-05-07T20:24:22.7065205Z 2025-05-07T20:24:22.7065211Z 2025-05-07T20:24:22.7065216Z 2025-05-07T20:24:22.7065234Z 2025-05-07T20:24:22.7065239Z 2025-05-07T20:24:22.7065243Z 2025-05-07T20:24:22.7065248Z 2025-05-07T20:24:22.7164998Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:22.7165372Z 2025-05-07T20:24:22.7165385Z 2025-05-07T20:24:22.7165389Z 2025-05-07T20:24:22.7165392Z 2025-05-07T20:24:22.7170054Z cffi-1.17.1 | 295 KB | ########## | 100%  2025-05-07T20:24:22.7170394Z 2025-05-07T20:24:22.7170398Z 2025-05-07T20:24:22.7170409Z 2025-05-07T20:24:22.7170412Z 2025-05-07T20:24:22.7255518Z cffi-1.17.1 | 295 KB | ########## | 100%  2025-05-07T20:24:22.7255897Z 2025-05-07T20:24:22.7255901Z 2025-05-07T20:24:22.7255914Z 2025-05-07T20:24:22.7255918Z 2025-05-07T20:24:22.7255921Z 2025-05-07T20:24:22.7255925Z 2025-05-07T20:24:22.7255929Z 2025-05-07T20:24:22.7255933Z 2025-05-07T20:24:22.7255937Z 2025-05-07T20:24:22.7257187Z 2025-05-07T20:24:22.7274808Z python_abi-3.11 | 5 KB | ########## | 100%  2025-05-07T20:24:22.7275221Z 2025-05-07T20:24:22.7275225Z 2025-05-07T20:24:22.7275229Z 2025-05-07T20:24:22.7275233Z 2025-05-07T20:24:22.7275237Z 2025-05-07T20:24:22.7275241Z 2025-05-07T20:24:22.7275245Z 2025-05-07T20:24:22.7275248Z 2025-05-07T20:24:22.7275252Z 2025-05-07T20:24:22.7275256Z 2025-05-07T20:24:22.7696157Z python_abi-3.11 | 5 KB | ########## | 100%  2025-05-07T20:24:22.7696436Z 2025-05-07T20:24:22.7696864Z 2025-05-07T20:24:22.7743125Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:22.7743455Z 2025-05-07T20:24:22.7743471Z 2025-05-07T20:24:22.7743827Z 2025-05-07T20:24:22.7748103Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:22.7748481Z 2025-05-07T20:24:22.7748487Z 2025-05-07T20:24:22.7748494Z 2025-05-07T20:24:22.8324063Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:22.8324570Z 2025-05-07T20:24:22.8324579Z 2025-05-07T20:24:22.8324587Z 2025-05-07T20:24:22.8324595Z 2025-05-07T20:24:22.8324602Z 2025-05-07T20:24:22.8324610Z 2025-05-07T20:24:22.8324618Z 2025-05-07T20:24:22.8324625Z 2025-05-07T20:24:22.8324633Z 2025-05-07T20:24:22.8329467Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:22.8329749Z 2025-05-07T20:24:22.8329753Z 2025-05-07T20:24:22.8329757Z 2025-05-07T20:24:22.8329761Z 2025-05-07T20:24:22.8329764Z 2025-05-07T20:24:22.8329768Z 2025-05-07T20:24:22.8329772Z 2025-05-07T20:24:22.8329775Z 2025-05-07T20:24:22.8329783Z 2025-05-07T20:24:22.9276396Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:22.9276812Z 2025-05-07T20:24:22.9276826Z 2025-05-07T20:24:22.9276830Z 2025-05-07T20:24:22.9276834Z 2025-05-07T20:24:22.9276837Z 2025-05-07T20:24:22.9276841Z 2025-05-07T20:24:22.9281312Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:22.9281742Z 2025-05-07T20:24:22.9281746Z 2025-05-07T20:24:22.9281750Z 2025-05-07T20:24:22.9281754Z 2025-05-07T20:24:22.9281758Z 2025-05-07T20:24:22.9281762Z 2025-05-07T20:24:22.9442379Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:22.9442790Z 2025-05-07T20:24:22.9442796Z 2025-05-07T20:24:22.9442801Z 2025-05-07T20:24:22.9442806Z 2025-05-07T20:24:22.9442811Z 2025-05-07T20:24:22.9442818Z 2025-05-07T20:24:22.9442910Z 2025-05-07T20:24:22.9448021Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:22.9448336Z 2025-05-07T20:24:22.9448340Z 2025-05-07T20:24:22.9448541Z 2025-05-07T20:24:22.9448545Z 2025-05-07T20:24:22.9448548Z 2025-05-07T20:24:22.9448552Z 2025-05-07T20:24:22.9448556Z 2025-05-07T20:24:22.9636865Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:22.9637160Z 2025-05-07T20:24:22.9637164Z 2025-05-07T20:24:22.9637382Z 2025-05-07T20:24:22.9637387Z 2025-05-07T20:24:22.9637391Z 2025-05-07T20:24:22.9637403Z 2025-05-07T20:24:22.9637407Z 2025-05-07T20:24:22.9637411Z 2025-05-07T20:24:22.9642256Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:22.9642546Z 2025-05-07T20:24:22.9642550Z 2025-05-07T20:24:22.9642563Z 2025-05-07T20:24:22.9642567Z 2025-05-07T20:24:22.9642571Z 2025-05-07T20:24:22.9642574Z 2025-05-07T20:24:22.9642578Z 2025-05-07T20:24:22.9644549Z 2025-05-07T20:24:22.9807220Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:22.9807582Z 2025-05-07T20:24:22.9807588Z 2025-05-07T20:24:22.9807593Z 2025-05-07T20:24:22.9807611Z 2025-05-07T20:24:22.9807616Z 2025-05-07T20:24:22.9807621Z 2025-05-07T20:24:22.9807626Z 2025-05-07T20:24:22.9807631Z 2025-05-07T20:24:22.9807636Z 2025-05-07T20:24:22.9808918Z 2025-05-07T20:24:23.0343937Z python_abi-3.11 | 5 KB | ########## | 100%  2025-05-07T20:24:23.0344238Z 2025-05-07T20:24:23.0391030Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:23.0397138Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:23.0397471Z 2025-05-07T20:24:23.0397704Z 2025-05-07T20:24:23.0397888Z  2025-05-07T20:24:23.0398205Z 2025-05-07T20:24:23.0398210Z 2025-05-07T20:24:23.0398447Z  2025-05-07T20:24:23.0398701Z 2025-05-07T20:24:23.0398705Z 2025-05-07T20:24:23.0398708Z 2025-05-07T20:24:23.0398883Z  2025-05-07T20:24:23.0399109Z 2025-05-07T20:24:23.0399113Z 2025-05-07T20:24:23.0399116Z 2025-05-07T20:24:23.0399120Z 2025-05-07T20:24:23.0399290Z  2025-05-07T20:24:23.0399503Z 2025-05-07T20:24:23.0399507Z 2025-05-07T20:24:23.0399515Z 2025-05-07T20:24:23.0399519Z 2025-05-07T20:24:23.0399522Z 2025-05-07T20:24:23.0399695Z  2025-05-07T20:24:23.0399909Z 2025-05-07T20:24:23.0399913Z 2025-05-07T20:24:23.0399916Z 2025-05-07T20:24:23.0399920Z 2025-05-07T20:24:23.0399924Z 2025-05-07T20:24:23.0399927Z 2025-05-07T20:24:23.0400101Z  2025-05-07T20:24:23.0400311Z 2025-05-07T20:24:23.0400321Z 2025-05-07T20:24:23.0400325Z 2025-05-07T20:24:23.0400328Z 2025-05-07T20:24:23.0400332Z 2025-05-07T20:24:23.0400336Z 2025-05-07T20:24:23.0400339Z 2025-05-07T20:24:23.0400515Z  2025-05-07T20:24:23.0400740Z 2025-05-07T20:24:23.0400743Z 2025-05-07T20:24:23.0400747Z 2025-05-07T20:24:23.0400750Z 2025-05-07T20:24:23.0400754Z 2025-05-07T20:24:23.0400757Z 2025-05-07T20:24:23.0400761Z 2025-05-07T20:24:23.0400764Z 2025-05-07T20:24:23.0400949Z  2025-05-07T20:24:23.0401171Z 2025-05-07T20:24:23.0401175Z 2025-05-07T20:24:23.0401178Z 2025-05-07T20:24:23.0401182Z 2025-05-07T20:24:23.0401185Z 2025-05-07T20:24:23.0401189Z 2025-05-07T20:24:23.0401192Z 2025-05-07T20:24:23.0401196Z 2025-05-07T20:24:23.0401199Z 2025-05-07T20:24:23.0401381Z  2025-05-07T20:24:23.0401601Z 2025-05-07T20:24:23.0401604Z 2025-05-07T20:24:23.0401608Z 2025-05-07T20:24:23.0401612Z 2025-05-07T20:24:23.0401615Z 2025-05-07T20:24:23.0401619Z 2025-05-07T20:24:23.0401629Z 2025-05-07T20:24:23.0401633Z 2025-05-07T20:24:23.0401847Z 2025-05-07T20:24:23.0401851Z 2025-05-07T20:24:23.0402051Z  done 2025-05-07T20:24:23.1404090Z Preparing transaction: - done 2025-05-07T20:24:23.2409227Z Verifying transaction: | done 2025-05-07T20:24:24.7437088Z Executing transaction: - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:24.9207573Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:24:26.6752709Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:24:26.6766186Z [SETUP] Installing libxcrypt ... 2025-05-07T20:24:26.6790206Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:24:27.5484048Z Channels: 2025-05-07T20:24:27.5484370Z - conda-forge 2025-05-07T20:24:27.5484695Z Platform: linux-64 2025-05-07T20:24:30.9328510Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:31.3073043Z Solving environment: \ done 2025-05-07T20:24:31.3687192Z 2025-05-07T20:24:31.3687510Z ## Package Plan ## 2025-05-07T20:24:31.3687734Z 2025-05-07T20:24:31.3688019Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:31.3688439Z 2025-05-07T20:24:31.3688567Z added / updated specs: 2025-05-07T20:24:31.3688927Z - libxcrypt 2025-05-07T20:24:31.3689060Z 2025-05-07T20:24:31.3689064Z 2025-05-07T20:24:31.3689185Z The following packages will be downloaded: 2025-05-07T20:24:31.3689414Z 2025-05-07T20:24:31.3689531Z package | build 2025-05-07T20:24:31.3689862Z ---------------------------|----------------- 2025-05-07T20:24:31.3690242Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:24:31.3690647Z ------------------------------------------------------------ 2025-05-07T20:24:31.3690994Z Total: 98 KB 2025-05-07T20:24:31.3691212Z 2025-05-07T20:24:31.3691360Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:31.3691580Z 2025-05-07T20:24:31.3691818Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:24:31.3692103Z 2025-05-07T20:24:31.3692107Z 2025-05-07T20:24:31.3692111Z 2025-05-07T20:24:31.3692259Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:31.5408700Z libxcrypt-4.4.36 | 98 KB | | 0% 2025-05-07T20:24:31.5425896Z libxcrypt-4.4.36 | 98 KB | #6 | 16% 2025-05-07T20:24:31.5526996Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:31.5530398Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:31.5531310Z 2025-05-07T20:24:31.5531971Z done 2025-05-07T20:24:31.6536294Z Preparing transaction: / done 2025-05-07T20:24:31.7537880Z Verifying transaction: \ done 2025-05-07T20:24:31.8543338Z Executing transaction: / done 2025-05-07T20:24:35.3283068Z [SETUP] Copying over ... 2025-05-07T20:24:35.3283932Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.11/crypt.h 2025-05-07T20:24:35.3284478Z 2025-05-07T20:24:35.3316873Z 2025-05-07T20:24:37.0193365Z [SETUP] Installed Python version: Python 3.11.11 2025-05-07T20:24:37.0193824Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:24:37.0227792Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:37.0228250Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:37.0241023Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:37.0241369Z env: 2025-05-07T20:24:37.0241592Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:37.0241881Z BUILD_ENV: build_binary 2025-05-07T20:24:37.0242122Z BUILD_TARGET: genai 2025-05-07T20:24:37.0242347Z BUILD_VARIANT: cuda 2025-05-07T20:24:37.0242573Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:24:37.0243014Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:37.0243307Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:37.0243840Z ##[endgroup] 2025-05-07T20:24:37.3641578Z ################################################################################ 2025-05-07T20:24:37.3641974Z # Install C/C++ Compilers 2025-05-07T20:24:37.3642224Z # 2025-05-07T20:24:37.3658600Z # [2025-05-07T20:24:37.365Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:24:37.3659008Z ################################################################################ 2025-05-07T20:24:37.3659226Z 2025-05-07T20:24:37.3673728Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:37.4559871Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:37.4569342Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:24:37.4589788Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:24:38.3268373Z Channels: 2025-05-07T20:24:38.3268635Z - conda-forge 2025-05-07T20:24:38.3268853Z Platform: linux-64 2025-05-07T20:24:41.7230868Z Collecting package metadata (repodata.json): - \ | / - done 2025-05-07T20:24:42.0949865Z Solving environment: | done 2025-05-07T20:24:42.1568506Z 2025-05-07T20:24:42.1568787Z ## Package Plan ## 2025-05-07T20:24:42.1568945Z 2025-05-07T20:24:42.1569189Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:42.1569489Z 2025-05-07T20:24:42.1569582Z added / updated specs: 2025-05-07T20:24:42.1569844Z - sysroot_linux-64=2.17 2025-05-07T20:24:42.1570005Z 2025-05-07T20:24:42.1570009Z 2025-05-07T20:24:42.1570133Z The following packages will be downloaded: 2025-05-07T20:24:42.1570344Z 2025-05-07T20:24:42.1570464Z package | build 2025-05-07T20:24:42.1570775Z ---------------------------|----------------- 2025-05-07T20:24:42.1571189Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:24:42.1571687Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:24:42.1572087Z ------------------------------------------------------------ 2025-05-07T20:24:42.1572424Z Total: 15.4 MB 2025-05-07T20:24:42.1572630Z 2025-05-07T20:24:42.1572761Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:42.1572984Z 2025-05-07T20:24:42.1573270Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:24:42.1573819Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:24:42.1574129Z 2025-05-07T20:24:42.1574132Z 2025-05-07T20:24:42.1574137Z 2025-05-07T20:24:42.1574282Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:42.1574651Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:42.1574876Z 2025-05-07T20:24:42.2571977Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:24:42.3019334Z sysroot_linux-64-2.1 | 14.5 MB | 6 | 7% 2025-05-07T20:24:42.3019587Z 2025-05-07T20:24:42.3077421Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:24:42.3079347Z 2025-05-07T20:24:42.3585985Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:42.4752672Z sysroot_linux-64-2.1 | 14.5 MB | ##4 | 24% 2025-05-07T20:24:42.4947532Z sysroot_linux-64-2.1 | 14.5 MB | ###7 | 38% 2025-05-07T20:24:42.4947784Z 2025-05-07T20:24:42.4948382Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:42.4948758Z 2025-05-07T20:24:42.5801681Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:42.6973289Z sysroot_linux-64-2.1 | 14.5 MB | ######9 | 69% 2025-05-07T20:24:42.7317905Z sysroot_linux-64-2.1 | 14.5 MB | #########8 | 98% 2025-05-07T20:24:43.2783329Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:43.2788140Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:43.2789354Z 2025-05-07T20:24:43.2789718Z 2025-05-07T20:24:43.2790118Z  done 2025-05-07T20:24:43.3793090Z Preparing transaction: - done 2025-05-07T20:24:43.5797243Z Verifying transaction: | / done 2025-05-07T20:24:43.7854591Z Executing transaction: \ | done 2025-05-07T20:24:43.9499720Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:24:43.9500035Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:24:45.6498674Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:24:45.6514098Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:24:45.6537722Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:24:46.5443532Z Channels: 2025-05-07T20:24:46.5443940Z - conda-forge 2025-05-07T20:24:46.5444236Z Platform: linux-64 2025-05-07T20:24:49.9082362Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:50.8899798Z Solving environment: \ | / done 2025-05-07T20:24:50.9550144Z 2025-05-07T20:24:50.9550573Z ## Package Plan ## 2025-05-07T20:24:50.9550802Z 2025-05-07T20:24:50.9551078Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:50.9551522Z 2025-05-07T20:24:50.9551623Z added / updated specs: 2025-05-07T20:24:50.9551880Z - gxx_linux-64=11.4.0 2025-05-07T20:24:50.9552072Z 2025-05-07T20:24:50.9552078Z 2025-05-07T20:24:50.9552250Z The following packages will be downloaded: 2025-05-07T20:24:50.9552546Z 2025-05-07T20:24:50.9552703Z package | build 2025-05-07T20:24:50.9553143Z ---------------------------|----------------- 2025-05-07T20:24:50.9553687Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:24:50.9554217Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:24:50.9554876Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:24:50.9555380Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:24:50.9555822Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:24:50.9556258Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:24:50.9556691Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:24:50.9557151Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:24:50.9557617Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:24:50.9558056Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:24:50.9558527Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:24:50.9559010Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:24:50.9559420Z ------------------------------------------------------------ 2025-05-07T20:24:50.9559915Z Total: 91.6 MB 2025-05-07T20:24:50.9560202Z 2025-05-07T20:24:50.9560385Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:50.9561010Z 2025-05-07T20:24:50.9561309Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:24:50.9561873Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:24:50.9562418Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:24:50.9562941Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:24:50.9563450Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:24:50.9564116Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:24:50.9564869Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:50.9565438Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:24:50.9565934Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:24:50.9566489Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:50.9566856Z 2025-05-07T20:24:50.9566973Z The following packages will be UPDATED: 2025-05-07T20:24:50.9567177Z 2025-05-07T20:24:50.9567497Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:24:50.9568214Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:24:50.9568620Z 2025-05-07T20:24:50.9568625Z 2025-05-07T20:24:50.9568629Z 2025-05-07T20:24:50.9568773Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:50.9569165Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:50.9569413Z 2025-05-07T20:24:50.9570020Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:50.9570344Z 2025-05-07T20:24:50.9570353Z 2025-05-07T20:24:50.9583431Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:50.9583790Z 2025-05-07T20:24:50.9583807Z 2025-05-07T20:24:50.9583812Z 2025-05-07T20:24:50.9593420Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:24:50.9593791Z 2025-05-07T20:24:50.9593797Z 2025-05-07T20:24:50.9593802Z 2025-05-07T20:24:50.9598767Z 2025-05-07T20:24:50.9609024Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:50.9609395Z 2025-05-07T20:24:50.9609401Z 2025-05-07T20:24:50.9609406Z 2025-05-07T20:24:50.9609411Z 2025-05-07T20:24:50.9615244Z 2025-05-07T20:24:50.9616824Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:50.9617219Z 2025-05-07T20:24:50.9617229Z 2025-05-07T20:24:50.9617234Z 2025-05-07T20:24:50.9617240Z 2025-05-07T20:24:50.9617245Z 2025-05-07T20:24:50.9630907Z 2025-05-07T20:24:50.9632600Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:24:50.9632997Z 2025-05-07T20:24:50.9633003Z 2025-05-07T20:24:50.9633008Z 2025-05-07T20:24:50.9633018Z 2025-05-07T20:24:50.9633033Z 2025-05-07T20:24:50.9633039Z 2025-05-07T20:24:50.9633044Z 2025-05-07T20:24:50.9635487Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:24:50.9635883Z 2025-05-07T20:24:50.9635889Z 2025-05-07T20:24:50.9635894Z 2025-05-07T20:24:50.9635899Z 2025-05-07T20:24:50.9635904Z 2025-05-07T20:24:50.9635922Z 2025-05-07T20:24:50.9635928Z 2025-05-07T20:24:50.9635933Z 2025-05-07T20:24:50.9637120Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:50.9637509Z 2025-05-07T20:24:50.9637521Z 2025-05-07T20:24:50.9637545Z 2025-05-07T20:24:50.9637551Z 2025-05-07T20:24:50.9637557Z 2025-05-07T20:24:50.9637562Z 2025-05-07T20:24:50.9637567Z 2025-05-07T20:24:50.9637573Z 2025-05-07T20:24:50.9644540Z 2025-05-07T20:24:50.9668998Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:24:50.9669396Z 2025-05-07T20:24:50.9669402Z 2025-05-07T20:24:50.9669407Z 2025-05-07T20:24:50.9669412Z 2025-05-07T20:24:50.9670356Z 2025-05-07T20:24:50.9670366Z 2025-05-07T20:24:50.9670371Z 2025-05-07T20:24:50.9670376Z 2025-05-07T20:24:50.9670381Z 2025-05-07T20:24:50.9670386Z 2025-05-07T20:24:50.9671263Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:24:50.9671659Z 2025-05-07T20:24:50.9671665Z 2025-05-07T20:24:50.9671670Z 2025-05-07T20:24:50.9671675Z 2025-05-07T20:24:50.9671681Z 2025-05-07T20:24:50.9671686Z 2025-05-07T20:24:50.9671691Z 2025-05-07T20:24:50.9671696Z 2025-05-07T20:24:50.9671701Z 2025-05-07T20:24:50.9671706Z 2025-05-07T20:24:50.9671887Z 2025-05-07T20:24:51.1125041Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:24:51.1125495Z 2025-05-07T20:24:51.1125500Z 2025-05-07T20:24:51.1125506Z 2025-05-07T20:24:51.1176467Z 2025-05-07T20:24:51.1217642Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:51.1218029Z 2025-05-07T20:24:51.1218035Z 2025-05-07T20:24:51.1219415Z 2025-05-07T20:24:51.1233133Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:24:51.1233531Z 2025-05-07T20:24:51.1233536Z 2025-05-07T20:24:51.1754805Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:51.1755202Z 2025-05-07T20:24:51.1755221Z 2025-05-07T20:24:51.1755226Z 2025-05-07T20:24:51.1755231Z 2025-05-07T20:24:51.1762534Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:51.1762901Z 2025-05-07T20:24:51.1835219Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:51.2173956Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:51.2174299Z 2025-05-07T20:24:51.2174306Z 2025-05-07T20:24:51.2174311Z 2025-05-07T20:24:51.2174316Z 2025-05-07T20:24:51.2176629Z 2025-05-07T20:24:51.2225444Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:51.2225838Z 2025-05-07T20:24:51.2225844Z 2025-05-07T20:24:51.2226596Z 2025-05-07T20:24:51.2235251Z binutils_impl_linux- | 6.0 MB | #######1 | 71%  2025-05-07T20:24:51.2235626Z 2025-05-07T20:24:51.2236381Z 2025-05-07T20:24:51.2765262Z libstdcxx-devel_linu | 11.1 MB | ##5 | 25%  2025-05-07T20:24:51.2768235Z 2025-05-07T20:24:51.2836770Z gxx_impl_linux-64-11 | 11.2 MB | ###2 | 32%  2025-05-07T20:24:51.3175001Z gcc_impl_linux-64-11 | 53.0 MB | 4 | 5% 2025-05-07T20:24:51.3175352Z 2025-05-07T20:24:51.3175359Z 2025-05-07T20:24:51.3175377Z 2025-05-07T20:24:51.3175383Z 2025-05-07T20:24:51.3175394Z 2025-05-07T20:24:51.3235440Z libsanitizer-11.4.0 | 3.5 MB | #########2 | 93%  2025-05-07T20:24:51.3235869Z 2025-05-07T20:24:51.3236189Z 2025-05-07T20:24:51.3765712Z libstdcxx-devel_linu | 11.1 MB | #####1 | 52%  2025-05-07T20:24:51.3766857Z 2025-05-07T20:24:51.3839645Z gxx_impl_linux-64-11 | 11.2 MB | ######4 | 64%  2025-05-07T20:24:51.4011252Z gcc_impl_linux-64-11 | 53.0 MB | #1 | 11% 2025-05-07T20:24:51.4011599Z 2025-05-07T20:24:51.4011606Z 2025-05-07T20:24:51.4011612Z 2025-05-07T20:24:51.4011629Z 2025-05-07T20:24:51.4014227Z 2025-05-07T20:24:51.4236468Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:51.4236871Z 2025-05-07T20:24:51.4237216Z 2025-05-07T20:24:51.4476263Z libstdcxx-devel_linu | 11.1 MB | #######9 | 79%  2025-05-07T20:24:51.4476634Z 2025-05-07T20:24:51.4476639Z 2025-05-07T20:24:51.4481573Z 2025-05-07T20:24:51.4531918Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:51.4532309Z 2025-05-07T20:24:51.4532326Z 2025-05-07T20:24:51.4532332Z 2025-05-07T20:24:51.4532337Z 2025-05-07T20:24:51.4532342Z 2025-05-07T20:24:51.4532358Z 2025-05-07T20:24:51.4835716Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:24:51.4836132Z 2025-05-07T20:24:51.4846827Z gxx_impl_linux-64-11 | 11.2 MB | ########9 | 89%  2025-05-07T20:24:51.5082092Z gcc_impl_linux-64-11 | 53.0 MB | #7 | 18% 2025-05-07T20:24:51.5082449Z 2025-05-07T20:24:51.5082455Z 2025-05-07T20:24:51.5082460Z 2025-05-07T20:24:51.5082466Z 2025-05-07T20:24:51.5082471Z 2025-05-07T20:24:51.5082476Z 2025-05-07T20:24:51.5088061Z 2025-05-07T20:24:51.5777260Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:24:51.5777665Z 2025-05-07T20:24:51.5777671Z 2025-05-07T20:24:51.5777676Z 2025-05-07T20:24:51.5777681Z 2025-05-07T20:24:51.5777686Z 2025-05-07T20:24:51.5777692Z 2025-05-07T20:24:51.5777697Z 2025-05-07T20:24:51.5842358Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:51.6214055Z gcc_impl_linux-64-11 | 53.0 MB | ##3 | 23% 2025-05-07T20:24:51.6214389Z 2025-05-07T20:24:51.6214395Z 2025-05-07T20:24:51.6214400Z 2025-05-07T20:24:51.6214406Z 2025-05-07T20:24:51.6214411Z 2025-05-07T20:24:51.6217708Z 2025-05-07T20:24:51.6218276Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:51.6218581Z 2025-05-07T20:24:51.6218591Z 2025-05-07T20:24:51.6218595Z 2025-05-07T20:24:51.6218599Z 2025-05-07T20:24:51.6218603Z 2025-05-07T20:24:51.6218607Z 2025-05-07T20:24:51.6286633Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:51.6286975Z 2025-05-07T20:24:51.6286979Z 2025-05-07T20:24:51.6286983Z 2025-05-07T20:24:51.6286986Z 2025-05-07T20:24:51.6286990Z 2025-05-07T20:24:51.6286994Z 2025-05-07T20:24:51.6286998Z 2025-05-07T20:24:51.6287001Z 2025-05-07T20:24:51.6320116Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:51.6320473Z 2025-05-07T20:24:51.6320477Z 2025-05-07T20:24:51.6320481Z 2025-05-07T20:24:51.6320484Z 2025-05-07T20:24:51.6320488Z 2025-05-07T20:24:51.6320492Z 2025-05-07T20:24:51.6320495Z 2025-05-07T20:24:51.6321951Z 2025-05-07T20:24:51.6618462Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:51.6618867Z 2025-05-07T20:24:51.6618880Z 2025-05-07T20:24:51.6618884Z 2025-05-07T20:24:51.6618887Z 2025-05-07T20:24:51.6618900Z 2025-05-07T20:24:51.6618903Z 2025-05-07T20:24:51.6618907Z 2025-05-07T20:24:51.6618910Z 2025-05-07T20:24:51.6620228Z 2025-05-07T20:24:51.6646798Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:24:51.6647241Z 2025-05-07T20:24:51.6647245Z 2025-05-07T20:24:51.6647249Z 2025-05-07T20:24:51.6647252Z 2025-05-07T20:24:51.6647256Z 2025-05-07T20:24:51.6647260Z 2025-05-07T20:24:51.6647264Z 2025-05-07T20:24:51.6647267Z 2025-05-07T20:24:51.6647904Z 2025-05-07T20:24:51.6843213Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:51.6874603Z gcc_impl_linux-64-11 | 53.0 MB | ### | 30% 2025-05-07T20:24:51.6874902Z 2025-05-07T20:24:51.6874907Z 2025-05-07T20:24:51.6874913Z 2025-05-07T20:24:51.6874918Z 2025-05-07T20:24:51.6874923Z 2025-05-07T20:24:51.6874928Z 2025-05-07T20:24:51.6874934Z 2025-05-07T20:24:51.6874949Z 2025-05-07T20:24:51.6874955Z 2025-05-07T20:24:51.6876460Z 2025-05-07T20:24:51.6913359Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:24:51.6913673Z 2025-05-07T20:24:51.6913677Z 2025-05-07T20:24:51.6913681Z 2025-05-07T20:24:51.6913685Z 2025-05-07T20:24:51.6913688Z 2025-05-07T20:24:51.6913692Z 2025-05-07T20:24:51.6913696Z 2025-05-07T20:24:51.6913699Z 2025-05-07T20:24:51.6913703Z 2025-05-07T20:24:51.6915299Z 2025-05-07T20:24:51.6955528Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:51.6955874Z 2025-05-07T20:24:51.6955879Z 2025-05-07T20:24:51.6955885Z 2025-05-07T20:24:51.6955890Z 2025-05-07T20:24:51.6963207Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:51.6963499Z 2025-05-07T20:24:51.6963503Z 2025-05-07T20:24:51.6963507Z 2025-05-07T20:24:51.6963511Z 2025-05-07T20:24:51.7024202Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:51.7024494Z 2025-05-07T20:24:51.7024498Z 2025-05-07T20:24:51.7024502Z 2025-05-07T20:24:51.7024505Z 2025-05-07T20:24:51.7024509Z 2025-05-07T20:24:51.7024513Z 2025-05-07T20:24:51.7024517Z 2025-05-07T20:24:51.7024520Z 2025-05-07T20:24:51.7024524Z 2025-05-07T20:24:51.7024528Z 2025-05-07T20:24:51.7024532Z 2025-05-07T20:24:51.7065043Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:24:51.7065457Z 2025-05-07T20:24:51.7065462Z 2025-05-07T20:24:51.7065466Z 2025-05-07T20:24:51.7065469Z 2025-05-07T20:24:51.7065473Z 2025-05-07T20:24:51.7065654Z 2025-05-07T20:24:51.7065658Z 2025-05-07T20:24:51.7065662Z 2025-05-07T20:24:51.7065665Z 2025-05-07T20:24:51.7065677Z 2025-05-07T20:24:51.7066197Z 2025-05-07T20:24:51.7846367Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:51.8076366Z gcc_impl_linux-64-11 | 53.0 MB | ###7 | 38% 2025-05-07T20:24:51.8076600Z 2025-05-07T20:24:51.8076717Z 2025-05-07T20:24:51.8300418Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:51.8300706Z 2025-05-07T20:24:51.8300711Z 2025-05-07T20:24:51.8300715Z 2025-05-07T20:24:51.8300719Z 2025-05-07T20:24:51.8300723Z 2025-05-07T20:24:51.8300727Z 2025-05-07T20:24:51.8302866Z 2025-05-07T20:24:51.8315522Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:51.8315896Z 2025-05-07T20:24:51.8315900Z 2025-05-07T20:24:51.8315904Z 2025-05-07T20:24:51.8315908Z 2025-05-07T20:24:51.8315912Z 2025-05-07T20:24:51.8315915Z 2025-05-07T20:24:51.8316453Z 2025-05-07T20:24:51.8592470Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:51.8594637Z 2025-05-07T20:24:51.8846050Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:51.9584001Z gcc_impl_linux-64-11 | 53.0 MB | ####7 | 48% 2025-05-07T20:24:51.9584361Z 2025-05-07T20:24:51.9584368Z 2025-05-07T20:24:51.9584373Z 2025-05-07T20:24:51.9584394Z 2025-05-07T20:24:51.9585140Z 2025-05-07T20:24:51.9848142Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:52.0169427Z gcc_impl_linux-64-11 | 53.0 MB | #####9 | 59% 2025-05-07T20:24:52.0169777Z 2025-05-07T20:24:52.0169783Z 2025-05-07T20:24:52.0169789Z 2025-05-07T20:24:52.0169794Z 2025-05-07T20:24:52.0169799Z 2025-05-07T20:24:52.0169804Z 2025-05-07T20:24:52.0169809Z 2025-05-07T20:24:52.0170724Z 2025-05-07T20:24:52.0175517Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:52.0175922Z 2025-05-07T20:24:52.0175926Z 2025-05-07T20:24:52.0175930Z 2025-05-07T20:24:52.0175934Z 2025-05-07T20:24:52.0175937Z 2025-05-07T20:24:52.0175941Z 2025-05-07T20:24:52.0175945Z 2025-05-07T20:24:52.0175948Z 2025-05-07T20:24:52.0490953Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:52.0491371Z 2025-05-07T20:24:52.0491375Z 2025-05-07T20:24:52.0491379Z 2025-05-07T20:24:52.0491400Z 2025-05-07T20:24:52.0491404Z 2025-05-07T20:24:52.0491611Z 2025-05-07T20:24:52.0849031Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:52.0943868Z gcc_impl_linux-64-11 | 53.0 MB | ######8 | 69% 2025-05-07T20:24:52.0944202Z 2025-05-07T20:24:52.0944208Z 2025-05-07T20:24:52.0944214Z 2025-05-07T20:24:52.0944226Z 2025-05-07T20:24:52.0944231Z 2025-05-07T20:24:52.0944236Z 2025-05-07T20:24:52.0944241Z 2025-05-07T20:24:52.0944246Z 2025-05-07T20:24:52.0945211Z 2025-05-07T20:24:52.0951327Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:52.0951645Z 2025-05-07T20:24:52.0951649Z 2025-05-07T20:24:52.0951653Z 2025-05-07T20:24:52.0951657Z 2025-05-07T20:24:52.0951661Z 2025-05-07T20:24:52.0951664Z 2025-05-07T20:24:52.0951668Z 2025-05-07T20:24:52.0951672Z 2025-05-07T20:24:52.0952167Z 2025-05-07T20:24:52.1269090Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:52.1269388Z 2025-05-07T20:24:52.1269392Z 2025-05-07T20:24:52.1269396Z 2025-05-07T20:24:52.1269400Z 2025-05-07T20:24:52.1269403Z 2025-05-07T20:24:52.1269407Z 2025-05-07T20:24:52.1269410Z 2025-05-07T20:24:52.1269414Z 2025-05-07T20:24:52.1269418Z 2025-05-07T20:24:52.1272065Z 2025-05-07T20:24:52.1277207Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:52.1277546Z 2025-05-07T20:24:52.1277550Z 2025-05-07T20:24:52.1277553Z 2025-05-07T20:24:52.1277557Z 2025-05-07T20:24:52.1277560Z 2025-05-07T20:24:52.1277564Z 2025-05-07T20:24:52.1277782Z 2025-05-07T20:24:52.1277786Z 2025-05-07T20:24:52.1277790Z 2025-05-07T20:24:52.1277793Z 2025-05-07T20:24:52.1607496Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:52.1607777Z 2025-05-07T20:24:52.1607781Z 2025-05-07T20:24:52.1607786Z 2025-05-07T20:24:52.1607789Z 2025-05-07T20:24:52.1607793Z 2025-05-07T20:24:52.1607796Z 2025-05-07T20:24:52.1607810Z 2025-05-07T20:24:52.1607814Z 2025-05-07T20:24:52.1607818Z 2025-05-07T20:24:52.1607829Z 2025-05-07T20:24:52.1608155Z 2025-05-07T20:24:52.1616605Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:52.1616988Z 2025-05-07T20:24:52.1616993Z 2025-05-07T20:24:52.1617005Z 2025-05-07T20:24:52.1617009Z 2025-05-07T20:24:52.1617013Z 2025-05-07T20:24:52.1617016Z 2025-05-07T20:24:52.1617020Z 2025-05-07T20:24:52.1617024Z 2025-05-07T20:24:52.1617027Z 2025-05-07T20:24:52.1617031Z 2025-05-07T20:24:52.1617590Z 2025-05-07T20:24:52.1853898Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:52.2856041Z gcc_impl_linux-64-11 | 53.0 MB | #######8 | 78% 2025-05-07T20:24:52.4227607Z gcc_impl_linux-64-11 | 53.0 MB | ########8 | 88% 2025-05-07T20:24:52.4227970Z 2025-05-07T20:24:52.4227977Z 2025-05-07T20:24:52.4227982Z 2025-05-07T20:24:52.5664674Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:52.5665107Z 2025-05-07T20:24:52.5676824Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:52.5677344Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:52.8909057Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:52.8909407Z 2025-05-07T20:24:52.8909413Z 2025-05-07T20:24:53.3135983Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:53.3143225Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:53.3143706Z 2025-05-07T20:24:53.3144022Z 2025-05-07T20:24:53.3144275Z  2025-05-07T20:24:53.3144556Z 2025-05-07T20:24:53.3144563Z 2025-05-07T20:24:53.3144796Z  2025-05-07T20:24:53.3145086Z 2025-05-07T20:24:53.3145092Z 2025-05-07T20:24:53.3145097Z 2025-05-07T20:24:53.3145344Z  2025-05-07T20:24:53.3145622Z 2025-05-07T20:24:53.3145627Z 2025-05-07T20:24:53.3145632Z 2025-05-07T20:24:53.3145637Z 2025-05-07T20:24:53.3145900Z  2025-05-07T20:24:53.3146133Z 2025-05-07T20:24:53.3146137Z 2025-05-07T20:24:53.3146140Z 2025-05-07T20:24:53.3146144Z 2025-05-07T20:24:53.3146148Z 2025-05-07T20:24:53.3146332Z  2025-05-07T20:24:53.3146575Z 2025-05-07T20:24:53.3146581Z 2025-05-07T20:24:53.3146586Z 2025-05-07T20:24:53.3146597Z 2025-05-07T20:24:53.3146602Z 2025-05-07T20:24:53.3146608Z 2025-05-07T20:24:53.3146889Z  2025-05-07T20:24:53.3147188Z 2025-05-07T20:24:53.3147193Z 2025-05-07T20:24:53.3147198Z 2025-05-07T20:24:53.3147203Z 2025-05-07T20:24:53.3147209Z 2025-05-07T20:24:53.3147214Z 2025-05-07T20:24:53.3147219Z 2025-05-07T20:24:53.3147783Z  2025-05-07T20:24:53.3148027Z 2025-05-07T20:24:53.3148031Z 2025-05-07T20:24:53.3148034Z 2025-05-07T20:24:53.3148038Z 2025-05-07T20:24:53.3148042Z 2025-05-07T20:24:53.3148046Z 2025-05-07T20:24:53.3148049Z 2025-05-07T20:24:53.3148053Z 2025-05-07T20:24:53.3148257Z  2025-05-07T20:24:53.3148478Z 2025-05-07T20:24:53.3148482Z 2025-05-07T20:24:53.3148486Z 2025-05-07T20:24:53.3148490Z 2025-05-07T20:24:53.3148494Z 2025-05-07T20:24:53.3148498Z 2025-05-07T20:24:53.3148642Z 2025-05-07T20:24:53.3148645Z 2025-05-07T20:24:53.3148649Z 2025-05-07T20:24:53.3148849Z  2025-05-07T20:24:53.3149073Z 2025-05-07T20:24:53.3149077Z 2025-05-07T20:24:53.3149081Z 2025-05-07T20:24:53.3149085Z 2025-05-07T20:24:53.3149088Z 2025-05-07T20:24:53.3149092Z 2025-05-07T20:24:53.3149096Z 2025-05-07T20:24:53.3149106Z 2025-05-07T20:24:53.3149109Z 2025-05-07T20:24:53.3149113Z 2025-05-07T20:24:53.3149313Z  2025-05-07T20:24:53.3149540Z 2025-05-07T20:24:53.3149543Z 2025-05-07T20:24:53.3149547Z 2025-05-07T20:24:53.3149551Z 2025-05-07T20:24:53.3149555Z 2025-05-07T20:24:53.3149559Z 2025-05-07T20:24:53.3149562Z 2025-05-07T20:24:53.3149576Z 2025-05-07T20:24:53.3149580Z 2025-05-07T20:24:53.3149583Z 2025-05-07T20:24:53.3149587Z 2025-05-07T20:24:53.3149796Z  done 2025-05-07T20:24:53.4150240Z Preparing transaction: \ done 2025-05-07T20:24:53.7160653Z Verifying transaction: / - \ done 2025-05-07T20:24:53.8170626Z Executing transaction: / done 2025-05-07T20:24:53.9871636Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:24:57.9302634Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:57.9303188Z 2025-05-07T20:24:57.9317289Z 2025-05-07T20:24:57.9335138Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:24:57.9335666Z 2025-05-07T20:24:57.9347562Z 2025-05-07T20:24:57.9364641Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:24:57.9365172Z 2025-05-07T20:24:57.9377939Z 2025-05-07T20:24:57.9395110Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:24:57.9395661Z 2025-05-07T20:24:57.9407421Z 2025-05-07T20:24:59.8374513Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:59.8374803Z 2025-05-07T20:24:59.9046725Z [CHECK] Binary cc found in PATH 2025-05-07T20:25:01.8009971Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:01.8010269Z 2025-05-07T20:25:01.8691808Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:03.7681949Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:03.7682324Z 2025-05-07T20:25:03.8336229Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:05.7274706Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:05.7274983Z 2025-05-07T20:25:05.7917812Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:05.7921786Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:05.7922437Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:05.7922782Z 2025-05-07T20:25:07.7021200Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:07.7021631Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:07.7022034Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:07.7022394Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:07.7022833Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:07.7023568Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:07.7023866Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:07.7024251Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:07.7024653Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:07.7024995Z #define __CHAR_BIT__ 8 2025-05-07T20:25:07.7025322Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:07.7025627Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:07.7025893Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:07.7026172Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:07.7026446Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:07.7026793Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.7027471Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:07.7027836Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:07.7028168Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:07.7028493Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:07.7028902Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:07.7029324Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:07.7029643Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:07.7029933Z #define __GCC_IEC_559 2 2025-05-07T20:25:07.7030181Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:07.7030472Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:07.7030747Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:07.7031026Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:07.7031364Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.7031692Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:07.7031966Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.7032251Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:07.7032519Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:07.7032778Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:07.7033044Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:07.7033302Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:07.7033568Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:07.7033819Z #define __INT8_C(c) c 2025-05-07T20:25:07.7034059Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:07.7034358Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.7034673Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:07.7034991Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:07.7035361Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:07.7035635Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:07.7035906Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.7036185Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:07.7036456Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:07.7036923Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:07.7037339Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:07.7037630Z #define __linux 1 2025-05-07T20:25:07.7037859Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:07.7038139Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:07.7038736Z #define __unix 1 2025-05-07T20:25:07.7039037Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:07.7039377Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:07.7039654Z #define __WINT_MIN__ 0U 2025-05-07T20:25:07.7039929Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.7040230Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:07.7040509Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:07.7040775Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:07.7041029Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:07.7041315Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:07.7041629Z #define __INT64_C(c) c ## L 2025-05-07T20:25:07.7041891Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:07.7042195Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:07.7042465Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:07.7042809Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:07.7043187Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:07.7043786Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:07.7044054Z #define __DBL_DIG__ 15 2025-05-07T20:25:07.7044285Z #define __FLT32_DIG__ 6 2025-05-07T20:25:07.7044592Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:07.7044939Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:07.7045192Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:07.7045518Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:07.7045852Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:07.7046107Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:07.7046373Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:07.7046882Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:07.7047276Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:07.7047554Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:07.7047814Z #define __unix__ 1 2025-05-07T20:25:07.7048031Z #define __INT_WIDTH__ 32 2025-05-07T20:25:07.7048278Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:07.7048529Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:07.7048773Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:07.7049044Z #define __UINT16_C(c) c 2025-05-07T20:25:07.7049283Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:07.7049536Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:07.7049894Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:07.7050255Z #define __gnu_linux__ 1 2025-05-07T20:25:07.7050492Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:07.7050769Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.7051053Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.7051328Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:07.7051586Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:07.7051842Z #define __GNUC__ 11 2025-05-07T20:25:07.7052062Z #define __pie__ 2 2025-05-07T20:25:07.7052274Z #define __MMX__ 1 2025-05-07T20:25:07.7052498Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:07.7052766Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:07.7053053Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:07.7053326Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:07.7053669Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:07.7054064Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.7054381Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:07.7054647Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:07.7054909Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:07.7055209Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:07.7055478Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:07.7055736Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:07.7056027Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:07.7056321Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:07.7056589Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:07.7056871Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:07.7057124Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:07.7057401Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:07.7057695Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:07.7057949Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:07.7058210Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:07.7058533Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:07.7058892Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:07.7067570Z #define __SSE2_MATH__ 1 2025-05-07T20:25:07.7067874Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:07.7068185Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.7068476Z #define __amd64 1 2025-05-07T20:25:07.7068714Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:07.7068999Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:07.7069301Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:07.7069617Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:07.7069865Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:07.7070125Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:07.7070365Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:07.7070789Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:07.7071060Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:07.7071319Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:07.7071587Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:07.7071874Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:07.7072120Z #define __x86_64 1 2025-05-07T20:25:07.7072355Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:07.7072730Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:07.7073184Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:07.7073728Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:07.7074194Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:07.7074583Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:07.7074833Z #define __LP64__ 1 2025-05-07T20:25:07.7075067Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.7075425Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:07.7075792Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:07.7076068Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:07.7076346Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.7076621Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:07.7076898Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:07.7077166Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:07.7077422Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:07.7077690Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:07.7077950Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:07.7078288Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:07.7078640Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:07.7078917Z #define __FLT_DIG__ 6 2025-05-07T20:25:07.7079155Z #define __NO_INLINE__ 1 2025-05-07T20:25:07.7079388Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:07.7079720Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:07.7080064Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:07.7080315Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:07.7080578Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:07.7080834Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:07.7081081Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:07.7081336Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:07.7081633Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:07.7081916Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:07.7082189Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:07.7082497Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:07.7082832Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:07.7083093Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:07.7083356Z #define __FLT128_DIG__ 33 2025-05-07T20:25:07.7083759Z #define __INT32_C(c) c 2025-05-07T20:25:07.7084015Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:07.7084294Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:07.7084578Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:07.7084849Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:07.7085166Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:07.7085476Z #define unix 1 2025-05-07T20:25:07.7085696Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:07.7086009Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.7086313Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:07.7086613Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:07.7086944Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:07.7087198Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:07.7087468Z #define __ELF__ 1 2025-05-07T20:25:07.7087692Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:07.7087980Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:07.7088256Z #define __FLT_RADIX__ 2 2025-05-07T20:25:07.7088500Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:07.7088856Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:07.7089331Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:07.7089584Z #define __SSE_MATH__ 1 2025-05-07T20:25:07.7089812Z #define __k8 1 2025-05-07T20:25:07.7090110Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:07.7090477Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:07.7090770Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:07.7091070Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:07.7091322Z #define __LDBL_DIG__ 18 2025-05-07T20:25:07.7091565Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:07.7091817Z #define __x86_64__ 1 2025-05-07T20:25:07.7092198Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:07.7092488Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:07.7092821Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.7093124Z #define __FLT64_DIG__ 15 2025-05-07T20:25:07.7093395Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.7093747Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:07.7094064Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.7094319Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:07.7094596Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.7094893Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:07.7095245Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:07.7095638Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:07.7095931Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:07.7096268Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:07.7096583Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:07.7096893Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:07.7097176Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:07.7097473Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:07.7097761Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:07.7098001Z #define __SEG_FS 1 2025-05-07T20:25:07.7098226Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:07.7098512Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:07.7098791Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.7099071Z #define __SEG_GS 1 2025-05-07T20:25:07.7099383Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:07.7099773Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:07.7100087Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:07.7100366Z #define __INT16_TYPE__ short int 2025-05-07T20:25:07.7100644Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:07.7100940Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:07.7101207Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:07.7101459Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:07.7101717Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:07.7102049Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:07.7102430Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.7102720Z #define linux 1 2025-05-07T20:25:07.7102940Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.7103216Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:07.7103489Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:07.7103730Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:07.7103993Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:07.7104256Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:07.7104600Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:07.7105001Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:07.7105330Z #define __code_model_small__ 1 2025-05-07T20:25:07.7105607Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:07.7105894Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:07.7106143Z #define __k8__ 1 2025-05-07T20:25:07.7106372Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:07.7106651Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:07.7106949Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:07.7107191Z #define __pic__ 2 2025-05-07T20:25:07.7107546Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.7107858Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:07.7108148Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.7108470Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:07.7108835Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:07.7109191Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:07.7109463Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:07.7109750Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:07.7110058Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:07.7110397Z #define __linux__ 1 2025-05-07T20:25:07.7110619Z #define __INT64_TYPE__ long int 2025-05-07T20:25:07.7110889Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:07.7111152Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:07.7111420Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:07.7111682Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:07.7111977Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.7112310Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:07.7112610Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:07.7112886Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:07.7113185Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:07.7113471Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:07.7113806Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:07.7114166Z #define __SSE__ 1 2025-05-07T20:25:07.7114387Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:07.7114731Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:07.7115087Z #define __amd64__ 1 2025-05-07T20:25:07.7115305Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:07.7115559Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:07.7115833Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:07.7116093Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:07.7116362Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:07.7116642Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:07.7116896Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:07.7117166Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:07.7117431Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:07.7117775Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:07.7118229Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:07.7118585Z #define _LP64 1 2025-05-07T20:25:07.7118799Z #define __UINT8_C(c) c 2025-05-07T20:25:07.7119035Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:07.7119296Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:07.7119561Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:07.7119832Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:07.7120127Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:07.7120470Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:07.7120929Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:07.7121304Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.7121588Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.7121896Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:07.7122259Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:07.7122624Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:07.7122885Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:07.7123221Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:07.7123723Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:07.7124000Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:07.7124259Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:07.7124507Z #define __FXSR__ 1 2025-05-07T20:25:07.7124798Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:07.7125249Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:07.7125659Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:07.7126063Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:07.7126318Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:07.7126656Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:07.7127010Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:07.7127249Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:07.7127487Z #define __PIC__ 2 2025-05-07T20:25:07.7127737Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:07.7128125Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:07.7128509Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:07.7128938Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:07.7129263Z #define __SSE2__ 1 2025-05-07T20:25:07.7129485Z #define __INT32_TYPE__ int 2025-05-07T20:25:07.7129735Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:07.7129983Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:07.7130317Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:07.7130686Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:07.7130958Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:07.7131218Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:07.7131485Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.7131771Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:07.7132009Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:07.7132270Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:07.7132557Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.7132846Z #define __PIE__ 2 2025-05-07T20:25:07.7133176Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:07.7133572Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:07.7133905Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:07.7134269Z #define __INT16_C(c) c 2025-05-07T20:25:07.7134494Z #define __STDC__ 1 2025-05-07T20:25:07.7134716Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:07.7134998Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:07.7135254Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.7135549Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:07.7135888Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:07.7136227Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:07.7136489Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.7136760Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:07.7137024Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:07.7137307Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:07.7137588Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.7137862Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:07.7138156Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.7139134Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:07.7139509Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:07.7139807Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:07.7140104Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:07.7140351Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:07.7140516Z 2025-05-07T20:25:07.7668336Z 2025-05-07T20:25:07.7669028Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:07.7669504Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:07.7669735Z 2025-05-07T20:25:09.6736975Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:09.6737457Z #define __cpp_attributes 200809L 2025-05-07T20:25:09.6737913Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:09.6738596Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:09.6738926Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:09.6739178Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:09.6739511Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:09.6739985Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:09.6740410Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:09.6740840Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:09.6741648Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:09.6741925Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:09.6742169Z #define __CHAR_BIT__ 8 2025-05-07T20:25:09.6742409Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:09.6742653Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:09.6742900Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:09.6743175Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:09.6743450Z #define __cpp_static_assert 201411L 2025-05-07T20:25:09.6743734Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:09.6744123Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.6744574Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:09.6744862Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:09.6745178Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:09.6745494Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:09.6745889Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:09.6746297Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:09.6746605Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:09.6746881Z #define __GCC_IEC_559 2 2025-05-07T20:25:09.6747123Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:09.6747397Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:09.6747664Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:09.6747946Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:09.6748236Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:09.6748552Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:09.6748859Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:09.6749189Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.6749510Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:09.6749773Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:09.6750039Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:09.6750320Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:09.6750615Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:09.6750881Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:09.6751143Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:09.6751412Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:09.6751736Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:09.6752065Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:09.6752319Z #define __INT8_C(c) c 2025-05-07T20:25:09.6752550Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:09.6752817Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:09.6753139Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.6753463Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:09.6753742Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:09.6754031Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:09.6754343Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:09.6754689Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:09.6754975Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:09.6755259Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:09.6755515Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.6755792Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:09.6756068Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:09.6756455Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:09.6756865Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:09.6757154Z #define __linux 1 2025-05-07T20:25:09.6757381Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:09.6757655Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:09.6757936Z #define __unix 1 2025-05-07T20:25:09.6758163Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:09.6758436Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:09.6758724Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:09.6758993Z #define __WINT_MIN__ 0U 2025-05-07T20:25:09.6759228Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:09.6759510Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:09.6759871Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:09.6760134Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:09.6760386Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:09.6760701Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:09.6760997Z #define __INT64_C(c) c ## L 2025-05-07T20:25:09.6761263Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:09.6761563Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:09.6761835Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:09.6762127Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:09.6762401Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:09.6762743Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:09.6763082Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:09.6763457Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:09.6763864Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:09.6764132Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:09.6764406Z #define __DBL_DIG__ 15 2025-05-07T20:25:09.6764643Z #define __FLT32_DIG__ 6 2025-05-07T20:25:09.6764940Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:09.6765284Z #define __GXX_WEAK__ 1 2025-05-07T20:25:09.6765518Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:09.6765769Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:09.6766103Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:09.6766450Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:09.6766710Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:09.6767003Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:09.6767342Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:09.6767745Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:09.6768135Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:09.6768411Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:09.6768668Z #define __unix__ 1 2025-05-07T20:25:09.6768886Z #define __INT_WIDTH__ 32 2025-05-07T20:25:09.6769136Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:09.6769383Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:09.6769632Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:09.6769899Z #define __UINT16_C(c) c 2025-05-07T20:25:09.6770139Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:09.6770393Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:09.6770812Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:09.6771165Z #define __gnu_linux__ 1 2025-05-07T20:25:09.6771408Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:09.6771672Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:09.6771948Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:09.6772234Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.6772501Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:09.6772764Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:09.6773008Z #define __GNUC__ 11 2025-05-07T20:25:09.6773227Z #define __GXX_RTTI 1 2025-05-07T20:25:09.6773450Z #define __pie__ 2 2025-05-07T20:25:09.6781716Z #define __MMX__ 1 2025-05-07T20:25:09.6781964Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:09.6782243Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:09.6782531Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:09.6782800Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:09.6783056Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:09.6783364Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:09.6783678Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:09.6784028Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:09.6784400Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:09.6784710Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.6785030Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:09.6785296Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:09.6785556Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:09.6785868Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:09.6786170Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:09.6786552Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:09.6786811Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:09.6787096Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:09.6787388Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:09.6787651Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:09.6787929Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:09.6788181Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:09.6788435Z #define __cplusplus 201703L 2025-05-07T20:25:09.6788703Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:09.6788984Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:09.6789313Z #define __DEPRECATED 1 2025-05-07T20:25:09.6789568Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:09.6789861Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:09.6790111Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:09.6790429Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:09.6790787Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:09.6791062Z #define __SSE2_MATH__ 1 2025-05-07T20:25:09.6791302Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:09.6791601Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.6791893Z #define __amd64 1 2025-05-07T20:25:09.6792107Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:09.6792373Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:09.6792634Z #define __GNUG__ 11 2025-05-07T20:25:09.6792882Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:09.6793195Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:09.6793451Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:09.6793700Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:09.6793983Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:09.6794237Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:09.6794502Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:09.6794800Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:09.6795063Z #define __cpp_hex_float 201603L 2025-05-07T20:25:09.6795323Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:09.6795595Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:09.6795871Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:09.6796140Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:09.6796400Z #define __x86_64 1 2025-05-07T20:25:09.6796625Z #define __cpp_lambdas 200907L 2025-05-07T20:25:09.6796895Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:09.6797255Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:09.6797645Z #define __cpp_template_auto 201606L 2025-05-07T20:25:09.6798007Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:09.6798455Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:09.6798925Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:09.6799315Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:09.6799570Z #define __LP64__ 1 2025-05-07T20:25:09.6799798Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.6800153Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:09.6800525Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:09.6800794Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:09.6801076Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:09.6801350Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:09.6801610Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:09.6801867Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:09.6802129Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:09.6802446Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:09.6802804Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:09.6803082Z #define __FLT_DIG__ 6 2025-05-07T20:25:09.6803312Z #define __NO_INLINE__ 1 2025-05-07T20:25:09.6803552Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:09.6803996Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:09.6804344Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:09.6804600Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:09.6805013Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:09.6805270Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:09.6805537Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:09.6805829Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:09.6806082Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:09.6806364Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:09.6806651Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:09.6806916Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:09.6807207Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:09.6807541Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:09.6807906Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:09.6808157Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:09.6808413Z #define __FLT128_DIG__ 33 2025-05-07T20:25:09.6808651Z #define __INT32_C(c) c 2025-05-07T20:25:09.6808889Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:09.6809161Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:09.6809438Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:09.6809713Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:09.6810019Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:09.6810327Z #define unix 1 2025-05-07T20:25:09.6810545Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:09.6810796Z #define __cpp_rtti 199711L 2025-05-07T20:25:09.6811064Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:09.6811374Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.6811667Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:09.6811977Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:09.6812314Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:09.6812556Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:09.6812844Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:09.6813128Z #define __ELF__ 1 2025-05-07T20:25:09.6813361Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:09.6813636Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:09.6813920Z #define __FLT_RADIX__ 2 2025-05-07T20:25:09.6814173Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:09.6814521Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:09.6814881Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:09.6815153Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:09.6815420Z #define __k8 1 2025-05-07T20:25:09.6815718Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:09.6816085Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:09.6816375Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:09.6816679Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:09.6816937Z #define __LDBL_DIG__ 18 2025-05-07T20:25:09.6817179Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:09.6817429Z #define __x86_64__ 1 2025-05-07T20:25:09.6817666Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:09.6817965Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:09.6818299Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.6818603Z #define __FLT64_DIG__ 15 2025-05-07T20:25:09.6818883Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.6819221Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:09.6819542Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.6819804Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:09.6820073Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.6820369Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:09.6820731Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:09.6821121Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:09.6821407Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:09.6821733Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:09.6822048Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:09.6822359Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:09.6822658Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:09.6823021Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:09.6823321Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:09.6823602Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:09.6823839Z #define __SEG_FS 1 2025-05-07T20:25:09.6824064Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:09.6824341Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:09.6824616Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.6824894Z #define __SEG_GS 1 2025-05-07T20:25:09.6825204Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:09.6825585Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:09.6825938Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:09.6826218Z #define __INT16_TYPE__ short int 2025-05-07T20:25:09.6826499Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:09.6826803Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:09.6827094Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:09.6827343Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:09.6827614Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:09.6827949Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:09.6828335Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.6828651Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:09.6828967Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:09.6829268Z #define linux 1 2025-05-07T20:25:09.6829493Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.6829770Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:09.6830031Z #define __EXCEPTIONS 1 2025-05-07T20:25:09.6830273Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:09.6830540Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:09.6830798Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:09.6831088Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:09.6831433Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:09.6831808Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:09.6832162Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:09.6832479Z #define __code_model_small__ 1 2025-05-07T20:25:09.6832748Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:09.6833048Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:09.6833339Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:09.6833611Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:09.6833898Z #define __k8__ 1 2025-05-07T20:25:09.6834118Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:09.6834398Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:09.6834688Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:09.6834931Z #define __pic__ 2 2025-05-07T20:25:09.6835181Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.6835486Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:09.6835753Z #define __cpp_decltype 200707L 2025-05-07T20:25:09.6836031Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.6836355Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:09.6836722Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:09.6837068Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:09.6837362Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:09.6837677Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:09.6837955Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:09.6838202Z #define __linux__ 1 2025-05-07T20:25:09.6838650Z #define __INT64_TYPE__ long int 2025-05-07T20:25:09.6838961Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:09.6839224Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:09.6839499Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:09.6839781Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:09.6840097Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:09.6840388Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.6840701Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:09.6840960Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:09.6841408Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:09.6841708Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:09.6842036Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:09.6842394Z #define __SSE__ 1 2025-05-07T20:25:09.6842627Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:09.6842959Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:09.6843301Z #define __amd64__ 1 2025-05-07T20:25:09.6843526Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:09.6843876Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:09.6844146Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:09.6844535Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:09.6844806Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:09.6845056Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:09.6845325Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:09.6845598Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:09.6845932Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:09.6846396Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:09.6846747Z #define _LP64 1 2025-05-07T20:25:09.6846958Z #define __UINT8_C(c) c 2025-05-07T20:25:09.6847201Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:09.6847462Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:09.6847723Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:09.6847986Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:09.6848337Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:09.6848800Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:09.6849171Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.6849461Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.6849768Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:09.6850070Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:09.6850471Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:09.6850864Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:09.6851119Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:09.6851381Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:09.6851717Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:09.6852074Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:09.6852330Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:09.6852581Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:09.6852831Z #define __FXSR__ 1 2025-05-07T20:25:09.6853120Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:09.6853565Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:09.6853973Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:09.6854273Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:09.6854545Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:09.6854845Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:09.6855128Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:09.6855399Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:09.6855757Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:09.6856114Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:09.6856380Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:09.6856631Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:09.6856867Z #define __PIC__ 2 2025-05-07T20:25:09.6857110Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:09.6857506Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:09.6857891Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:09.6858219Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:09.6858559Z #define __cpp_constexpr 201603L 2025-05-07T20:25:09.6858814Z #define __SSE2__ 1 2025-05-07T20:25:09.6859042Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:09.6859333Z #define __INT32_TYPE__ int 2025-05-07T20:25:09.6859591Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:09.6859938Z #define __cpp_exceptions 199711L 2025-05-07T20:25:09.6860221Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:09.6860556Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:09.6860911Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:09.6861171Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:09.6861435Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:09.6861697Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.6861961Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:09.6862207Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:09.6862458Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:09.6862867Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:09.6863151Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.6863442Z #define __PIE__ 2 2025-05-07T20:25:09.6863756Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:09.6864160Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:09.6864467Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:09.6864809Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:09.6865161Z #define __INT16_C(c) c 2025-05-07T20:25:09.6865383Z #define __STDC__ 1 2025-05-07T20:25:09.6865600Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:09.6865849Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:09.6866116Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:09.6866370Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:09.6866657Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:09.6866999Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:09.6867334Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:09.6867589Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:09.6867876Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:09.6868153Z #define __SSE_MATH__ 1 2025-05-07T20:25:09.6868384Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:09.6868661Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:09.6868966Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:09.6869243Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:09.6869523Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.6869791Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:09.6870082Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.6870467Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:09.6870840Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:09.6871138Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:09.6871419Z #define _GNU_SOURCE 1 2025-05-07T20:25:09.6871666Z #define __cpp_init_captures 201304L 2025-05-07T20:25:09.6871941Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:09.6872178Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:09.6872337Z 2025-05-07T20:25:09.7393317Z 2025-05-07T20:25:09.7393770Z + conda run -n build_binary c++ --version 2025-05-07T20:25:09.7394315Z 2025-05-07T20:25:11.6334822Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:11.6335287Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:11.6335748Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:11.6336286Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:11.6336619Z 2025-05-07T20:25:11.6336624Z 2025-05-07T20:25:11.6970298Z 2025-05-07T20:25:11.6970656Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:11.6971320Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:11.6971616Z 2025-05-07T20:25:13.6598114Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:13.6600527Z 2025-05-07T20:25:13.6601298Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:13.6602016Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:13.6602327Z 2025-05-07T20:25:15.6325033Z #define __cplusplus 201703L 2025-05-07T20:25:15.6327978Z 2025-05-07T20:25:15.6328188Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:15.6365173Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:15.6365599Z . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:15.6385437Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:15.6385932Z env: 2025-05-07T20:25:15.6386152Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:15.6386446Z BUILD_ENV: build_binary 2025-05-07T20:25:15.6386674Z BUILD_TARGET: genai 2025-05-07T20:25:15.6386896Z BUILD_VARIANT: cuda 2025-05-07T20:25:15.6387332Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:25:15.6387571Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:15.6387869Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:15.6388199Z ##[endgroup] 2025-05-07T20:25:15.9797496Z ################################################################################ 2025-05-07T20:25:15.9797836Z # Install CUDA 2025-05-07T20:25:15.9798042Z # 2025-05-07T20:25:15.9812696Z # [2025-05-07T20:25:15.980Z] + install_cuda build_binary 12.6.3 2025-05-07T20:25:15.9813184Z ################################################################################ 2025-05-07T20:25:15.9813406Z 2025-05-07T20:25:15.9827820Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:16.0709025Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:16.0709490Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:16.0713676Z + conda clean --packages --tarball -y 2025-05-07T20:25:16.0713897Z 2025-05-07T20:25:16.7893092Z Will remove 32 (148.9 MB) tarball(s). 2025-05-07T20:25:16.7893595Z Will remove 6 (619 KB) package(s). 2025-05-07T20:25:16.8551390Z 2025-05-07T20:25:16.8559855Z + conda clean --all -y 2025-05-07T20:25:16.8560060Z 2025-05-07T20:25:17.5282945Z There are no unused tarball(s) to remove. 2025-05-07T20:25:17.5283336Z Will remove 1 index cache(s). 2025-05-07T20:25:17.5283845Z There are no unused package(s) to remove. 2025-05-07T20:25:17.5284198Z There are no tempfile(s) to remove. 2025-05-07T20:25:17.5284515Z There are no logfile(s) to remove. 2025-05-07T20:25:17.5956445Z 2025-05-07T20:25:17.5971478Z [INSTALL] Installing CUDA 12.6.3 ... 2025-05-07T20:25:17.5994646Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3 2025-05-07T20:25:18.5074401Z Channels: 2025-05-07T20:25:18.5074676Z - conda-forge 2025-05-07T20:25:18.5074896Z Platform: linux-64 2025-05-07T20:25:29.2683460Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:25:30.3774494Z Solving environment: - \ | / done 2025-05-07T20:25:30.4510824Z 2025-05-07T20:25:30.4510943Z ## Package Plan ## 2025-05-07T20:25:30.4511093Z 2025-05-07T20:25:30.4511313Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:30.4511609Z 2025-05-07T20:25:30.4511715Z added / updated specs: 2025-05-07T20:25:30.4511960Z - cuda=12.6.3 2025-05-07T20:25:30.4512105Z 2025-05-07T20:25:30.4512126Z 2025-05-07T20:25:30.4512258Z The following packages will be downloaded: 2025-05-07T20:25:30.4512571Z 2025-05-07T20:25:30.4512750Z package | build 2025-05-07T20:25:30.4513145Z ---------------------------|----------------- 2025-05-07T20:25:30.4513514Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:25:30.4513933Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:25:30.4514345Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:25:30.4514783Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:25:30.4515199Z cuda-12.6.3 | ha804496_0 26 KB conda-forge 2025-05-07T20:25:30.4515680Z cuda-cccl_linux-64-12.6.77 | ha770c72_0 1.0 MB conda-forge 2025-05-07T20:25:30.4516778Z cuda-command-line-tools-12.6.3| ha770c72_0 20 KB conda-forge 2025-05-07T20:25:30.4517338Z cuda-compiler-12.6.3 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:25:30.4517817Z cuda-crt-dev_linux-64-12.6.85| ha770c72_0 87 KB conda-forge 2025-05-07T20:25:30.4518294Z cuda-crt-tools-12.6.85 | ha770c72_0 26 KB conda-forge 2025-05-07T20:25:30.4518748Z cuda-cudart-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:30.4519226Z cuda-cudart-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:30.4519910Z cuda-cudart-dev_linux-64-12.6.77| h3f2d84a_0 357 KB conda-forge 2025-05-07T20:25:30.4520420Z cuda-cudart-static-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:30.4520934Z cuda-cudart-static_linux-64-12.6.77| h3f2d84a_0 744 KB conda-forge 2025-05-07T20:25:30.4521452Z cuda-cudart_linux-64-12.6.77| h3f2d84a_0 184 KB conda-forge 2025-05-07T20:25:30.4521943Z cuda-cuobjdump-12.6.77 | hbd13f7d_1 241 KB conda-forge 2025-05-07T20:25:30.4522407Z cuda-cupti-12.6.80 | hbd13f7d_0 1.9 MB conda-forge 2025-05-07T20:25:30.4522851Z cuda-cupti-dev-12.6.80 | h5888daf_0 3.4 MB conda-forge 2025-05-07T20:25:30.4523324Z cuda-cuxxfilt-12.6.77 | hbd13f7d_1 211 KB conda-forge 2025-05-07T20:25:30.4523954Z cuda-driver-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:30.4524445Z cuda-driver-dev_linux-64-12.6.77| h3f2d84a_0 35 KB conda-forge 2025-05-07T20:25:30.4524984Z cuda-gdb-12.6.77 | h50b4baa_1 370 KB conda-forge 2025-05-07T20:25:30.4525427Z cuda-libraries-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:30.4525905Z cuda-libraries-dev-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:30.4526374Z cuda-nsight-12.6.77 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:25:30.4526812Z cuda-nvcc-12.6.85 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:25:30.4527306Z cuda-nvcc-dev_linux-64-12.6.85| he91c749_0 10.8 MB conda-forge 2025-05-07T20:25:30.4527804Z cuda-nvcc-impl-12.6.85 | h85509e4_0 25 KB conda-forge 2025-05-07T20:25:30.4528271Z cuda-nvcc-tools-12.6.85 | he02047a_0 23.0 MB conda-forge 2025-05-07T20:25:30.4528742Z cuda-nvcc_linux-64-12.6.85 | h04802cd_0 25 KB conda-forge 2025-05-07T20:25:30.4529218Z cuda-nvdisasm-12.6.77 | hbd13f7d_1 47.6 MB conda-forge 2025-05-07T20:25:30.4529673Z cuda-nvml-dev-12.6.77 | hbd13f7d_1 159 KB conda-forge 2025-05-07T20:25:30.4530125Z cuda-nvprof-12.6.80 | hbd13f7d_0 2.6 MB conda-forge 2025-05-07T20:25:30.4530576Z cuda-nvprune-12.6.77 | hbd13f7d_1 66 KB conda-forge 2025-05-07T20:25:30.4531024Z cuda-nvrtc-12.6.85 | hbd13f7d_0 17.3 MB conda-forge 2025-05-07T20:25:30.4531467Z cuda-nvrtc-dev-12.6.85 | h5888daf_0 31 KB conda-forge 2025-05-07T20:25:30.4531912Z cuda-nvtx-12.6.77 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:25:30.4532370Z cuda-nvvm-dev_linux-64-12.6.85| ha770c72_0 25 KB conda-forge 2025-05-07T20:25:30.4532841Z cuda-nvvm-impl-12.6.85 | he02047a_0 7.7 MB conda-forge 2025-05-07T20:25:30.4533308Z cuda-nvvm-tools-12.6.85 | he02047a_0 10.4 MB conda-forge 2025-05-07T20:25:30.4533760Z cuda-nvvp-12.6.80 | hbd13f7d_1 109.3 MB conda-forge 2025-05-07T20:25:30.4534197Z cuda-opencl-12.6.77 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:25:30.4534653Z cuda-opencl-dev-12.6.77 | h5888daf_0 93 KB conda-forge 2025-05-07T20:25:30.4535265Z cuda-profiler-api-12.6.77 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:25:30.4535750Z cuda-runtime-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:25:30.4536216Z cuda-sanitizer-api-12.6.77 | hbd13f7d_1 8.9 MB conda-forge 2025-05-07T20:25:30.4536684Z cuda-toolkit-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:25:30.4537125Z cuda-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:30.4537561Z cuda-version-12.6 | h7480c83_3 20 KB conda-forge 2025-05-07T20:25:30.4538107Z cuda-visual-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:30.4539046Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:25:30.4539466Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:25:30.4539851Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:25:30.4540325Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:25:30.4540846Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:25:30.4541364Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:25:30.4541855Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:25:30.4542304Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:25:30.4542770Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:25:30.4543251Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:25:30.4543681Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:25:30.4544084Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:30.4544488Z gds-tools-1.11.1.6 | h5888daf_4 37.8 MB conda-forge 2025-05-07T20:25:30.4544893Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:25:30.4545273Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:30.4545672Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:25:30.4546077Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:25:30.4546465Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:25:30.4546893Z libcublas-12.6.4.1 | h5888daf_1 256.2 MB conda-forge 2025-05-07T20:25:30.4547341Z libcublas-dev-12.6.4.1 | h5888daf_1 88 KB conda-forge 2025-05-07T20:25:30.4547781Z libcufft-11.3.0.4 | hbd13f7d_0 156.2 MB conda-forge 2025-05-07T20:25:30.4548222Z libcufft-dev-11.3.0.4 | h5888daf_0 33 KB conda-forge 2025-05-07T20:25:30.4548671Z libcufile-1.11.1.6 | h12f29b5_4 900 KB conda-forge 2025-05-07T20:25:30.4549120Z libcufile-dev-1.11.1.6 | h5888daf_4 35 KB conda-forge 2025-05-07T20:25:30.4549563Z libcurand-10.3.7.77 | hbd13f7d_0 39.9 MB conda-forge 2025-05-07T20:25:30.4550016Z libcurand-dev-10.3.7.77 | h5888daf_0 262 KB conda-forge 2025-05-07T20:25:30.4550478Z libcusolver-11.7.1.2 | h5888daf_1 95.8 MB conda-forge 2025-05-07T20:25:30.4550936Z libcusolver-dev-11.7.1.2 | h5888daf_1 59 KB conda-forge 2025-05-07T20:25:30.4551409Z libcusparse-12.5.4.2 | hbd13f7d_0 118.6 MB conda-forge 2025-05-07T20:25:30.4551876Z libcusparse-dev-12.5.4.2 | h5888daf_0 51 KB conda-forge 2025-05-07T20:25:30.4552341Z libedit-3.1.20191231 | he28a2e2_2 121 KB conda-forge 2025-05-07T20:25:30.4552771Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:25:30.4553362Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:25:30.4553815Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:25:30.4554268Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:25:30.4554716Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:25:30.4555150Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:25:30.4555703Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:25:30.4556115Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:25:30.4556519Z libnpp-12.3.1.54 | h5888daf_0 93.4 MB conda-forge 2025-05-07T20:25:30.4556948Z libnpp-dev-12.3.1.54 | h5888daf_0 441 KB conda-forge 2025-05-07T20:25:30.4557382Z libnsl-2.0.1 | hd590300_0 33 KB conda-forge 2025-05-07T20:25:30.4557789Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:25:30.4558215Z libnvfatbin-12.6.77 | hbd13f7d_0 783 KB conda-forge 2025-05-07T20:25:30.4558683Z libnvfatbin-dev-12.6.77 | h5888daf_0 26 KB conda-forge 2025-05-07T20:25:30.4559151Z libnvjitlink-12.6.85 | hbd13f7d_0 14.9 MB conda-forge 2025-05-07T20:25:30.4559613Z libnvjitlink-dev-12.6.85 | h5888daf_0 25 KB conda-forge 2025-05-07T20:25:30.4560083Z libnvjpeg-12.3.3.54 | h5888daf_0 2.4 MB conda-forge 2025-05-07T20:25:30.4560541Z libnvjpeg-dev-12.3.3.54 | ha770c72_0 31 KB conda-forge 2025-05-07T20:25:30.4560980Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:25:30.4561389Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:25:30.4561835Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:25:30.4562267Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:25:30.4562678Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:25:30.4563081Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:25:30.4563507Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:25:30.4564067Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:25:30.4564485Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:25:30.4564890Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:25:30.4565285Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:25:30.4565723Z nsight-compute-2024.3.2.3 | hb5ebaad_0 443.1 MB conda-forge 2025-05-07T20:25:30.4566162Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:25:30.4566543Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:25:30.4566933Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:25:30.4567420Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:25:30.4567864Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:25:30.4568297Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:25:30.4568742Z python-3.11.8 |hab00c5b_0_cpython 29.3 MB conda-forge 2025-05-07T20:25:30.4569162Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:25:30.4569577Z sqlite-3.32.3 | hcee41ef_1 1.4 MB conda-forge 2025-05-07T20:25:30.4570071Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:25:30.4570468Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:25:30.4570880Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:25:30.4571317Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:25:30.4571778Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:25:30.4572231Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:25:30.4572822Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:25:30.4573284Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:25:30.4573728Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:25:30.4574185Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:25:30.4574624Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:25:30.4575057Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:25:30.4575486Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:25:30.4575959Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:25:30.4576442Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:25:30.4576903Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:30.4577355Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:25:30.4577811Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:30.4578256Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:25:30.4578699Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:25:30.4579167Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:25:30.4579629Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:25:30.4580043Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:25:30.4580423Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:25:30.4580805Z ------------------------------------------------------------ 2025-05-07T20:25:30.4581164Z Total: 1.64 GB 2025-05-07T20:25:30.4581376Z 2025-05-07T20:25:30.4581508Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:30.4581735Z 2025-05-07T20:25:30.4581948Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:25:30.4582374Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:25:30.4582800Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:25:30.4583255Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:25:30.4583698Z cuda conda-forge/noarch::cuda-12.6.3-ha804496_0 2025-05-07T20:25:30.4584174Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 2025-05-07T20:25:30.4584770Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 2025-05-07T20:25:30.4585345Z cuda-compiler conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 2025-05-07T20:25:30.4585898Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:25:30.4586458Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 2025-05-07T20:25:30.4586976Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 2025-05-07T20:25:30.4587498Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 2025-05-07T20:25:30.4588164Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:30.4588772Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 2025-05-07T20:25:30.4592015Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:30.4592627Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:30.4593191Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 2025-05-07T20:25:30.4593809Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 2025-05-07T20:25:30.4594318Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 2025-05-07T20:25:30.4594839Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 2025-05-07T20:25:30.4595379Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 2025-05-07T20:25:30.4595962Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:30.4596484Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 2025-05-07T20:25:30.4596975Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 2025-05-07T20:25:30.4597540Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 2025-05-07T20:25:30.4598087Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 2025-05-07T20:25:30.4598561Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 2025-05-07T20:25:30.4599094Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 2025-05-07T20:25:30.4599655Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 2025-05-07T20:25:30.4600205Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 2025-05-07T20:25:30.4600763Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 2025-05-07T20:25:30.4601307Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 2025-05-07T20:25:30.4601831Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 2025-05-07T20:25:30.4602344Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 2025-05-07T20:25:30.4602846Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 2025-05-07T20:25:30.4603356Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 2025-05-07T20:25:30.4603974Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 2025-05-07T20:25:30.4604470Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 2025-05-07T20:25:30.4604991Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:25:30.4605555Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 2025-05-07T20:25:30.4606107Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 2025-05-07T20:25:30.4606621Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 2025-05-07T20:25:30.4607096Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 2025-05-07T20:25:30.4607622Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 2025-05-07T20:25:30.4608197Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 2025-05-07T20:25:30.4608743Z cuda-runtime conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 2025-05-07T20:25:30.4609294Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 2025-05-07T20:25:30.4609847Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 2025-05-07T20:25:30.4610333Z cuda-tools conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 2025-05-07T20:25:30.4610803Z cuda-version conda-forge/noarch::cuda-version-12.6-h7480c83_3 2025-05-07T20:25:30.4611441Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 2025-05-07T20:25:30.4611992Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:25:30.4612445Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:25:30.4612843Z expat conda-forge/linux-64::expat-2.7.0-h5888daf_0 2025-05-07T20:25:30.4613360Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:25:30.4613970Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:25:30.4614672Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:25:30.4615242Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:25:30.4615746Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:25:30.4616243Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:25:30.4616739Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:25:30.4617202Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:25:30.4617626Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:25:30.4618051Z gds-tools conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 2025-05-07T20:25:30.4618475Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:25:30.4618848Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:25:30.4619268Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:25:30.4619685Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:25:30.4620086Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:25:30.4620531Z libcublas conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 2025-05-07T20:25:30.4621048Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 2025-05-07T20:25:30.4621545Z libcufft conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 2025-05-07T20:25:30.4622025Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 2025-05-07T20:25:30.4622523Z libcufile conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 2025-05-07T20:25:30.4623032Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 2025-05-07T20:25:30.4623537Z libcurand conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 2025-05-07T20:25:30.4624049Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 2025-05-07T20:25:30.4624568Z libcusolver conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 2025-05-07T20:25:30.4625105Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 2025-05-07T20:25:30.4625647Z libcusparse conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 2025-05-07T20:25:30.4626181Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 2025-05-07T20:25:30.4626704Z libedit conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 2025-05-07T20:25:30.4627165Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:25:30.4627641Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:25:30.4628139Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:25:30.4628656Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:25:30.4629152Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:25:30.4629611Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:25:30.4630083Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:25:30.4630517Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:25:30.4631040Z libnpp conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 2025-05-07T20:25:30.4631505Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 2025-05-07T20:25:30.4631966Z libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0 2025-05-07T20:25:30.4632394Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:25:30.4632864Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 2025-05-07T20:25:30.4633389Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 2025-05-07T20:25:30.4634199Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 2025-05-07T20:25:30.4634807Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 2025-05-07T20:25:30.4635343Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 2025-05-07T20:25:30.4635853Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 2025-05-07T20:25:30.4636353Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:25:30.4636807Z libsqlite conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 2025-05-07T20:25:30.4637282Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:25:30.4637740Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:25:30.4638182Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:25:30.4638954Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:25:30.4639458Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:25:30.4639924Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:25:30.4640360Z libzlib conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:30.4640786Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:25:30.4641279Z nsight-compute conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 2025-05-07T20:25:30.4641771Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:25:30.4642168Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:25:30.4642578Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:25:30.4643073Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:25:30.4643644Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:25:30.4644153Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:25:30.4644654Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:25:30.4645096Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:25:30.4645542Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:25:30.4646042Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:25:30.4646572Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:25:30.4647115Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:25:30.4647697Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:25:30.4648257Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:25:30.4648770Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:25:30.4649304Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:25:30.4649789Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:25:30.4650274Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:25:30.4650753Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:25:30.4651498Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:25:30.4652090Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:25:30.4652621Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:25:30.4662283Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:25:30.4662810Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:25:30.4663317Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:25:30.4664135Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:25:30.4664806Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:25:30.4665337Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:25:30.4665790Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:25:30.4666035Z 2025-05-07T20:25:30.4666158Z The following packages will be UPDATED: 2025-05-07T20:25:30.4666369Z 2025-05-07T20:25:30.4666642Z libuuid pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:25:30.4667234Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:30.4667560Z 2025-05-07T20:25:30.4667785Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:25:30.4668096Z 2025-05-07T20:25:30.4668388Z python pkgs/main::python-3.11.11-he870216_0 --> conda-forge::python-3.11.8-hab00c5b_0_cpython 2025-05-07T20:25:30.4669024Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 2025-05-07T20:25:30.4669603Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:25:30.4669921Z 2025-05-07T20:25:30.4669960Z 2025-05-07T20:25:30.4669964Z 2025-05-07T20:25:30.4670113Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:30.4670495Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:25:30.4670731Z 2025-05-07T20:25:30.4671148Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:25:30.4671385Z 2025-05-07T20:25:30.4671389Z 2025-05-07T20:25:30.4671641Z libcufft-11.3.0.4 | 156.2 MB | | 0%  2025-05-07T20:25:30.4671887Z 2025-05-07T20:25:30.4671891Z 2025-05-07T20:25:30.4671895Z 2025-05-07T20:25:30.4672122Z libcusparse-12.5.4.2 | 118.6 MB | | 0%  2025-05-07T20:25:30.4672387Z 2025-05-07T20:25:30.4672391Z 2025-05-07T20:25:30.4672395Z 2025-05-07T20:25:30.4672399Z 2025-05-07T20:25:30.4673187Z cuda-nsight-12.6.77 | 113.2 MB | | 0%  2025-05-07T20:25:30.4673454Z 2025-05-07T20:25:30.4673458Z 2025-05-07T20:25:30.4673461Z 2025-05-07T20:25:30.4673465Z 2025-05-07T20:25:30.4673469Z 2025-05-07T20:25:30.4673709Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:25:30.4673976Z 2025-05-07T20:25:30.4673980Z 2025-05-07T20:25:30.4673984Z 2025-05-07T20:25:30.4673987Z 2025-05-07T20:25:30.4673991Z 2025-05-07T20:25:30.4673995Z 2025-05-07T20:25:30.4674241Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:25:30.4674517Z 2025-05-07T20:25:30.4674521Z 2025-05-07T20:25:30.4674524Z 2025-05-07T20:25:30.4674528Z 2025-05-07T20:25:30.4674531Z 2025-05-07T20:25:30.4674535Z 2025-05-07T20:25:30.4674544Z 2025-05-07T20:25:30.4675281Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:25:30.4675551Z 2025-05-07T20:25:30.4676024Z 2025-05-07T20:25:30.4676167Z 2025-05-07T20:25:30.4676172Z 2025-05-07T20:25:30.4676176Z 2025-05-07T20:25:30.4676180Z 2025-05-07T20:25:30.4676184Z 2025-05-07T20:25:30.4679076Z 2025-05-07T20:25:30.4680919Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:25:30.4681344Z 2025-05-07T20:25:30.4681492Z 2025-05-07T20:25:30.4681497Z 2025-05-07T20:25:30.4681501Z 2025-05-07T20:25:30.4681504Z 2025-05-07T20:25:30.4681508Z 2025-05-07T20:25:30.4681512Z 2025-05-07T20:25:30.4681516Z 2025-05-07T20:25:30.4690119Z 2025-05-07T20:25:30.4698166Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:25:30.4698483Z 2025-05-07T20:25:30.4698488Z 2025-05-07T20:25:30.4698492Z 2025-05-07T20:25:30.4698496Z 2025-05-07T20:25:30.4698500Z 2025-05-07T20:25:30.4698504Z 2025-05-07T20:25:30.4698634Z 2025-05-07T20:25:30.4698638Z 2025-05-07T20:25:30.4698642Z 2025-05-07T20:25:30.4701800Z 2025-05-07T20:25:30.4704106Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:25:30.4704575Z 2025-05-07T20:25:30.4704581Z 2025-05-07T20:25:30.4704586Z 2025-05-07T20:25:30.4704592Z 2025-05-07T20:25:30.4704598Z 2025-05-07T20:25:30.4704604Z 2025-05-07T20:25:30.4704610Z 2025-05-07T20:25:30.4704616Z 2025-05-07T20:25:30.4704633Z 2025-05-07T20:25:30.4704657Z 2025-05-07T20:25:30.4704663Z 2025-05-07T20:25:30.4705621Z python-3.11.8 | 29.3 MB | | 0%  2025-05-07T20:25:30.4706162Z 2025-05-07T20:25:30.4706169Z 2025-05-07T20:25:30.4706188Z 2025-05-07T20:25:30.4706194Z 2025-05-07T20:25:30.4706199Z 2025-05-07T20:25:30.4706205Z 2025-05-07T20:25:30.4706210Z 2025-05-07T20:25:30.4706216Z 2025-05-07T20:25:30.4706222Z 2025-05-07T20:25:30.4706227Z 2025-05-07T20:25:30.4706239Z 2025-05-07T20:25:30.4706245Z 2025-05-07T20:25:30.4708323Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:25:30.4708843Z 2025-05-07T20:25:30.4708850Z 2025-05-07T20:25:30.4708855Z 2025-05-07T20:25:30.4708861Z 2025-05-07T20:25:30.4708866Z 2025-05-07T20:25:30.4708872Z 2025-05-07T20:25:30.4708877Z 2025-05-07T20:25:30.4708883Z 2025-05-07T20:25:30.4708888Z 2025-05-07T20:25:30.4708894Z 2025-05-07T20:25:30.4708899Z 2025-05-07T20:25:30.4708905Z 2025-05-07T20:25:30.4708922Z 2025-05-07T20:25:30.4709423Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:25:30.4709906Z 2025-05-07T20:25:30.4709912Z 2025-05-07T20:25:30.4709918Z 2025-05-07T20:25:30.4709923Z 2025-05-07T20:25:30.4709929Z 2025-05-07T20:25:30.4709934Z 2025-05-07T20:25:30.4709940Z 2025-05-07T20:25:30.4709945Z 2025-05-07T20:25:30.4709963Z 2025-05-07T20:25:30.4709969Z 2025-05-07T20:25:30.4709974Z 2025-05-07T20:25:30.4709980Z 2025-05-07T20:25:30.4709986Z 2025-05-07T20:25:30.4710001Z 2025-05-07T20:25:30.4711308Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:25:30.4711837Z 2025-05-07T20:25:30.4711843Z 2025-05-07T20:25:30.4711848Z 2025-05-07T20:25:30.4711854Z 2025-05-07T20:25:30.4711859Z 2025-05-07T20:25:30.4711865Z 2025-05-07T20:25:30.4711870Z 2025-05-07T20:25:30.4711876Z 2025-05-07T20:25:30.4711881Z 2025-05-07T20:25:30.4711887Z 2025-05-07T20:25:30.4711893Z 2025-05-07T20:25:30.4711908Z 2025-05-07T20:25:30.4711914Z 2025-05-07T20:25:30.4711920Z 2025-05-07T20:25:30.4711927Z 2025-05-07T20:25:30.4712929Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:25:30.4713255Z 2025-05-07T20:25:30.4713260Z 2025-05-07T20:25:30.4713265Z 2025-05-07T20:25:30.4713270Z 2025-05-07T20:25:30.4713275Z 2025-05-07T20:25:30.4713280Z 2025-05-07T20:25:30.4713285Z 2025-05-07T20:25:30.4713290Z 2025-05-07T20:25:30.4713295Z 2025-05-07T20:25:30.4713310Z 2025-05-07T20:25:30.4713315Z 2025-05-07T20:25:30.4713320Z 2025-05-07T20:25:30.4713336Z 2025-05-07T20:25:30.4713341Z 2025-05-07T20:25:30.4713347Z 2025-05-07T20:25:30.4713357Z 2025-05-07T20:25:30.4715975Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:25:30.4716507Z 2025-05-07T20:25:30.4716513Z 2025-05-07T20:25:30.4716518Z 2025-05-07T20:25:30.4716523Z 2025-05-07T20:25:30.4716529Z 2025-05-07T20:25:30.4716695Z 2025-05-07T20:25:30.4716704Z 2025-05-07T20:25:30.4716710Z 2025-05-07T20:25:30.4716716Z 2025-05-07T20:25:30.4716721Z 2025-05-07T20:25:30.4716727Z 2025-05-07T20:25:30.4716732Z 2025-05-07T20:25:30.4716738Z 2025-05-07T20:25:30.4716744Z 2025-05-07T20:25:30.4716750Z 2025-05-07T20:25:30.4716755Z 2025-05-07T20:25:30.4716761Z 2025-05-07T20:25:30.4717429Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:25:30.4717765Z 2025-05-07T20:25:30.4717769Z 2025-05-07T20:25:30.4717905Z 2025-05-07T20:25:30.4717909Z 2025-05-07T20:25:30.4717927Z 2025-05-07T20:25:30.4717933Z 2025-05-07T20:25:30.4717937Z 2025-05-07T20:25:30.4717941Z 2025-05-07T20:25:30.4717945Z 2025-05-07T20:25:30.4717948Z 2025-05-07T20:25:30.4717952Z 2025-05-07T20:25:30.4717956Z 2025-05-07T20:25:30.4717959Z 2025-05-07T20:25:30.4717963Z 2025-05-07T20:25:30.4717966Z 2025-05-07T20:25:30.4717970Z 2025-05-07T20:25:30.4717974Z 2025-05-07T20:25:30.4717981Z 2025-05-07T20:25:30.4718563Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:25:30.4718893Z 2025-05-07T20:25:30.4718909Z 2025-05-07T20:25:30.4718914Z 2025-05-07T20:25:30.4718919Z 2025-05-07T20:25:30.4718924Z 2025-05-07T20:25:30.4718929Z 2025-05-07T20:25:30.4718935Z 2025-05-07T20:25:30.4718940Z 2025-05-07T20:25:30.4718945Z 2025-05-07T20:25:30.4718950Z 2025-05-07T20:25:30.4718955Z 2025-05-07T20:25:30.4718960Z 2025-05-07T20:25:30.4718976Z 2025-05-07T20:25:30.4718989Z 2025-05-07T20:25:30.4718994Z 2025-05-07T20:25:30.4719000Z 2025-05-07T20:25:30.4719005Z 2025-05-07T20:25:30.4719010Z 2025-05-07T20:25:30.4719015Z 2025-05-07T20:25:30.5616886Z ... (more hidden) ... 2025-05-07T20:25:30.5617685Z 2025-05-07T20:25:30.5660439Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:25:30.5661302Z 2025-05-07T20:25:30.5661307Z 2025-05-07T20:25:30.5702603Z libcufft-11.3.0.4 | 156.2 MB | 5 | 5%  2025-05-07T20:25:30.5702871Z 2025-05-07T20:25:30.5702875Z 2025-05-07T20:25:30.5703048Z 2025-05-07T20:25:30.5754299Z libcusparse-12.5.4.2 | 118.6 MB | | 0%  2025-05-07T20:25:30.5761281Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:25:30.5761609Z 2025-05-07T20:25:30.5761613Z 2025-05-07T20:25:30.5761625Z 2025-05-07T20:25:30.5761629Z 2025-05-07T20:25:30.6619628Z cuda-nsight-12.6.77 | 113.2 MB | 2 | 2%  2025-05-07T20:25:30.6622413Z 2025-05-07T20:25:30.6712550Z libcublas-12.6.4.1 | 256.2 MB | 1 | 2%  2025-05-07T20:25:30.6712802Z 2025-05-07T20:25:30.6712820Z 2025-05-07T20:25:30.6714097Z 2025-05-07T20:25:30.6755631Z libcusparse-12.5.4.2 | 118.6 MB | 2 | 2%  2025-05-07T20:25:30.6764257Z nsight-compute-2024. | 443.1 MB | | 1% 2025-05-07T20:25:30.6764541Z 2025-05-07T20:25:30.6764546Z 2025-05-07T20:25:30.6764550Z 2025-05-07T20:25:30.6764553Z 2025-05-07T20:25:30.7410423Z cuda-nsight-12.6.77 | 113.2 MB | 5 | 6%  2025-05-07T20:25:30.7410825Z 2025-05-07T20:25:30.7414131Z 2025-05-07T20:25:30.7637629Z libcufft-11.3.0.4 | 156.2 MB | 9 | 10%  2025-05-07T20:25:30.7638053Z 2025-05-07T20:25:30.7717678Z libcublas-12.6.4.1 | 256.2 MB | 3 | 3%  2025-05-07T20:25:30.7717939Z 2025-05-07T20:25:30.7717943Z 2025-05-07T20:25:30.7719229Z 2025-05-07T20:25:30.7757821Z libcusparse-12.5.4.2 | 118.6 MB | 4 | 5%  2025-05-07T20:25:30.7765177Z nsight-compute-2024. | 443.1 MB | 1 | 2% 2025-05-07T20:25:30.7765526Z 2025-05-07T20:25:30.7765531Z 2025-05-07T20:25:30.7765535Z 2025-05-07T20:25:30.7767147Z 2025-05-07T20:25:30.8640512Z cuda-nsight-12.6.77 | 113.2 MB | 9 | 9%  2025-05-07T20:25:30.8640848Z 2025-05-07T20:25:30.8719410Z libcublas-12.6.4.1 | 256.2 MB | 4 | 5%  2025-05-07T20:25:30.8719680Z 2025-05-07T20:25:30.8719684Z 2025-05-07T20:25:30.8720832Z 2025-05-07T20:25:30.8758026Z libcusparse-12.5.4.2 | 118.6 MB | 8 | 8%  2025-05-07T20:25:30.8765988Z nsight-compute-2024. | 443.1 MB | 2 | 3% 2025-05-07T20:25:30.8766255Z 2025-05-07T20:25:30.8766259Z 2025-05-07T20:25:30.8766263Z 2025-05-07T20:25:30.8766268Z 2025-05-07T20:25:30.8911658Z cuda-nsight-12.6.77 | 113.2 MB | #2 | 13%  2025-05-07T20:25:30.8912011Z 2025-05-07T20:25:30.8915046Z 2025-05-07T20:25:30.9720595Z libcufft-11.3.0.4 | 156.2 MB | #3 | 14%  2025-05-07T20:25:30.9721136Z 2025-05-07T20:25:30.9721143Z 2025-05-07T20:25:30.9721148Z 2025-05-07T20:25:30.9758187Z libcusparse-12.5.4.2 | 118.6 MB | #1 | 12%  2025-05-07T20:25:30.9768537Z nsight-compute-2024. | 443.1 MB | 3 | 4% 2025-05-07T20:25:30.9768846Z 2025-05-07T20:25:30.9768930Z 2025-05-07T20:25:30.9768934Z 2025-05-07T20:25:30.9769405Z 2025-05-07T20:25:31.0053831Z cuda-nsight-12.6.77 | 113.2 MB | #6 | 16%  2025-05-07T20:25:31.0054150Z 2025-05-07T20:25:31.0054939Z 2025-05-07T20:25:31.0252957Z libcufft-11.3.0.4 | 156.2 MB | #6 | 17%  2025-05-07T20:25:31.0255827Z 2025-05-07T20:25:31.0721602Z libcublas-12.6.4.1 | 256.2 MB | 5 | 6%  2025-05-07T20:25:31.0721874Z 2025-05-07T20:25:31.0721879Z 2025-05-07T20:25:31.0722214Z 2025-05-07T20:25:31.0769308Z libcusparse-12.5.4.2 | 118.6 MB | #5 | 15%  2025-05-07T20:25:31.0769687Z 2025-05-07T20:25:31.0769693Z 2025-05-07T20:25:31.0769726Z 2025-05-07T20:25:31.0769730Z 2025-05-07T20:25:31.0810270Z cuda-nsight-12.6.77 | 113.2 MB | #9 | 20%  2025-05-07T20:25:31.1173524Z nsight-compute-2024. | 443.1 MB | 4 | 4% 2025-05-07T20:25:31.1173916Z 2025-05-07T20:25:31.1175401Z 2025-05-07T20:25:31.1253008Z libcufft-11.3.0.4 | 156.2 MB | #9 | 20%  2025-05-07T20:25:31.1256525Z 2025-05-07T20:25:31.1724189Z libcublas-12.6.4.1 | 256.2 MB | 7 | 7%  2025-05-07T20:25:31.1724460Z 2025-05-07T20:25:31.1724465Z 2025-05-07T20:25:31.1724980Z 2025-05-07T20:25:31.1770521Z libcusparse-12.5.4.2 | 118.6 MB | #8 | 19%  2025-05-07T20:25:31.1770869Z 2025-05-07T20:25:31.1770875Z 2025-05-07T20:25:31.1770881Z 2025-05-07T20:25:31.1771573Z 2025-05-07T20:25:31.2201264Z cuda-nsight-12.6.77 | 113.2 MB | ##3 | 24%  2025-05-07T20:25:31.2201542Z 2025-05-07T20:25:31.2203151Z 2025-05-07T20:25:31.2255941Z libcufft-11.3.0.4 | 156.2 MB | ##2 | 22%  2025-05-07T20:25:31.2260456Z 2025-05-07T20:25:31.2724550Z libcublas-12.6.4.1 | 256.2 MB | 9 | 9%  2025-05-07T20:25:31.2724802Z 2025-05-07T20:25:31.2724807Z 2025-05-07T20:25:31.2726856Z 2025-05-07T20:25:31.2742181Z libcusparse-12.5.4.2 | 118.6 MB | ##2 | 22%  2025-05-07T20:25:31.2771625Z nsight-compute-2024. | 443.1 MB | 5 | 5% 2025-05-07T20:25:31.2771882Z 2025-05-07T20:25:31.2771886Z 2025-05-07T20:25:31.2771890Z 2025-05-07T20:25:31.2773739Z 2025-05-07T20:25:31.3260292Z cuda-nsight-12.6.77 | 113.2 MB | ##7 | 28%  2025-05-07T20:25:31.3263641Z 2025-05-07T20:25:31.3272837Z libcublas-12.6.4.1 | 256.2 MB | # | 10%  2025-05-07T20:25:31.3273166Z 2025-05-07T20:25:31.3276706Z 2025-05-07T20:25:31.3725941Z libcufft-11.3.0.4 | 156.2 MB | ##5 | 25%  2025-05-07T20:25:31.3726208Z 2025-05-07T20:25:31.3726224Z 2025-05-07T20:25:31.3726227Z 2025-05-07T20:25:31.3743649Z libcusparse-12.5.4.2 | 118.6 MB | ##5 | 26%  2025-05-07T20:25:31.3812533Z nsight-compute-2024. | 443.1 MB | 5 | 6% 2025-05-07T20:25:31.3812788Z 2025-05-07T20:25:31.3812793Z 2025-05-07T20:25:31.3812796Z 2025-05-07T20:25:31.3813958Z 2025-05-07T20:25:31.4273961Z cuda-nsight-12.6.77 | 113.2 MB | ###1 | 31%  2025-05-07T20:25:31.4275673Z 2025-05-07T20:25:31.4509635Z libcublas-12.6.4.1 | 256.2 MB | #1 | 12%  2025-05-07T20:25:31.4509899Z 2025-05-07T20:25:31.4511109Z 2025-05-07T20:25:31.4726644Z libcufft-11.3.0.4 | 156.2 MB | ##8 | 28%  2025-05-07T20:25:31.4726905Z 2025-05-07T20:25:31.4726956Z 2025-05-07T20:25:31.4728374Z 2025-05-07T20:25:31.4745916Z libcusparse-12.5.4.2 | 118.6 MB | ##8 | 29%  2025-05-07T20:25:31.5122875Z nsight-compute-2024. | 443.1 MB | 6 | 7% 2025-05-07T20:25:31.5123144Z 2025-05-07T20:25:31.5123148Z 2025-05-07T20:25:31.5123152Z 2025-05-07T20:25:31.5126686Z 2025-05-07T20:25:31.5277066Z cuda-nsight-12.6.77 | 113.2 MB | ###5 | 35%  2025-05-07T20:25:31.5278243Z 2025-05-07T20:25:31.5667805Z libcublas-12.6.4.1 | 256.2 MB | #3 | 13%  2025-05-07T20:25:31.5668070Z 2025-05-07T20:25:31.5670880Z 2025-05-07T20:25:31.5746760Z libcufft-11.3.0.4 | 156.2 MB | ### | 31%  2025-05-07T20:25:31.5768981Z nsight-compute-2024. | 443.1 MB | 7 | 8% 2025-05-07T20:25:31.5769231Z 2025-05-07T20:25:31.5769235Z 2025-05-07T20:25:31.5769239Z 2025-05-07T20:25:31.6244687Z libcusparse-12.5.4.2 | 118.6 MB | ###2 | 32%  2025-05-07T20:25:31.6244974Z 2025-05-07T20:25:31.6244979Z 2025-05-07T20:25:31.6244982Z 2025-05-07T20:25:31.6244986Z 2025-05-07T20:25:31.6327821Z cuda-nsight-12.6.77 | 113.2 MB | ###8 | 39%  2025-05-07T20:25:31.6328523Z 2025-05-07T20:25:31.6669862Z libcublas-12.6.4.1 | 256.2 MB | #4 | 15%  2025-05-07T20:25:31.6670264Z 2025-05-07T20:25:31.6673214Z 2025-05-07T20:25:31.6747144Z libcufft-11.3.0.4 | 156.2 MB | ###3 | 33%  2025-05-07T20:25:31.6773986Z nsight-compute-2024. | 443.1 MB | 8 | 8% 2025-05-07T20:25:31.6774243Z 2025-05-07T20:25:31.6774248Z 2025-05-07T20:25:31.6775461Z 2025-05-07T20:25:31.7247628Z libcusparse-12.5.4.2 | 118.6 MB | ###5 | 36%  2025-05-07T20:25:31.7247922Z 2025-05-07T20:25:31.7247949Z 2025-05-07T20:25:31.7247953Z 2025-05-07T20:25:31.7249139Z 2025-05-07T20:25:31.7330416Z cuda-nsight-12.6.77 | 113.2 MB | ####2 | 42%  2025-05-07T20:25:31.7333336Z 2025-05-07T20:25:31.7670506Z libcublas-12.6.4.1 | 256.2 MB | #6 | 16%  2025-05-07T20:25:31.7671292Z 2025-05-07T20:25:31.7672802Z 2025-05-07T20:25:31.7749727Z libcufft-11.3.0.4 | 156.2 MB | ###5 | 36%  2025-05-07T20:25:31.7814395Z nsight-compute-2024. | 443.1 MB | 9 | 9% 2025-05-07T20:25:31.7814661Z 2025-05-07T20:25:31.7814666Z 2025-05-07T20:25:31.7816232Z 2025-05-07T20:25:31.8252848Z libcusparse-12.5.4.2 | 118.6 MB | ###8 | 39%  2025-05-07T20:25:31.8253186Z 2025-05-07T20:25:31.8253190Z 2025-05-07T20:25:31.8253202Z 2025-05-07T20:25:31.8258482Z 2025-05-07T20:25:31.8340083Z cuda-nsight-12.6.77 | 113.2 MB | ####5 | 45%  2025-05-07T20:25:31.8340363Z 2025-05-07T20:25:31.8705071Z libcublas-12.6.4.1 | 256.2 MB | #7 | 18%  2025-05-07T20:25:31.8705332Z 2025-05-07T20:25:31.8707916Z 2025-05-07T20:25:31.8753079Z libcufft-11.3.0.4 | 156.2 MB | ###8 | 38%  2025-05-07T20:25:31.8816176Z nsight-compute-2024. | 443.1 MB | # | 10% 2025-05-07T20:25:31.8816436Z 2025-05-07T20:25:31.8816440Z 2025-05-07T20:25:31.8818412Z 2025-05-07T20:25:31.9253753Z libcusparse-12.5.4.2 | 118.6 MB | ####2 | 42%  2025-05-07T20:25:31.9254057Z 2025-05-07T20:25:31.9254062Z 2025-05-07T20:25:31.9254066Z 2025-05-07T20:25:31.9254751Z 2025-05-07T20:25:31.9357947Z cuda-nsight-12.6.77 | 113.2 MB | ####8 | 49%  2025-05-07T20:25:31.9361248Z 2025-05-07T20:25:31.9708053Z libcublas-12.6.4.1 | 256.2 MB | #9 | 19%  2025-05-07T20:25:31.9708356Z 2025-05-07T20:25:31.9709973Z 2025-05-07T20:25:31.9833600Z libcufft-11.3.0.4 | 156.2 MB | #### | 41%  2025-05-07T20:25:31.9864610Z nsight-compute-2024. | 443.1 MB | #1 | 11% 2025-05-07T20:25:31.9864936Z 2025-05-07T20:25:31.9864942Z 2025-05-07T20:25:31.9865506Z 2025-05-07T20:25:32.0276715Z libcusparse-12.5.4.2 | 118.6 MB | ####5 | 46%  2025-05-07T20:25:32.0277291Z 2025-05-07T20:25:32.0277297Z 2025-05-07T20:25:32.0277300Z 2025-05-07T20:25:32.0277722Z 2025-05-07T20:25:32.0363557Z cuda-nsight-12.6.77 | 113.2 MB | #####2 | 52%  2025-05-07T20:25:32.0365678Z 2025-05-07T20:25:32.0715229Z libcublas-12.6.4.1 | 256.2 MB | ## | 21%  2025-05-07T20:25:32.0715543Z 2025-05-07T20:25:32.0717980Z 2025-05-07T20:25:32.0866314Z libcufft-11.3.0.4 | 156.2 MB | ####3 | 43%  2025-05-07T20:25:32.0866680Z 2025-05-07T20:25:32.0866955Z 2025-05-07T20:25:32.0867086Z 2025-05-07T20:25:32.1050341Z libcusparse-12.5.4.2 | 118.6 MB | ####9 | 49%  2025-05-07T20:25:32.1283195Z nsight-compute-2024. | 443.1 MB | #1 | 12% 2025-05-07T20:25:32.1283558Z 2025-05-07T20:25:32.1283564Z 2025-05-07T20:25:32.1283569Z 2025-05-07T20:25:32.1284442Z 2025-05-07T20:25:32.1367070Z cuda-nsight-12.6.77 | 113.2 MB | #####5 | 56%  2025-05-07T20:25:32.1367737Z 2025-05-07T20:25:32.1716077Z libcublas-12.6.4.1 | 256.2 MB | ##2 | 22%  2025-05-07T20:25:32.1716404Z 2025-05-07T20:25:32.1717468Z 2025-05-07T20:25:32.1869435Z libcufft-11.3.0.4 | 156.2 MB | ####5 | 46%  2025-05-07T20:25:32.1869701Z 2025-05-07T20:25:32.1869706Z 2025-05-07T20:25:32.1870140Z 2025-05-07T20:25:32.2050875Z libcusparse-12.5.4.2 | 118.6 MB | #####2 | 53%  2025-05-07T20:25:32.2287363Z nsight-compute-2024. | 443.1 MB | #2 | 13% 2025-05-07T20:25:32.2287722Z 2025-05-07T20:25:32.2287728Z 2025-05-07T20:25:32.2287749Z 2025-05-07T20:25:32.2289021Z 2025-05-07T20:25:32.2367000Z cuda-nsight-12.6.77 | 113.2 MB | #####9 | 59%  2025-05-07T20:25:32.2367880Z 2025-05-07T20:25:32.2746961Z libcublas-12.6.4.1 | 256.2 MB | ##3 | 24%  2025-05-07T20:25:32.2747313Z 2025-05-07T20:25:32.2748409Z 2025-05-07T20:25:32.2967851Z libcufft-11.3.0.4 | 156.2 MB | ####8 | 48%  2025-05-07T20:25:32.2968190Z 2025-05-07T20:25:32.2968196Z 2025-05-07T20:25:32.2968651Z 2025-05-07T20:25:32.3096921Z libcusparse-12.5.4.2 | 118.6 MB | #####5 | 56%  2025-05-07T20:25:32.3398408Z nsight-compute-2024. | 443.1 MB | #3 | 13% 2025-05-07T20:25:32.3398760Z 2025-05-07T20:25:32.3398766Z 2025-05-07T20:25:32.3398772Z 2025-05-07T20:25:32.3405147Z 2025-05-07T20:25:32.3527598Z cuda-nsight-12.6.77 | 113.2 MB | ######2 | 63%  2025-05-07T20:25:32.3529448Z 2025-05-07T20:25:32.3936418Z libcublas-12.6.4.1 | 256.2 MB | ##5 | 25%  2025-05-07T20:25:32.3936700Z 2025-05-07T20:25:32.3938261Z 2025-05-07T20:25:32.4023256Z libcufft-11.3.0.4 | 156.2 MB | ##### | 51%  2025-05-07T20:25:32.4023524Z 2025-05-07T20:25:32.4023528Z 2025-05-07T20:25:32.4026717Z 2025-05-07T20:25:32.4109741Z libcusparse-12.5.4.2 | 118.6 MB | #####9 | 59%  2025-05-07T20:25:32.4475087Z nsight-compute-2024. | 443.1 MB | #4 | 14% 2025-05-07T20:25:32.4475388Z 2025-05-07T20:25:32.4475392Z 2025-05-07T20:25:32.4475396Z 2025-05-07T20:25:32.4477351Z 2025-05-07T20:25:32.4604776Z cuda-nsight-12.6.77 | 113.2 MB | ######5 | 66%  2025-05-07T20:25:32.4606602Z 2025-05-07T20:25:32.4971135Z libcublas-12.6.4.1 | 256.2 MB | ##6 | 27%  2025-05-07T20:25:32.4971396Z 2025-05-07T20:25:32.4972120Z 2025-05-07T20:25:32.5110163Z libcufft-11.3.0.4 | 156.2 MB | #####3 | 53%  2025-05-07T20:25:32.5213902Z nsight-compute-2024. | 443.1 MB | #4 | 15% 2025-05-07T20:25:32.5214161Z 2025-05-07T20:25:32.5214165Z 2025-05-07T20:25:32.5216394Z 2025-05-07T20:25:32.5477844Z libcusparse-12.5.4.2 | 118.6 MB | ######2 | 63%  2025-05-07T20:25:32.5478128Z 2025-05-07T20:25:32.5478133Z 2025-05-07T20:25:32.5478136Z 2025-05-07T20:25:32.5478140Z 2025-05-07T20:25:32.5604964Z cuda-nsight-12.6.77 | 113.2 MB | ######9 | 69%  2025-05-07T20:25:32.5605257Z 2025-05-07T20:25:32.6055653Z libcublas-12.6.4.1 | 256.2 MB | ##8 | 28%  2025-05-07T20:25:32.6055923Z 2025-05-07T20:25:32.6059014Z 2025-05-07T20:25:32.6128311Z libcufft-11.3.0.4 | 156.2 MB | #####5 | 56%  2025-05-07T20:25:32.6220825Z nsight-compute-2024. | 443.1 MB | #5 | 16% 2025-05-07T20:25:32.6221080Z 2025-05-07T20:25:32.6221085Z 2025-05-07T20:25:32.6223707Z 2025-05-07T20:25:32.6553720Z libcusparse-12.5.4.2 | 118.6 MB | ######5 | 66%  2025-05-07T20:25:32.6554108Z 2025-05-07T20:25:32.6554113Z 2025-05-07T20:25:32.6554119Z 2025-05-07T20:25:32.6554124Z 2025-05-07T20:25:32.6654049Z cuda-nsight-12.6.77 | 113.2 MB | #######2 | 73%  2025-05-07T20:25:32.6655124Z 2025-05-07T20:25:32.7128275Z libcublas-12.6.4.1 | 256.2 MB | ##9 | 30%  2025-05-07T20:25:32.7128665Z 2025-05-07T20:25:32.7128858Z 2025-05-07T20:25:32.7140902Z libcufft-11.3.0.4 | 156.2 MB | #####8 | 58%  2025-05-07T20:25:32.7221585Z nsight-compute-2024. | 443.1 MB | #6 | 17% 2025-05-07T20:25:32.7221937Z 2025-05-07T20:25:32.7221952Z 2025-05-07T20:25:32.7223489Z 2025-05-07T20:25:32.7569141Z libcusparse-12.5.4.2 | 118.6 MB | ######8 | 69%  2025-05-07T20:25:32.7569518Z 2025-05-07T20:25:32.7569524Z 2025-05-07T20:25:32.7569529Z 2025-05-07T20:25:32.7569544Z 2025-05-07T20:25:32.7654852Z cuda-nsight-12.6.77 | 113.2 MB | #######5 | 76%  2025-05-07T20:25:32.7655904Z 2025-05-07T20:25:32.8154734Z libcublas-12.6.4.1 | 256.2 MB | ###1 | 31%  2025-05-07T20:25:32.8155083Z 2025-05-07T20:25:32.8157064Z 2025-05-07T20:25:32.8270647Z libcufft-11.3.0.4 | 156.2 MB | ###### | 60%  2025-05-07T20:25:32.8278134Z nsight-compute-2024. | 443.1 MB | #7 | 17% 2025-05-07T20:25:32.8278491Z 2025-05-07T20:25:32.8278506Z 2025-05-07T20:25:32.8278511Z 2025-05-07T20:25:32.8574381Z libcusparse-12.5.4.2 | 118.6 MB | #######2 | 72%  2025-05-07T20:25:32.8574757Z 2025-05-07T20:25:32.8574763Z 2025-05-07T20:25:32.8574768Z 2025-05-07T20:25:32.8574783Z 2025-05-07T20:25:32.8727591Z cuda-nsight-12.6.77 | 113.2 MB | #######8 | 79%  2025-05-07T20:25:32.8730682Z 2025-05-07T20:25:32.9164112Z libcublas-12.6.4.1 | 256.2 MB | ###2 | 33%  2025-05-07T20:25:32.9164462Z 2025-05-07T20:25:32.9166972Z 2025-05-07T20:25:32.9274263Z libcufft-11.3.0.4 | 156.2 MB | ######2 | 63%  2025-05-07T20:25:32.9280509Z nsight-compute-2024. | 443.1 MB | #8 | 18% 2025-05-07T20:25:32.9280764Z 2025-05-07T20:25:32.9280769Z 2025-05-07T20:25:32.9280795Z 2025-05-07T20:25:32.9576075Z libcusparse-12.5.4.2 | 118.6 MB | #######5 | 75%  2025-05-07T20:25:32.9576364Z 2025-05-07T20:25:32.9576369Z 2025-05-07T20:25:32.9576372Z 2025-05-07T20:25:32.9578209Z 2025-05-07T20:25:32.9769686Z cuda-nsight-12.6.77 | 113.2 MB | ########2 | 82%  2025-05-07T20:25:32.9769964Z 2025-05-07T20:25:33.0210236Z libcublas-12.6.4.1 | 256.2 MB | ###3 | 34%  2025-05-07T20:25:33.0210504Z 2025-05-07T20:25:33.0210509Z 2025-05-07T20:25:33.0275870Z libcufft-11.3.0.4 | 156.2 MB | ######4 | 65%  2025-05-07T20:25:33.0313593Z nsight-compute-2024. | 443.1 MB | #8 | 19% 2025-05-07T20:25:33.0313862Z 2025-05-07T20:25:33.0313866Z 2025-05-07T20:25:33.0316465Z 2025-05-07T20:25:33.0577595Z libcusparse-12.5.4.2 | 118.6 MB | #######8 | 78%  2025-05-07T20:25:33.0577875Z 2025-05-07T20:25:33.0577888Z 2025-05-07T20:25:33.0577891Z 2025-05-07T20:25:33.0580225Z 2025-05-07T20:25:33.0798151Z cuda-nsight-12.6.77 | 113.2 MB | ########5 | 86%  2025-05-07T20:25:33.0799728Z 2025-05-07T20:25:33.1239045Z libcublas-12.6.4.1 | 256.2 MB | ###5 | 35%  2025-05-07T20:25:33.1239314Z 2025-05-07T20:25:33.1239833Z 2025-05-07T20:25:33.1334413Z libcufft-11.3.0.4 | 156.2 MB | ######7 | 67%  2025-05-07T20:25:33.1334706Z 2025-05-07T20:25:33.1334712Z 2025-05-07T20:25:33.1336125Z 2025-05-07T20:25:33.1340356Z libcusparse-12.5.4.2 | 118.6 MB | ########1 | 81%  2025-05-07T20:25:33.1592778Z nsight-compute-2024. | 443.1 MB | #9 | 20% 2025-05-07T20:25:33.1593283Z 2025-05-07T20:25:33.1593297Z 2025-05-07T20:25:33.1593302Z 2025-05-07T20:25:33.1593977Z 2025-05-07T20:25:33.1800504Z cuda-nsight-12.6.77 | 113.2 MB | ########8 | 89%  2025-05-07T20:25:33.1801916Z 2025-05-07T20:25:33.2243705Z libcublas-12.6.4.1 | 256.2 MB | ###6 | 37%  2025-05-07T20:25:33.2244031Z 2025-05-07T20:25:33.2244715Z 2025-05-07T20:25:33.2350322Z libcufft-11.3.0.4 | 156.2 MB | ######9 | 69%  2025-05-07T20:25:33.2479015Z nsight-compute-2024. | 443.1 MB | ## | 21% 2025-05-07T20:25:33.2479556Z 2025-05-07T20:25:33.2479560Z 2025-05-07T20:25:33.2480233Z 2025-05-07T20:25:33.2681406Z libcusparse-12.5.4.2 | 118.6 MB | ########4 | 85%  2025-05-07T20:25:33.2681680Z 2025-05-07T20:25:33.2681684Z 2025-05-07T20:25:33.2681688Z 2025-05-07T20:25:33.2684224Z 2025-05-07T20:25:33.2809890Z cuda-nsight-12.6.77 | 113.2 MB | #########1 | 92%  2025-05-07T20:25:33.2810166Z 2025-05-07T20:25:33.3390469Z libcublas-12.6.4.1 | 256.2 MB | ###8 | 38%  2025-05-07T20:25:33.3411409Z nsight-compute-2024. | 443.1 MB | ##1 | 21% 2025-05-07T20:25:33.3411650Z 2025-05-07T20:25:33.3413714Z 2025-05-07T20:25:33.3611029Z libcufft-11.3.0.4 | 156.2 MB | #######1 | 72%  2025-05-07T20:25:33.3611284Z 2025-05-07T20:25:33.3611525Z 2025-05-07T20:25:33.3611540Z 2025-05-07T20:25:33.3811106Z libcusparse-12.5.4.2 | 118.6 MB | ########7 | 88%  2025-05-07T20:25:33.3811415Z 2025-05-07T20:25:33.3811421Z 2025-05-07T20:25:33.3811462Z 2025-05-07T20:25:33.3814488Z 2025-05-07T20:25:33.3843501Z cuda-nsight-12.6.77 | 113.2 MB | #########5 | 95%  2025-05-07T20:25:33.3843920Z 2025-05-07T20:25:33.4417036Z libcublas-12.6.4.1 | 256.2 MB | ###9 | 40%  2025-05-07T20:25:33.4538117Z nsight-compute-2024. | 443.1 MB | ##2 | 22% 2025-05-07T20:25:33.4538518Z 2025-05-07T20:25:33.4541019Z 2025-05-07T20:25:33.4705705Z libcufft-11.3.0.4 | 156.2 MB | #######3 | 74%  2025-05-07T20:25:33.4705968Z 2025-05-07T20:25:33.4705979Z 2025-05-07T20:25:33.4705984Z 2025-05-07T20:25:33.4845979Z libcusparse-12.5.4.2 | 118.6 MB | ######### | 90%  2025-05-07T20:25:33.4846252Z 2025-05-07T20:25:33.4891822Z libcublas-12.6.4.1 | 256.2 MB | ####1 | 41%  2025-05-07T20:25:33.4892074Z 2025-05-07T20:25:33.4892078Z 2025-05-07T20:25:33.4892082Z 2025-05-07T20:25:33.4893586Z 2025-05-07T20:25:33.5462443Z cuda-nsight-12.6.77 | 113.2 MB | #########8 | 98%  2025-05-07T20:25:33.5586080Z nsight-compute-2024. | 443.1 MB | ##2 | 23% 2025-05-07T20:25:33.5586321Z 2025-05-07T20:25:33.5586331Z 2025-05-07T20:25:33.5838316Z libcufft-11.3.0.4 | 156.2 MB | #######5 | 76%  2025-05-07T20:25:33.5838790Z 2025-05-07T20:25:33.5838794Z 2025-05-07T20:25:33.5839918Z 2025-05-07T20:25:33.5852968Z libcusparse-12.5.4.2 | 118.6 MB | #########3 | 93%  2025-05-07T20:25:33.5853245Z 2025-05-07T20:25:33.6466702Z libcublas-12.6.4.1 | 256.2 MB | ####2 | 42%  2025-05-07T20:25:33.6601040Z nsight-compute-2024. | 443.1 MB | ##3 | 24% 2025-05-07T20:25:33.6601292Z 2025-05-07T20:25:33.6602699Z 2025-05-07T20:25:33.6845482Z libcufft-11.3.0.4 | 156.2 MB | #######7 | 78%  2025-05-07T20:25:33.6845742Z 2025-05-07T20:25:33.6845746Z 2025-05-07T20:25:33.6845750Z 2025-05-07T20:25:33.6856312Z libcusparse-12.5.4.2 | 118.6 MB | #########5 | 96%  2025-05-07T20:25:33.6856580Z 2025-05-07T20:25:33.7551647Z libcublas-12.6.4.1 | 256.2 MB | ####3 | 44%  2025-05-07T20:25:33.7604099Z nsight-compute-2024. | 443.1 MB | ##4 | 24% 2025-05-07T20:25:33.7604354Z 2025-05-07T20:25:33.7606102Z 2025-05-07T20:25:33.7884103Z libcufft-11.3.0.4 | 156.2 MB | ######## | 80%  2025-05-07T20:25:33.7884366Z 2025-05-07T20:25:33.7884370Z 2025-05-07T20:25:33.7888142Z 2025-05-07T20:25:33.8000455Z libcusparse-12.5.4.2 | 118.6 MB | #########8 | 99%  2025-05-07T20:25:33.8004856Z 2025-05-07T20:25:33.8605613Z libcublas-12.6.4.1 | 256.2 MB | ####5 | 45%  2025-05-07T20:25:33.8605868Z 2025-05-07T20:25:33.8607459Z 2025-05-07T20:25:33.8610273Z libcufft-11.3.0.4 | 156.2 MB | ########2 | 82%  2025-05-07T20:25:33.9087526Z nsight-compute-2024. | 443.1 MB | ##5 | 25% 2025-05-07T20:25:33.9087778Z 2025-05-07T20:25:33.9606089Z libcublas-12.6.4.1 | 256.2 MB | ####6 | 47%  2025-05-07T20:25:33.9606390Z 2025-05-07T20:25:33.9609611Z 2025-05-07T20:25:33.9623074Z libcufft-11.3.0.4 | 156.2 MB | ########4 | 84%  2025-05-07T20:25:34.0106705Z nsight-compute-2024. | 443.1 MB | ##5 | 26% 2025-05-07T20:25:34.0108920Z 2025-05-07T20:25:34.0607696Z libcublas-12.6.4.1 | 256.2 MB | ####8 | 48%  2025-05-07T20:25:34.0607951Z 2025-05-07T20:25:34.0610933Z 2025-05-07T20:25:34.0624770Z libcufft-11.3.0.4 | 156.2 MB | ########6 | 87%  2025-05-07T20:25:34.1107044Z nsight-compute-2024. | 443.1 MB | ##6 | 27% 2025-05-07T20:25:34.1108090Z 2025-05-07T20:25:34.1613974Z libcublas-12.6.4.1 | 256.2 MB | ####9 | 49%  2025-05-07T20:25:34.1614242Z 2025-05-07T20:25:34.1614883Z 2025-05-07T20:25:34.1627122Z libcufft-11.3.0.4 | 156.2 MB | ########9 | 89%  2025-05-07T20:25:34.2107187Z nsight-compute-2024. | 443.1 MB | ##7 | 28% 2025-05-07T20:25:34.2108160Z 2025-05-07T20:25:34.2614594Z libcublas-12.6.4.1 | 256.2 MB | ##### | 51%  2025-05-07T20:25:34.2614846Z 2025-05-07T20:25:34.2616144Z 2025-05-07T20:25:34.2629805Z libcufft-11.3.0.4 | 156.2 MB | #########1 | 92%  2025-05-07T20:25:34.3108292Z nsight-compute-2024. | 443.1 MB | ##8 | 28% 2025-05-07T20:25:34.3109727Z 2025-05-07T20:25:34.3630366Z libcublas-12.6.4.1 | 256.2 MB | #####2 | 52%  2025-05-07T20:25:34.3630627Z 2025-05-07T20:25:34.3631345Z 2025-05-07T20:25:34.3702381Z libcufft-11.3.0.4 | 156.2 MB | #########4 | 94%  2025-05-07T20:25:34.4185569Z nsight-compute-2024. | 443.1 MB | ##9 | 29% 2025-05-07T20:25:34.4187312Z 2025-05-07T20:25:34.4660002Z libcublas-12.6.4.1 | 256.2 MB | #####3 | 54%  2025-05-07T20:25:34.4660254Z 2025-05-07T20:25:34.4660344Z 2025-05-07T20:25:34.4739829Z libcufft-11.3.0.4 | 156.2 MB | #########6 | 97%  2025-05-07T20:25:34.5208179Z nsight-compute-2024. | 443.1 MB | ##9 | 30% 2025-05-07T20:25:34.5208635Z 2025-05-07T20:25:34.5686313Z libcublas-12.6.4.1 | 256.2 MB | #####5 | 55%  2025-05-07T20:25:34.5686574Z 2025-05-07T20:25:34.5687102Z 2025-05-07T20:25:34.5746777Z libcufft-11.3.0.4 | 156.2 MB | #########8 | 99%  2025-05-07T20:25:34.6208344Z nsight-compute-2024. | 443.1 MB | ### | 31% 2025-05-07T20:25:34.6210114Z 2025-05-07T20:25:34.6754727Z libcublas-12.6.4.1 | 256.2 MB | #####6 | 57%  2025-05-07T20:25:34.7208642Z nsight-compute-2024. | 443.1 MB | ###1 | 32% 2025-05-07T20:25:34.7209039Z 2025-05-07T20:25:34.7754579Z libcublas-12.6.4.1 | 256.2 MB | #####8 | 59%  2025-05-07T20:25:34.8209182Z nsight-compute-2024. | 443.1 MB | ###2 | 33% 2025-05-07T20:25:34.8209542Z 2025-05-07T20:25:34.9149870Z libcublas-12.6.4.1 | 256.2 MB | ######1 | 61%  2025-05-07T20:25:34.9210397Z nsight-compute-2024. | 443.1 MB | ###3 | 33% 2025-05-07T20:25:34.9212123Z 2025-05-07T20:25:35.0153194Z libcublas-12.6.4.1 | 256.2 MB | ######3 | 64%  2025-05-07T20:25:35.0223920Z nsight-compute-2024. | 443.1 MB | ###4 | 34% 2025-05-07T20:25:35.0225331Z 2025-05-07T20:25:35.1224106Z libcublas-12.6.4.1 | 256.2 MB | ######5 | 66%  2025-05-07T20:25:35.1225122Z 2025-05-07T20:25:35.1464094Z libcublas-12.6.4.1 | 256.2 MB | ######8 | 69%  2025-05-07T20:25:35.2326684Z nsight-compute-2024. | 443.1 MB | ###5 | 35% 2025-05-07T20:25:35.2326990Z 2025-05-07T20:25:35.2464854Z libcublas-12.6.4.1 | 256.2 MB | #######1 | 71%  2025-05-07T20:25:35.3399634Z nsight-compute-2024. | 443.1 MB | ###6 | 36% 2025-05-07T20:25:35.3399943Z 2025-05-07T20:25:35.3465971Z libcublas-12.6.4.1 | 256.2 MB | #######3 | 73%  2025-05-07T20:25:35.4401298Z nsight-compute-2024. | 443.1 MB | ###7 | 37% 2025-05-07T20:25:35.4402211Z 2025-05-07T20:25:35.4467080Z libcublas-12.6.4.1 | 256.2 MB | #######5 | 76%  2025-05-07T20:25:35.5467166Z nsight-compute-2024. | 443.1 MB | ###8 | 38% 2025-05-07T20:25:35.5613440Z nsight-compute-2024. | 443.1 MB | ###9 | 39% 2025-05-07T20:25:35.5614284Z 2025-05-07T20:25:35.6473004Z libcublas-12.6.4.1 | 256.2 MB | #######7 | 78%  2025-05-07T20:25:35.6613684Z nsight-compute-2024. | 443.1 MB | #### | 40% 2025-05-07T20:25:35.6614461Z 2025-05-07T20:25:35.7440625Z libcublas-12.6.4.1 | 256.2 MB | ######## | 80%  2025-05-07T20:25:35.7440959Z 2025-05-07T20:25:35.7440963Z 2025-05-07T20:25:35.7440967Z 2025-05-07T20:25:35.7440971Z 2025-05-07T20:25:35.7614397Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:25:35.7615192Z 2025-05-07T20:25:35.7685254Z libcublas-12.6.4.1 | 256.2 MB | ########2 | 83%  2025-05-07T20:25:35.7839966Z nsight-compute-2024. | 443.1 MB | ####1 | 42% 2025-05-07T20:25:35.7840313Z 2025-05-07T20:25:35.7840317Z 2025-05-07T20:25:35.7840321Z 2025-05-07T20:25:35.7840324Z 2025-05-07T20:25:35.7841891Z 2025-05-07T20:25:35.8827066Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:25:35.8839783Z nsight-compute-2024. | 443.1 MB | ####2 | 42% 2025-05-07T20:25:35.8840156Z 2025-05-07T20:25:35.8840164Z 2025-05-07T20:25:35.8840196Z 2025-05-07T20:25:35.8840201Z 2025-05-07T20:25:35.8840211Z 2025-05-07T20:25:35.9120779Z cuda-nvvp-12.6.80 | 109.3 MB | 3 | 4%  2025-05-07T20:25:35.9121156Z 2025-05-07T20:25:35.9844639Z libcublas-12.6.4.1 | 256.2 MB | ########4 | 85%  2025-05-07T20:25:35.9844915Z 2025-05-07T20:25:35.9844919Z 2025-05-07T20:25:35.9844923Z 2025-05-07T20:25:35.9844927Z 2025-05-07T20:25:35.9847342Z 2025-05-07T20:25:35.9870365Z cuda-nvvp-12.6.80 | 109.3 MB | 7 | 7%  2025-05-07T20:25:36.0601236Z nsight-compute-2024. | 443.1 MB | ####3 | 43% 2025-05-07T20:25:36.0601569Z 2025-05-07T20:25:36.0845320Z libcublas-12.6.4.1 | 256.2 MB | ########6 | 87%  2025-05-07T20:25:36.0845580Z 2025-05-07T20:25:36.0845597Z 2025-05-07T20:25:36.0845601Z 2025-05-07T20:25:36.0845605Z 2025-05-07T20:25:36.0847408Z 2025-05-07T20:25:36.0908128Z cuda-nvvp-12.6.80 | 109.3 MB | # | 10%  2025-05-07T20:25:36.1671713Z nsight-compute-2024. | 443.1 MB | ####4 | 44% 2025-05-07T20:25:36.1672267Z 2025-05-07T20:25:36.1672275Z 2025-05-07T20:25:36.1672283Z 2025-05-07T20:25:36.1846883Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:25:36.1847244Z 2025-05-07T20:25:36.1847248Z 2025-05-07T20:25:36.1847252Z 2025-05-07T20:25:36.1847255Z 2025-05-07T20:25:36.1847259Z 2025-05-07T20:25:36.1949923Z cuda-nvvp-12.6.80 | 109.3 MB | #3 | 13%  2025-05-07T20:25:36.1950281Z 2025-05-07T20:25:36.2019195Z libcublas-12.6.4.1 | 256.2 MB | ########8 | 89%  2025-05-07T20:25:36.2181264Z nsight-compute-2024. | 443.1 MB | ####5 | 45% 2025-05-07T20:25:36.2181576Z 2025-05-07T20:25:36.2181580Z 2025-05-07T20:25:36.2181584Z 2025-05-07T20:25:36.2181587Z 2025-05-07T20:25:36.2181591Z 2025-05-07T20:25:36.2189983Z 2025-05-07T20:25:36.3062574Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:25:36.3062883Z 2025-05-07T20:25:36.3062888Z 2025-05-07T20:25:36.3062904Z 2025-05-07T20:25:36.3062908Z 2025-05-07T20:25:36.3070336Z 2025-05-07T20:25:36.3187951Z cuda-nvvp-12.6.80 | 109.3 MB | #6 | 17%  2025-05-07T20:25:36.3188244Z 2025-05-07T20:25:36.3188248Z 2025-05-07T20:25:36.3188252Z 2025-05-07T20:25:36.3188256Z 2025-05-07T20:25:36.3188259Z 2025-05-07T20:25:36.3192794Z 2025-05-07T20:25:36.3403555Z libcusolver-11.7.1.2 | 95.8 MB | 2 | 3%  2025-05-07T20:25:36.3503318Z nsight-compute-2024. | 443.1 MB | ####6 | 46% 2025-05-07T20:25:36.3503594Z 2025-05-07T20:25:36.4188297Z libcublas-12.6.4.1 | 256.2 MB | ######### | 91%  2025-05-07T20:25:36.4188578Z 2025-05-07T20:25:36.4188583Z 2025-05-07T20:25:36.4188586Z 2025-05-07T20:25:36.4188590Z 2025-05-07T20:25:36.4188594Z 2025-05-07T20:25:36.4190393Z 2025-05-07T20:25:36.4226105Z libcusolver-11.7.1.2 | 95.8 MB | 5 | 5%  2025-05-07T20:25:36.4226407Z 2025-05-07T20:25:36.4226410Z 2025-05-07T20:25:36.4226651Z 2025-05-07T20:25:36.4226657Z 2025-05-07T20:25:36.4228407Z 2025-05-07T20:25:36.4796366Z cuda-nvvp-12.6.80 | 109.3 MB | #9 | 19%  2025-05-07T20:25:36.5136902Z nsight-compute-2024. | 443.1 MB | ####6 | 47% 2025-05-07T20:25:36.5137243Z 2025-05-07T20:25:36.5188180Z libcublas-12.6.4.1 | 256.2 MB | #########2 | 92%  2025-05-07T20:25:36.5188510Z 2025-05-07T20:25:36.5188516Z 2025-05-07T20:25:36.5188521Z 2025-05-07T20:25:36.5188527Z 2025-05-07T20:25:36.5188551Z 2025-05-07T20:25:36.5188556Z 2025-05-07T20:25:36.5470701Z libcusolver-11.7.1.2 | 95.8 MB | 7 | 7%  2025-05-07T20:25:36.5470999Z 2025-05-07T20:25:36.5471003Z 2025-05-07T20:25:36.5471007Z 2025-05-07T20:25:36.5471011Z 2025-05-07T20:25:36.5477206Z 2025-05-07T20:25:36.6044439Z cuda-nvvp-12.6.80 | 109.3 MB | ##2 | 22%  2025-05-07T20:25:36.6198219Z nsight-compute-2024. | 443.1 MB | ####7 | 48% 2025-05-07T20:25:36.6198580Z 2025-05-07T20:25:36.6198607Z 2025-05-07T20:25:36.6198611Z 2025-05-07T20:25:36.6198615Z 2025-05-07T20:25:36.6198619Z 2025-05-07T20:25:36.6200619Z 2025-05-07T20:25:36.6475217Z libcusolver-11.7.1.2 | 95.8 MB | 9 | 10%  2025-05-07T20:25:36.6475513Z 2025-05-07T20:25:36.6475517Z 2025-05-07T20:25:36.6475521Z 2025-05-07T20:25:36.6475525Z 2025-05-07T20:25:36.6475529Z 2025-05-07T20:25:36.6579236Z cuda-nvvp-12.6.80 | 109.3 MB | ##4 | 25%  2025-05-07T20:25:36.6579558Z 2025-05-07T20:25:36.7140892Z libcublas-12.6.4.1 | 256.2 MB | #########3 | 93%  2025-05-07T20:25:36.7200446Z nsight-compute-2024. | 443.1 MB | ####8 | 48% 2025-05-07T20:25:36.7200747Z 2025-05-07T20:25:36.7200752Z 2025-05-07T20:25:36.7200755Z 2025-05-07T20:25:36.7200759Z 2025-05-07T20:25:36.7200763Z 2025-05-07T20:25:36.7202267Z 2025-05-07T20:25:36.7478826Z libcusolver-11.7.1.2 | 95.8 MB | #2 | 12%  2025-05-07T20:25:36.7479123Z 2025-05-07T20:25:36.7479128Z 2025-05-07T20:25:36.7479149Z 2025-05-07T20:25:36.7479153Z 2025-05-07T20:25:36.7479262Z 2025-05-07T20:25:36.7931006Z cuda-nvvp-12.6.80 | 109.3 MB | ##7 | 27%  2025-05-07T20:25:36.7931366Z 2025-05-07T20:25:36.8140738Z libcublas-12.6.4.1 | 256.2 MB | #########4 | 95%  2025-05-07T20:25:36.8206432Z nsight-compute-2024. | 443.1 MB | ####9 | 49% 2025-05-07T20:25:36.8206715Z 2025-05-07T20:25:36.8206719Z 2025-05-07T20:25:36.8206723Z 2025-05-07T20:25:36.8206744Z 2025-05-07T20:25:36.8206748Z 2025-05-07T20:25:36.8209215Z 2025-05-07T20:25:36.8481429Z libcusolver-11.7.1.2 | 95.8 MB | #4 | 15%  2025-05-07T20:25:36.8481724Z 2025-05-07T20:25:36.8481728Z 2025-05-07T20:25:36.8481732Z 2025-05-07T20:25:36.8481737Z 2025-05-07T20:25:36.8483983Z 2025-05-07T20:25:36.9185554Z cuda-nvvp-12.6.80 | 109.3 MB | ### | 30%  2025-05-07T20:25:36.9211303Z nsight-compute-2024. | 443.1 MB | ####9 | 50% 2025-05-07T20:25:36.9211590Z 2025-05-07T20:25:36.9211613Z 2025-05-07T20:25:36.9211617Z 2025-05-07T20:25:36.9211621Z 2025-05-07T20:25:36.9211625Z 2025-05-07T20:25:36.9213776Z 2025-05-07T20:25:36.9227446Z libcusolver-11.7.1.2 | 95.8 MB | #7 | 17%  2025-05-07T20:25:36.9227779Z 2025-05-07T20:25:36.9482467Z libcublas-12.6.4.1 | 256.2 MB | #########5 | 96%  2025-05-07T20:25:36.9482811Z 2025-05-07T20:25:36.9482817Z 2025-05-07T20:25:36.9482822Z 2025-05-07T20:25:36.9482827Z 2025-05-07T20:25:36.9485741Z 2025-05-07T20:25:37.0186553Z cuda-nvvp-12.6.80 | 109.3 MB | ###2 | 33%  2025-05-07T20:25:37.0212413Z nsight-compute-2024. | 443.1 MB | ##### | 50% 2025-05-07T20:25:37.0212749Z 2025-05-07T20:25:37.0212753Z 2025-05-07T20:25:37.0212757Z 2025-05-07T20:25:37.0212761Z 2025-05-07T20:25:37.0212765Z 2025-05-07T20:25:37.0216079Z 2025-05-07T20:25:37.0408562Z libcusolver-11.7.1.2 | 95.8 MB | #9 | 20%  2025-05-07T20:25:37.0413091Z 2025-05-07T20:25:37.0524843Z libcublas-12.6.4.1 | 256.2 MB | #########6 | 97%  2025-05-07T20:25:37.0525134Z 2025-05-07T20:25:37.0525138Z 2025-05-07T20:25:37.0525141Z 2025-05-07T20:25:37.0525145Z 2025-05-07T20:25:37.0526660Z 2025-05-07T20:25:37.1213760Z cuda-nvvp-12.6.80 | 109.3 MB | ###5 | 36%  2025-05-07T20:25:37.1214125Z 2025-05-07T20:25:37.1214129Z 2025-05-07T20:25:37.1214133Z 2025-05-07T20:25:37.1214137Z 2025-05-07T20:25:37.1214141Z 2025-05-07T20:25:37.1214161Z 2025-05-07T20:25:37.1341434Z libcusolver-11.7.1.2 | 95.8 MB | ##2 | 22%  2025-05-07T20:25:37.1408474Z nsight-compute-2024. | 443.1 MB | #####1 | 51% 2025-05-07T20:25:37.1408827Z 2025-05-07T20:25:37.1526758Z libcublas-12.6.4.1 | 256.2 MB | #########7 | 98%  2025-05-07T20:25:37.1527050Z 2025-05-07T20:25:37.1527054Z 2025-05-07T20:25:37.1527058Z 2025-05-07T20:25:37.1527061Z 2025-05-07T20:25:37.1528979Z 2025-05-07T20:25:37.2216761Z cuda-nvvp-12.6.80 | 109.3 MB | ###8 | 39%  2025-05-07T20:25:37.2217068Z 2025-05-07T20:25:37.2217072Z 2025-05-07T20:25:37.2217076Z 2025-05-07T20:25:37.2217079Z 2025-05-07T20:25:37.2217083Z 2025-05-07T20:25:37.2221097Z 2025-05-07T20:25:37.2343115Z libcusolver-11.7.1.2 | 95.8 MB | ##4 | 25%  2025-05-07T20:25:37.2521586Z nsight-compute-2024. | 443.1 MB | #####1 | 52% 2025-05-07T20:25:37.2524078Z 2025-05-07T20:25:37.2532213Z libcublas-12.6.4.1 | 256.2 MB | #########8 | 99%  2025-05-07T20:25:37.2532470Z 2025-05-07T20:25:37.2532474Z 2025-05-07T20:25:37.2532478Z 2025-05-07T20:25:37.2532482Z 2025-05-07T20:25:37.2534063Z 2025-05-07T20:25:37.3222153Z cuda-nvvp-12.6.80 | 109.3 MB | ####1 | 41%  2025-05-07T20:25:37.3222443Z 2025-05-07T20:25:37.3222447Z 2025-05-07T20:25:37.3222451Z 2025-05-07T20:25:37.3222455Z 2025-05-07T20:25:37.3222459Z 2025-05-07T20:25:37.3227129Z 2025-05-07T20:25:37.3390054Z libcusolver-11.7.1.2 | 95.8 MB | ##7 | 27%  2025-05-07T20:25:37.3552662Z nsight-compute-2024. | 443.1 MB | #####2 | 53% 2025-05-07T20:25:37.3553017Z 2025-05-07T20:25:37.3553028Z 2025-05-07T20:25:37.3553172Z 2025-05-07T20:25:37.3553179Z 2025-05-07T20:25:37.3553189Z 2025-05-07T20:25:37.3671321Z cuda-nvvp-12.6.80 | 109.3 MB | ####4 | 44%  2025-05-07T20:25:37.3671699Z 2025-05-07T20:25:37.4225622Z libcublas-12.6.4.1 | 256.2 MB | #########9 | 100%  2025-05-07T20:25:37.4225992Z 2025-05-07T20:25:37.4226019Z 2025-05-07T20:25:37.4226030Z 2025-05-07T20:25:37.4226045Z 2025-05-07T20:25:37.4226052Z 2025-05-07T20:25:37.4227408Z 2025-05-07T20:25:37.4392878Z libcusolver-11.7.1.2 | 95.8 MB | ##9 | 30%  2025-05-07T20:25:37.4553077Z nsight-compute-2024. | 443.1 MB | #####3 | 53% 2025-05-07T20:25:37.4553412Z 2025-05-07T20:25:37.4553419Z 2025-05-07T20:25:37.4553425Z 2025-05-07T20:25:37.4553431Z 2025-05-07T20:25:37.4553457Z 2025-05-07T20:25:37.5232761Z cuda-nvvp-12.6.80 | 109.3 MB | ####7 | 47%  2025-05-07T20:25:37.5233174Z 2025-05-07T20:25:37.5233180Z 2025-05-07T20:25:37.5233185Z 2025-05-07T20:25:37.5233190Z 2025-05-07T20:25:37.5233196Z 2025-05-07T20:25:37.5233201Z 2025-05-07T20:25:37.5395073Z libcusolver-11.7.1.2 | 95.8 MB | ###2 | 33%  2025-05-07T20:25:37.5556791Z nsight-compute-2024. | 443.1 MB | #####3 | 54% 2025-05-07T20:25:37.5557149Z 2025-05-07T20:25:37.5557155Z 2025-05-07T20:25:37.5557409Z 2025-05-07T20:25:37.5557416Z 2025-05-07T20:25:37.5557421Z 2025-05-07T20:25:37.6237198Z cuda-nvvp-12.6.80 | 109.3 MB | ##### | 50%  2025-05-07T20:25:37.6237579Z 2025-05-07T20:25:37.6237585Z 2025-05-07T20:25:37.6237591Z 2025-05-07T20:25:37.6237596Z 2025-05-07T20:25:37.6237603Z 2025-05-07T20:25:37.6238961Z 2025-05-07T20:25:37.6395999Z libcusolver-11.7.1.2 | 95.8 MB | ###5 | 36%  2025-05-07T20:25:37.6558327Z nsight-compute-2024. | 443.1 MB | #####4 | 55% 2025-05-07T20:25:37.6558949Z 2025-05-07T20:25:37.6558955Z 2025-05-07T20:25:37.6558960Z 2025-05-07T20:25:37.6558965Z 2025-05-07T20:25:37.6558975Z 2025-05-07T20:25:37.7240166Z cuda-nvvp-12.6.80 | 109.3 MB | #####3 | 53%  2025-05-07T20:25:37.7240569Z 2025-05-07T20:25:37.7240574Z 2025-05-07T20:25:37.7240580Z 2025-05-07T20:25:37.7240585Z 2025-05-07T20:25:37.7240590Z 2025-05-07T20:25:37.7240595Z 2025-05-07T20:25:37.7396922Z libcusolver-11.7.1.2 | 95.8 MB | ###8 | 39%  2025-05-07T20:25:37.7564083Z nsight-compute-2024. | 443.1 MB | #####5 | 55% 2025-05-07T20:25:37.7564444Z 2025-05-07T20:25:37.7564537Z 2025-05-07T20:25:37.7564542Z 2025-05-07T20:25:37.7564548Z 2025-05-07T20:25:37.7568386Z 2025-05-07T20:25:37.8245598Z cuda-nvvp-12.6.80 | 109.3 MB | #####6 | 56%  2025-05-07T20:25:37.8246009Z 2025-05-07T20:25:37.8246016Z 2025-05-07T20:25:37.8246021Z 2025-05-07T20:25:37.8246026Z 2025-05-07T20:25:37.8246031Z 2025-05-07T20:25:37.8246056Z 2025-05-07T20:25:37.8400000Z libcusolver-11.7.1.2 | 95.8 MB | ####1 | 42%  2025-05-07T20:25:37.8565678Z nsight-compute-2024. | 443.1 MB | #####6 | 56% 2025-05-07T20:25:37.8566035Z 2025-05-07T20:25:37.8566040Z 2025-05-07T20:25:37.8566046Z 2025-05-07T20:25:37.8566051Z 2025-05-07T20:25:37.8568429Z 2025-05-07T20:25:37.9247004Z cuda-nvvp-12.6.80 | 109.3 MB | #####9 | 59%  2025-05-07T20:25:37.9247415Z 2025-05-07T20:25:37.9247436Z 2025-05-07T20:25:37.9247441Z 2025-05-07T20:25:37.9247446Z 2025-05-07T20:25:37.9247452Z 2025-05-07T20:25:37.9247457Z 2025-05-07T20:25:37.9401294Z libcusolver-11.7.1.2 | 95.8 MB | ####4 | 45%  2025-05-07T20:25:37.9612161Z nsight-compute-2024. | 443.1 MB | #####6 | 57% 2025-05-07T20:25:37.9612500Z 2025-05-07T20:25:37.9612506Z 2025-05-07T20:25:37.9612511Z 2025-05-07T20:25:37.9612517Z 2025-05-07T20:25:37.9614760Z 2025-05-07T20:25:38.0258337Z cuda-nvvp-12.6.80 | 109.3 MB | ######2 | 62%  2025-05-07T20:25:38.0258732Z 2025-05-07T20:25:38.0258738Z 2025-05-07T20:25:38.0258743Z 2025-05-07T20:25:38.0258748Z 2025-05-07T20:25:38.0258753Z 2025-05-07T20:25:38.0265887Z 2025-05-07T20:25:38.0420901Z libcusolver-11.7.1.2 | 95.8 MB | ####7 | 48%  2025-05-07T20:25:38.0663478Z nsight-compute-2024. | 443.1 MB | #####7 | 58% 2025-05-07T20:25:38.0663825Z 2025-05-07T20:25:38.0663831Z 2025-05-07T20:25:38.0663836Z 2025-05-07T20:25:38.0663855Z 2025-05-07T20:25:38.0668703Z 2025-05-07T20:25:38.1277296Z cuda-nvvp-12.6.80 | 109.3 MB | ######4 | 65%  2025-05-07T20:25:38.1277684Z 2025-05-07T20:25:38.1277690Z 2025-05-07T20:25:38.1277695Z 2025-05-07T20:25:38.1277701Z 2025-05-07T20:25:38.1277706Z 2025-05-07T20:25:38.1277711Z 2025-05-07T20:25:38.1421572Z libcusolver-11.7.1.2 | 95.8 MB | ##### | 51%  2025-05-07T20:25:38.1712915Z nsight-compute-2024. | 443.1 MB | #####8 | 59% 2025-05-07T20:25:38.1713308Z 2025-05-07T20:25:38.1713313Z 2025-05-07T20:25:38.1713319Z 2025-05-07T20:25:38.1713324Z 2025-05-07T20:25:38.1717102Z 2025-05-07T20:25:38.2319352Z cuda-nvvp-12.6.80 | 109.3 MB | ######7 | 68%  2025-05-07T20:25:38.2328697Z 2025-05-07T20:25:38.2328704Z 2025-05-07T20:25:38.2328710Z 2025-05-07T20:25:38.2328715Z 2025-05-07T20:25:38.2328720Z 2025-05-07T20:25:38.2328725Z 2025-05-07T20:25:38.2430324Z libcusolver-11.7.1.2 | 95.8 MB | #####3 | 54%  2025-05-07T20:25:38.2765025Z nsight-compute-2024. | 443.1 MB | #####9 | 59% 2025-05-07T20:25:38.2765379Z 2025-05-07T20:25:38.2765385Z 2025-05-07T20:25:38.2765390Z 2025-05-07T20:25:38.2765395Z 2025-05-07T20:25:38.2766801Z 2025-05-07T20:25:38.3336204Z cuda-nvvp-12.6.80 | 109.3 MB | ####### | 71%  2025-05-07T20:25:38.3336595Z 2025-05-07T20:25:38.3336601Z 2025-05-07T20:25:38.3336611Z 2025-05-07T20:25:38.3336616Z 2025-05-07T20:25:38.3336621Z 2025-05-07T20:25:38.3336627Z 2025-05-07T20:25:38.3452001Z libcusolver-11.7.1.2 | 95.8 MB | #####6 | 56%  2025-05-07T20:25:38.3766254Z nsight-compute-2024. | 443.1 MB | ###### | 60% 2025-05-07T20:25:38.3766643Z 2025-05-07T20:25:38.3766918Z 2025-05-07T20:25:38.3766954Z 2025-05-07T20:25:38.3766960Z 2025-05-07T20:25:38.3768637Z 2025-05-07T20:25:38.4337786Z cuda-nvvp-12.6.80 | 109.3 MB | #######3 | 74%  2025-05-07T20:25:38.4338183Z 2025-05-07T20:25:38.4338199Z 2025-05-07T20:25:38.4338223Z 2025-05-07T20:25:38.4338228Z 2025-05-07T20:25:38.4338234Z 2025-05-07T20:25:38.4338239Z 2025-05-07T20:25:38.4563161Z libcusolver-11.7.1.2 | 95.8 MB | #####9 | 59%  2025-05-07T20:25:38.4768753Z nsight-compute-2024. | 443.1 MB | ###### | 61% 2025-05-07T20:25:38.4769112Z 2025-05-07T20:25:38.4769222Z 2025-05-07T20:25:38.4769228Z 2025-05-07T20:25:38.4769231Z 2025-05-07T20:25:38.4770502Z 2025-05-07T20:25:38.5340230Z cuda-nvvp-12.6.80 | 109.3 MB | #######6 | 77%  2025-05-07T20:25:38.5340649Z 2025-05-07T20:25:38.5340654Z 2025-05-07T20:25:38.5340659Z 2025-05-07T20:25:38.5340665Z 2025-05-07T20:25:38.5340670Z 2025-05-07T20:25:38.5340675Z 2025-05-07T20:25:38.5564740Z libcusolver-11.7.1.2 | 95.8 MB | ######2 | 63%  2025-05-07T20:25:38.5783857Z nsight-compute-2024. | 443.1 MB | ######1 | 62% 2025-05-07T20:25:38.5784228Z 2025-05-07T20:25:38.5784234Z 2025-05-07T20:25:38.5784239Z 2025-05-07T20:25:38.5784261Z 2025-05-07T20:25:38.5784267Z 2025-05-07T20:25:38.6438731Z cuda-nvvp-12.6.80 | 109.3 MB | #######9 | 79%  2025-05-07T20:25:38.6439143Z 2025-05-07T20:25:38.6439148Z 2025-05-07T20:25:38.6439154Z 2025-05-07T20:25:38.6439159Z 2025-05-07T20:25:38.6439165Z 2025-05-07T20:25:38.6442790Z 2025-05-07T20:25:38.6565951Z libcusolver-11.7.1.2 | 95.8 MB | ######5 | 66%  2025-05-07T20:25:38.6802721Z nsight-compute-2024. | 443.1 MB | ######2 | 62% 2025-05-07T20:25:38.6803016Z 2025-05-07T20:25:38.6803234Z 2025-05-07T20:25:38.6803239Z 2025-05-07T20:25:38.6803245Z 2025-05-07T20:25:38.6804891Z 2025-05-07T20:25:38.7444189Z cuda-nvvp-12.6.80 | 109.3 MB | ########2 | 82%  2025-05-07T20:25:38.7444590Z 2025-05-07T20:25:38.7444596Z 2025-05-07T20:25:38.7444601Z 2025-05-07T20:25:38.7444607Z 2025-05-07T20:25:38.7444612Z 2025-05-07T20:25:38.7444618Z 2025-05-07T20:25:38.7575826Z libcusolver-11.7.1.2 | 95.8 MB | ######8 | 69%  2025-05-07T20:25:38.7805362Z nsight-compute-2024. | 443.1 MB | ######3 | 63% 2025-05-07T20:25:38.7805670Z 2025-05-07T20:25:38.7805813Z 2025-05-07T20:25:38.7805819Z 2025-05-07T20:25:38.7805824Z 2025-05-07T20:25:38.7807496Z 2025-05-07T20:25:38.8445006Z cuda-nvvp-12.6.80 | 109.3 MB | ########5 | 85%  2025-05-07T20:25:38.8445363Z 2025-05-07T20:25:38.8445383Z 2025-05-07T20:25:38.8445388Z 2025-05-07T20:25:38.8445394Z 2025-05-07T20:25:38.8445400Z 2025-05-07T20:25:38.8447727Z 2025-05-07T20:25:38.8663587Z libcusolver-11.7.1.2 | 95.8 MB | #######2 | 72%  2025-05-07T20:25:38.8806860Z nsight-compute-2024. | 443.1 MB | ######3 | 64% 2025-05-07T20:25:38.8807192Z 2025-05-07T20:25:38.8807198Z 2025-05-07T20:25:38.8807204Z 2025-05-07T20:25:38.8807209Z 2025-05-07T20:25:38.8810348Z 2025-05-07T20:25:38.8924072Z cuda-nvvp-12.6.80 | 109.3 MB | ########8 | 88%  2025-05-07T20:25:38.8924357Z 2025-05-07T20:25:38.8931834Z 2025-05-07T20:25:38.9322373Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:25:38.9322890Z 2025-05-07T20:25:38.9322896Z 2025-05-07T20:25:38.9322902Z 2025-05-07T20:25:38.9322905Z 2025-05-07T20:25:38.9322909Z 2025-05-07T20:25:38.9322913Z 2025-05-07T20:25:38.9324153Z 2025-05-07T20:25:38.9488220Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:25:38.9488570Z 2025-05-07T20:25:38.9488574Z 2025-05-07T20:25:38.9488578Z 2025-05-07T20:25:38.9488582Z 2025-05-07T20:25:38.9488824Z 2025-05-07T20:25:38.9488837Z 2025-05-07T20:25:38.9935601Z libcusolver-11.7.1.2 | 95.8 MB | #######5 | 75%  2025-05-07T20:25:38.9982717Z nsight-compute-2024. | 443.1 MB | ######4 | 65% 2025-05-07T20:25:38.9982986Z 2025-05-07T20:25:38.9982990Z 2025-05-07T20:25:38.9982994Z 2025-05-07T20:25:38.9982997Z 2025-05-07T20:25:38.9983001Z 2025-05-07T20:25:39.0322954Z cuda-nvvp-12.6.80 | 109.3 MB | #########1 | 91%  2025-05-07T20:25:39.0323384Z 2025-05-07T20:25:39.0323391Z 2025-05-07T20:25:39.0323396Z 2025-05-07T20:25:39.0323402Z 2025-05-07T20:25:39.0323407Z 2025-05-07T20:25:39.0323412Z 2025-05-07T20:25:39.0325408Z 2025-05-07T20:25:39.0577318Z libnpp-12.3.1.54 | 93.4 MB | 2 | 3%  2025-05-07T20:25:39.0577622Z 2025-05-07T20:25:39.0577633Z 2025-05-07T20:25:39.0577638Z 2025-05-07T20:25:39.0577643Z 2025-05-07T20:25:39.0577648Z 2025-05-07T20:25:39.0581926Z 2025-05-07T20:25:39.1070939Z libcusolver-11.7.1.2 | 95.8 MB | #######8 | 78%  2025-05-07T20:25:39.1071415Z 2025-05-07T20:25:39.1071421Z 2025-05-07T20:25:39.1071428Z 2025-05-07T20:25:39.1071433Z 2025-05-07T20:25:39.1075275Z 2025-05-07T20:25:39.1160657Z cuda-nvvp-12.6.80 | 109.3 MB | #########3 | 94%  2025-05-07T20:25:39.1326560Z nsight-compute-2024. | 443.1 MB | ######5 | 65% 2025-05-07T20:25:39.1326861Z 2025-05-07T20:25:39.1327052Z 2025-05-07T20:25:39.1327058Z 2025-05-07T20:25:39.1327062Z 2025-05-07T20:25:39.1327082Z 2025-05-07T20:25:39.1327086Z 2025-05-07T20:25:39.1328340Z 2025-05-07T20:25:39.1752216Z libnpp-12.3.1.54 | 93.4 MB | 5 | 5%  2025-05-07T20:25:39.1752524Z 2025-05-07T20:25:39.1752530Z 2025-05-07T20:25:39.1752534Z 2025-05-07T20:25:39.1752537Z 2025-05-07T20:25:39.1752541Z 2025-05-07T20:25:39.1757173Z 2025-05-07T20:25:39.2165802Z libcusolver-11.7.1.2 | 95.8 MB | ########1 | 81%  2025-05-07T20:25:39.2166239Z 2025-05-07T20:25:39.2166275Z 2025-05-07T20:25:39.2166281Z 2025-05-07T20:25:39.2166286Z 2025-05-07T20:25:39.2169480Z 2025-05-07T20:25:39.2298562Z cuda-nvvp-12.6.80 | 109.3 MB | #########6 | 97%  2025-05-07T20:25:39.2327553Z nsight-compute-2024. | 443.1 MB | ######6 | 66% 2025-05-07T20:25:39.2327924Z 2025-05-07T20:25:39.2327938Z 2025-05-07T20:25:39.2327942Z 2025-05-07T20:25:39.2327946Z 2025-05-07T20:25:39.2327949Z 2025-05-07T20:25:39.2327953Z 2025-05-07T20:25:39.2327957Z 2025-05-07T20:25:39.2794133Z libnpp-12.3.1.54 | 93.4 MB | 8 | 8%  2025-05-07T20:25:39.2794492Z 2025-05-07T20:25:39.2794502Z 2025-05-07T20:25:39.2794506Z 2025-05-07T20:25:39.2794510Z 2025-05-07T20:25:39.2794514Z 2025-05-07T20:25:39.2796464Z 2025-05-07T20:25:39.3249211Z libcusolver-11.7.1.2 | 95.8 MB | ########4 | 84%  2025-05-07T20:25:39.3249614Z 2025-05-07T20:25:39.3249619Z 2025-05-07T20:25:39.3249622Z 2025-05-07T20:25:39.3249626Z 2025-05-07T20:25:39.3254188Z 2025-05-07T20:25:39.3334221Z cuda-nvvp-12.6.80 | 109.3 MB | #########9 | 99%  2025-05-07T20:25:39.3334618Z 2025-05-07T20:25:39.3334624Z 2025-05-07T20:25:39.3334630Z 2025-05-07T20:25:39.3334635Z 2025-05-07T20:25:39.3334640Z 2025-05-07T20:25:39.3334645Z 2025-05-07T20:25:39.3336293Z 2025-05-07T20:25:39.3394313Z libnpp-12.3.1.54 | 93.4 MB | #1 | 11%  2025-05-07T20:25:39.3802049Z nsight-compute-2024. | 443.1 MB | ######6 | 67% 2025-05-07T20:25:39.3803290Z 2025-05-07T20:25:39.3803299Z 2025-05-07T20:25:39.3803304Z 2025-05-07T20:25:39.3803310Z 2025-05-07T20:25:39.3803315Z 2025-05-07T20:25:39.3803324Z 2025-05-07T20:25:39.4332127Z libcusolver-11.7.1.2 | 95.8 MB | ########7 | 87%  2025-05-07T20:25:39.4332534Z 2025-05-07T20:25:39.4332539Z 2025-05-07T20:25:39.4332544Z 2025-05-07T20:25:39.4334220Z 2025-05-07T20:25:39.4335143Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:25:39.4335424Z 2025-05-07T20:25:39.4335662Z 2025-05-07T20:25:39.4335671Z 2025-05-07T20:25:39.4335675Z 2025-05-07T20:25:39.4335679Z 2025-05-07T20:25:39.4335682Z 2025-05-07T20:25:39.4343095Z 2025-05-07T20:25:39.4449959Z libnpp-12.3.1.54 | 93.4 MB | #4 | 14%  2025-05-07T20:25:39.4802911Z nsight-compute-2024. | 443.1 MB | ######7 | 67% 2025-05-07T20:25:39.4803198Z 2025-05-07T20:25:39.4803202Z 2025-05-07T20:25:39.4803205Z 2025-05-07T20:25:39.4803209Z 2025-05-07T20:25:39.4803226Z 2025-05-07T20:25:39.4805962Z 2025-05-07T20:25:39.5339394Z libcusolver-11.7.1.2 | 95.8 MB | ######### | 91%  2025-05-07T20:25:39.5339809Z 2025-05-07T20:25:39.5339814Z 2025-05-07T20:25:39.5339818Z 2025-05-07T20:25:39.5339822Z 2025-05-07T20:25:39.5339826Z 2025-05-07T20:25:39.5339829Z 2025-05-07T20:25:39.5339957Z 2025-05-07T20:25:39.5452727Z libnpp-12.3.1.54 | 93.4 MB | #7 | 17%  2025-05-07T20:25:39.5802903Z nsight-compute-2024. | 443.1 MB | ######8 | 68% 2025-05-07T20:25:39.5803180Z 2025-05-07T20:25:39.5803184Z 2025-05-07T20:25:39.5803188Z 2025-05-07T20:25:39.5803191Z 2025-05-07T20:25:39.5803195Z 2025-05-07T20:25:39.5808956Z 2025-05-07T20:25:39.6343008Z libcusolver-11.7.1.2 | 95.8 MB | #########4 | 94%  2025-05-07T20:25:39.6343308Z 2025-05-07T20:25:39.6343312Z 2025-05-07T20:25:39.6343316Z 2025-05-07T20:25:39.6343320Z 2025-05-07T20:25:39.6343324Z 2025-05-07T20:25:39.6343328Z 2025-05-07T20:25:39.6344822Z 2025-05-07T20:25:39.6454005Z libnpp-12.3.1.54 | 93.4 MB | ## | 21%  2025-05-07T20:25:39.6907269Z nsight-compute-2024. | 443.1 MB | ######8 | 69% 2025-05-07T20:25:39.6907524Z 2025-05-07T20:25:39.6907529Z 2025-05-07T20:25:39.6907541Z 2025-05-07T20:25:39.6907545Z 2025-05-07T20:25:39.6907549Z 2025-05-07T20:25:39.6909917Z 2025-05-07T20:25:39.7346327Z libcusolver-11.7.1.2 | 95.8 MB | #########7 | 97%  2025-05-07T20:25:39.7346734Z 2025-05-07T20:25:39.7346751Z 2025-05-07T20:25:39.7346755Z 2025-05-07T20:25:39.7346759Z 2025-05-07T20:25:39.7346763Z 2025-05-07T20:25:39.7346767Z 2025-05-07T20:25:39.7349651Z 2025-05-07T20:25:39.7505193Z libnpp-12.3.1.54 | 93.4 MB | ##4 | 24%  2025-05-07T20:25:39.8347410Z nsight-compute-2024. | 443.1 MB | ######9 | 69% 2025-05-07T20:25:39.8347704Z 2025-05-07T20:25:39.8347710Z 2025-05-07T20:25:39.8347715Z 2025-05-07T20:25:39.8347737Z 2025-05-07T20:25:39.8347754Z 2025-05-07T20:25:39.8347760Z 2025-05-07T20:25:39.8350796Z 2025-05-07T20:25:39.8506109Z libnpp-12.3.1.54 | 93.4 MB | ##7 | 28%  2025-05-07T20:25:39.9348981Z nsight-compute-2024. | 443.1 MB | ####### | 70% 2025-05-07T20:25:39.9349326Z 2025-05-07T20:25:39.9349330Z 2025-05-07T20:25:39.9349334Z 2025-05-07T20:25:39.9349337Z 2025-05-07T20:25:39.9349341Z 2025-05-07T20:25:39.9349352Z 2025-05-07T20:25:39.9352610Z 2025-05-07T20:25:39.9524703Z libnpp-12.3.1.54 | 93.4 MB | ###1 | 31%  2025-05-07T20:25:40.0353884Z nsight-compute-2024. | 443.1 MB | ####### | 71% 2025-05-07T20:25:40.0354245Z 2025-05-07T20:25:40.0354252Z 2025-05-07T20:25:40.0354257Z 2025-05-07T20:25:40.0354262Z 2025-05-07T20:25:40.0354267Z 2025-05-07T20:25:40.0354282Z 2025-05-07T20:25:40.0355857Z 2025-05-07T20:25:40.0936951Z libnpp-12.3.1.54 | 93.4 MB | ###5 | 35%  2025-05-07T20:25:40.1356171Z nsight-compute-2024. | 443.1 MB | #######1 | 72% 2025-05-07T20:25:40.1356522Z 2025-05-07T20:25:40.1356528Z 2025-05-07T20:25:40.1356533Z 2025-05-07T20:25:40.1356538Z 2025-05-07T20:25:40.1356543Z 2025-05-07T20:25:40.1356548Z 2025-05-07T20:25:40.1358247Z 2025-05-07T20:25:40.2358049Z libnpp-12.3.1.54 | 93.4 MB | ###9 | 40%  2025-05-07T20:25:40.2358438Z 2025-05-07T20:25:40.2358443Z 2025-05-07T20:25:40.2358449Z 2025-05-07T20:25:40.2358454Z 2025-05-07T20:25:40.2358469Z 2025-05-07T20:25:40.2358475Z 2025-05-07T20:25:40.2358909Z 2025-05-07T20:25:40.2406398Z libnpp-12.3.1.54 | 93.4 MB | ####3 | 44%  2025-05-07T20:25:40.3362144Z nsight-compute-2024. | 443.1 MB | #######2 | 72% 2025-05-07T20:25:40.3362514Z 2025-05-07T20:25:40.3362519Z 2025-05-07T20:25:40.3362525Z 2025-05-07T20:25:40.3362530Z 2025-05-07T20:25:40.3362535Z 2025-05-07T20:25:40.3362540Z 2025-05-07T20:25:40.3364015Z 2025-05-07T20:25:40.4363532Z libnpp-12.3.1.54 | 93.4 MB | ####8 | 49%  2025-05-07T20:25:40.4364066Z 2025-05-07T20:25:40.4364070Z 2025-05-07T20:25:40.4364074Z 2025-05-07T20:25:40.4364078Z 2025-05-07T20:25:40.4364089Z 2025-05-07T20:25:40.4364093Z 2025-05-07T20:25:40.4370127Z 2025-05-07T20:25:40.4424209Z libnpp-12.3.1.54 | 93.4 MB | #####2 | 53%  2025-05-07T20:25:40.5428810Z nsight-compute-2024. | 443.1 MB | #######2 | 73% 2025-05-07T20:25:40.5504473Z nsight-compute-2024. | 443.1 MB | #######3 | 73% 2025-05-07T20:25:40.5504840Z 2025-05-07T20:25:40.5504878Z 2025-05-07T20:25:40.5504884Z 2025-05-07T20:25:40.5504889Z 2025-05-07T20:25:40.5504894Z 2025-05-07T20:25:40.5504899Z 2025-05-07T20:25:40.5504904Z 2025-05-07T20:25:40.6433004Z libnpp-12.3.1.54 | 93.4 MB | #####7 | 57%  2025-05-07T20:25:40.6667493Z nsight-compute-2024. | 443.1 MB | #######4 | 74% 2025-05-07T20:25:40.6667902Z 2025-05-07T20:25:40.6667908Z 2025-05-07T20:25:40.6667914Z 2025-05-07T20:25:40.6667935Z 2025-05-07T20:25:40.6667973Z 2025-05-07T20:25:40.6667977Z 2025-05-07T20:25:40.6669253Z 2025-05-07T20:25:40.7673174Z libnpp-12.3.1.54 | 93.4 MB | ######1 | 61%  2025-05-07T20:25:40.7673565Z 2025-05-07T20:25:40.7673569Z 2025-05-07T20:25:40.7673573Z 2025-05-07T20:25:40.7673577Z 2025-05-07T20:25:40.7673581Z 2025-05-07T20:25:40.7673585Z 2025-05-07T20:25:40.7674790Z 2025-05-07T20:25:40.7749596Z libnpp-12.3.1.54 | 93.4 MB | ######5 | 66%  2025-05-07T20:25:40.8710585Z nsight-compute-2024. | 443.1 MB | #######4 | 75% 2025-05-07T20:25:40.8710962Z 2025-05-07T20:25:40.8710968Z 2025-05-07T20:25:40.8710973Z 2025-05-07T20:25:40.8710989Z 2025-05-07T20:25:40.8710994Z 2025-05-07T20:25:40.8710999Z 2025-05-07T20:25:40.8711005Z 2025-05-07T20:25:40.8753953Z libnpp-12.3.1.54 | 93.4 MB | ######9 | 70%  2025-05-07T20:25:40.9710778Z nsight-compute-2024. | 443.1 MB | #######5 | 76% 2025-05-07T20:25:40.9711056Z 2025-05-07T20:25:40.9711080Z 2025-05-07T20:25:40.9711084Z 2025-05-07T20:25:40.9711087Z 2025-05-07T20:25:40.9711091Z 2025-05-07T20:25:40.9711096Z 2025-05-07T20:25:40.9711100Z 2025-05-07T20:25:41.0246560Z libnpp-12.3.1.54 | 93.4 MB | #######4 | 74%  2025-05-07T20:25:41.0792679Z nsight-compute-2024. | 443.1 MB | #######6 | 76% 2025-05-07T20:25:41.0792948Z 2025-05-07T20:25:41.0792953Z 2025-05-07T20:25:41.0792956Z 2025-05-07T20:25:41.0792960Z 2025-05-07T20:25:41.0792964Z 2025-05-07T20:25:41.0792994Z 2025-05-07T20:25:41.0797872Z 2025-05-07T20:25:41.1796023Z libnpp-12.3.1.54 | 93.4 MB | #######8 | 79%  2025-05-07T20:25:41.1796471Z 2025-05-07T20:25:41.1796488Z 2025-05-07T20:25:41.1796495Z 2025-05-07T20:25:41.1796501Z 2025-05-07T20:25:41.1796508Z 2025-05-07T20:25:41.1796515Z 2025-05-07T20:25:41.1796520Z 2025-05-07T20:25:41.2797645Z libnpp-12.3.1.54 | 93.4 MB | ########3 | 83%  2025-05-07T20:25:41.2797967Z 2025-05-07T20:25:41.2798248Z 2025-05-07T20:25:41.2798253Z 2025-05-07T20:25:41.2798257Z 2025-05-07T20:25:41.2798261Z 2025-05-07T20:25:41.2798264Z 2025-05-07T20:25:41.2798297Z 2025-05-07T20:25:41.3804703Z libnpp-12.3.1.54 | 93.4 MB | ########8 | 89%  2025-05-07T20:25:41.3805082Z 2025-05-07T20:25:41.3805091Z 2025-05-07T20:25:41.3805098Z 2025-05-07T20:25:41.3805108Z 2025-05-07T20:25:41.3805132Z 2025-05-07T20:25:41.3805138Z 2025-05-07T20:25:41.3805144Z 2025-05-07T20:25:41.4120964Z libnpp-12.3.1.54 | 93.4 MB | #########3 | 93%  2025-05-07T20:25:41.5011942Z nsight-compute-2024. | 443.1 MB | #######6 | 77% 2025-05-07T20:25:41.5012232Z 2025-05-07T20:25:41.5012237Z 2025-05-07T20:25:41.5012240Z 2025-05-07T20:25:41.5012244Z 2025-05-07T20:25:41.5012248Z 2025-05-07T20:25:41.5012253Z 2025-05-07T20:25:41.5012447Z 2025-05-07T20:25:41.5597027Z libnpp-12.3.1.54 | 93.4 MB | #########7 | 98%  2025-05-07T20:25:41.6600816Z nsight-compute-2024. | 443.1 MB | #######7 | 77% 2025-05-07T20:25:41.7603570Z nsight-compute-2024. | 443.1 MB | #######7 | 78% 2025-05-07T20:25:41.8604642Z nsight-compute-2024. | 443.1 MB | #######8 | 79% 2025-05-07T20:25:41.9604825Z nsight-compute-2024. | 443.1 MB | #######9 | 80% 2025-05-07T20:25:42.0607119Z nsight-compute-2024. | 443.1 MB | ######## | 81% 2025-05-07T20:25:42.1613435Z nsight-compute-2024. | 443.1 MB | ########1 | 81% 2025-05-07T20:25:42.2626085Z nsight-compute-2024. | 443.1 MB | ########2 | 82% 2025-05-07T20:25:42.3666882Z nsight-compute-2024. | 443.1 MB | ########3 | 83% 2025-05-07T20:25:42.4681254Z nsight-compute-2024. | 443.1 MB | ########4 | 84% 2025-05-07T20:25:42.5389158Z nsight-compute-2024. | 443.1 MB | ########4 | 85% 2025-05-07T20:25:42.5389538Z 2025-05-07T20:25:42.5389543Z 2025-05-07T20:25:42.5389549Z 2025-05-07T20:25:42.5389554Z 2025-05-07T20:25:42.5389570Z 2025-05-07T20:25:42.5398735Z 2025-05-07T20:25:42.5732736Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:25:42.6029454Z nsight-compute-2024. | 443.1 MB | ########5 | 86% 2025-05-07T20:25:42.6029824Z 2025-05-07T20:25:42.6029830Z 2025-05-07T20:25:42.6029836Z 2025-05-07T20:25:42.6029840Z 2025-05-07T20:25:42.6029847Z 2025-05-07T20:25:42.6029852Z 2025-05-07T20:25:42.6029857Z 2025-05-07T20:25:42.6031151Z 2025-05-07T20:25:42.6120618Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:25:42.6120936Z 2025-05-07T20:25:42.6120966Z 2025-05-07T20:25:42.6120970Z 2025-05-07T20:25:42.6120974Z 2025-05-07T20:25:42.6125323Z 2025-05-07T20:25:42.6631645Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:25:42.6631944Z 2025-05-07T20:25:42.6631949Z 2025-05-07T20:25:42.6631952Z 2025-05-07T20:25:42.6631956Z 2025-05-07T20:25:42.6631960Z 2025-05-07T20:25:42.6631964Z 2025-05-07T20:25:42.6631968Z 2025-05-07T20:25:42.6631972Z 2025-05-07T20:25:42.6632715Z 2025-05-07T20:25:42.6893363Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:25:42.7031933Z nsight-compute-2024. | 443.1 MB | ########6 | 87% 2025-05-07T20:25:42.7032324Z 2025-05-07T20:25:42.7032330Z 2025-05-07T20:25:42.7032335Z 2025-05-07T20:25:42.7032340Z 2025-05-07T20:25:42.7032346Z 2025-05-07T20:25:42.7032351Z 2025-05-07T20:25:42.7032356Z 2025-05-07T20:25:42.7039893Z 2025-05-07T20:25:42.7631418Z cuda-nvdisasm-12.6.7 | 47.6 MB | 6 | 7%  2025-05-07T20:25:42.7631799Z 2025-05-07T20:25:42.7631805Z 2025-05-07T20:25:42.7631811Z 2025-05-07T20:25:42.7631816Z 2025-05-07T20:25:42.7631821Z 2025-05-07T20:25:42.7631826Z 2025-05-07T20:25:42.7631831Z 2025-05-07T20:25:42.7631836Z 2025-05-07T20:25:42.7633363Z 2025-05-07T20:25:42.8125211Z libcurand-10.3.7.77 | 39.9 MB | 7 | 7%  2025-05-07T20:25:42.8125536Z 2025-05-07T20:25:42.8125540Z 2025-05-07T20:25:42.8125544Z 2025-05-07T20:25:42.8125557Z 2025-05-07T20:25:42.8125881Z 2025-05-07T20:25:42.8125890Z 2025-05-07T20:25:42.8125895Z 2025-05-07T20:25:42.8125901Z 2025-05-07T20:25:42.8275872Z cuda-nvdisasm-12.6.7 | 47.6 MB | #3 | 13%  2025-05-07T20:25:42.8671720Z nsight-compute-2024. | 443.1 MB | ########7 | 87% 2025-05-07T20:25:42.8672071Z 2025-05-07T20:25:42.8672076Z 2025-05-07T20:25:42.8672081Z 2025-05-07T20:25:42.8672087Z 2025-05-07T20:25:42.8672092Z 2025-05-07T20:25:42.8672097Z 2025-05-07T20:25:42.8672110Z 2025-05-07T20:25:42.8672428Z 2025-05-07T20:25:42.8673122Z 2025-05-07T20:25:42.9197772Z libcurand-10.3.7.77 | 39.9 MB | #4 | 14%  2025-05-07T20:25:42.9198180Z 2025-05-07T20:25:42.9198198Z 2025-05-07T20:25:42.9198203Z 2025-05-07T20:25:42.9198208Z 2025-05-07T20:25:42.9198213Z 2025-05-07T20:25:42.9198218Z 2025-05-07T20:25:42.9198223Z 2025-05-07T20:25:42.9200676Z 2025-05-07T20:25:42.9437036Z cuda-nvdisasm-12.6.7 | 47.6 MB | #9 | 19%  2025-05-07T20:25:42.9671845Z nsight-compute-2024. | 443.1 MB | ########8 | 88% 2025-05-07T20:25:42.9672130Z 2025-05-07T20:25:42.9672134Z 2025-05-07T20:25:42.9672138Z 2025-05-07T20:25:42.9672142Z 2025-05-07T20:25:42.9672153Z 2025-05-07T20:25:42.9672157Z 2025-05-07T20:25:42.9672160Z 2025-05-07T20:25:42.9672164Z 2025-05-07T20:25:42.9673472Z 2025-05-07T20:25:43.0260569Z libcurand-10.3.7.77 | 39.9 MB | ##1 | 22%  2025-05-07T20:25:43.0260914Z 2025-05-07T20:25:43.0260919Z 2025-05-07T20:25:43.0260954Z 2025-05-07T20:25:43.0260966Z 2025-05-07T20:25:43.0260969Z 2025-05-07T20:25:43.0260973Z 2025-05-07T20:25:43.0260977Z 2025-05-07T20:25:43.0260981Z 2025-05-07T20:25:43.0630949Z cuda-nvdisasm-12.6.7 | 47.6 MB | ##5 | 26%  2025-05-07T20:25:43.0672477Z nsight-compute-2024. | 443.1 MB | ########8 | 89% 2025-05-07T20:25:43.0672803Z 2025-05-07T20:25:43.0672818Z 2025-05-07T20:25:43.0672822Z 2025-05-07T20:25:43.0672826Z 2025-05-07T20:25:43.0672928Z 2025-05-07T20:25:43.0672933Z 2025-05-07T20:25:43.0672936Z 2025-05-07T20:25:43.0672940Z 2025-05-07T20:25:43.0672943Z 2025-05-07T20:25:43.1334749Z libcurand-10.3.7.77 | 39.9 MB | ##8 | 29%  2025-05-07T20:25:43.1335074Z 2025-05-07T20:25:43.1335078Z 2025-05-07T20:25:43.1335082Z 2025-05-07T20:25:43.1335086Z 2025-05-07T20:25:43.1335090Z 2025-05-07T20:25:43.1335094Z 2025-05-07T20:25:43.1335098Z 2025-05-07T20:25:43.1335928Z 2025-05-07T20:25:43.1632355Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###1 | 31%  2025-05-07T20:25:43.1674510Z nsight-compute-2024. | 443.1 MB | ########9 | 90% 2025-05-07T20:25:43.1674863Z 2025-05-07T20:25:43.1674867Z 2025-05-07T20:25:43.1674871Z 2025-05-07T20:25:43.1674874Z 2025-05-07T20:25:43.1674878Z 2025-05-07T20:25:43.1674882Z 2025-05-07T20:25:43.1674888Z 2025-05-07T20:25:43.1674892Z 2025-05-07T20:25:43.1674896Z 2025-05-07T20:25:43.2396211Z libcurand-10.3.7.77 | 39.9 MB | ###6 | 37%  2025-05-07T20:25:43.2396630Z 2025-05-07T20:25:43.2396635Z 2025-05-07T20:25:43.2396640Z 2025-05-07T20:25:43.2396645Z 2025-05-07T20:25:43.2396651Z 2025-05-07T20:25:43.2396656Z 2025-05-07T20:25:43.2396662Z 2025-05-07T20:25:43.2398643Z 2025-05-07T20:25:43.2670140Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###7 | 37%  2025-05-07T20:25:43.2679040Z nsight-compute-2024. | 443.1 MB | ######### | 90% 2025-05-07T20:25:43.2679757Z 2025-05-07T20:25:43.2679763Z 2025-05-07T20:25:43.2679792Z 2025-05-07T20:25:43.2679798Z 2025-05-07T20:25:43.2679803Z 2025-05-07T20:25:43.2679808Z 2025-05-07T20:25:43.2679813Z 2025-05-07T20:25:43.2679819Z 2025-05-07T20:25:43.2679843Z 2025-05-07T20:25:43.3399377Z libcurand-10.3.7.77 | 39.9 MB | ####4 | 45%  2025-05-07T20:25:43.3399842Z 2025-05-07T20:25:43.3399848Z 2025-05-07T20:25:43.3399854Z 2025-05-07T20:25:43.3399859Z 2025-05-07T20:25:43.3399865Z 2025-05-07T20:25:43.3400178Z 2025-05-07T20:25:43.3400187Z 2025-05-07T20:25:43.3401980Z 2025-05-07T20:25:43.3686879Z cuda-nvdisasm-12.6.7 | 47.6 MB | ####3 | 43%  2025-05-07T20:25:43.3691860Z nsight-compute-2024. | 443.1 MB | ######### | 91% 2025-05-07T20:25:43.3692118Z 2025-05-07T20:25:43.3692122Z 2025-05-07T20:25:43.3692126Z 2025-05-07T20:25:43.3692134Z 2025-05-07T20:25:43.3692139Z 2025-05-07T20:25:43.3692145Z 2025-05-07T20:25:43.3692150Z 2025-05-07T20:25:43.3692166Z 2025-05-07T20:25:43.3693646Z 2025-05-07T20:25:43.4405782Z libcurand-10.3.7.77 | 39.9 MB | #####2 | 52%  2025-05-07T20:25:43.4406142Z 2025-05-07T20:25:43.4406147Z 2025-05-07T20:25:43.4406151Z 2025-05-07T20:25:43.4406155Z 2025-05-07T20:25:43.4406166Z 2025-05-07T20:25:43.4406170Z 2025-05-07T20:25:43.4406173Z 2025-05-07T20:25:43.4407469Z 2025-05-07T20:25:43.4687880Z cuda-nvdisasm-12.6.7 | 47.6 MB | ##### | 50%  2025-05-07T20:25:43.4826054Z nsight-compute-2024. | 443.1 MB | #########1 | 92% 2025-05-07T20:25:43.4826404Z 2025-05-07T20:25:43.4826408Z 2025-05-07T20:25:43.4826412Z 2025-05-07T20:25:43.4826416Z 2025-05-07T20:25:43.4826419Z 2025-05-07T20:25:43.4826423Z 2025-05-07T20:25:43.4826435Z 2025-05-07T20:25:43.4826439Z 2025-05-07T20:25:43.4830319Z 2025-05-07T20:25:43.5475490Z libcurand-10.3.7.77 | 39.9 MB | #####9 | 60%  2025-05-07T20:25:43.5475860Z 2025-05-07T20:25:43.5475878Z 2025-05-07T20:25:43.5475884Z 2025-05-07T20:25:43.5475923Z 2025-05-07T20:25:43.5475929Z 2025-05-07T20:25:43.5475934Z 2025-05-07T20:25:43.5475940Z 2025-05-07T20:25:43.5477740Z 2025-05-07T20:25:43.5687727Z cuda-nvdisasm-12.6.7 | 47.6 MB | #####6 | 56%  2025-05-07T20:25:43.5831105Z nsight-compute-2024. | 443.1 MB | #########2 | 92% 2025-05-07T20:25:43.5831591Z 2025-05-07T20:25:43.5831597Z 2025-05-07T20:25:43.5831612Z 2025-05-07T20:25:43.5831618Z 2025-05-07T20:25:43.5831623Z 2025-05-07T20:25:43.5831653Z 2025-05-07T20:25:43.5831661Z 2025-05-07T20:25:43.5831667Z 2025-05-07T20:25:43.5834918Z 2025-05-07T20:25:43.6475949Z libcurand-10.3.7.77 | 39.9 MB | ######7 | 67%  2025-05-07T20:25:43.6476280Z 2025-05-07T20:25:43.6476284Z 2025-05-07T20:25:43.6476287Z 2025-05-07T20:25:43.6476291Z 2025-05-07T20:25:43.6476295Z 2025-05-07T20:25:43.6476298Z 2025-05-07T20:25:43.6476303Z 2025-05-07T20:25:43.6476309Z 2025-05-07T20:25:43.6688199Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######2 | 63%  2025-05-07T20:25:43.6899802Z nsight-compute-2024. | 443.1 MB | #########2 | 93% 2025-05-07T20:25:43.6900117Z 2025-05-07T20:25:43.6900121Z 2025-05-07T20:25:43.6900125Z 2025-05-07T20:25:43.6900129Z 2025-05-07T20:25:43.6900132Z 2025-05-07T20:25:43.6900136Z 2025-05-07T20:25:43.6900140Z 2025-05-07T20:25:43.6900144Z 2025-05-07T20:25:43.6902587Z 2025-05-07T20:25:43.7482681Z libcurand-10.3.7.77 | 39.9 MB | #######4 | 75%  2025-05-07T20:25:43.7483199Z 2025-05-07T20:25:43.7483206Z 2025-05-07T20:25:43.7483211Z 2025-05-07T20:25:43.7483216Z 2025-05-07T20:25:43.7483221Z 2025-05-07T20:25:43.7483227Z 2025-05-07T20:25:43.7483235Z 2025-05-07T20:25:43.7483242Z 2025-05-07T20:25:43.7694707Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######9 | 69%  2025-05-07T20:25:43.7902386Z nsight-compute-2024. | 443.1 MB | #########3 | 94% 2025-05-07T20:25:43.7902687Z 2025-05-07T20:25:43.7902691Z 2025-05-07T20:25:43.7902694Z 2025-05-07T20:25:43.7902718Z 2025-05-07T20:25:43.7902722Z 2025-05-07T20:25:43.7902726Z 2025-05-07T20:25:43.7902730Z 2025-05-07T20:25:43.7902733Z 2025-05-07T20:25:43.7907405Z 2025-05-07T20:25:43.8483429Z libcurand-10.3.7.77 | 39.9 MB | ########2 | 82%  2025-05-07T20:25:43.8484021Z 2025-05-07T20:25:43.8484027Z 2025-05-07T20:25:43.8484033Z 2025-05-07T20:25:43.8484038Z 2025-05-07T20:25:43.8484043Z 2025-05-07T20:25:43.8484048Z 2025-05-07T20:25:43.8484053Z 2025-05-07T20:25:43.8488692Z 2025-05-07T20:25:43.8700501Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######5 | 76%  2025-05-07T20:25:43.8980019Z nsight-compute-2024. | 443.1 MB | #########4 | 94% 2025-05-07T20:25:43.8980374Z 2025-05-07T20:25:43.8980380Z 2025-05-07T20:25:43.8980386Z 2025-05-07T20:25:43.8980391Z 2025-05-07T20:25:43.8980396Z 2025-05-07T20:25:43.8980401Z 2025-05-07T20:25:43.8980407Z 2025-05-07T20:25:43.8980412Z 2025-05-07T20:25:43.8982108Z 2025-05-07T20:25:43.9488915Z libcurand-10.3.7.77 | 39.9 MB | ########9 | 90%  2025-05-07T20:25:43.9489467Z 2025-05-07T20:25:43.9489471Z 2025-05-07T20:25:43.9489483Z 2025-05-07T20:25:43.9489487Z 2025-05-07T20:25:43.9489491Z 2025-05-07T20:25:43.9489494Z 2025-05-07T20:25:43.9489498Z 2025-05-07T20:25:43.9490577Z 2025-05-07T20:25:43.9702583Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########1 | 82%  2025-05-07T20:25:43.9982104Z nsight-compute-2024. | 443.1 MB | #########5 | 95% 2025-05-07T20:25:43.9982434Z 2025-05-07T20:25:43.9982439Z 2025-05-07T20:25:43.9982443Z 2025-05-07T20:25:43.9982447Z 2025-05-07T20:25:43.9982451Z 2025-05-07T20:25:43.9982455Z 2025-05-07T20:25:43.9982458Z 2025-05-07T20:25:43.9982462Z 2025-05-07T20:25:43.9982466Z 2025-05-07T20:25:44.0492925Z libcurand-10.3.7.77 | 39.9 MB | #########7 | 97%  2025-05-07T20:25:44.0493314Z 2025-05-07T20:25:44.0493318Z 2025-05-07T20:25:44.0493323Z 2025-05-07T20:25:44.0493327Z 2025-05-07T20:25:44.0493331Z 2025-05-07T20:25:44.0493368Z 2025-05-07T20:25:44.0493372Z 2025-05-07T20:25:44.0493375Z 2025-05-07T20:25:44.0702648Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########8 | 88%  2025-05-07T20:25:44.1493066Z nsight-compute-2024. | 443.1 MB | #########5 | 96% 2025-05-07T20:25:44.1493377Z 2025-05-07T20:25:44.1493382Z 2025-05-07T20:25:44.1493387Z 2025-05-07T20:25:44.1493391Z 2025-05-07T20:25:44.1493395Z 2025-05-07T20:25:44.1493399Z 2025-05-07T20:25:44.1493403Z 2025-05-07T20:25:44.1494742Z 2025-05-07T20:25:44.1704509Z cuda-nvdisasm-12.6.7 | 47.6 MB | #########5 | 96%  2025-05-07T20:25:44.1923364Z nsight-compute-2024. | 443.1 MB | #########6 | 97% 2025-05-07T20:25:44.1923844Z 2025-05-07T20:25:44.1923851Z 2025-05-07T20:25:44.1925336Z 2025-05-07T20:25:44.2713827Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:25:44.3721323Z nsight-compute-2024. | 443.1 MB | #########7 | 97% 2025-05-07T20:25:44.4722605Z nsight-compute-2024. | 443.1 MB | #########8 | 98% 2025-05-07T20:25:44.5725292Z nsight-compute-2024. | 443.1 MB | #########8 | 99% 2025-05-07T20:25:44.8576583Z nsight-compute-2024. | 443.1 MB | #########9 | 100% 2025-05-07T20:25:44.8576861Z 2025-05-07T20:25:44.8576865Z 2025-05-07T20:25:44.8576870Z 2025-05-07T20:25:44.8576873Z 2025-05-07T20:25:44.8576881Z 2025-05-07T20:25:44.8576885Z 2025-05-07T20:25:44.8576889Z 2025-05-07T20:25:44.9051974Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:25:44.9052337Z 2025-05-07T20:25:44.9052342Z 2025-05-07T20:25:44.9052346Z 2025-05-07T20:25:44.9052349Z 2025-05-07T20:25:44.9052353Z 2025-05-07T20:25:44.9052357Z 2025-05-07T20:25:44.9052361Z 2025-05-07T20:25:44.9052366Z 2025-05-07T20:25:44.9052370Z 2025-05-07T20:25:44.9054427Z 2025-05-07T20:25:45.0052511Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:25:45.0052955Z 2025-05-07T20:25:45.0052961Z 2025-05-07T20:25:45.0052967Z 2025-05-07T20:25:45.0053014Z 2025-05-07T20:25:45.0053020Z 2025-05-07T20:25:45.0053026Z 2025-05-07T20:25:45.0053032Z 2025-05-07T20:25:45.0053037Z 2025-05-07T20:25:45.0053044Z 2025-05-07T20:25:45.0053049Z 2025-05-07T20:25:45.1055515Z gds-tools-1.11.1.6 | 37.8 MB | 8 | 8%  2025-05-07T20:25:45.1055833Z 2025-05-07T20:25:45.1055840Z 2025-05-07T20:25:45.1055843Z 2025-05-07T20:25:45.1055848Z 2025-05-07T20:25:45.1055853Z 2025-05-07T20:25:45.1055857Z 2025-05-07T20:25:45.1056127Z 2025-05-07T20:25:45.1056132Z 2025-05-07T20:25:45.1056136Z 2025-05-07T20:25:45.1056162Z 2025-05-07T20:25:45.2055994Z gds-tools-1.11.1.6 | 37.8 MB | #7 | 17%  2025-05-07T20:25:45.2056315Z 2025-05-07T20:25:45.2056319Z 2025-05-07T20:25:45.2056323Z 2025-05-07T20:25:45.2056326Z 2025-05-07T20:25:45.2056330Z 2025-05-07T20:25:45.2056335Z 2025-05-07T20:25:45.2056338Z 2025-05-07T20:25:45.2056343Z 2025-05-07T20:25:45.2056347Z 2025-05-07T20:25:45.2057655Z 2025-05-07T20:25:45.3146564Z gds-tools-1.11.1.6 | 37.8 MB | ##6 | 26%  2025-05-07T20:25:45.3147013Z 2025-05-07T20:25:45.3147019Z 2025-05-07T20:25:45.3147024Z 2025-05-07T20:25:45.3147044Z 2025-05-07T20:25:45.3147052Z 2025-05-07T20:25:45.3147057Z 2025-05-07T20:25:45.3147063Z 2025-05-07T20:25:45.3147069Z 2025-05-07T20:25:45.3147075Z 2025-05-07T20:25:45.3147082Z 2025-05-07T20:25:45.3908268Z gds-tools-1.11.1.6 | 37.8 MB | ###4 | 35%  2025-05-07T20:25:45.3911610Z 2025-05-07T20:25:45.4148850Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:25:45.4149131Z 2025-05-07T20:25:45.4149136Z 2025-05-07T20:25:45.4149139Z 2025-05-07T20:25:45.4149143Z 2025-05-07T20:25:45.4149148Z 2025-05-07T20:25:45.4149154Z 2025-05-07T20:25:45.4149159Z 2025-05-07T20:25:45.4149163Z 2025-05-07T20:25:45.4149167Z 2025-05-07T20:25:45.4150397Z 2025-05-07T20:25:45.4459848Z gds-tools-1.11.1.6 | 37.8 MB | ####3 | 44%  2025-05-07T20:25:45.4460261Z 2025-05-07T20:25:45.4460267Z 2025-05-07T20:25:45.4460273Z 2025-05-07T20:25:45.4460278Z 2025-05-07T20:25:45.4460283Z 2025-05-07T20:25:45.4460288Z 2025-05-07T20:25:45.4460293Z 2025-05-07T20:25:45.4460299Z 2025-05-07T20:25:45.4460304Z 2025-05-07T20:25:45.4460311Z 2025-05-07T20:25:45.4460318Z 2025-05-07T20:25:45.4603524Z python-3.11.8 | 29.3 MB | | 0%  2025-05-07T20:25:45.4604034Z 2025-05-07T20:25:45.4604039Z 2025-05-07T20:25:45.4604043Z 2025-05-07T20:25:45.4604046Z 2025-05-07T20:25:45.4604061Z 2025-05-07T20:25:45.4604065Z 2025-05-07T20:25:45.4604068Z 2025-05-07T20:25:45.4604072Z 2025-05-07T20:25:45.4606970Z 2025-05-07T20:25:45.5029518Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:25:45.5029982Z 2025-05-07T20:25:45.5029989Z 2025-05-07T20:25:45.5029995Z 2025-05-07T20:25:45.5030004Z 2025-05-07T20:25:45.5030011Z 2025-05-07T20:25:45.5030048Z 2025-05-07T20:25:45.5030053Z 2025-05-07T20:25:45.5030059Z 2025-05-07T20:25:45.5030065Z 2025-05-07T20:25:45.5030070Z 2025-05-07T20:25:45.5030075Z 2025-05-07T20:25:45.5031984Z 2025-05-07T20:25:45.5219643Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:25:45.5220042Z 2025-05-07T20:25:45.5220047Z 2025-05-07T20:25:45.5220051Z 2025-05-07T20:25:45.5220056Z 2025-05-07T20:25:45.5220060Z 2025-05-07T20:25:45.5220064Z 2025-05-07T20:25:45.5220083Z 2025-05-07T20:25:45.5220087Z 2025-05-07T20:25:45.5220091Z 2025-05-07T20:25:45.5220100Z 2025-05-07T20:25:45.5462827Z gds-tools-1.11.1.6 | 37.8 MB | #####2 | 52%  2025-05-07T20:25:45.5463306Z 2025-05-07T20:25:45.5463312Z 2025-05-07T20:25:45.5463317Z 2025-05-07T20:25:45.5463321Z 2025-05-07T20:25:45.5463326Z 2025-05-07T20:25:45.5463329Z 2025-05-07T20:25:45.5463333Z 2025-05-07T20:25:45.5463337Z 2025-05-07T20:25:45.5463353Z 2025-05-07T20:25:45.5463357Z 2025-05-07T20:25:45.5463386Z 2025-05-07T20:25:45.6033102Z python-3.11.8 | 29.3 MB | # | 10%  2025-05-07T20:25:45.6033401Z 2025-05-07T20:25:45.6033414Z 2025-05-07T20:25:45.6033418Z 2025-05-07T20:25:45.6033422Z 2025-05-07T20:25:45.6033427Z 2025-05-07T20:25:45.6033431Z 2025-05-07T20:25:45.6033435Z 2025-05-07T20:25:45.6033440Z 2025-05-07T20:25:45.6033443Z 2025-05-07T20:25:45.6033447Z 2025-05-07T20:25:45.6033451Z 2025-05-07T20:25:45.6034964Z 2025-05-07T20:25:45.6411229Z cuda-nvcc-tools-12.6 | 23.0 MB | # | 10%  2025-05-07T20:25:45.6411631Z 2025-05-07T20:25:45.6411636Z 2025-05-07T20:25:45.6411640Z 2025-05-07T20:25:45.6411643Z 2025-05-07T20:25:45.6411647Z 2025-05-07T20:25:45.6411651Z 2025-05-07T20:25:45.6411654Z 2025-05-07T20:25:45.6411658Z 2025-05-07T20:25:45.6411662Z 2025-05-07T20:25:45.6413920Z 2025-05-07T20:25:45.6539553Z gds-tools-1.11.1.6 | 37.8 MB | ###### | 60%  2025-05-07T20:25:45.6540193Z 2025-05-07T20:25:45.6540197Z 2025-05-07T20:25:45.6540201Z 2025-05-07T20:25:45.6540204Z 2025-05-07T20:25:45.6540208Z 2025-05-07T20:25:45.6540212Z 2025-05-07T20:25:45.6540215Z 2025-05-07T20:25:45.6540219Z 2025-05-07T20:25:45.6540223Z 2025-05-07T20:25:45.6540235Z 2025-05-07T20:25:45.6540239Z 2025-05-07T20:25:45.7034901Z python-3.11.8 | 29.3 MB | ## | 21%  2025-05-07T20:25:45.7035199Z 2025-05-07T20:25:45.7035227Z 2025-05-07T20:25:45.7035231Z 2025-05-07T20:25:45.7035242Z 2025-05-07T20:25:45.7035246Z 2025-05-07T20:25:45.7035250Z 2025-05-07T20:25:45.7035254Z 2025-05-07T20:25:45.7035258Z 2025-05-07T20:25:45.7035261Z 2025-05-07T20:25:45.7035265Z 2025-05-07T20:25:45.7035269Z 2025-05-07T20:25:45.7035278Z 2025-05-07T20:25:45.7600587Z cuda-nvcc-tools-12.6 | 23.0 MB | ## | 21%  2025-05-07T20:25:45.7601052Z 2025-05-07T20:25:45.7601056Z 2025-05-07T20:25:45.7601060Z 2025-05-07T20:25:45.7601092Z 2025-05-07T20:25:45.7601096Z 2025-05-07T20:25:45.7601100Z 2025-05-07T20:25:45.7601104Z 2025-05-07T20:25:45.7601108Z 2025-05-07T20:25:45.7601111Z 2025-05-07T20:25:45.7605624Z 2025-05-07T20:25:45.7619380Z gds-tools-1.11.1.6 | 37.8 MB | ######8 | 68%  2025-05-07T20:25:45.7619751Z 2025-05-07T20:25:45.7619755Z 2025-05-07T20:25:45.7619759Z 2025-05-07T20:25:45.7619762Z 2025-05-07T20:25:45.7619766Z 2025-05-07T20:25:45.7619786Z 2025-05-07T20:25:45.7619790Z 2025-05-07T20:25:45.7619793Z 2025-05-07T20:25:45.7619797Z 2025-05-07T20:25:45.7619801Z 2025-05-07T20:25:45.7621865Z 2025-05-07T20:25:45.8035736Z python-3.11.8 | 29.3 MB | ### | 30%  2025-05-07T20:25:45.8036143Z 2025-05-07T20:25:45.8036149Z 2025-05-07T20:25:45.8036155Z 2025-05-07T20:25:45.8036168Z 2025-05-07T20:25:45.8036174Z 2025-05-07T20:25:45.8036179Z 2025-05-07T20:25:45.8036184Z 2025-05-07T20:25:45.8036190Z 2025-05-07T20:25:45.8036216Z 2025-05-07T20:25:45.8036221Z 2025-05-07T20:25:45.8036226Z 2025-05-07T20:25:45.8038154Z 2025-05-07T20:25:45.8623008Z cuda-nvcc-tools-12.6 | 23.0 MB | ###2 | 32%  2025-05-07T20:25:45.8623437Z 2025-05-07T20:25:45.8623441Z 2025-05-07T20:25:45.8623445Z 2025-05-07T20:25:45.8623449Z 2025-05-07T20:25:45.8623453Z 2025-05-07T20:25:45.8623457Z 2025-05-07T20:25:45.8623461Z 2025-05-07T20:25:45.8623466Z 2025-05-07T20:25:45.8623500Z 2025-05-07T20:25:45.8623505Z 2025-05-07T20:25:45.8627653Z 2025-05-07T20:25:45.8673962Z python-3.11.8 | 29.3 MB | #### | 40%  2025-05-07T20:25:45.8674345Z 2025-05-07T20:25:45.8674352Z 2025-05-07T20:25:45.8674357Z 2025-05-07T20:25:45.8674363Z 2025-05-07T20:25:45.8674368Z 2025-05-07T20:25:45.8674374Z 2025-05-07T20:25:45.8674380Z 2025-05-07T20:25:45.8674386Z 2025-05-07T20:25:45.8674391Z 2025-05-07T20:25:45.8674396Z 2025-05-07T20:25:45.9039476Z gds-tools-1.11.1.6 | 37.8 MB | #######5 | 76%  2025-05-07T20:25:45.9039904Z 2025-05-07T20:25:45.9039910Z 2025-05-07T20:25:45.9039915Z 2025-05-07T20:25:45.9039930Z 2025-05-07T20:25:45.9039935Z 2025-05-07T20:25:45.9039960Z 2025-05-07T20:25:45.9039965Z 2025-05-07T20:25:45.9039970Z 2025-05-07T20:25:45.9039975Z 2025-05-07T20:25:45.9039980Z 2025-05-07T20:25:45.9039985Z 2025-05-07T20:25:45.9039990Z 2025-05-07T20:25:45.9419014Z cuda-nvcc-tools-12.6 | 23.0 MB | ####4 | 44%  2025-05-07T20:25:45.9419347Z 2025-05-07T20:25:45.9419351Z 2025-05-07T20:25:45.9419365Z 2025-05-07T20:25:45.9419369Z 2025-05-07T20:25:45.9419374Z 2025-05-07T20:25:45.9419378Z 2025-05-07T20:25:45.9419381Z 2025-05-07T20:25:45.9421503Z 2025-05-07T20:25:45.9626232Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:25:45.9626552Z 2025-05-07T20:25:45.9626556Z 2025-05-07T20:25:45.9626560Z 2025-05-07T20:25:45.9626564Z 2025-05-07T20:25:45.9626838Z 2025-05-07T20:25:45.9626842Z 2025-05-07T20:25:45.9626854Z 2025-05-07T20:25:45.9626857Z 2025-05-07T20:25:45.9626861Z 2025-05-07T20:25:45.9626865Z 2025-05-07T20:25:45.9627966Z 2025-05-07T20:25:45.9677027Z python-3.11.8 | 29.3 MB | ##### | 50%  2025-05-07T20:25:45.9677324Z 2025-05-07T20:25:45.9677328Z 2025-05-07T20:25:45.9677332Z 2025-05-07T20:25:45.9677336Z 2025-05-07T20:25:45.9677340Z 2025-05-07T20:25:45.9677344Z 2025-05-07T20:25:45.9677360Z 2025-05-07T20:25:45.9677364Z 2025-05-07T20:25:45.9677368Z 2025-05-07T20:25:45.9679518Z 2025-05-07T20:25:46.0040895Z gds-tools-1.11.1.6 | 37.8 MB | ########3 | 83%  2025-05-07T20:25:46.0041329Z 2025-05-07T20:25:46.0041335Z 2025-05-07T20:25:46.0041341Z 2025-05-07T20:25:46.0041346Z 2025-05-07T20:25:46.0041351Z 2025-05-07T20:25:46.0041355Z 2025-05-07T20:25:46.0041360Z 2025-05-07T20:25:46.0041364Z 2025-05-07T20:25:46.0041368Z 2025-05-07T20:25:46.0041373Z 2025-05-07T20:25:46.0041413Z 2025-05-07T20:25:46.0043391Z 2025-05-07T20:25:46.0191757Z cuda-nvcc-tools-12.6 | 23.0 MB | #####6 | 56%  2025-05-07T20:25:46.0192174Z 2025-05-07T20:25:46.0192178Z 2025-05-07T20:25:46.0192182Z 2025-05-07T20:25:46.0192186Z 2025-05-07T20:25:46.0192190Z 2025-05-07T20:25:46.0192194Z 2025-05-07T20:25:46.0192198Z 2025-05-07T20:25:46.0192210Z 2025-05-07T20:25:46.0192214Z 2025-05-07T20:25:46.0192218Z 2025-05-07T20:25:46.0192238Z 2025-05-07T20:25:46.0192242Z 2025-05-07T20:25:46.0192247Z 2025-05-07T20:25:46.0630038Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:25:46.0630543Z 2025-05-07T20:25:46.0630550Z 2025-05-07T20:25:46.0630555Z 2025-05-07T20:25:46.0630561Z 2025-05-07T20:25:46.0630566Z 2025-05-07T20:25:46.0630571Z 2025-05-07T20:25:46.0630577Z 2025-05-07T20:25:46.0630582Z 2025-05-07T20:25:46.0630587Z 2025-05-07T20:25:46.0630592Z 2025-05-07T20:25:46.0632172Z 2025-05-07T20:25:46.0676944Z python-3.11.8 | 29.3 MB | ###### | 60%  2025-05-07T20:25:46.0677380Z 2025-05-07T20:25:46.0677386Z 2025-05-07T20:25:46.0677391Z 2025-05-07T20:25:46.0677396Z 2025-05-07T20:25:46.0677401Z 2025-05-07T20:25:46.0677407Z 2025-05-07T20:25:46.0677412Z 2025-05-07T20:25:46.0677417Z 2025-05-07T20:25:46.0677422Z 2025-05-07T20:25:46.0677428Z 2025-05-07T20:25:46.1195777Z gds-tools-1.11.1.6 | 37.8 MB | #########1 | 91%  2025-05-07T20:25:46.1196097Z 2025-05-07T20:25:46.1196101Z 2025-05-07T20:25:46.1196105Z 2025-05-07T20:25:46.1196109Z 2025-05-07T20:25:46.1196113Z 2025-05-07T20:25:46.1196117Z 2025-05-07T20:25:46.1196120Z 2025-05-07T20:25:46.1196131Z 2025-05-07T20:25:46.1196135Z 2025-05-07T20:25:46.1196139Z 2025-05-07T20:25:46.1196143Z 2025-05-07T20:25:46.1196146Z 2025-05-07T20:25:46.1196150Z 2025-05-07T20:25:46.1235656Z cuda-nvrtc-12.6.85 | 17.3 MB | #2 | 13%  2025-05-07T20:25:46.1236114Z 2025-05-07T20:25:46.1236120Z 2025-05-07T20:25:46.1236125Z 2025-05-07T20:25:46.1236130Z 2025-05-07T20:25:46.1236136Z 2025-05-07T20:25:46.1236140Z 2025-05-07T20:25:46.1236145Z 2025-05-07T20:25:46.1236151Z 2025-05-07T20:25:46.1236156Z 2025-05-07T20:25:46.1236161Z 2025-05-07T20:25:46.1236166Z 2025-05-07T20:25:46.1237409Z 2025-05-07T20:25:46.1702002Z cuda-nvcc-tools-12.6 | 23.0 MB | ######7 | 68%  2025-05-07T20:25:46.1702660Z 2025-05-07T20:25:46.1702666Z 2025-05-07T20:25:46.1702670Z 2025-05-07T20:25:46.1702674Z 2025-05-07T20:25:46.1702678Z 2025-05-07T20:25:46.1702682Z 2025-05-07T20:25:46.1702685Z 2025-05-07T20:25:46.1702689Z 2025-05-07T20:25:46.1702693Z 2025-05-07T20:25:46.1702697Z 2025-05-07T20:25:46.1702700Z 2025-05-07T20:25:46.1836253Z python-3.11.8 | 29.3 MB | ####### | 70%  2025-05-07T20:25:46.1836629Z 2025-05-07T20:25:46.1836634Z 2025-05-07T20:25:46.1836866Z 2025-05-07T20:25:46.1836870Z 2025-05-07T20:25:46.1836874Z 2025-05-07T20:25:46.1836878Z 2025-05-07T20:25:46.1836894Z 2025-05-07T20:25:46.1836898Z 2025-05-07T20:25:46.1836902Z 2025-05-07T20:25:46.1838257Z 2025-05-07T20:25:46.2198570Z gds-tools-1.11.1.6 | 37.8 MB | #########8 | 99%  2025-05-07T20:25:46.2198957Z 2025-05-07T20:25:46.2198961Z 2025-05-07T20:25:46.2198965Z 2025-05-07T20:25:46.2198968Z 2025-05-07T20:25:46.2198972Z 2025-05-07T20:25:46.2198992Z 2025-05-07T20:25:46.2198996Z 2025-05-07T20:25:46.2198999Z 2025-05-07T20:25:46.2199003Z 2025-05-07T20:25:46.2199007Z 2025-05-07T20:25:46.2199010Z 2025-05-07T20:25:46.2199014Z 2025-05-07T20:25:46.2203535Z 2025-05-07T20:25:46.2239903Z cuda-nvrtc-12.6.85 | 17.3 MB | ##5 | 25%  2025-05-07T20:25:46.2240351Z 2025-05-07T20:25:46.2240357Z 2025-05-07T20:25:46.2240363Z 2025-05-07T20:25:46.2240369Z 2025-05-07T20:25:46.2240374Z 2025-05-07T20:25:46.2240380Z 2025-05-07T20:25:46.2240399Z 2025-05-07T20:25:46.2240404Z 2025-05-07T20:25:46.2240410Z 2025-05-07T20:25:46.2240415Z 2025-05-07T20:25:46.2240421Z 2025-05-07T20:25:46.2240426Z 2025-05-07T20:25:46.2709394Z cuda-nvcc-tools-12.6 | 23.0 MB | #######8 | 79%  2025-05-07T20:25:46.2709734Z 2025-05-07T20:25:46.2709739Z 2025-05-07T20:25:46.2709742Z 2025-05-07T20:25:46.2709746Z 2025-05-07T20:25:46.2709750Z 2025-05-07T20:25:46.2709754Z 2025-05-07T20:25:46.2709784Z 2025-05-07T20:25:46.2709789Z 2025-05-07T20:25:46.2709793Z 2025-05-07T20:25:46.2709797Z 2025-05-07T20:25:46.2710206Z 2025-05-07T20:25:46.3199352Z python-3.11.8 | 29.3 MB | ######## | 80%  2025-05-07T20:25:46.3199666Z 2025-05-07T20:25:46.3202100Z 2025-05-07T20:25:46.3202110Z 2025-05-07T20:25:46.3202116Z 2025-05-07T20:25:46.3202122Z 2025-05-07T20:25:46.3202128Z 2025-05-07T20:25:46.3202133Z 2025-05-07T20:25:46.3202139Z 2025-05-07T20:25:46.3202144Z 2025-05-07T20:25:46.3202188Z 2025-05-07T20:25:46.3202255Z 2025-05-07T20:25:46.3202265Z 2025-05-07T20:25:46.3202305Z 2025-05-07T20:25:46.3249019Z cuda-nvrtc-12.6.85 | 17.3 MB | ###8 | 39%  2025-05-07T20:25:46.3249345Z 2025-05-07T20:25:46.3249350Z 2025-05-07T20:25:46.3249355Z 2025-05-07T20:25:46.3249358Z 2025-05-07T20:25:46.3249363Z 2025-05-07T20:25:46.3249367Z 2025-05-07T20:25:46.3249372Z 2025-05-07T20:25:46.3249376Z 2025-05-07T20:25:46.3249394Z 2025-05-07T20:25:46.3249399Z 2025-05-07T20:25:46.3249403Z 2025-05-07T20:25:46.3249409Z 2025-05-07T20:25:46.3718064Z cuda-nvcc-tools-12.6 | 23.0 MB | ######### | 91%  2025-05-07T20:25:46.3718408Z 2025-05-07T20:25:46.3718412Z 2025-05-07T20:25:46.3718416Z 2025-05-07T20:25:46.3718420Z 2025-05-07T20:25:46.3718424Z 2025-05-07T20:25:46.3718427Z 2025-05-07T20:25:46.3718431Z 2025-05-07T20:25:46.3718435Z 2025-05-07T20:25:46.3718439Z 2025-05-07T20:25:46.3718443Z 2025-05-07T20:25:46.3719950Z 2025-05-07T20:25:46.4199901Z python-3.11.8 | 29.3 MB | ########9 | 90%  2025-05-07T20:25:46.4200198Z 2025-05-07T20:25:46.4200202Z 2025-05-07T20:25:46.4200206Z 2025-05-07T20:25:46.4200210Z 2025-05-07T20:25:46.4200214Z 2025-05-07T20:25:46.4200218Z 2025-05-07T20:25:46.4200222Z 2025-05-07T20:25:46.4200235Z 2025-05-07T20:25:46.4200239Z 2025-05-07T20:25:46.4200243Z 2025-05-07T20:25:46.4200247Z 2025-05-07T20:25:46.4200252Z 2025-05-07T20:25:46.4200520Z 2025-05-07T20:25:46.5205655Z cuda-nvrtc-12.6.85 | 17.3 MB | #####2 | 53%  2025-05-07T20:25:46.5206008Z 2025-05-07T20:25:46.5206012Z 2025-05-07T20:25:46.5206015Z 2025-05-07T20:25:46.5206019Z 2025-05-07T20:25:46.5206023Z 2025-05-07T20:25:46.5206027Z 2025-05-07T20:25:46.5206031Z 2025-05-07T20:25:46.5206034Z 2025-05-07T20:25:46.5206038Z 2025-05-07T20:25:46.5206042Z 2025-05-07T20:25:46.5206046Z 2025-05-07T20:25:46.5206050Z 2025-05-07T20:25:46.5206304Z 2025-05-07T20:25:46.6208306Z cuda-nvrtc-12.6.85 | 17.3 MB | #######1 | 71%  2025-05-07T20:25:46.6208625Z 2025-05-07T20:25:46.6208629Z 2025-05-07T20:25:46.6208633Z 2025-05-07T20:25:46.6208637Z 2025-05-07T20:25:46.6208641Z 2025-05-07T20:25:46.6208646Z 2025-05-07T20:25:46.6208654Z 2025-05-07T20:25:46.6208658Z 2025-05-07T20:25:46.6208663Z 2025-05-07T20:25:46.6208667Z 2025-05-07T20:25:46.6208671Z 2025-05-07T20:25:46.6208684Z 2025-05-07T20:25:46.6209200Z 2025-05-07T20:25:46.7243926Z cuda-nvrtc-12.6.85 | 17.3 MB | #########1 | 91%  2025-05-07T20:25:46.7244257Z 2025-05-07T20:25:46.7244266Z 2025-05-07T20:25:47.2120017Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:25:47.2120301Z 2025-05-07T20:25:47.2120305Z 2025-05-07T20:25:47.2120309Z 2025-05-07T20:25:47.2120313Z 2025-05-07T20:25:47.2120316Z 2025-05-07T20:25:47.2120320Z 2025-05-07T20:25:47.2120324Z 2025-05-07T20:25:47.2120354Z 2025-05-07T20:25:47.2120358Z 2025-05-07T20:25:47.2120361Z 2025-05-07T20:25:47.2120366Z 2025-05-07T20:25:47.2121522Z 2025-05-07T20:25:47.2214085Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:25:47.2214406Z 2025-05-07T20:25:47.2214411Z 2025-05-07T20:25:47.2214414Z 2025-05-07T20:25:47.2214418Z 2025-05-07T20:25:47.2214422Z 2025-05-07T20:25:47.2214426Z 2025-05-07T20:25:47.2214430Z 2025-05-07T20:25:47.2214434Z 2025-05-07T20:25:47.2214451Z 2025-05-07T20:25:47.2214455Z 2025-05-07T20:25:47.2214458Z 2025-05-07T20:25:47.2214462Z 2025-05-07T20:25:47.2220500Z 2025-05-07T20:25:47.2634235Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:25:47.2634660Z 2025-05-07T20:25:47.2634666Z 2025-05-07T20:25:47.2634685Z 2025-05-07T20:25:47.2634690Z 2025-05-07T20:25:47.2634696Z 2025-05-07T20:25:47.2634701Z 2025-05-07T20:25:47.2634707Z 2025-05-07T20:25:47.2634710Z 2025-05-07T20:25:47.2634714Z 2025-05-07T20:25:47.2634728Z 2025-05-07T20:25:47.2634732Z 2025-05-07T20:25:47.2634736Z 2025-05-07T20:25:47.2634740Z 2025-05-07T20:25:47.2638080Z 2025-05-07T20:25:47.2871213Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:25:47.2872191Z 2025-05-07T20:25:47.2872195Z 2025-05-07T20:25:47.2872199Z 2025-05-07T20:25:47.2872203Z 2025-05-07T20:25:47.2872206Z 2025-05-07T20:25:47.2872210Z 2025-05-07T20:25:47.2872214Z 2025-05-07T20:25:47.2872232Z 2025-05-07T20:25:47.2872236Z 2025-05-07T20:25:47.2872240Z 2025-05-07T20:25:47.2872244Z 2025-05-07T20:25:47.2872248Z 2025-05-07T20:25:47.2872251Z 2025-05-07T20:25:47.2872255Z 2025-05-07T20:25:47.2873371Z 2025-05-07T20:25:47.3634977Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:25:47.3635320Z 2025-05-07T20:25:47.3635325Z 2025-05-07T20:25:47.3635328Z 2025-05-07T20:25:47.3635340Z 2025-05-07T20:25:47.3635344Z 2025-05-07T20:25:47.3635359Z 2025-05-07T20:25:47.3635363Z 2025-05-07T20:25:47.3635367Z 2025-05-07T20:25:47.3635370Z 2025-05-07T20:25:47.3635374Z 2025-05-07T20:25:47.3635378Z 2025-05-07T20:25:47.3635381Z 2025-05-07T20:25:47.3635385Z 2025-05-07T20:25:47.3642174Z 2025-05-07T20:25:47.3873056Z libnvjitlink-12.6.85 | 14.9 MB | ## | 20%  2025-05-07T20:25:47.3873398Z 2025-05-07T20:25:47.3873402Z 2025-05-07T20:25:47.3873406Z 2025-05-07T20:25:47.3873668Z 2025-05-07T20:25:47.3873677Z 2025-05-07T20:25:47.3873683Z 2025-05-07T20:25:47.3873688Z 2025-05-07T20:25:47.3873694Z 2025-05-07T20:25:47.3873699Z 2025-05-07T20:25:47.3873704Z 2025-05-07T20:25:47.3873708Z 2025-05-07T20:25:47.3873712Z 2025-05-07T20:25:47.3873715Z 2025-05-07T20:25:47.3873719Z 2025-05-07T20:25:47.3873726Z 2025-05-07T20:25:47.4241322Z cuda-nvcc-dev_linux- | 10.8 MB | ##7 | 28%  2025-05-07T20:25:47.4241662Z 2025-05-07T20:25:47.4241666Z 2025-05-07T20:25:47.4241895Z 2025-05-07T20:25:47.4241899Z 2025-05-07T20:25:47.4241911Z 2025-05-07T20:25:47.4241915Z 2025-05-07T20:25:47.4241919Z 2025-05-07T20:25:47.4241923Z 2025-05-07T20:25:47.4241926Z 2025-05-07T20:25:47.4241930Z 2025-05-07T20:25:47.4634805Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:25:47.4635191Z 2025-05-07T20:25:47.4635195Z 2025-05-07T20:25:47.4635199Z 2025-05-07T20:25:47.4635203Z 2025-05-07T20:25:47.4635207Z 2025-05-07T20:25:47.4635220Z 2025-05-07T20:25:47.4635224Z 2025-05-07T20:25:47.4635234Z 2025-05-07T20:25:47.4635238Z 2025-05-07T20:25:47.4635242Z 2025-05-07T20:25:47.4635245Z 2025-05-07T20:25:47.4635249Z 2025-05-07T20:25:47.4635253Z 2025-05-07T20:25:47.4635257Z 2025-05-07T20:25:47.4703129Z libnvjitlink-12.6.85 | 14.9 MB | ####1 | 41%  2025-05-07T20:25:47.4703466Z 2025-05-07T20:25:47.4703470Z 2025-05-07T20:25:47.4703474Z 2025-05-07T20:25:47.4703478Z 2025-05-07T20:25:47.4703491Z 2025-05-07T20:25:47.4703495Z 2025-05-07T20:25:47.4703499Z 2025-05-07T20:25:47.4703502Z 2025-05-07T20:25:47.4703506Z 2025-05-07T20:25:47.4703510Z 2025-05-07T20:25:47.4703514Z 2025-05-07T20:25:47.4703518Z 2025-05-07T20:25:47.4703522Z 2025-05-07T20:25:47.4703525Z 2025-05-07T20:25:47.4703529Z 2025-05-07T20:25:47.4706232Z 2025-05-07T20:25:47.4832619Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:25:47.4833001Z 2025-05-07T20:25:47.4833006Z 2025-05-07T20:25:47.4833009Z 2025-05-07T20:25:47.4833013Z 2025-05-07T20:25:47.4833017Z 2025-05-07T20:25:47.4833021Z 2025-05-07T20:25:47.4833025Z 2025-05-07T20:25:47.4833028Z 2025-05-07T20:25:47.4833040Z 2025-05-07T20:25:47.4833043Z 2025-05-07T20:25:47.4834287Z 2025-05-07T20:25:47.4842668Z python-3.11.8 | 29.3 MB | ########## | 100%  2025-05-07T20:25:47.4842971Z 2025-05-07T20:25:47.4842975Z 2025-05-07T20:25:47.4842978Z 2025-05-07T20:25:47.4842994Z 2025-05-07T20:25:47.4842997Z 2025-05-07T20:25:47.4843001Z 2025-05-07T20:25:47.4843004Z 2025-05-07T20:25:47.4843008Z 2025-05-07T20:25:47.4843012Z 2025-05-07T20:25:47.4843016Z 2025-05-07T20:25:47.4843020Z 2025-05-07T20:25:47.4876462Z python-3.11.8 | 29.3 MB | ########## | 100%  2025-05-07T20:25:47.4876851Z 2025-05-07T20:25:47.4876856Z 2025-05-07T20:25:47.4876861Z 2025-05-07T20:25:47.4876866Z 2025-05-07T20:25:47.4876872Z 2025-05-07T20:25:47.4876888Z 2025-05-07T20:25:47.4876893Z 2025-05-07T20:25:47.4876898Z 2025-05-07T20:25:47.4876904Z 2025-05-07T20:25:47.4876909Z 2025-05-07T20:25:47.4876914Z 2025-05-07T20:25:47.4876919Z 2025-05-07T20:25:47.4876925Z 2025-05-07T20:25:47.4876930Z 2025-05-07T20:25:47.4876935Z 2025-05-07T20:25:47.5341913Z cuda-nvcc-dev_linux- | 10.8 MB | #####5 | 56%  2025-05-07T20:25:47.5342371Z 2025-05-07T20:25:47.5342377Z 2025-05-07T20:25:47.5342382Z 2025-05-07T20:25:47.5342403Z 2025-05-07T20:25:47.5342419Z 2025-05-07T20:25:47.5342425Z 2025-05-07T20:25:47.5342430Z 2025-05-07T20:25:47.5342435Z 2025-05-07T20:25:47.5342440Z 2025-05-07T20:25:47.5342445Z 2025-05-07T20:25:47.5342450Z 2025-05-07T20:25:47.5342455Z 2025-05-07T20:25:47.5342460Z 2025-05-07T20:25:47.5342466Z 2025-05-07T20:25:47.5342471Z 2025-05-07T20:25:47.5342476Z 2025-05-07T20:25:47.5342481Z 2025-05-07T20:25:47.5704633Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:25:47.5704986Z 2025-05-07T20:25:47.5704990Z 2025-05-07T20:25:47.5704994Z 2025-05-07T20:25:47.5705007Z 2025-05-07T20:25:47.5705010Z 2025-05-07T20:25:47.5705014Z 2025-05-07T20:25:47.5705018Z 2025-05-07T20:25:47.5705021Z 2025-05-07T20:25:47.5705025Z 2025-05-07T20:25:47.5705029Z 2025-05-07T20:25:47.5705032Z 2025-05-07T20:25:47.5705036Z 2025-05-07T20:25:47.5705039Z 2025-05-07T20:25:47.5705043Z 2025-05-07T20:25:47.5705047Z 2025-05-07T20:25:47.5706792Z 2025-05-07T20:25:47.5757428Z cuda-nvvm-tools-12.6 | 10.4 MB | ##6 | 27%  2025-05-07T20:25:47.5757856Z 2025-05-07T20:25:47.5757860Z 2025-05-07T20:25:47.5757879Z 2025-05-07T20:25:47.5757882Z 2025-05-07T20:25:47.5757886Z 2025-05-07T20:25:47.5757890Z 2025-05-07T20:25:47.5757893Z 2025-05-07T20:25:47.5757897Z 2025-05-07T20:25:47.5757901Z 2025-05-07T20:25:47.5757904Z 2025-05-07T20:25:47.5757908Z 2025-05-07T20:25:47.5757911Z 2025-05-07T20:25:47.5757924Z 2025-05-07T20:25:47.5757928Z 2025-05-07T20:25:47.6040269Z libnvjitlink-12.6.85 | 14.9 MB | ######2 | 62%  2025-05-07T20:25:47.6040700Z 2025-05-07T20:25:47.6040706Z 2025-05-07T20:25:47.6040711Z 2025-05-07T20:25:47.6040716Z 2025-05-07T20:25:47.6040722Z 2025-05-07T20:25:47.6040727Z 2025-05-07T20:25:47.6040732Z 2025-05-07T20:25:47.6040737Z 2025-05-07T20:25:47.6040742Z 2025-05-07T20:25:47.6040748Z 2025-05-07T20:25:47.6040752Z 2025-05-07T20:25:47.6040771Z 2025-05-07T20:25:47.6040781Z 2025-05-07T20:25:47.6040786Z 2025-05-07T20:25:47.6044030Z 2025-05-07T20:25:47.6343021Z cuda-nvcc-dev_linux- | 10.8 MB | ########3 | 84%  2025-05-07T20:25:47.6343504Z 2025-05-07T20:25:47.6343519Z 2025-05-07T20:25:47.6343523Z 2025-05-07T20:25:47.6343526Z 2025-05-07T20:25:47.6343530Z 2025-05-07T20:25:47.6343534Z 2025-05-07T20:25:47.6343538Z 2025-05-07T20:25:47.6343542Z 2025-05-07T20:25:47.6343556Z 2025-05-07T20:25:47.6343560Z 2025-05-07T20:25:47.6343563Z 2025-05-07T20:25:47.6343567Z 2025-05-07T20:25:47.6343571Z 2025-05-07T20:25:47.6343574Z 2025-05-07T20:25:47.6343578Z 2025-05-07T20:25:47.6343582Z 2025-05-07T20:25:47.6345164Z 2025-05-07T20:25:47.6814047Z cuda-sanitizer-api-1 | 8.9 MB | ##6 | 27%  2025-05-07T20:25:47.6814401Z 2025-05-07T20:25:47.6814406Z 2025-05-07T20:25:47.6814409Z 2025-05-07T20:25:47.6814413Z 2025-05-07T20:25:47.6814417Z 2025-05-07T20:25:47.6814432Z 2025-05-07T20:25:47.6814445Z 2025-05-07T20:25:47.6814449Z 2025-05-07T20:25:47.6814452Z 2025-05-07T20:25:47.6814456Z 2025-05-07T20:25:47.6814459Z 2025-05-07T20:25:47.6814463Z 2025-05-07T20:25:47.6814467Z 2025-05-07T20:25:47.6814470Z 2025-05-07T20:25:47.6814474Z 2025-05-07T20:25:47.6816681Z 2025-05-07T20:25:47.6921089Z cuda-nvvm-tools-12.6 | 10.4 MB | #####3 | 54%  2025-05-07T20:25:47.6921420Z 2025-05-07T20:25:47.6921433Z 2025-05-07T20:25:47.6921437Z 2025-05-07T20:25:47.6921440Z 2025-05-07T20:25:47.6921444Z 2025-05-07T20:25:47.6921448Z 2025-05-07T20:25:47.6921451Z 2025-05-07T20:25:47.6921455Z 2025-05-07T20:25:47.6921459Z 2025-05-07T20:25:47.6921462Z 2025-05-07T20:25:47.6921466Z 2025-05-07T20:25:47.6921470Z 2025-05-07T20:25:47.6921474Z 2025-05-07T20:25:47.6923015Z 2025-05-07T20:25:47.7346196Z libnvjitlink-12.6.85 | 14.9 MB | ########1 | 82%  2025-05-07T20:25:47.7346523Z 2025-05-07T20:25:47.7346527Z 2025-05-07T20:25:47.7346531Z 2025-05-07T20:25:47.7346535Z 2025-05-07T20:25:47.7346539Z 2025-05-07T20:25:47.7346543Z 2025-05-07T20:25:47.7346608Z 2025-05-07T20:25:47.7346616Z 2025-05-07T20:25:47.7346622Z 2025-05-07T20:25:47.7346628Z 2025-05-07T20:25:47.7346634Z 2025-05-07T20:25:47.7346644Z 2025-05-07T20:25:47.7346651Z 2025-05-07T20:25:47.7346656Z 2025-05-07T20:25:47.7346662Z 2025-05-07T20:25:47.7346667Z 2025-05-07T20:25:47.7350823Z 2025-05-07T20:25:47.7818567Z cuda-sanitizer-api-1 | 8.9 MB | #####3 | 54%  2025-05-07T20:25:47.7818945Z 2025-05-07T20:25:47.7818949Z 2025-05-07T20:25:47.7818953Z 2025-05-07T20:25:47.7818957Z 2025-05-07T20:25:47.7818961Z 2025-05-07T20:25:47.7818964Z 2025-05-07T20:25:47.7818968Z 2025-05-07T20:25:47.7818972Z 2025-05-07T20:25:47.7818976Z 2025-05-07T20:25:47.7818980Z 2025-05-07T20:25:47.7818984Z 2025-05-07T20:25:47.7818988Z 2025-05-07T20:25:47.7818991Z 2025-05-07T20:25:47.7819153Z 2025-05-07T20:25:47.7819157Z 2025-05-07T20:25:47.7821687Z 2025-05-07T20:25:47.8349837Z cuda-nvvm-tools-12.6 | 10.4 MB | ########2 | 82%  2025-05-07T20:25:47.8350281Z 2025-05-07T20:25:47.8350286Z 2025-05-07T20:25:47.8350290Z 2025-05-07T20:25:47.8350293Z 2025-05-07T20:25:47.8350297Z 2025-05-07T20:25:47.8350301Z 2025-05-07T20:25:47.8350305Z 2025-05-07T20:25:47.8350308Z 2025-05-07T20:25:47.8350312Z 2025-05-07T20:25:47.8350327Z 2025-05-07T20:25:47.8350331Z 2025-05-07T20:25:47.8350335Z 2025-05-07T20:25:47.8350338Z 2025-05-07T20:25:47.8350342Z 2025-05-07T20:25:47.8350345Z 2025-05-07T20:25:47.8350349Z 2025-05-07T20:25:47.8350353Z 2025-05-07T20:25:48.0415036Z cuda-sanitizer-api-1 | 8.9 MB | ########5 | 86%  2025-05-07T20:25:48.0415441Z 2025-05-07T20:25:48.0415445Z 2025-05-07T20:25:48.0415449Z 2025-05-07T20:25:48.0415453Z 2025-05-07T20:25:48.0415466Z 2025-05-07T20:25:48.0415482Z 2025-05-07T20:25:48.0415486Z 2025-05-07T20:25:48.0415489Z 2025-05-07T20:25:48.0415493Z 2025-05-07T20:25:48.0415497Z 2025-05-07T20:25:48.0415501Z 2025-05-07T20:25:48.0415504Z 2025-05-07T20:25:48.0415508Z 2025-05-07T20:25:48.0415512Z 2025-05-07T20:25:48.0418090Z 2025-05-07T20:25:48.1045986Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:25:48.1046396Z 2025-05-07T20:25:48.1046402Z 2025-05-07T20:25:48.1046423Z 2025-05-07T20:25:48.1046429Z 2025-05-07T20:25:48.1046434Z 2025-05-07T20:25:48.1046439Z 2025-05-07T20:25:48.1046444Z 2025-05-07T20:25:48.1046450Z 2025-05-07T20:25:48.1046455Z 2025-05-07T20:25:48.1046461Z 2025-05-07T20:25:48.1046466Z 2025-05-07T20:25:48.1046471Z 2025-05-07T20:25:48.1046476Z 2025-05-07T20:25:48.1046482Z 2025-05-07T20:25:48.1046496Z 2025-05-07T20:25:48.1046502Z 2025-05-07T20:25:48.1046507Z 2025-05-07T20:25:48.1046513Z 2025-05-07T20:25:48.1313697Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:25:48.1314182Z 2025-05-07T20:25:48.1314187Z 2025-05-07T20:25:48.1314190Z 2025-05-07T20:25:48.1314201Z 2025-05-07T20:25:48.1314205Z 2025-05-07T20:25:48.1314209Z 2025-05-07T20:25:48.1314213Z 2025-05-07T20:25:48.1314217Z 2025-05-07T20:25:48.1314220Z 2025-05-07T20:25:48.1314224Z 2025-05-07T20:25:48.1314228Z 2025-05-07T20:25:48.1314232Z 2025-05-07T20:25:48.1314235Z 2025-05-07T20:25:48.1314239Z 2025-05-07T20:25:48.1314250Z 2025-05-07T20:25:48.1314254Z 2025-05-07T20:25:48.1314257Z 2025-05-07T20:25:48.1509098Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:25:48.1509476Z 2025-05-07T20:25:48.1509480Z 2025-05-07T20:25:48.1509484Z 2025-05-07T20:25:48.1509488Z 2025-05-07T20:25:48.1509491Z 2025-05-07T20:25:48.1509495Z 2025-05-07T20:25:48.1509499Z 2025-05-07T20:25:48.1509503Z 2025-05-07T20:25:48.1509507Z 2025-05-07T20:25:48.1509510Z 2025-05-07T20:25:48.1509527Z 2025-05-07T20:25:48.1509531Z 2025-05-07T20:25:48.1509534Z 2025-05-07T20:25:48.1509538Z 2025-05-07T20:25:48.1509547Z 2025-05-07T20:25:48.1513276Z 2025-05-07T20:25:48.1732021Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:25:48.1732480Z 2025-05-07T20:25:48.1732485Z 2025-05-07T20:25:48.1732491Z 2025-05-07T20:25:48.1732496Z 2025-05-07T20:25:48.1732502Z 2025-05-07T20:25:48.1732508Z 2025-05-07T20:25:48.1732791Z 2025-05-07T20:25:48.1732800Z 2025-05-07T20:25:48.1732806Z 2025-05-07T20:25:48.1732811Z 2025-05-07T20:25:48.1732816Z 2025-05-07T20:25:48.1732821Z 2025-05-07T20:25:48.1732826Z 2025-05-07T20:25:48.1732831Z 2025-05-07T20:25:48.1732836Z 2025-05-07T20:25:48.1732840Z 2025-05-07T20:25:48.1732843Z 2025-05-07T20:25:48.1732847Z 2025-05-07T20:25:48.1733032Z 2025-05-07T20:25:48.2048181Z ... (more hidden) ... 2025-05-07T20:25:48.2048489Z 2025-05-07T20:25:48.2048730Z 2025-05-07T20:25:48.2048734Z 2025-05-07T20:25:48.2048738Z 2025-05-07T20:25:48.2048742Z 2025-05-07T20:25:48.2048746Z 2025-05-07T20:25:48.2048750Z 2025-05-07T20:25:48.2048763Z 2025-05-07T20:25:48.2048767Z 2025-05-07T20:25:48.2048771Z 2025-05-07T20:25:48.2048775Z 2025-05-07T20:25:48.2048779Z 2025-05-07T20:25:48.2048782Z 2025-05-07T20:25:48.2048786Z 2025-05-07T20:25:48.2048789Z 2025-05-07T20:25:48.2048793Z 2025-05-07T20:25:48.2048797Z 2025-05-07T20:25:48.2048801Z 2025-05-07T20:25:48.2149192Z cuda-nvvm-impl-12.6. | 7.7 MB | ####3 | 44%  2025-05-07T20:25:48.2149826Z 2025-05-07T20:25:48.2149831Z 2025-05-07T20:25:48.2149835Z 2025-05-07T20:25:48.2149838Z 2025-05-07T20:25:48.2149842Z 2025-05-07T20:25:48.2149846Z 2025-05-07T20:25:48.2149850Z 2025-05-07T20:25:48.2149853Z 2025-05-07T20:25:48.2149857Z 2025-05-07T20:25:48.2149861Z 2025-05-07T20:25:48.2149865Z 2025-05-07T20:25:48.2149868Z 2025-05-07T20:25:48.2149872Z 2025-05-07T20:25:48.2149883Z 2025-05-07T20:25:48.2738133Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:25:48.2738815Z 2025-05-07T20:25:48.2738821Z 2025-05-07T20:25:48.2738826Z 2025-05-07T20:25:48.2738832Z 2025-05-07T20:25:48.2738837Z 2025-05-07T20:25:48.2738842Z 2025-05-07T20:25:48.2738848Z 2025-05-07T20:25:48.2738865Z 2025-05-07T20:25:48.2738871Z 2025-05-07T20:25:48.2738877Z 2025-05-07T20:25:48.2738882Z 2025-05-07T20:25:48.2738919Z 2025-05-07T20:25:48.2738926Z 2025-05-07T20:25:48.2738932Z 2025-05-07T20:25:48.2738938Z 2025-05-07T20:25:48.2738944Z 2025-05-07T20:25:48.2738948Z 2025-05-07T20:25:48.2738955Z 2025-05-07T20:25:48.2739260Z 2025-05-07T20:25:48.3295180Z ... (more hidden) ... 2025-05-07T20:25:48.3295525Z 2025-05-07T20:25:48.3295531Z 2025-05-07T20:25:48.3295537Z 2025-05-07T20:25:48.3295542Z 2025-05-07T20:25:48.3295548Z 2025-05-07T20:25:48.3295554Z 2025-05-07T20:25:48.3295593Z 2025-05-07T20:25:48.3295598Z 2025-05-07T20:25:48.3295603Z 2025-05-07T20:25:48.3295609Z 2025-05-07T20:25:48.3295614Z 2025-05-07T20:25:48.3295619Z 2025-05-07T20:25:48.3295625Z 2025-05-07T20:25:48.3295630Z 2025-05-07T20:25:48.3295647Z 2025-05-07T20:25:48.3295653Z 2025-05-07T20:25:48.3295658Z 2025-05-07T20:25:48.3295665Z 2025-05-07T20:25:48.4258117Z cuda-nvvm-impl-12.6. | 7.7 MB | ########7 | 88%  2025-05-07T20:25:48.4258644Z 2025-05-07T20:25:48.4258649Z 2025-05-07T20:25:48.4258653Z 2025-05-07T20:25:48.4258657Z 2025-05-07T20:25:48.4258661Z 2025-05-07T20:25:48.4258664Z 2025-05-07T20:25:48.4258668Z 2025-05-07T20:25:48.4258672Z 2025-05-07T20:25:48.4258676Z 2025-05-07T20:25:48.4258680Z 2025-05-07T20:25:48.4258684Z 2025-05-07T20:25:48.4258687Z 2025-05-07T20:25:48.4258691Z 2025-05-07T20:25:48.4258695Z 2025-05-07T20:25:48.4258699Z 2025-05-07T20:25:48.4258703Z 2025-05-07T20:25:48.4258706Z 2025-05-07T20:25:48.4258711Z 2025-05-07T20:25:48.4267167Z 2025-05-07T20:25:48.6374391Z ... (more hidden) ... 2025-05-07T20:25:48.6374797Z 2025-05-07T20:25:48.6374814Z 2025-05-07T20:25:48.6374820Z 2025-05-07T20:25:48.6374836Z 2025-05-07T20:25:48.6374841Z 2025-05-07T20:25:48.6374846Z 2025-05-07T20:25:48.6374852Z 2025-05-07T20:25:48.6374857Z 2025-05-07T20:25:48.6374863Z 2025-05-07T20:25:48.6374868Z 2025-05-07T20:25:48.6374874Z 2025-05-07T20:25:48.6374879Z 2025-05-07T20:25:48.6375156Z 2025-05-07T20:25:48.6375161Z 2025-05-07T20:25:48.6375165Z 2025-05-07T20:25:48.6375168Z 2025-05-07T20:25:48.6375172Z 2025-05-07T20:25:48.6377019Z 2025-05-07T20:25:49.4270080Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:25:49.4270577Z 2025-05-07T20:25:49.4270583Z 2025-05-07T20:25:49.4270588Z 2025-05-07T20:25:49.4270594Z 2025-05-07T20:25:49.4270599Z 2025-05-07T20:25:49.4270604Z 2025-05-07T20:25:50.2670689Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:25:50.2671269Z 2025-05-07T20:25:50.2671273Z 2025-05-07T20:25:50.2671277Z 2025-05-07T20:25:50.2671280Z 2025-05-07T20:25:50.2671288Z 2025-05-07T20:25:50.8855659Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:25:50.8855972Z 2025-05-07T20:25:50.8855976Z 2025-05-07T20:25:50.8855989Z 2025-05-07T20:25:50.8855993Z 2025-05-07T20:25:50.8855997Z 2025-05-07T20:25:50.8856001Z 2025-05-07T20:25:50.8856030Z 2025-05-07T20:25:50.8856034Z 2025-05-07T20:25:50.8856038Z 2025-05-07T20:25:50.9185765Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:25:50.9186068Z 2025-05-07T20:25:50.9186073Z 2025-05-07T20:25:50.9186077Z 2025-05-07T20:25:50.9186081Z 2025-05-07T20:25:50.9186084Z 2025-05-07T20:25:50.9186088Z 2025-05-07T20:25:50.9186480Z 2025-05-07T20:25:51.2394967Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:25:51.2395315Z 2025-05-07T20:25:51.2395331Z 2025-05-07T20:25:51.2395334Z 2025-05-07T20:25:51.2395338Z 2025-05-07T20:25:51.2395342Z 2025-05-07T20:25:51.2395345Z 2025-05-07T20:25:51.2395349Z 2025-05-07T20:25:51.2395352Z 2025-05-07T20:25:51.3747420Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:25:51.4994426Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:25:51.4994711Z 2025-05-07T20:25:51.4994716Z 2025-05-07T20:25:51.4994742Z 2025-05-07T20:25:51.4994755Z 2025-05-07T20:25:51.4994759Z 2025-05-07T20:25:51.4994763Z 2025-05-07T20:25:51.4994768Z 2025-05-07T20:25:51.4994771Z 2025-05-07T20:25:51.4994775Z 2025-05-07T20:25:51.4994779Z 2025-05-07T20:25:51.4994783Z 2025-05-07T20:25:51.4994787Z 2025-05-07T20:25:51.5599133Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:25:51.5599471Z 2025-05-07T20:25:51.5599475Z 2025-05-07T20:25:51.5599479Z 2025-05-07T20:25:51.5599483Z 2025-05-07T20:25:51.5599515Z 2025-05-07T20:25:51.5599521Z 2025-05-07T20:25:51.5599524Z 2025-05-07T20:25:51.5599528Z 2025-05-07T20:25:51.5599532Z 2025-05-07T20:25:51.5599536Z 2025-05-07T20:25:51.5599539Z 2025-05-07T20:25:51.5599543Z 2025-05-07T20:25:51.5599550Z 2025-05-07T20:25:51.7847323Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:25:51.7847704Z 2025-05-07T20:25:51.7847709Z 2025-05-07T20:25:51.7847713Z 2025-05-07T20:25:51.7847754Z 2025-05-07T20:25:51.7847758Z 2025-05-07T20:25:51.7847762Z 2025-05-07T20:25:51.7847766Z 2025-05-07T20:25:51.7847770Z 2025-05-07T20:25:51.7847774Z 2025-05-07T20:25:51.7847778Z 2025-05-07T20:25:52.1120569Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:25:52.1120900Z 2025-05-07T20:25:52.1120905Z 2025-05-07T20:25:52.1120910Z 2025-05-07T20:25:52.1120914Z 2025-05-07T20:25:52.1120918Z 2025-05-07T20:25:52.1120922Z 2025-05-07T20:25:52.1120927Z 2025-05-07T20:25:52.1120964Z 2025-05-07T20:25:52.1120968Z 2025-05-07T20:25:52.1120972Z 2025-05-07T20:25:52.1120975Z 2025-05-07T20:25:52.1120979Z 2025-05-07T20:25:52.1120983Z 2025-05-07T20:25:52.1120987Z 2025-05-07T20:25:52.1120991Z 2025-05-07T20:25:52.4108885Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:25:52.4109238Z 2025-05-07T20:25:52.4109242Z 2025-05-07T20:25:52.4109246Z 2025-05-07T20:25:52.4109250Z 2025-05-07T20:25:52.4109503Z 2025-05-07T20:25:52.4109508Z 2025-05-07T20:25:52.4109512Z 2025-05-07T20:25:52.4109525Z 2025-05-07T20:25:52.4109529Z 2025-05-07T20:25:52.4109532Z 2025-05-07T20:25:52.4109536Z 2025-05-07T20:25:52.4109540Z 2025-05-07T20:25:52.4109545Z 2025-05-07T20:25:52.4109549Z 2025-05-07T20:25:52.4109554Z 2025-05-07T20:25:52.4109558Z 2025-05-07T20:25:52.4109562Z 2025-05-07T20:25:52.6477113Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:25:52.6477495Z 2025-05-07T20:25:52.6477784Z 2025-05-07T20:25:52.6477788Z 2025-05-07T20:25:52.6477792Z 2025-05-07T20:25:52.6477796Z 2025-05-07T20:25:52.6477799Z 2025-05-07T20:25:52.6477804Z 2025-05-07T20:25:52.6477808Z 2025-05-07T20:25:52.6477811Z 2025-05-07T20:25:52.6477815Z 2025-05-07T20:25:52.6477819Z 2025-05-07T20:25:52.6477823Z 2025-05-07T20:25:52.6477826Z 2025-05-07T20:25:52.6477830Z 2025-05-07T20:25:52.6477834Z 2025-05-07T20:25:52.6477838Z 2025-05-07T20:25:52.8995331Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:25:52.8995692Z 2025-05-07T20:25:52.8995697Z 2025-05-07T20:25:52.8995701Z 2025-05-07T20:25:52.8995704Z 2025-05-07T20:25:52.8995708Z 2025-05-07T20:25:52.8995712Z 2025-05-07T20:25:52.8995716Z 2025-05-07T20:25:52.8995719Z 2025-05-07T20:25:52.8995723Z 2025-05-07T20:25:52.8995727Z 2025-05-07T20:25:52.8995731Z 2025-05-07T20:25:52.8995734Z 2025-05-07T20:25:52.8995738Z 2025-05-07T20:25:52.8996476Z 2025-05-07T20:25:53.1841407Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:25:53.1841751Z 2025-05-07T20:25:53.1841755Z 2025-05-07T20:25:53.1841768Z 2025-05-07T20:25:53.1841772Z 2025-05-07T20:25:53.1841776Z 2025-05-07T20:25:53.1841780Z 2025-05-07T20:25:53.1841784Z 2025-05-07T20:25:53.1841789Z 2025-05-07T20:25:53.1841793Z 2025-05-07T20:25:53.1841797Z 2025-05-07T20:25:53.1841801Z 2025-05-07T20:25:53.2097651Z python-3.11.8 | 29.3 MB | ########## | 100%  2025-05-07T20:25:53.2097945Z 2025-05-07T20:25:53.2097950Z 2025-05-07T20:25:53.2097953Z 2025-05-07T20:25:53.2097957Z 2025-05-07T20:25:53.2097961Z 2025-05-07T20:25:53.2097965Z 2025-05-07T20:25:53.2097970Z 2025-05-07T20:25:53.2097973Z 2025-05-07T20:25:53.2097977Z 2025-05-07T20:25:53.2097981Z 2025-05-07T20:25:53.2097985Z 2025-05-07T20:25:53.2097989Z 2025-05-07T20:25:53.2097993Z 2025-05-07T20:25:53.2097997Z 2025-05-07T20:25:53.2098002Z 2025-05-07T20:25:53.2098005Z 2025-05-07T20:25:53.2098021Z 2025-05-07T20:25:53.2098025Z 2025-05-07T20:25:53.2098038Z 2025-05-07T20:25:53.5451294Z ... (more hidden) ... 2025-05-07T20:25:53.5451600Z 2025-05-07T20:25:53.5451610Z 2025-05-07T20:25:53.5451614Z 2025-05-07T20:25:53.5451617Z 2025-05-07T20:25:53.5451621Z 2025-05-07T20:25:53.5451624Z 2025-05-07T20:25:53.5451629Z 2025-05-07T20:25:53.5451632Z 2025-05-07T20:25:53.5451636Z 2025-05-07T20:25:53.5451673Z 2025-05-07T20:25:53.5451676Z 2025-05-07T20:25:53.5451680Z 2025-05-07T20:25:53.5451684Z 2025-05-07T20:25:53.5451687Z 2025-05-07T20:25:53.5451691Z 2025-05-07T20:25:53.5451695Z 2025-05-07T20:25:53.5451698Z 2025-05-07T20:25:53.5451702Z 2025-05-07T20:25:53.7835292Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:25:53.7835651Z 2025-05-07T20:25:59.2081411Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:25:59.2090234Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:25:59.2090685Z 2025-05-07T20:25:59.2090692Z 2025-05-07T20:25:59.2090699Z 2025-05-07T20:25:59.2090706Z 2025-05-07T20:25:59.2090713Z 2025-05-07T20:25:59.2090719Z 2025-05-07T20:25:59.2090738Z 2025-05-07T20:25:59.2090746Z 2025-05-07T20:25:59.2090752Z 2025-05-07T20:25:59.2090758Z 2025-05-07T20:25:59.2090764Z 2025-05-07T20:25:59.2090771Z 2025-05-07T20:25:59.2090777Z 2025-05-07T20:25:59.2090783Z 2025-05-07T20:25:59.2091521Z 2025-05-07T20:25:59.2091530Z 2025-05-07T20:25:59.2091536Z 2025-05-07T20:25:59.2091541Z 2025-05-07T20:25:59.2091547Z 2025-05-07T20:25:59.2091704Z 2025-05-07T20:25:59.2092370Z  2025-05-07T20:25:59.2092967Z 2025-05-07T20:25:59.2093380Z 2025-05-07T20:25:59.2093664Z  2025-05-07T20:25:59.2094079Z 2025-05-07T20:25:59.2094089Z 2025-05-07T20:25:59.2094610Z  2025-05-07T20:25:59.2095052Z 2025-05-07T20:25:59.2095061Z 2025-05-07T20:25:59.2095069Z 2025-05-07T20:25:59.2095381Z  2025-05-07T20:25:59.2095751Z 2025-05-07T20:25:59.2095758Z 2025-05-07T20:25:59.2095766Z 2025-05-07T20:25:59.2095775Z 2025-05-07T20:25:59.2096122Z  2025-05-07T20:25:59.2096556Z 2025-05-07T20:25:59.2096579Z 2025-05-07T20:25:59.2096586Z 2025-05-07T20:25:59.2096593Z 2025-05-07T20:25:59.2096601Z 2025-05-07T20:25:59.2096972Z  2025-05-07T20:25:59.2097411Z 2025-05-07T20:25:59.2097431Z 2025-05-07T20:25:59.2097439Z 2025-05-07T20:25:59.2097445Z 2025-05-07T20:25:59.2097453Z 2025-05-07T20:25:59.2097462Z 2025-05-07T20:25:59.2097766Z  2025-05-07T20:25:59.2098150Z 2025-05-07T20:25:59.2098169Z 2025-05-07T20:25:59.2098175Z 2025-05-07T20:25:59.2098180Z 2025-05-07T20:25:59.2098186Z 2025-05-07T20:25:59.2098192Z 2025-05-07T20:25:59.2098198Z 2025-05-07T20:25:59.2098476Z  2025-05-07T20:25:59.2098841Z 2025-05-07T20:25:59.2098847Z 2025-05-07T20:25:59.2098853Z 2025-05-07T20:25:59.2098859Z 2025-05-07T20:25:59.2098865Z 2025-05-07T20:25:59.2098871Z 2025-05-07T20:25:59.2098884Z 2025-05-07T20:25:59.2098890Z 2025-05-07T20:25:59.2099194Z  2025-05-07T20:25:59.2099536Z 2025-05-07T20:25:59.2099542Z 2025-05-07T20:25:59.2099548Z 2025-05-07T20:25:59.2099554Z 2025-05-07T20:25:59.2099560Z 2025-05-07T20:25:59.2099566Z 2025-05-07T20:25:59.2099572Z 2025-05-07T20:25:59.2099578Z 2025-05-07T20:25:59.2099584Z 2025-05-07T20:25:59.2100053Z  2025-05-07T20:25:59.2100323Z 2025-05-07T20:25:59.2100328Z 2025-05-07T20:25:59.2100331Z 2025-05-07T20:25:59.2100335Z 2025-05-07T20:25:59.2100339Z 2025-05-07T20:25:59.2100353Z 2025-05-07T20:25:59.2100356Z 2025-05-07T20:25:59.2100360Z 2025-05-07T20:25:59.2100364Z 2025-05-07T20:25:59.2100367Z 2025-05-07T20:25:59.2100560Z  2025-05-07T20:25:59.2100786Z 2025-05-07T20:25:59.2100796Z 2025-05-07T20:25:59.2100806Z 2025-05-07T20:25:59.2100810Z 2025-05-07T20:25:59.2100813Z 2025-05-07T20:25:59.2100817Z 2025-05-07T20:25:59.2100821Z 2025-05-07T20:25:59.2100825Z 2025-05-07T20:25:59.2100828Z 2025-05-07T20:25:59.2100832Z 2025-05-07T20:25:59.2100835Z 2025-05-07T20:25:59.2101410Z  2025-05-07T20:25:59.2101783Z 2025-05-07T20:25:59.2101789Z 2025-05-07T20:25:59.2101795Z 2025-05-07T20:25:59.2101801Z 2025-05-07T20:25:59.2101807Z 2025-05-07T20:25:59.2101823Z 2025-05-07T20:25:59.2101828Z 2025-05-07T20:25:59.2101834Z 2025-05-07T20:25:59.2101840Z 2025-05-07T20:25:59.2101846Z 2025-05-07T20:25:59.2101852Z 2025-05-07T20:25:59.2101857Z 2025-05-07T20:25:59.2102225Z  2025-05-07T20:25:59.2102605Z 2025-05-07T20:25:59.2102610Z 2025-05-07T20:25:59.2102616Z 2025-05-07T20:25:59.2102621Z 2025-05-07T20:25:59.2102626Z 2025-05-07T20:25:59.2102764Z 2025-05-07T20:25:59.2102772Z 2025-05-07T20:25:59.2102777Z 2025-05-07T20:25:59.2102793Z 2025-05-07T20:25:59.2102799Z 2025-05-07T20:25:59.2102805Z 2025-05-07T20:25:59.2102810Z 2025-05-07T20:25:59.2102816Z 2025-05-07T20:25:59.2103164Z  2025-05-07T20:25:59.2103524Z 2025-05-07T20:25:59.2103540Z 2025-05-07T20:25:59.2103546Z 2025-05-07T20:25:59.2103559Z 2025-05-07T20:25:59.2103565Z 2025-05-07T20:25:59.2103570Z 2025-05-07T20:25:59.2103686Z 2025-05-07T20:25:59.2103692Z 2025-05-07T20:25:59.2103698Z 2025-05-07T20:25:59.2103704Z 2025-05-07T20:25:59.2103709Z 2025-05-07T20:25:59.2103715Z 2025-05-07T20:25:59.2103721Z 2025-05-07T20:25:59.2103727Z 2025-05-07T20:25:59.2104157Z  2025-05-07T20:25:59.2104408Z 2025-05-07T20:25:59.2104413Z 2025-05-07T20:25:59.2104416Z 2025-05-07T20:25:59.2104420Z 2025-05-07T20:25:59.2104435Z 2025-05-07T20:25:59.2104439Z 2025-05-07T20:25:59.2104442Z 2025-05-07T20:25:59.2104446Z 2025-05-07T20:25:59.2104450Z 2025-05-07T20:25:59.2104463Z 2025-05-07T20:25:59.2104466Z 2025-05-07T20:25:59.2104470Z 2025-05-07T20:25:59.2104474Z 2025-05-07T20:25:59.2104478Z 2025-05-07T20:25:59.2104481Z 2025-05-07T20:25:59.2104928Z  2025-05-07T20:25:59.2105291Z 2025-05-07T20:25:59.2105297Z 2025-05-07T20:25:59.2105303Z 2025-05-07T20:25:59.2105319Z 2025-05-07T20:25:59.2105325Z 2025-05-07T20:25:59.2105331Z 2025-05-07T20:25:59.2105337Z 2025-05-07T20:25:59.2105343Z 2025-05-07T20:25:59.2105349Z 2025-05-07T20:25:59.2105355Z 2025-05-07T20:25:59.2105361Z 2025-05-07T20:25:59.2105367Z 2025-05-07T20:25:59.2105373Z 2025-05-07T20:25:59.2105388Z 2025-05-07T20:25:59.2105395Z 2025-05-07T20:25:59.2105412Z 2025-05-07T20:25:59.2105774Z  2025-05-07T20:25:59.2106142Z 2025-05-07T20:25:59.2106157Z 2025-05-07T20:25:59.2106163Z 2025-05-07T20:25:59.2106169Z 2025-05-07T20:25:59.2106175Z 2025-05-07T20:25:59.2106180Z 2025-05-07T20:25:59.2106186Z 2025-05-07T20:25:59.2106192Z 2025-05-07T20:25:59.2106198Z 2025-05-07T20:25:59.2106204Z 2025-05-07T20:25:59.2106210Z 2025-05-07T20:25:59.2106216Z 2025-05-07T20:25:59.2106222Z 2025-05-07T20:25:59.2106227Z 2025-05-07T20:25:59.2106233Z 2025-05-07T20:25:59.2106239Z 2025-05-07T20:25:59.2106253Z 2025-05-07T20:25:59.2106633Z  2025-05-07T20:25:59.2107006Z 2025-05-07T20:25:59.2107013Z 2025-05-07T20:25:59.2107019Z 2025-05-07T20:25:59.2107025Z 2025-05-07T20:25:59.2107031Z 2025-05-07T20:25:59.2107037Z 2025-05-07T20:25:59.2107043Z 2025-05-07T20:25:59.2107049Z 2025-05-07T20:25:59.2107055Z 2025-05-07T20:25:59.2107061Z 2025-05-07T20:25:59.2107067Z 2025-05-07T20:25:59.2107082Z 2025-05-07T20:25:59.2107096Z 2025-05-07T20:25:59.2107103Z 2025-05-07T20:25:59.2107108Z 2025-05-07T20:25:59.2107114Z 2025-05-07T20:25:59.2107120Z 2025-05-07T20:25:59.2107126Z 2025-05-07T20:25:59.2108329Z  2025-05-07T20:25:59.2108721Z 2025-05-07T20:25:59.2108727Z 2025-05-07T20:25:59.2108897Z  2025-05-07T20:25:59.2109063Z 2025-05-07T20:25:59.2109069Z 2025-05-07T20:25:59.2109756Z  2025-05-07T20:25:59.2109946Z 2025-05-07T20:25:59.2109952Z 2025-05-07T20:25:59.2109962Z 2025-05-07T20:25:59.2110455Z  2025-05-07T20:25:59.2110637Z 2025-05-07T20:25:59.2110644Z 2025-05-07T20:25:59.2110654Z 2025-05-07T20:25:59.2110660Z 2025-05-07T20:25:59.2111415Z  2025-05-07T20:25:59.2111599Z 2025-05-07T20:25:59.2111605Z 2025-05-07T20:25:59.2111610Z 2025-05-07T20:25:59.2111615Z 2025-05-07T20:25:59.2111624Z 2025-05-07T20:25:59.2112435Z  2025-05-07T20:25:59.2112776Z 2025-05-07T20:25:59.2112784Z 2025-05-07T20:25:59.2112789Z 2025-05-07T20:25:59.2112794Z 2025-05-07T20:25:59.2112799Z 2025-05-07T20:25:59.2112809Z 2025-05-07T20:25:59.2113311Z  2025-05-07T20:25:59.2113520Z 2025-05-07T20:25:59.2113531Z 2025-05-07T20:25:59.2113535Z 2025-05-07T20:25:59.2113545Z 2025-05-07T20:25:59.2113548Z 2025-05-07T20:25:59.2113552Z 2025-05-07T20:25:59.2113556Z 2025-05-07T20:25:59.2114012Z  2025-05-07T20:25:59.2114248Z 2025-05-07T20:25:59.2114254Z 2025-05-07T20:25:59.2114380Z 2025-05-07T20:25:59.2114386Z 2025-05-07T20:25:59.2114391Z 2025-05-07T20:25:59.2114396Z 2025-05-07T20:25:59.2114402Z 2025-05-07T20:25:59.2114412Z 2025-05-07T20:25:59.2114955Z  2025-05-07T20:25:59.2115201Z 2025-05-07T20:25:59.2115207Z 2025-05-07T20:25:59.2115212Z 2025-05-07T20:25:59.2115218Z 2025-05-07T20:25:59.2115224Z 2025-05-07T20:25:59.2115230Z 2025-05-07T20:25:59.2115235Z 2025-05-07T20:25:59.2115240Z 2025-05-07T20:25:59.2115250Z 2025-05-07T20:25:59.2115655Z  2025-05-07T20:25:59.2115918Z 2025-05-07T20:25:59.2115925Z 2025-05-07T20:25:59.2115931Z 2025-05-07T20:25:59.2115936Z 2025-05-07T20:25:59.2115942Z 2025-05-07T20:25:59.2115948Z 2025-05-07T20:25:59.2115953Z 2025-05-07T20:25:59.2115959Z 2025-05-07T20:25:59.2115974Z 2025-05-07T20:25:59.2115984Z 2025-05-07T20:25:59.2116623Z  2025-05-07T20:25:59.2116917Z 2025-05-07T20:25:59.2116922Z 2025-05-07T20:25:59.2116927Z 2025-05-07T20:25:59.2116960Z 2025-05-07T20:25:59.2116966Z 2025-05-07T20:25:59.2116971Z 2025-05-07T20:25:59.2116976Z 2025-05-07T20:25:59.2116981Z 2025-05-07T20:25:59.2116986Z 2025-05-07T20:25:59.2116991Z 2025-05-07T20:25:59.2116996Z 2025-05-07T20:25:59.2117385Z  2025-05-07T20:25:59.2117581Z 2025-05-07T20:25:59.2117585Z 2025-05-07T20:25:59.2117588Z 2025-05-07T20:25:59.2117600Z 2025-05-07T20:25:59.2117603Z 2025-05-07T20:25:59.2117607Z 2025-05-07T20:25:59.2117611Z 2025-05-07T20:25:59.2117626Z 2025-05-07T20:25:59.2117629Z 2025-05-07T20:25:59.2117633Z 2025-05-07T20:25:59.2117637Z 2025-05-07T20:25:59.2117644Z 2025-05-07T20:25:59.2117970Z  2025-05-07T20:25:59.2118225Z 2025-05-07T20:25:59.2118235Z 2025-05-07T20:25:59.2118239Z 2025-05-07T20:25:59.2118242Z 2025-05-07T20:25:59.2118246Z 2025-05-07T20:25:59.2118250Z 2025-05-07T20:25:59.2118254Z 2025-05-07T20:25:59.2118258Z 2025-05-07T20:25:59.2118262Z 2025-05-07T20:25:59.2118265Z 2025-05-07T20:25:59.2118275Z 2025-05-07T20:25:59.2118279Z 2025-05-07T20:25:59.2118282Z 2025-05-07T20:25:59.2118835Z  2025-05-07T20:25:59.2119165Z 2025-05-07T20:25:59.2119170Z 2025-05-07T20:25:59.2119176Z 2025-05-07T20:25:59.2119181Z 2025-05-07T20:25:59.2119186Z 2025-05-07T20:25:59.2119198Z 2025-05-07T20:25:59.2119204Z 2025-05-07T20:25:59.2119209Z 2025-05-07T20:25:59.2119214Z 2025-05-07T20:25:59.2119219Z 2025-05-07T20:25:59.2119224Z 2025-05-07T20:25:59.2119239Z 2025-05-07T20:25:59.2119245Z 2025-05-07T20:25:59.2119250Z 2025-05-07T20:25:59.2119597Z  2025-05-07T20:25:59.2119828Z 2025-05-07T20:25:59.2119832Z 2025-05-07T20:25:59.2119843Z 2025-05-07T20:25:59.2119847Z 2025-05-07T20:25:59.2119851Z 2025-05-07T20:25:59.2119854Z 2025-05-07T20:25:59.2119858Z 2025-05-07T20:25:59.2119862Z 2025-05-07T20:25:59.2119866Z 2025-05-07T20:25:59.2119869Z 2025-05-07T20:25:59.2119873Z 2025-05-07T20:25:59.2119876Z 2025-05-07T20:25:59.2119884Z 2025-05-07T20:25:59.2119898Z 2025-05-07T20:25:59.2119902Z 2025-05-07T20:25:59.2120361Z  2025-05-07T20:25:59.2120698Z 2025-05-07T20:25:59.2120703Z 2025-05-07T20:25:59.2120709Z 2025-05-07T20:25:59.2120714Z 2025-05-07T20:25:59.2120728Z 2025-05-07T20:25:59.2120733Z 2025-05-07T20:25:59.2120739Z 2025-05-07T20:25:59.2120744Z 2025-05-07T20:25:59.2120749Z 2025-05-07T20:25:59.2120755Z 2025-05-07T20:25:59.2120770Z 2025-05-07T20:25:59.2120776Z 2025-05-07T20:25:59.2120916Z 2025-05-07T20:25:59.2120925Z 2025-05-07T20:25:59.2120931Z 2025-05-07T20:25:59.2120937Z 2025-05-07T20:25:59.2121462Z  2025-05-07T20:25:59.2121795Z 2025-05-07T20:25:59.2121801Z 2025-05-07T20:25:59.2121807Z 2025-05-07T20:25:59.2121813Z 2025-05-07T20:25:59.2121819Z 2025-05-07T20:25:59.2121825Z 2025-05-07T20:25:59.2121831Z 2025-05-07T20:25:59.2121836Z 2025-05-07T20:25:59.2121842Z 2025-05-07T20:25:59.2121848Z 2025-05-07T20:25:59.2121854Z 2025-05-07T20:25:59.2122009Z 2025-05-07T20:25:59.2122015Z 2025-05-07T20:25:59.2122032Z 2025-05-07T20:25:59.2122038Z 2025-05-07T20:25:59.2122043Z 2025-05-07T20:25:59.2122048Z 2025-05-07T20:25:59.2122324Z  2025-05-07T20:25:59.2122693Z 2025-05-07T20:25:59.2122700Z 2025-05-07T20:25:59.2122706Z 2025-05-07T20:25:59.2122712Z 2025-05-07T20:25:59.2122718Z 2025-05-07T20:25:59.2122723Z 2025-05-07T20:25:59.2122729Z 2025-05-07T20:25:59.2122747Z 2025-05-07T20:25:59.2122753Z 2025-05-07T20:25:59.2122759Z 2025-05-07T20:25:59.2122765Z 2025-05-07T20:25:59.2122771Z 2025-05-07T20:25:59.2122777Z 2025-05-07T20:25:59.2122782Z 2025-05-07T20:25:59.2122788Z 2025-05-07T20:25:59.2122794Z 2025-05-07T20:25:59.2122800Z 2025-05-07T20:25:59.2122805Z 2025-05-07T20:25:59.2123741Z  2025-05-07T20:25:59.2124076Z 2025-05-07T20:25:59.2124088Z 2025-05-07T20:25:59.2124266Z  2025-05-07T20:25:59.2124444Z 2025-05-07T20:25:59.2124465Z 2025-05-07T20:25:59.2125034Z  2025-05-07T20:25:59.2125166Z 2025-05-07T20:25:59.2125173Z 2025-05-07T20:25:59.2125177Z 2025-05-07T20:25:59.2125624Z  2025-05-07T20:25:59.2125739Z 2025-05-07T20:25:59.2125743Z 2025-05-07T20:25:59.2125749Z 2025-05-07T20:25:59.2125804Z 2025-05-07T20:25:59.2126464Z  2025-05-07T20:25:59.2126633Z 2025-05-07T20:25:59.2126638Z 2025-05-07T20:25:59.2126643Z 2025-05-07T20:25:59.2126647Z 2025-05-07T20:25:59.2126652Z 2025-05-07T20:25:59.2127121Z  2025-05-07T20:25:59.2127303Z 2025-05-07T20:25:59.2127307Z 2025-05-07T20:25:59.2127314Z 2025-05-07T20:25:59.2127318Z 2025-05-07T20:25:59.2127322Z 2025-05-07T20:25:59.2127325Z 2025-05-07T20:25:59.2127758Z  2025-05-07T20:25:59.2127935Z 2025-05-07T20:25:59.2127944Z 2025-05-07T20:25:59.2127948Z 2025-05-07T20:25:59.2127952Z 2025-05-07T20:25:59.2127956Z 2025-05-07T20:25:59.2127960Z 2025-05-07T20:25:59.2127963Z 2025-05-07T20:25:59.2128370Z  2025-05-07T20:25:59.2128526Z 2025-05-07T20:25:59.2128534Z 2025-05-07T20:25:59.2128538Z 2025-05-07T20:25:59.2128541Z 2025-05-07T20:25:59.2128545Z 2025-05-07T20:25:59.2128549Z 2025-05-07T20:25:59.2128553Z 2025-05-07T20:25:59.2128556Z 2025-05-07T20:25:59.2137532Z  2025-05-07T20:25:59.2137789Z 2025-05-07T20:25:59.2137821Z 2025-05-07T20:25:59.2137825Z 2025-05-07T20:25:59.2137829Z 2025-05-07T20:25:59.2137832Z 2025-05-07T20:25:59.2137836Z 2025-05-07T20:25:59.2137849Z 2025-05-07T20:25:59.2137853Z 2025-05-07T20:25:59.2137870Z 2025-05-07T20:25:59.2138119Z  2025-05-07T20:25:59.2138340Z 2025-05-07T20:25:59.2138347Z 2025-05-07T20:25:59.2138352Z 2025-05-07T20:25:59.2138357Z 2025-05-07T20:25:59.2138617Z 2025-05-07T20:25:59.2138624Z 2025-05-07T20:25:59.2138630Z 2025-05-07T20:25:59.2138647Z 2025-05-07T20:25:59.2138652Z 2025-05-07T20:25:59.2138657Z 2025-05-07T20:25:59.2138858Z  2025-05-07T20:25:59.2139083Z 2025-05-07T20:25:59.2139096Z 2025-05-07T20:25:59.2139102Z 2025-05-07T20:25:59.2139107Z 2025-05-07T20:25:59.2139122Z 2025-05-07T20:25:59.2139127Z 2025-05-07T20:25:59.2139132Z 2025-05-07T20:25:59.2139137Z 2025-05-07T20:25:59.2139142Z 2025-05-07T20:25:59.2139147Z 2025-05-07T20:25:59.2139152Z 2025-05-07T20:25:59.2139339Z  2025-05-07T20:25:59.2139527Z 2025-05-07T20:25:59.2139531Z 2025-05-07T20:25:59.2139535Z 2025-05-07T20:25:59.2139538Z 2025-05-07T20:25:59.2139976Z 2025-05-07T20:25:59.2139981Z 2025-05-07T20:25:59.2139984Z 2025-05-07T20:25:59.2139988Z 2025-05-07T20:25:59.2139992Z 2025-05-07T20:25:59.2139995Z 2025-05-07T20:25:59.2139999Z 2025-05-07T20:25:59.2140002Z 2025-05-07T20:25:59.2140181Z  2025-05-07T20:25:59.2140455Z 2025-05-07T20:25:59.2140461Z 2025-05-07T20:25:59.2140466Z 2025-05-07T20:25:59.2140471Z 2025-05-07T20:25:59.2140476Z 2025-05-07T20:25:59.2140482Z 2025-05-07T20:25:59.2140487Z 2025-05-07T20:25:59.2140492Z 2025-05-07T20:25:59.2140661Z 2025-05-07T20:25:59.2140666Z 2025-05-07T20:25:59.2140671Z 2025-05-07T20:25:59.2140675Z 2025-05-07T20:25:59.2140680Z 2025-05-07T20:25:59.2140892Z  2025-05-07T20:25:59.2141148Z 2025-05-07T20:25:59.2141153Z 2025-05-07T20:25:59.2141158Z 2025-05-07T20:25:59.2141164Z 2025-05-07T20:25:59.2141168Z 2025-05-07T20:25:59.2141173Z 2025-05-07T20:25:59.2141179Z 2025-05-07T20:25:59.2141184Z 2025-05-07T20:25:59.2141189Z 2025-05-07T20:25:59.2141201Z 2025-05-07T20:25:59.2141277Z 2025-05-07T20:25:59.2141282Z 2025-05-07T20:25:59.2141287Z 2025-05-07T20:25:59.2141302Z 2025-05-07T20:25:59.2141495Z  2025-05-07T20:25:59.2141756Z 2025-05-07T20:25:59.2141762Z 2025-05-07T20:25:59.2141767Z 2025-05-07T20:25:59.2141772Z 2025-05-07T20:25:59.2141777Z 2025-05-07T20:25:59.2141782Z 2025-05-07T20:25:59.2141795Z 2025-05-07T20:25:59.2141800Z 2025-05-07T20:25:59.2141805Z 2025-05-07T20:25:59.2141810Z 2025-05-07T20:25:59.2141823Z 2025-05-07T20:25:59.2141828Z 2025-05-07T20:25:59.2141833Z 2025-05-07T20:25:59.2141839Z 2025-05-07T20:25:59.2141843Z 2025-05-07T20:25:59.2142043Z  2025-05-07T20:25:59.2142322Z 2025-05-07T20:25:59.2142327Z 2025-05-07T20:25:59.2142332Z 2025-05-07T20:25:59.2142337Z 2025-05-07T20:25:59.2142342Z 2025-05-07T20:25:59.2142347Z 2025-05-07T20:25:59.2142352Z 2025-05-07T20:25:59.2142357Z 2025-05-07T20:25:59.2142362Z 2025-05-07T20:25:59.2142374Z 2025-05-07T20:25:59.2142379Z 2025-05-07T20:25:59.2142384Z 2025-05-07T20:25:59.2142389Z 2025-05-07T20:25:59.2142394Z 2025-05-07T20:25:59.2142399Z 2025-05-07T20:25:59.2142404Z 2025-05-07T20:25:59.2142616Z  2025-05-07T20:25:59.2142903Z 2025-05-07T20:25:59.2142909Z 2025-05-07T20:25:59.2142914Z 2025-05-07T20:25:59.2142919Z 2025-05-07T20:25:59.2142924Z 2025-05-07T20:25:59.2142929Z 2025-05-07T20:25:59.2142933Z 2025-05-07T20:25:59.2142938Z 2025-05-07T20:25:59.2142951Z 2025-05-07T20:25:59.2142956Z 2025-05-07T20:25:59.2142961Z 2025-05-07T20:25:59.2142966Z 2025-05-07T20:25:59.2142971Z 2025-05-07T20:25:59.2142986Z 2025-05-07T20:25:59.2142991Z 2025-05-07T20:25:59.2142996Z 2025-05-07T20:25:59.2143001Z 2025-05-07T20:25:59.2143216Z  2025-05-07T20:25:59.2143495Z 2025-05-07T20:25:59.2143500Z 2025-05-07T20:25:59.2143505Z 2025-05-07T20:25:59.2143510Z 2025-05-07T20:25:59.2143524Z 2025-05-07T20:25:59.2143536Z 2025-05-07T20:25:59.2143541Z 2025-05-07T20:25:59.2143546Z 2025-05-07T20:25:59.2143551Z 2025-05-07T20:25:59.2143556Z 2025-05-07T20:25:59.2143561Z 2025-05-07T20:25:59.2143566Z 2025-05-07T20:25:59.2143571Z 2025-05-07T20:25:59.2143576Z 2025-05-07T20:25:59.2143581Z 2025-05-07T20:25:59.2143586Z 2025-05-07T20:25:59.2143591Z 2025-05-07T20:25:59.2143596Z 2025-05-07T20:25:59.2143814Z  2025-05-07T20:25:59.2144110Z 2025-05-07T20:25:59.2144116Z 2025-05-07T20:25:59.2144252Z  2025-05-07T20:25:59.2144396Z 2025-05-07T20:25:59.2144401Z 2025-05-07T20:25:59.2144534Z  2025-05-07T20:25:59.2144675Z 2025-05-07T20:25:59.2144680Z 2025-05-07T20:25:59.2144685Z 2025-05-07T20:25:59.2144831Z  2025-05-07T20:25:59.2144978Z 2025-05-07T20:25:59.2144983Z 2025-05-07T20:25:59.2144988Z 2025-05-07T20:25:59.2144993Z 2025-05-07T20:25:59.2145147Z  2025-05-07T20:25:59.2145298Z 2025-05-07T20:25:59.2145303Z 2025-05-07T20:25:59.2145417Z 2025-05-07T20:25:59.2145423Z 2025-05-07T20:25:59.2145428Z 2025-05-07T20:25:59.2145582Z  2025-05-07T20:25:59.2145743Z 2025-05-07T20:25:59.2145748Z 2025-05-07T20:25:59.2145753Z 2025-05-07T20:25:59.2145758Z 2025-05-07T20:25:59.2145771Z 2025-05-07T20:25:59.2145777Z 2025-05-07T20:25:59.2145927Z  2025-05-07T20:25:59.2146096Z 2025-05-07T20:25:59.2146101Z 2025-05-07T20:25:59.2146106Z 2025-05-07T20:25:59.2146111Z 2025-05-07T20:25:59.2146117Z 2025-05-07T20:25:59.2146122Z 2025-05-07T20:25:59.2146231Z 2025-05-07T20:25:59.2146388Z  2025-05-07T20:25:59.2146583Z 2025-05-07T20:25:59.2146588Z 2025-05-07T20:25:59.2146593Z 2025-05-07T20:25:59.2146597Z 2025-05-07T20:25:59.2146602Z 2025-05-07T20:25:59.2146607Z 2025-05-07T20:25:59.2146621Z 2025-05-07T20:25:59.2146626Z 2025-05-07T20:25:59.2146785Z  2025-05-07T20:25:59.2146984Z 2025-05-07T20:25:59.2146990Z 2025-05-07T20:25:59.2146995Z 2025-05-07T20:25:59.2147000Z 2025-05-07T20:25:59.2147012Z 2025-05-07T20:25:59.2147017Z 2025-05-07T20:25:59.2147032Z 2025-05-07T20:25:59.2147037Z 2025-05-07T20:25:59.2147042Z 2025-05-07T20:25:59.2147205Z  2025-05-07T20:25:59.2147411Z 2025-05-07T20:25:59.2147417Z 2025-05-07T20:25:59.2147422Z 2025-05-07T20:25:59.2147427Z 2025-05-07T20:25:59.2147432Z 2025-05-07T20:25:59.2147445Z 2025-05-07T20:25:59.2147450Z 2025-05-07T20:25:59.2147455Z 2025-05-07T20:25:59.2147460Z 2025-05-07T20:25:59.2147466Z 2025-05-07T20:25:59.2147636Z  2025-05-07T20:25:59.2147858Z 2025-05-07T20:25:59.2147863Z 2025-05-07T20:25:59.2147876Z 2025-05-07T20:25:59.2147881Z 2025-05-07T20:25:59.2147886Z 2025-05-07T20:25:59.2147892Z 2025-05-07T20:25:59.2147897Z 2025-05-07T20:25:59.2147902Z 2025-05-07T20:25:59.2147907Z 2025-05-07T20:25:59.2147912Z 2025-05-07T20:25:59.2147917Z 2025-05-07T20:25:59.2148091Z  2025-05-07T20:25:59.2148329Z 2025-05-07T20:25:59.2148335Z 2025-05-07T20:25:59.2148347Z 2025-05-07T20:25:59.2148352Z 2025-05-07T20:25:59.2148357Z 2025-05-07T20:25:59.2148362Z 2025-05-07T20:25:59.2148367Z 2025-05-07T20:25:59.2148372Z 2025-05-07T20:25:59.2148377Z 2025-05-07T20:25:59.2148382Z 2025-05-07T20:25:59.2148387Z 2025-05-07T20:25:59.2148392Z 2025-05-07T20:25:59.2148565Z  2025-05-07T20:25:59.2148815Z 2025-05-07T20:25:59.2148821Z 2025-05-07T20:25:59.2148826Z 2025-05-07T20:25:59.2148831Z 2025-05-07T20:25:59.2148836Z 2025-05-07T20:25:59.2148841Z 2025-05-07T20:25:59.2148851Z 2025-05-07T20:25:59.2148856Z 2025-05-07T20:25:59.2148861Z 2025-05-07T20:25:59.2148866Z 2025-05-07T20:25:59.2148871Z 2025-05-07T20:25:59.2148876Z 2025-05-07T20:25:59.2148881Z 2025-05-07T20:25:59.2149068Z  2025-05-07T20:25:59.2149318Z 2025-05-07T20:25:59.2149323Z 2025-05-07T20:25:59.2149328Z 2025-05-07T20:25:59.2149333Z 2025-05-07T20:25:59.2149338Z 2025-05-07T20:25:59.2149343Z 2025-05-07T20:25:59.2149348Z 2025-05-07T20:25:59.2149360Z 2025-05-07T20:25:59.2149365Z 2025-05-07T20:25:59.2149370Z 2025-05-07T20:25:59.2149375Z 2025-05-07T20:25:59.2149380Z 2025-05-07T20:25:59.2149392Z 2025-05-07T20:25:59.2149398Z 2025-05-07T20:25:59.2149586Z  2025-05-07T20:25:59.2149846Z 2025-05-07T20:25:59.2149851Z 2025-05-07T20:25:59.2149857Z 2025-05-07T20:25:59.2149862Z 2025-05-07T20:25:59.2149867Z 2025-05-07T20:25:59.2149880Z 2025-05-07T20:25:59.2149885Z 2025-05-07T20:25:59.2149891Z 2025-05-07T20:25:59.2149901Z 2025-05-07T20:25:59.2149907Z 2025-05-07T20:25:59.2149912Z 2025-05-07T20:25:59.2149917Z 2025-05-07T20:25:59.2149922Z 2025-05-07T20:25:59.2149927Z 2025-05-07T20:25:59.2149932Z 2025-05-07T20:25:59.2150124Z  2025-05-07T20:25:59.2150396Z 2025-05-07T20:25:59.2150401Z 2025-05-07T20:25:59.2150406Z 2025-05-07T20:25:59.2150411Z 2025-05-07T20:25:59.2150416Z 2025-05-07T20:25:59.2150421Z 2025-05-07T20:25:59.2150427Z 2025-05-07T20:25:59.2150532Z 2025-05-07T20:25:59.2150538Z 2025-05-07T20:25:59.2150543Z 2025-05-07T20:25:59.2150548Z 2025-05-07T20:25:59.2150553Z 2025-05-07T20:25:59.2150558Z 2025-05-07T20:25:59.2150563Z 2025-05-07T20:25:59.2150568Z 2025-05-07T20:25:59.2150573Z 2025-05-07T20:25:59.2150793Z  2025-05-07T20:25:59.2151069Z 2025-05-07T20:25:59.2151074Z 2025-05-07T20:25:59.2151079Z 2025-05-07T20:25:59.2151084Z 2025-05-07T20:25:59.2151089Z 2025-05-07T20:25:59.2151094Z 2025-05-07T20:25:59.2151216Z 2025-05-07T20:25:59.2151221Z 2025-05-07T20:25:59.2151227Z 2025-05-07T20:25:59.2151232Z 2025-05-07T20:25:59.2151237Z 2025-05-07T20:25:59.2151242Z 2025-05-07T20:25:59.2151247Z 2025-05-07T20:25:59.2151261Z 2025-05-07T20:25:59.2151267Z 2025-05-07T20:25:59.2151272Z 2025-05-07T20:25:59.2151277Z 2025-05-07T20:25:59.2151496Z  2025-05-07T20:25:59.2151775Z 2025-05-07T20:25:59.2151781Z 2025-05-07T20:25:59.2151786Z 2025-05-07T20:25:59.2151806Z 2025-05-07T20:25:59.2151811Z 2025-05-07T20:25:59.2151816Z 2025-05-07T20:25:59.2151821Z 2025-05-07T20:25:59.2151827Z 2025-05-07T20:25:59.2151832Z 2025-05-07T20:25:59.2151837Z 2025-05-07T20:25:59.2151842Z 2025-05-07T20:25:59.2151847Z 2025-05-07T20:25:59.2151852Z 2025-05-07T20:25:59.2151857Z 2025-05-07T20:25:59.2151862Z 2025-05-07T20:25:59.2151867Z 2025-05-07T20:25:59.2151872Z 2025-05-07T20:25:59.2151877Z 2025-05-07T20:25:59.2152098Z  2025-05-07T20:25:59.2152396Z 2025-05-07T20:25:59.2152402Z 2025-05-07T20:25:59.2152533Z  2025-05-07T20:25:59.2152677Z 2025-05-07T20:25:59.2152682Z 2025-05-07T20:25:59.2152818Z  2025-05-07T20:25:59.2152962Z 2025-05-07T20:25:59.2152967Z 2025-05-07T20:25:59.2152972Z 2025-05-07T20:25:59.2153181Z  2025-05-07T20:25:59.2153327Z 2025-05-07T20:25:59.2153332Z 2025-05-07T20:25:59.2153337Z 2025-05-07T20:25:59.2153342Z 2025-05-07T20:25:59.2153491Z  2025-05-07T20:25:59.2153653Z 2025-05-07T20:25:59.2153658Z 2025-05-07T20:25:59.2153663Z 2025-05-07T20:25:59.2153668Z 2025-05-07T20:25:59.2153674Z 2025-05-07T20:25:59.2153823Z  2025-05-07T20:25:59.2153984Z 2025-05-07T20:25:59.2153990Z 2025-05-07T20:25:59.2153995Z 2025-05-07T20:25:59.2154000Z 2025-05-07T20:25:59.2154005Z 2025-05-07T20:25:59.2154010Z 2025-05-07T20:25:59.2154155Z  2025-05-07T20:25:59.2154335Z 2025-05-07T20:25:59.2154340Z 2025-05-07T20:25:59.2154345Z 2025-05-07T20:25:59.2154351Z 2025-05-07T20:25:59.2154366Z 2025-05-07T20:25:59.2154371Z 2025-05-07T20:25:59.2154376Z 2025-05-07T20:25:59.2154532Z  2025-05-07T20:25:59.2154723Z 2025-05-07T20:25:59.2154728Z 2025-05-07T20:25:59.2154733Z 2025-05-07T20:25:59.2154738Z 2025-05-07T20:25:59.2154743Z 2025-05-07T20:25:59.2154748Z 2025-05-07T20:25:59.2154753Z 2025-05-07T20:25:59.2154758Z 2025-05-07T20:25:59.2154915Z  2025-05-07T20:25:59.2155125Z 2025-05-07T20:25:59.2155130Z 2025-05-07T20:25:59.2155142Z 2025-05-07T20:25:59.2155147Z 2025-05-07T20:25:59.2155152Z 2025-05-07T20:25:59.2155157Z 2025-05-07T20:25:59.2155162Z 2025-05-07T20:25:59.2155167Z 2025-05-07T20:25:59.2155173Z 2025-05-07T20:25:59.2155339Z  2025-05-07T20:25:59.2155543Z 2025-05-07T20:25:59.2155548Z 2025-05-07T20:25:59.2155553Z 2025-05-07T20:25:59.2155558Z 2025-05-07T20:25:59.2155563Z 2025-05-07T20:25:59.2155568Z 2025-05-07T20:25:59.2155573Z 2025-05-07T20:25:59.2155578Z 2025-05-07T20:25:59.2155589Z 2025-05-07T20:25:59.2155594Z 2025-05-07T20:25:59.2155773Z  2025-05-07T20:25:59.2155985Z 2025-05-07T20:25:59.2155990Z 2025-05-07T20:25:59.2155996Z 2025-05-07T20:25:59.2156001Z 2025-05-07T20:25:59.2156006Z 2025-05-07T20:25:59.2156011Z 2025-05-07T20:25:59.2156016Z 2025-05-07T20:25:59.2156021Z 2025-05-07T20:25:59.2156026Z 2025-05-07T20:25:59.2156031Z 2025-05-07T20:25:59.2156036Z 2025-05-07T20:25:59.2156218Z  2025-05-07T20:25:59.2156556Z 2025-05-07T20:25:59.2156563Z 2025-05-07T20:25:59.2156568Z 2025-05-07T20:25:59.2156573Z 2025-05-07T20:25:59.2156578Z 2025-05-07T20:25:59.2156583Z 2025-05-07T20:25:59.2156588Z 2025-05-07T20:25:59.2156594Z 2025-05-07T20:25:59.2156599Z 2025-05-07T20:25:59.2156611Z 2025-05-07T20:25:59.2156617Z 2025-05-07T20:25:59.2156622Z 2025-05-07T20:25:59.2156808Z  2025-05-07T20:25:59.2157053Z 2025-05-07T20:25:59.2157058Z 2025-05-07T20:25:59.2157064Z 2025-05-07T20:25:59.2157176Z 2025-05-07T20:25:59.2157190Z 2025-05-07T20:25:59.2157195Z 2025-05-07T20:25:59.2157200Z 2025-05-07T20:25:59.2157205Z 2025-05-07T20:25:59.2157210Z 2025-05-07T20:25:59.2157215Z 2025-05-07T20:25:59.2157220Z 2025-05-07T20:25:59.2157225Z 2025-05-07T20:25:59.2157230Z 2025-05-07T20:25:59.2157418Z  2025-05-07T20:25:59.2157677Z 2025-05-07T20:25:59.2157682Z 2025-05-07T20:25:59.2157687Z 2025-05-07T20:25:59.2157691Z 2025-05-07T20:25:59.2157704Z 2025-05-07T20:25:59.2157709Z 2025-05-07T20:25:59.2157714Z 2025-05-07T20:25:59.2157719Z 2025-05-07T20:25:59.2157725Z 2025-05-07T20:25:59.2157729Z 2025-05-07T20:25:59.2157735Z 2025-05-07T20:25:59.2157739Z 2025-05-07T20:25:59.2157745Z 2025-05-07T20:25:59.2157749Z 2025-05-07T20:25:59.2157936Z  2025-05-07T20:25:59.2158204Z 2025-05-07T20:25:59.2158209Z 2025-05-07T20:25:59.2158214Z 2025-05-07T20:25:59.2158220Z 2025-05-07T20:25:59.2158225Z 2025-05-07T20:25:59.2158230Z 2025-05-07T20:25:59.2158242Z 2025-05-07T20:25:59.2158247Z 2025-05-07T20:25:59.2158252Z 2025-05-07T20:25:59.2158257Z 2025-05-07T20:25:59.2158262Z 2025-05-07T20:25:59.2158267Z 2025-05-07T20:25:59.2158272Z 2025-05-07T20:25:59.2158278Z 2025-05-07T20:25:59.2158283Z 2025-05-07T20:25:59.2158489Z  2025-05-07T20:25:59.2158760Z 2025-05-07T20:25:59.2158765Z 2025-05-07T20:25:59.2158771Z 2025-05-07T20:25:59.2158776Z 2025-05-07T20:25:59.2158781Z 2025-05-07T20:25:59.2158794Z 2025-05-07T20:25:59.2158799Z 2025-05-07T20:25:59.2158804Z 2025-05-07T20:25:59.2158820Z 2025-05-07T20:25:59.2158825Z 2025-05-07T20:25:59.2158830Z 2025-05-07T20:25:59.2158835Z 2025-05-07T20:25:59.2158841Z 2025-05-07T20:25:59.2158845Z 2025-05-07T20:25:59.2158850Z 2025-05-07T20:25:59.2158856Z 2025-05-07T20:25:59.2159064Z  2025-05-07T20:25:59.2159344Z 2025-05-07T20:25:59.2159349Z 2025-05-07T20:25:59.2159354Z 2025-05-07T20:25:59.2159359Z 2025-05-07T20:25:59.2159371Z 2025-05-07T20:25:59.2159376Z 2025-05-07T20:25:59.2159381Z 2025-05-07T20:25:59.2159386Z 2025-05-07T20:25:59.2159391Z 2025-05-07T20:25:59.2159396Z 2025-05-07T20:25:59.2159401Z 2025-05-07T20:25:59.2159406Z 2025-05-07T20:25:59.2159411Z 2025-05-07T20:25:59.2159416Z 2025-05-07T20:25:59.2159422Z 2025-05-07T20:25:59.2159427Z 2025-05-07T20:25:59.2159432Z 2025-05-07T20:25:59.2159649Z  2025-05-07T20:25:59.2159933Z 2025-05-07T20:25:59.2159938Z 2025-05-07T20:25:59.2159943Z 2025-05-07T20:25:59.2159948Z 2025-05-07T20:25:59.2159952Z 2025-05-07T20:25:59.2159957Z 2025-05-07T20:25:59.2159962Z 2025-05-07T20:25:59.2159967Z 2025-05-07T20:25:59.2159972Z 2025-05-07T20:25:59.2159977Z 2025-05-07T20:25:59.2159982Z 2025-05-07T20:25:59.2159987Z 2025-05-07T20:25:59.2159992Z 2025-05-07T20:25:59.2160005Z 2025-05-07T20:25:59.2160010Z 2025-05-07T20:25:59.2160016Z 2025-05-07T20:25:59.2160021Z 2025-05-07T20:25:59.2160026Z 2025-05-07T20:25:59.2160246Z  2025-05-07T20:25:59.2160527Z 2025-05-07T20:25:59.2160532Z 2025-05-07T20:25:59.2160672Z  2025-05-07T20:25:59.2160806Z 2025-05-07T20:25:59.2160811Z 2025-05-07T20:25:59.2160950Z  2025-05-07T20:25:59.2161095Z 2025-05-07T20:25:59.2161100Z 2025-05-07T20:25:59.2161106Z 2025-05-07T20:25:59.2161244Z  2025-05-07T20:25:59.2161395Z 2025-05-07T20:25:59.2161400Z 2025-05-07T20:25:59.2161405Z 2025-05-07T20:25:59.2161517Z 2025-05-07T20:25:59.2161665Z  2025-05-07T20:25:59.2161828Z 2025-05-07T20:25:59.2161833Z 2025-05-07T20:25:59.2161838Z 2025-05-07T20:25:59.2161843Z 2025-05-07T20:25:59.2161848Z 2025-05-07T20:25:59.2161991Z  2025-05-07T20:25:59.2162152Z 2025-05-07T20:25:59.2162165Z 2025-05-07T20:25:59.2162171Z 2025-05-07T20:25:59.2162176Z 2025-05-07T20:25:59.2162181Z 2025-05-07T20:25:59.2162186Z 2025-05-07T20:25:59.2162333Z  2025-05-07T20:25:59.2162500Z 2025-05-07T20:25:59.2162599Z 2025-05-07T20:25:59.2162604Z 2025-05-07T20:25:59.2162618Z 2025-05-07T20:25:59.2162622Z 2025-05-07T20:25:59.2162627Z 2025-05-07T20:25:59.2162632Z 2025-05-07T20:25:59.2162791Z  2025-05-07T20:25:59.2162978Z 2025-05-07T20:25:59.2162984Z 2025-05-07T20:25:59.2162989Z 2025-05-07T20:25:59.2162994Z 2025-05-07T20:25:59.2163007Z 2025-05-07T20:25:59.2163012Z 2025-05-07T20:25:59.2163017Z 2025-05-07T20:25:59.2163022Z 2025-05-07T20:25:59.2163193Z  2025-05-07T20:25:59.2163392Z 2025-05-07T20:25:59.2163397Z 2025-05-07T20:25:59.2163402Z 2025-05-07T20:25:59.2163415Z 2025-05-07T20:25:59.2163421Z 2025-05-07T20:25:59.2163426Z 2025-05-07T20:25:59.2163431Z 2025-05-07T20:25:59.2163436Z 2025-05-07T20:25:59.2163442Z 2025-05-07T20:25:59.2163605Z  2025-05-07T20:25:59.2163956Z 2025-05-07T20:25:59.2163962Z 2025-05-07T20:25:59.2163973Z 2025-05-07T20:25:59.2163977Z 2025-05-07T20:25:59.2163981Z 2025-05-07T20:25:59.2163984Z 2025-05-07T20:25:59.2163996Z 2025-05-07T20:25:59.2164000Z 2025-05-07T20:25:59.2164003Z 2025-05-07T20:25:59.2164007Z 2025-05-07T20:25:59.2164139Z  2025-05-07T20:25:59.2164308Z 2025-05-07T20:25:59.2164312Z 2025-05-07T20:25:59.2164316Z 2025-05-07T20:25:59.2164320Z 2025-05-07T20:25:59.2164323Z 2025-05-07T20:25:59.2164327Z 2025-05-07T20:25:59.2164331Z 2025-05-07T20:25:59.2164335Z 2025-05-07T20:25:59.2164338Z 2025-05-07T20:25:59.2164342Z 2025-05-07T20:25:59.2164346Z 2025-05-07T20:25:59.2164478Z  2025-05-07T20:25:59.2164653Z 2025-05-07T20:25:59.2164657Z 2025-05-07T20:25:59.2164660Z 2025-05-07T20:25:59.2164664Z 2025-05-07T20:25:59.2164668Z 2025-05-07T20:25:59.2164671Z 2025-05-07T20:25:59.2164675Z 2025-05-07T20:25:59.2164679Z 2025-05-07T20:25:59.2164682Z 2025-05-07T20:25:59.2164686Z 2025-05-07T20:25:59.2164690Z 2025-05-07T20:25:59.2164693Z 2025-05-07T20:25:59.2164829Z  2025-05-07T20:25:59.2165005Z 2025-05-07T20:25:59.2165013Z 2025-05-07T20:25:59.2165017Z 2025-05-07T20:25:59.2165021Z 2025-05-07T20:25:59.2165024Z 2025-05-07T20:25:59.2165028Z 2025-05-07T20:25:59.2165031Z 2025-05-07T20:25:59.2165035Z 2025-05-07T20:25:59.2165038Z 2025-05-07T20:25:59.2165042Z 2025-05-07T20:25:59.2165045Z 2025-05-07T20:25:59.2165049Z 2025-05-07T20:25:59.2165053Z 2025-05-07T20:25:59.2165188Z  2025-05-07T20:25:59.2165368Z 2025-05-07T20:25:59.2165372Z 2025-05-07T20:25:59.2165380Z 2025-05-07T20:25:59.2165383Z 2025-05-07T20:25:59.2165387Z 2025-05-07T20:25:59.2165391Z 2025-05-07T20:25:59.2165394Z 2025-05-07T20:25:59.2165398Z 2025-05-07T20:25:59.2165401Z 2025-05-07T20:25:59.2165410Z 2025-05-07T20:25:59.2165413Z 2025-05-07T20:25:59.2165417Z 2025-05-07T20:25:59.2165421Z 2025-05-07T20:25:59.2165424Z 2025-05-07T20:25:59.2165606Z  2025-05-07T20:25:59.2165793Z 2025-05-07T20:25:59.2165804Z 2025-05-07T20:25:59.2165807Z 2025-05-07T20:25:59.2165814Z 2025-05-07T20:25:59.2165818Z 2025-05-07T20:25:59.2165821Z 2025-05-07T20:25:59.2165825Z 2025-05-07T20:25:59.2165828Z 2025-05-07T20:25:59.2165832Z 2025-05-07T20:25:59.2165836Z 2025-05-07T20:25:59.2165839Z 2025-05-07T20:25:59.2165843Z 2025-05-07T20:25:59.2165846Z 2025-05-07T20:25:59.2165850Z 2025-05-07T20:25:59.2165853Z 2025-05-07T20:25:59.2165996Z  2025-05-07T20:25:59.2166194Z 2025-05-07T20:25:59.2166198Z 2025-05-07T20:25:59.2166311Z 2025-05-07T20:25:59.2166315Z 2025-05-07T20:25:59.2166319Z 2025-05-07T20:25:59.2166323Z 2025-05-07T20:25:59.2166326Z 2025-05-07T20:25:59.2166330Z 2025-05-07T20:25:59.2166333Z 2025-05-07T20:25:59.2166337Z 2025-05-07T20:25:59.2166340Z 2025-05-07T20:25:59.2166344Z 2025-05-07T20:25:59.2166348Z 2025-05-07T20:25:59.2166351Z 2025-05-07T20:25:59.2166355Z 2025-05-07T20:25:59.2166358Z 2025-05-07T20:25:59.2166512Z  2025-05-07T20:25:59.2166764Z 2025-05-07T20:25:59.2166869Z 2025-05-07T20:25:59.2166875Z 2025-05-07T20:25:59.2166880Z 2025-05-07T20:25:59.2166885Z 2025-05-07T20:25:59.2166890Z 2025-05-07T20:25:59.2166895Z 2025-05-07T20:25:59.2166901Z 2025-05-07T20:25:59.2166906Z 2025-05-07T20:25:59.2166920Z 2025-05-07T20:25:59.2166925Z 2025-05-07T20:25:59.2166930Z 2025-05-07T20:25:59.2166935Z 2025-05-07T20:25:59.2166940Z 2025-05-07T20:25:59.2166945Z 2025-05-07T20:25:59.2166950Z 2025-05-07T20:25:59.2166956Z 2025-05-07T20:25:59.2167206Z  2025-05-07T20:25:59.2167494Z 2025-05-07T20:25:59.2167500Z 2025-05-07T20:25:59.2167504Z 2025-05-07T20:25:59.2167510Z 2025-05-07T20:25:59.2167514Z 2025-05-07T20:25:59.2167520Z 2025-05-07T20:25:59.2167525Z 2025-05-07T20:25:59.2167530Z 2025-05-07T20:25:59.2167535Z 2025-05-07T20:25:59.2167540Z 2025-05-07T20:25:59.2167545Z 2025-05-07T20:25:59.2167550Z 2025-05-07T20:25:59.2167555Z 2025-05-07T20:25:59.2167560Z 2025-05-07T20:25:59.2167565Z 2025-05-07T20:25:59.2167570Z 2025-05-07T20:25:59.2167583Z 2025-05-07T20:25:59.2167588Z 2025-05-07T20:25:59.2167814Z  2025-05-07T20:25:59.2168083Z 2025-05-07T20:25:59.2168086Z 2025-05-07T20:25:59.2168189Z  2025-05-07T20:25:59.2168297Z 2025-05-07T20:25:59.2168301Z 2025-05-07T20:25:59.2168402Z  2025-05-07T20:25:59.2168514Z 2025-05-07T20:25:59.2168518Z 2025-05-07T20:25:59.2168522Z 2025-05-07T20:25:59.2168621Z  2025-05-07T20:25:59.2168725Z 2025-05-07T20:25:59.2168735Z 2025-05-07T20:25:59.2168739Z 2025-05-07T20:25:59.2168743Z 2025-05-07T20:25:59.2168855Z  2025-05-07T20:25:59.2168968Z 2025-05-07T20:25:59.2168972Z 2025-05-07T20:25:59.2168976Z 2025-05-07T20:25:59.2168979Z 2025-05-07T20:25:59.2168983Z 2025-05-07T20:25:59.2169098Z  2025-05-07T20:25:59.2169218Z 2025-05-07T20:25:59.2169221Z 2025-05-07T20:25:59.2169225Z 2025-05-07T20:25:59.2169228Z 2025-05-07T20:25:59.2169232Z 2025-05-07T20:25:59.2169236Z 2025-05-07T20:25:59.2169354Z  2025-05-07T20:25:59.2169484Z 2025-05-07T20:25:59.2169487Z 2025-05-07T20:25:59.2169491Z 2025-05-07T20:25:59.2169494Z 2025-05-07T20:25:59.2169498Z 2025-05-07T20:25:59.2169502Z 2025-05-07T20:25:59.2169505Z 2025-05-07T20:25:59.2169623Z  2025-05-07T20:25:59.2169760Z 2025-05-07T20:25:59.2169764Z 2025-05-07T20:25:59.2169768Z 2025-05-07T20:25:59.2169771Z 2025-05-07T20:25:59.2169775Z 2025-05-07T20:25:59.2169778Z 2025-05-07T20:25:59.2169782Z 2025-05-07T20:25:59.2169790Z 2025-05-07T20:25:59.2169915Z  2025-05-07T20:25:59.2170058Z 2025-05-07T20:25:59.2170062Z 2025-05-07T20:25:59.2170065Z 2025-05-07T20:25:59.2170069Z 2025-05-07T20:25:59.2170072Z 2025-05-07T20:25:59.2170076Z 2025-05-07T20:25:59.2170079Z 2025-05-07T20:25:59.2170083Z 2025-05-07T20:25:59.2170087Z 2025-05-07T20:25:59.2170210Z  2025-05-07T20:25:59.2170363Z 2025-05-07T20:25:59.2170366Z 2025-05-07T20:25:59.2170370Z 2025-05-07T20:25:59.2170374Z 2025-05-07T20:25:59.2170381Z 2025-05-07T20:25:59.2170385Z 2025-05-07T20:25:59.2170388Z 2025-05-07T20:25:59.2170392Z 2025-05-07T20:25:59.2170395Z 2025-05-07T20:25:59.2170399Z 2025-05-07T20:25:59.2170529Z  2025-05-07T20:25:59.2170687Z 2025-05-07T20:25:59.2170691Z 2025-05-07T20:25:59.2170694Z 2025-05-07T20:25:59.2170698Z 2025-05-07T20:25:59.2170701Z 2025-05-07T20:25:59.2170705Z 2025-05-07T20:25:59.2170709Z 2025-05-07T20:25:59.2170712Z 2025-05-07T20:25:59.2170833Z 2025-05-07T20:25:59.2170846Z 2025-05-07T20:25:59.2170850Z 2025-05-07T20:25:59.2170978Z  2025-05-07T20:25:59.2171148Z 2025-05-07T20:25:59.2171151Z 2025-05-07T20:25:59.2171155Z 2025-05-07T20:25:59.2171158Z 2025-05-07T20:25:59.2171162Z 2025-05-07T20:25:59.2171166Z 2025-05-07T20:25:59.2171175Z 2025-05-07T20:25:59.2171179Z 2025-05-07T20:25:59.2171182Z 2025-05-07T20:25:59.2171186Z 2025-05-07T20:25:59.2171189Z 2025-05-07T20:25:59.2171193Z 2025-05-07T20:25:59.2171323Z  2025-05-07T20:25:59.2171576Z 2025-05-07T20:25:59.2171585Z 2025-05-07T20:25:59.2171589Z 2025-05-07T20:25:59.2171592Z 2025-05-07T20:25:59.2171596Z 2025-05-07T20:25:59.2171599Z 2025-05-07T20:25:59.2171603Z 2025-05-07T20:25:59.2171607Z 2025-05-07T20:25:59.2171610Z 2025-05-07T20:25:59.2171614Z 2025-05-07T20:25:59.2171617Z 2025-05-07T20:25:59.2171621Z 2025-05-07T20:25:59.2171624Z 2025-05-07T20:25:59.2171758Z  2025-05-07T20:25:59.2171952Z 2025-05-07T20:25:59.2171956Z 2025-05-07T20:25:59.2171960Z 2025-05-07T20:25:59.2171964Z 2025-05-07T20:25:59.2171967Z 2025-05-07T20:25:59.2171971Z 2025-05-07T20:25:59.2171974Z 2025-05-07T20:25:59.2171978Z 2025-05-07T20:25:59.2171981Z 2025-05-07T20:25:59.2171985Z 2025-05-07T20:25:59.2171988Z 2025-05-07T20:25:59.2171992Z 2025-05-07T20:25:59.2171996Z 2025-05-07T20:25:59.2171999Z 2025-05-07T20:25:59.2172145Z  2025-05-07T20:25:59.2172334Z 2025-05-07T20:25:59.2172342Z 2025-05-07T20:25:59.2172346Z 2025-05-07T20:25:59.2172349Z 2025-05-07T20:25:59.2172353Z 2025-05-07T20:25:59.2172357Z 2025-05-07T20:25:59.2172360Z 2025-05-07T20:25:59.2172364Z 2025-05-07T20:25:59.2172367Z 2025-05-07T20:25:59.2172371Z 2025-05-07T20:25:59.2172375Z 2025-05-07T20:25:59.2172378Z 2025-05-07T20:25:59.2172382Z 2025-05-07T20:25:59.2172392Z 2025-05-07T20:25:59.2172396Z 2025-05-07T20:25:59.2172538Z  2025-05-07T20:25:59.2172735Z 2025-05-07T20:25:59.2172739Z 2025-05-07T20:25:59.2172742Z 2025-05-07T20:25:59.2172746Z 2025-05-07T20:25:59.2172749Z 2025-05-07T20:25:59.2172760Z 2025-05-07T20:25:59.2172764Z 2025-05-07T20:25:59.2172768Z 2025-05-07T20:25:59.2172771Z 2025-05-07T20:25:59.2172775Z 2025-05-07T20:25:59.2172779Z 2025-05-07T20:25:59.2172783Z 2025-05-07T20:25:59.2172787Z 2025-05-07T20:25:59.2172790Z 2025-05-07T20:25:59.2172794Z 2025-05-07T20:25:59.2172798Z 2025-05-07T20:25:59.2172945Z  2025-05-07T20:25:59.2173152Z 2025-05-07T20:25:59.2173156Z 2025-05-07T20:25:59.2173160Z 2025-05-07T20:25:59.2173163Z 2025-05-07T20:25:59.2173167Z 2025-05-07T20:25:59.2173171Z 2025-05-07T20:25:59.2173175Z 2025-05-07T20:25:59.2173178Z 2025-05-07T20:25:59.2173182Z 2025-05-07T20:25:59.2173185Z 2025-05-07T20:25:59.2173189Z 2025-05-07T20:25:59.2173193Z 2025-05-07T20:25:59.2173196Z 2025-05-07T20:25:59.2173200Z 2025-05-07T20:25:59.2173203Z 2025-05-07T20:25:59.2173207Z 2025-05-07T20:25:59.2173215Z 2025-05-07T20:25:59.2173369Z  2025-05-07T20:25:59.2173571Z 2025-05-07T20:25:59.2173575Z 2025-05-07T20:25:59.2173578Z 2025-05-07T20:25:59.2173582Z 2025-05-07T20:25:59.2173586Z 2025-05-07T20:25:59.2173589Z 2025-05-07T20:25:59.2173593Z 2025-05-07T20:25:59.2173596Z 2025-05-07T20:25:59.2173600Z 2025-05-07T20:25:59.2173604Z 2025-05-07T20:25:59.2173616Z 2025-05-07T20:25:59.2173619Z 2025-05-07T20:25:59.2173623Z 2025-05-07T20:25:59.2173627Z 2025-05-07T20:25:59.2173634Z 2025-05-07T20:25:59.2173638Z 2025-05-07T20:25:59.2173642Z 2025-05-07T20:25:59.2173646Z 2025-05-07T20:25:59.2173810Z  2025-05-07T20:25:59.2174024Z 2025-05-07T20:25:59.2174027Z 2025-05-07T20:25:59.2174122Z  2025-05-07T20:25:59.2174221Z 2025-05-07T20:25:59.2174225Z 2025-05-07T20:25:59.2174330Z  2025-05-07T20:25:59.2174434Z 2025-05-07T20:25:59.2174438Z 2025-05-07T20:25:59.2174441Z 2025-05-07T20:25:59.2174624Z  2025-05-07T20:25:59.2174738Z 2025-05-07T20:25:59.2174741Z 2025-05-07T20:25:59.2174745Z 2025-05-07T20:25:59.2174749Z 2025-05-07T20:25:59.2174854Z  2025-05-07T20:25:59.2174970Z 2025-05-07T20:25:59.2174973Z 2025-05-07T20:25:59.2174977Z 2025-05-07T20:25:59.2174981Z 2025-05-07T20:25:59.2174985Z 2025-05-07T20:25:59.2175092Z  2025-05-07T20:25:59.2175214Z 2025-05-07T20:25:59.2175218Z 2025-05-07T20:25:59.2175222Z 2025-05-07T20:25:59.2175225Z 2025-05-07T20:25:59.2175307Z 2025-05-07T20:25:59.2175311Z 2025-05-07T20:25:59.2175422Z  2025-05-07T20:25:59.2175552Z 2025-05-07T20:25:59.2175556Z 2025-05-07T20:25:59.2175559Z 2025-05-07T20:25:59.2175563Z 2025-05-07T20:25:59.2175566Z 2025-05-07T20:25:59.2175570Z 2025-05-07T20:25:59.2175574Z 2025-05-07T20:25:59.2175685Z  2025-05-07T20:25:59.2175818Z 2025-05-07T20:25:59.2175827Z 2025-05-07T20:25:59.2175831Z 2025-05-07T20:25:59.2175835Z 2025-05-07T20:25:59.2175838Z 2025-05-07T20:25:59.2175847Z 2025-05-07T20:25:59.2175851Z 2025-05-07T20:25:59.2175855Z 2025-05-07T20:25:59.2175971Z  2025-05-07T20:25:59.2176114Z 2025-05-07T20:25:59.2176124Z 2025-05-07T20:25:59.2176128Z 2025-05-07T20:25:59.2176132Z 2025-05-07T20:25:59.2176135Z 2025-05-07T20:25:59.2176139Z 2025-05-07T20:25:59.2176142Z 2025-05-07T20:25:59.2176146Z 2025-05-07T20:25:59.2176149Z 2025-05-07T20:25:59.2176269Z  2025-05-07T20:25:59.2176427Z 2025-05-07T20:25:59.2176431Z 2025-05-07T20:25:59.2176440Z 2025-05-07T20:25:59.2176444Z 2025-05-07T20:25:59.2176448Z 2025-05-07T20:25:59.2176452Z 2025-05-07T20:25:59.2176456Z 2025-05-07T20:25:59.2176459Z 2025-05-07T20:25:59.2176463Z 2025-05-07T20:25:59.2176467Z 2025-05-07T20:25:59.2176594Z  2025-05-07T20:25:59.2176760Z 2025-05-07T20:25:59.2176764Z 2025-05-07T20:25:59.2176768Z 2025-05-07T20:25:59.2176771Z 2025-05-07T20:25:59.2176775Z 2025-05-07T20:25:59.2176779Z 2025-05-07T20:25:59.2176787Z 2025-05-07T20:25:59.2176791Z 2025-05-07T20:25:59.2176794Z 2025-05-07T20:25:59.2176798Z 2025-05-07T20:25:59.2176802Z 2025-05-07T20:25:59.2176939Z  done 2025-05-07T20:25:59.5375294Z Preparing transaction: \ | / done 2025-05-07T20:26:01.1950806Z Verifying transaction: \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:26:02.1198665Z Executing transaction: | / - \ | / - \ | done 2025-05-07T20:26:04.5144536Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ... 2025-05-07T20:26:04.5145119Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:04.5145824Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:04.5146385Z 2025-05-07T20:26:04.5157042Z 2025-05-07T20:26:04.5158090Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:04.5158853Z 2025-05-07T20:26:04.5170370Z 2025-05-07T20:26:04.5170562Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:04.5176584Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:04.5180483Z 2025-05-07T20:26:04.6779934Z 2025-05-07T20:26:04.6785414Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:04.6789482Z 2025-05-07T20:26:04.6808063Z 2025-05-07T20:26:04.6808352Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:04.7177246Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:06.6084933Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:06.6752847Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:06.6753375Z 2025-05-07T20:26:07.1045356Z 2025-05-07T20:26:07.1055177Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:07.1402038Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:07.1402726Z 2025-05-07T20:26:07.5816188Z 2025-05-07T20:26:07.5816587Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:07.5817767Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:07.5818487Z 2025-05-07T20:26:08.0221738Z 2025-05-07T20:26:10.0723246Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:12.1107490Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:14.1649524Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:14.1650444Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:16.2153688Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:18.1379241Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:26:18.1379515Z 2025-05-07T20:26:18.2032668Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:26:22.0978126Z /tmp/tmpdjpi5s50: line 3: clang: command not found 2025-05-07T20:26:22.0978417Z 2025-05-07T20:26:22.0978702Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:26:22.1662481Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:26:22.1662813Z 2025-05-07T20:26:22.1685129Z total 36 2025-05-07T20:26:22.1685627Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:26 . 2025-05-07T20:26:22.1686032Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:24 .. 2025-05-07T20:26:22.1686917Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:26:22.1687691Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:26:22.1688405Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:26:22.1689125Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:26:22.1689780Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:26:22.1690414Z -rw-r--r--. 2 ec2-user ec2-user 2932 Nov 20 20:32 ~cuda-nvcc_activate.sh 2025-05-07T20:26:22.1691033Z 2025-05-07T20:26:22.1691337Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:26:22.1692233Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:26:22.1692811Z 2025-05-07T20:26:22.1710963Z 2025-05-07T20:26:22.1711266Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:26:22.1711539Z 2025-05-07T20:26:24.1566883Z 2025-05-07T20:26:24.1567407Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:26:24.1567996Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:26:24.1568383Z 2025-05-07T20:26:24.5865851Z 2025-05-07T20:26:24.5866205Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:26:24.5866470Z 2025-05-07T20:26:26.4978698Z -allow-unsupported-compiler 2025-05-07T20:26:26.4978964Z 2025-05-07T20:26:26.5631575Z 2025-05-07T20:26:26.5632169Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:26:26.5633277Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:26:26.5633926Z 2025-05-07T20:26:28.5449305Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:26:28.5450087Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:26:28.5450458Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:26:28.5450795Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:26:28.5451122Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:26:28.5451414Z #define _STL_PAIR_H 1 2025-05-07T20:26:28.5451766Z #define __cpp_attributes 200809L 2025-05-07T20:26:28.5452214Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:26:28.5452689Z #define __DELETE_THROW throw() 2025-05-07T20:26:28.5453003Z #define _PTRDIFF_T_ 2025-05-07T20:26:28.5453329Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:26:28.5453767Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:26:28.5454168Z #define _IO_LEFT 02 2025-05-07T20:26:28.5454478Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:26:28.5454837Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:26:28.5455123Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:26:28.5455548Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:26:28.5455974Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:26:28.5456391Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:26:28.5456757Z #define _IOS_OUTPUT 2 2025-05-07T20:26:28.5457183Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:26:28.5457698Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:26:28.5458141Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:26:28.5458533Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:26:28.5458925Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:26:28.5460013Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:26:28.5461130Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:26:28.5461505Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:26:28.5461992Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:26:28.5462443Z #define _T_WCHAR_ 2025-05-07T20:26:28.5462738Z #define stdout stdout 2025-05-07T20:26:28.5463195Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:26:28.5463913Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:26:28.5464164Z #define __flexarr [] 2025-05-07T20:26:28.5464402Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:26:28.5464723Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:26:28.5465057Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:26:28.5465311Z #define _MATH_H 1 2025-05-07T20:26:28.5465659Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:26:28.5465996Z #define __S64_TYPE long int 2025-05-07T20:26:28.5466245Z #define __stub_fchflags 2025-05-07T20:26:28.5466682Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:26:28.5466969Z #define __SQUAD_TYPE long int 2025-05-07T20:26:28.5467229Z #define __INTMAX_C(c) c ## L 2025-05-07T20:26:28.5467490Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:26:28.5467741Z #define NL_NMAX INT_MAX 2025-05-07T20:26:28.5467969Z #define _BITS_TIME_H 1 2025-05-07T20:26:28.5468242Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:26:28.5468569Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:26:28.5468865Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:26:28.5469220Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:26:28.5469612Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:26:28.5469967Z #define __CHAR_BIT__ 8 2025-05-07T20:26:28.5471656Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:28.5472023Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:26:28.5472314Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:26:28.5472584Z #define FP_NAN 0 2025-05-07T20:26:28.5472841Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:26:28.5473276Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:26:28.5473753Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:26:28.5474136Z #define __cudaCDP2GetErrorString 2025-05-07T20:26:28.5474417Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:26:28.5474675Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:26:28.5474929Z #define __SM_80_RT_H__ 2025-05-07T20:26:28.5475152Z #define _NEW 2025-05-07T20:26:28.5475366Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:26:28.5475644Z #define __UINT8_MAX__ 0xff 2025-05-07T20:26:28.5476005Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:26:28.5476396Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:26:28.5476631Z #define __USE_ANSI 1 2025-05-07T20:26:28.5476911Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:26:28.5477298Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:26:28.5477650Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:26:28.5477954Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:26:28.5478227Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:26:28.5478501Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:26:28.5478777Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:26:28.5479057Z #define PIPE_BUF 4096 2025-05-07T20:26:28.5479376Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:26:28.5481116Z #define ADJ_TICK 0x4000 2025-05-07T20:26:28.5481391Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:26:28.5481703Z #define MQ_PRIO_MAX 32768 2025-05-07T20:26:28.5481964Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:26:28.5482280Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:26:28.5482737Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:28.5483255Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:26:28.5483812Z #define _XOPEN_SOURCE 700 2025-05-07T20:26:28.5484071Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:26:28.5484336Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:26:28.5484621Z #define __cpp_static_assert 201411L 2025-05-07T20:26:28.5484961Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:26:28.5485294Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:26:28.5485670Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:26:28.5485952Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:26:28.5486245Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:26:28.5486526Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:26:28.5486825Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:28.5487170Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:26:28.5487509Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:26:28.5487786Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:26:28.5488094Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:28.5488522Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:26:28.5488871Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:26:28.5489166Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:26:28.5489447Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:26:28.5489768Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:26:28.5490090Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:26:28.5490485Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:26:28.5490885Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:26:28.5491183Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:26:28.5491446Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:26:28.5491714Z #define __GCC_IEC_559 2 2025-05-07T20:26:28.5491999Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:26:28.5492331Z #define _IO_flockfile(_fp) 2025-05-07T20:26:28.5492583Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:26:28.5492852Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:26:28.5493111Z #define _IOFBF 0 2025-05-07T20:26:28.5493316Z #define __USE_BSD 1 2025-05-07T20:26:28.5493538Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:26:28.5493802Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:26:28.5494065Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:26:28.5494312Z #define _IO_NO_WRITES 8 2025-05-07T20:26:28.5494562Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:26:28.5494910Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:26:28.5495255Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:26:28.5495554Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:26:28.5495872Z #define __cpp_binary_literals 201304L 2025-05-07T20:26:28.5496156Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:26:28.5496420Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:26:28.5496685Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:26:28.5496986Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:26:28.5497376Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:26:28.5497734Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:26:28.5498031Z #define M_PI 3.14159265358979323846 2025-05-07T20:26:28.5498336Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:26:28.5498885Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:26:28.5499185Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:26:28.5499494Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:26:28.5499767Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:26:28.5500033Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:26:28.5500604Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:26:28.5501194Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:26:28.5501518Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:26:28.5501833Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:26:28.5502143Z #define __cudaCDP2GetErrorName 2025-05-07T20:26:28.5502423Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:26:28.5502680Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:26:28.5502984Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:26:28.5503309Z #define __cpp_variadic_templates 200704L 2025-05-07T20:26:28.5503608Z #define RAND_MAX 2147483647 2025-05-07T20:26:28.5503869Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:26:28.5504289Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:28.5504601Z #define __SM_90_RT_H__ 2025-05-07T20:26:28.5504836Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:26:28.5505093Z #define __COMPAR_FN_T 2025-05-07T20:26:28.5505331Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:26:28.5505617Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:26:28.5506113Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:26:28.5506642Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:26:28.5507068Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:26:28.5507416Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:26:28.5507713Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:26:28.5508054Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:26:28.5508358Z #define __cpp_variable_templates 201304L 2025-05-07T20:26:28.5508866Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:28.5509417Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:26:28.5509753Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:26:28.5510017Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:26:28.5522535Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:26:28.5522835Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:26:28.5523087Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:26:28.5523342Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:26:28.5523587Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:26:28.5524015Z #define __u_char_defined 2025-05-07T20:26:28.5524330Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:26:28.5524672Z #define STA_PPSERROR 0x0800 2025-05-07T20:26:28.5524909Z #define _GLIBCXX_STD_A std 2025-05-07T20:26:28.5525154Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:26:28.5525437Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:26:28.5525876Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:26:28.5526298Z #define FP_INFINITE 1 2025-05-07T20:26:28.5526670Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:28.5527086Z #define _IO_pid_t __pid_t 2025-05-07T20:26:28.5527345Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:26:28.5527603Z #define __LEAF , __leaf__ 2025-05-07T20:26:28.5527843Z #define PATH_MAX 4096 2025-05-07T20:26:28.5528079Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:26:28.5528412Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:26:28.5528734Z #define _LIMITS_H___ 2025-05-07T20:26:28.5528961Z #define __size_t 2025-05-07T20:26:28.5529182Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:26:28.5529720Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:26:28.5530284Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:26:28.5530581Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:26:28.5530913Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:26:28.5531176Z #define _WCHAR_T_DEFINED 2025-05-07T20:26:28.5531524Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:26:28.5531931Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:26:28.5532230Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:26:28.5532550Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:26:28.5532835Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:26:28.5533108Z #define __INT8_C(c) c 2025-05-07T20:26:28.5533372Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:26:28.5533672Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:26:28.5533925Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:26:28.5534181Z #define __SM_70_RT_HPP__ 2025-05-07T20:26:28.5534431Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:26:28.5534702Z #define __cpp_variadic_using 201611L 2025-05-07T20:26:28.5535017Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:28.5535338Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:26:28.5535601Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:26:28.5536069Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:26:28.5536362Z #define __cpp_capture_star_this 201603L 2025-05-07T20:26:28.5536673Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:26:28.5536972Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:26:28.5537326Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:26:28.5537699Z #define NFDBITS __NFDBITS 2025-05-07T20:26:28.5537949Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:26:28.5538232Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:26:28.5539106Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:26:28.5539426Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:26:28.5539673Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:26:28.5539957Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:26:28.5540255Z #define STA_UNSYNC 0x0040 2025-05-07T20:26:28.5540554Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:28.5540971Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:26:28.5541325Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:26:28.5541601Z #define __cpp_if_constexpr 201606L 2025-05-07T20:26:28.5541910Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:26:28.5542277Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:26:28.5542602Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:26:28.5542913Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:26:28.5543240Z #define __daddr_t_defined 2025-05-07T20:26:28.5543493Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:26:28.5543752Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:26:28.5544057Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:26:28.5544557Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:26:28.5545023Z #define _ACRTIMP 2025-05-07T20:26:28.5545243Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:26:28.5545501Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:26:28.5545787Z #define _IOS_BIN 128 2025-05-07T20:26:28.5546147Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:26:28.5546592Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:28.5546847Z #define UNDERFLOW 4 2025-05-07T20:26:28.5547062Z #define NAME_MAX 255 2025-05-07T20:26:28.5547291Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:26:28.5547562Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:26:28.5547827Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:26:28.5548119Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:26:28.5548497Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:26:28.5548873Z #define __ptr_t void * 2025-05-07T20:26:28.5549112Z #define M_E 2.7182818284590452354 2025-05-07T20:26:28.5549386Z #define cudaSurfaceType1D 0x01 2025-05-07T20:26:28.5549642Z #define __USE_ISOCXX11 1 2025-05-07T20:26:28.5549906Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:26:28.5550227Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:26:28.5550510Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:26:28.5550781Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:26:28.5551062Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:26:28.5551362Z #define cudaSurfaceType2D 0x02 2025-05-07T20:26:28.5551619Z #define __linux 1 2025-05-07T20:26:28.5551842Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:26:28.5552112Z #define cudaDeviceMask 0xff 2025-05-07T20:26:28.5552374Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:26:28.5552666Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:26:28.5552940Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:26:28.5553217Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:26:28.5553522Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:26:28.5553821Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:26:28.5554102Z #define _BITS_TYPES_H 1 2025-05-07T20:26:28.5554387Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:26:28.5554976Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:26:28.5555273Z #define cudaSurfaceType3D 0x03 2025-05-07T20:26:28.5555547Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:26:28.5555829Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:26:28.5556107Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:26:28.5556878Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:26:28.5557685Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:26:28.5558105Z #define __unix 1 2025-05-07T20:26:28.5558309Z #define MATH_ERRNO 1 2025-05-07T20:26:28.5558546Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:26:28.5558823Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:26:28.5559080Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:26:28.5559357Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:26:28.5559637Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:26:28.5559913Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:26:28.5560365Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:26:28.5560821Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:26:28.5561112Z #define CUDARTAPI_CDECL 2025-05-07T20:26:28.5561358Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:26:28.5561625Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:26:28.5561904Z #define __cpp_lib_void_t 201411 2025-05-07T20:26:28.5562156Z #define _POSIX_AIO_MAX 1 2025-05-07T20:26:28.5562401Z #define __SIZE_T 2025-05-07T20:26:28.5562646Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:26:28.5562955Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:26:28.5563244Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:26:28.5563500Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:26:28.5563898Z #define _ATFILE_SOURCE 1 2025-05-07T20:26:28.5564281Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:26:28.5564713Z #define __WAIT_STATUS void * 2025-05-07T20:26:28.5564974Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:26:28.5565233Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:26:28.5565493Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:26:28.5565776Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:26:28.5566043Z #define __WINT_MIN__ 0U 2025-05-07T20:26:28.5566602Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:26:28.5567240Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:26:28.5567528Z #define WUNTRACED 2 2025-05-07T20:26:28.5567754Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:26:28.5568025Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:26:28.5568295Z #define NZERO 20 2025-05-07T20:26:28.5568520Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:26:28.5568795Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:26:28.5569083Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:26:28.5569367Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:26:28.5569616Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:26:28.5569901Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:26:28.5570163Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:26:28.5570435Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:26:28.5570707Z #define EXIT_FAILURE 1 2025-05-07T20:26:28.5570933Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:26:28.5571188Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:26:28.5571451Z #define _SIZE_T_DEFINED_ 2025-05-07T20:26:28.5571698Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:26:28.5571979Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:26:28.5572309Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:26:28.5572656Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:26:28.5572946Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:26:28.5573192Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:26:28.5573461Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:26:28.5573842Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:26:28.5574146Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:26:28.5574431Z #define SEEK_DATA 3 2025-05-07T20:26:28.5574651Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:26:28.5574939Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:26:28.5575353Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:26:28.5575731Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:26:28.5575976Z #define __INT64_C(c) c ## L 2025-05-07T20:26:28.5576241Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:26:28.5576644Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:26:28.5576966Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:26:28.5577237Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:26:28.5577524Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:26:28.5577819Z #define STA_PPSWANDER 0x0400 2025-05-07T20:26:28.5578077Z #define __INT_WCHAR_T_H 2025-05-07T20:26:28.5578311Z #define WSTOPPED 2 2025-05-07T20:26:28.5578542Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:26:28.5578832Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:26:28.5579081Z #define FP_NORMAL 4 2025-05-07T20:26:28.5579312Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:26:28.5579594Z #define _BITS_TIMEX_H 1 2025-05-07T20:26:28.5579830Z #define _POSIX_LINK_MAX 8 2025-05-07T20:26:28.5580075Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:26:28.5580359Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:26:28.5580628Z #define cudaTextureType1D 0x01 2025-05-07T20:26:28.5580917Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:26:28.5581179Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:26:28.5581436Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:26:28.5581730Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:26:28.5582161Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:26:28.5582599Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:26:28.5582860Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:26:28.5583126Z #define _POSIX_SOURCE 1 2025-05-07T20:26:28.5583373Z #define cudaTextureType2D 0x02 2025-05-07T20:26:28.5583633Z #define _PTR_TRAITS_H 1 2025-05-07T20:26:28.5583899Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:26:28.5584203Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:26:28.5584463Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:26:28.5584780Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:26:28.5585114Z #define cudaTextureType3D 0x03 2025-05-07T20:26:28.5585375Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:26:28.5585640Z #define CLOCK_REALTIME 0 2025-05-07T20:26:28.5585889Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:26:28.5586155Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:26:28.5586455Z #define __cpp_aligned_new 201606L 2025-05-07T20:26:28.5586728Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:26:28.5586994Z #define cudaEventBlockingSync 0x01 2025-05-07T20:26:28.5587274Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:26:28.5587545Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:26:28.5587836Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:26:28.5588128Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:26:28.5588404Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:26:28.5588642Z #define __GLIBC__ 2 2025-05-07T20:26:28.5588856Z #define __END_DECLS } 2025-05-07T20:26:28.5589090Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:26:28.5589450Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:26:28.5589815Z #define __CONCAT(x,y) x ## y 2025-05-07T20:26:28.5590067Z #define WCONTINUED 8 2025-05-07T20:26:28.5590296Z #define __STDC_HOSTED__ 1 2025-05-07T20:26:28.5590540Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:26:28.5590809Z #define _ALLOCA_H 1 2025-05-07T20:26:28.5591036Z #define __host__ __location__(host) 2025-05-07T20:26:28.5591448Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:26:28.5591883Z #define __SLONG32_TYPE int 2025-05-07T20:26:28.5592241Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:26:28.5592514Z #define _SYS_SELECT_H 1 2025-05-07T20:26:28.5592752Z #define _IO_LINE_BUF 0x200 2025-05-07T20:26:28.5592996Z #define _IOS_NOCREATE 32 2025-05-07T20:26:28.5593234Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:26:28.5593512Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:26:28.5593803Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:26:28.5594085Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:26:28.5594364Z #define __global__ __location__(global) 2025-05-07T20:26:28.5594747Z #define __GNU_LIBRARY__ 6 2025-05-07T20:26:28.5595003Z #define __cpp_decltype_auto 201304L 2025-05-07T20:26:28.5595269Z #define __DBL_DIG__ 15 2025-05-07T20:26:28.5595494Z #define TIME_UTC 1 2025-05-07T20:26:28.5595710Z #define __FLT32_DIG__ 6 2025-05-07T20:26:28.5596031Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:26:28.5596467Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:26:28.5596780Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:26:28.5597087Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:26:28.5597382Z #define _G_BUFSIZ 8192 2025-05-07T20:26:28.5597680Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:26:28.5598045Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:26:28.5598335Z #define __cudaCDP2GetDevice 2025-05-07T20:26:28.5598614Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:26:28.5598904Z #define STA_CLOCKERR 0x1000 2025-05-07T20:26:28.5599141Z #define __GXX_WEAK__ 1 2025-05-07T20:26:28.5599401Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:28.5599707Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:26:28.5599956Z #define __SHRT_WIDTH__ 16 2025-05-07T20:26:28.5600250Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:26:28.5600586Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:26:28.5600854Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:26:28.5601140Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:26:28.5601436Z #define _G_config_h 1 2025-05-07T20:26:28.5601705Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:26:28.5602037Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:26:28.5602310Z #define _GCC_WCHAR_T 2025-05-07T20:26:28.5602536Z #define TMP_MAX 238328 2025-05-07T20:26:28.5602764Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:26:28.5603028Z #define __DEVICE_TYPES_H__ 2025-05-07T20:26:28.5603287Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:28.5603554Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:26:28.5603977Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:26:28.5604260Z #define _IO_SKIPWS 01 2025-05-07T20:26:28.5604655Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:26:28.5605115Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:26:28.5605376Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:26:28.5605699Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:26:28.5606061Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:26:28.5606437Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:26:28.5606797Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:26:28.5607039Z #define le32toh(x) (x) 2025-05-07T20:26:28.5607272Z #define _SIZE_T_DEFINED 2025-05-07T20:26:28.5607524Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:26:28.5607854Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:26:28.5608200Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:26:28.5608590Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:26:28.5609002Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:26:28.5609266Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:26:28.5609525Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:26:28.5609781Z #define _POSIX_NAME_MAX 14 2025-05-07T20:26:28.5610060Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:26:28.5610592Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:26:28.5611217Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:26:28.5611521Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:26:28.5611869Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:26:28.5612181Z #define _WCHAR_T_ 2025-05-07T20:26:28.5612401Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:26:28.5612760Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:26:28.5613142Z #define RTSIG_MAX 32 2025-05-07T20:26:28.5613355Z #define _STDDEF_H 2025-05-07T20:26:28.5613668Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:26:28.5613940Z #define _VA_LIST_DEFINED 2025-05-07T20:26:28.5614180Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:26:28.5614509Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:26:28.5614893Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:26:28.5615209Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:26:28.5615493Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:26:28.5615954Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:26:28.5616470Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:26:28.5616825Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:26:28.5617139Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:26:28.5617449Z #define __unix__ 1 2025-05-07T20:26:28.5617669Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:28.5617944Z #define __INT_WIDTH__ 32 2025-05-07T20:26:28.5618182Z #define __SIZEOF_LONG__ 8 2025-05-07T20:26:28.5618416Z #define _IONBF 2 2025-05-07T20:26:28.5618856Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:26:28.5619609Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:26:28.5620135Z #define __STDC_IEC_559__ 1 2025-05-07T20:26:28.5620383Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:26:28.5620646Z #define __UINT16_C(c) c 2025-05-07T20:26:28.5620884Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:26:28.5621142Z #define STA_DEL 0x0020 2025-05-07T20:26:28.5621378Z #define __CUDACC_VER_MINOR__ 6 2025-05-07T20:26:28.5621628Z #define __id_t_defined 2025-05-07T20:26:28.5621886Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:26:28.5622330Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:26:28.5622751Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:26:28.5623019Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:26:28.5623270Z #define __DECIMAL_DIG__ 21 2025-05-07T20:26:28.5623521Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:26:28.5623777Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:26:28.5624035Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:26:28.5624297Z #define SING 2 2025-05-07T20:26:28.5624514Z #define STA_FREQHOLD 0x0080 2025-05-07T20:26:28.5624770Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:28.5625077Z #define cudaStreamDefault 0x00 2025-05-07T20:26:28.5625425Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:26:28.5625787Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:26:28.5626059Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:26:28.5626324Z #define __gnu_linux__ 1 2025-05-07T20:26:28.5626555Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:26:28.5626810Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:26:28.5627058Z #define MAX_INPUT 255 2025-05-07T20:26:28.5627288Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:26:28.5627614Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:26:28.5627984Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:26:28.5628297Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:26:28.5628608Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:26:28.5629007Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:26:28.5629430Z #define _IO_SHOWPOS 02000 2025-05-07T20:26:28.5629863Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:26:28.5630222Z #define _Mfloat_ float 2025-05-07T20:26:28.5630481Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:26:28.5630779Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:26:28.5631063Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:26:28.5631543Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:26:28.5632025Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:28.5632372Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:26:28.5632693Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:26:28.5633043Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:26:28.5633326Z #define __USE_ISOC11 1 2025-05-07T20:26:28.5633553Z #define _BSD_SIZE_T_ 2025-05-07T20:26:28.5633782Z #define ADJ_MICRO 0x1000 2025-05-07T20:26:28.5634023Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:26:28.5634287Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:26:28.5634578Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:26:28.5634887Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:26:28.5635193Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:26:28.5635519Z #define __THROW throw () 2025-05-07T20:26:28.5635761Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:26:28.5636040Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:28.5636388Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:26:28.5636732Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:26:28.5637000Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:26:28.5637255Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:26:28.5637515Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:26:28.5637766Z #define L_tmpnam 20 2025-05-07T20:26:28.5637984Z #define ___int_wchar_t_h 2025-05-07T20:26:28.5638319Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:26:28.5639082Z #define isascii(c) __isascii (c) 2025-05-07T20:26:28.5647403Z #define _T_PTRDIFF 2025-05-07T20:26:28.5647726Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:26:28.5648075Z #define toascii(c) __toascii (c) 2025-05-07T20:26:28.5648329Z #define __GNUC__ 11 2025-05-07T20:26:28.5648586Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:26:28.5648883Z #define __GXX_RTTI 1 2025-05-07T20:26:28.5649099Z #define __pie__ 2 2025-05-07T20:26:28.5649310Z #define __MMX__ 1 2025-05-07T20:26:28.5649523Z #define __cudaCDP2Malloc 2025-05-07T20:26:28.5649788Z #define __timespec_defined 1 2025-05-07T20:26:28.5650036Z #define L_ctermid 9 2025-05-07T20:26:28.5650256Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:28.5650564Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:26:28.5650952Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:26:28.5651314Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:26:28.5651577Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:26:28.5651868Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:26:28.5652168Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:26:28.5652478Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:26:28.5652740Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:26:28.5653172Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:26:28.5653901Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:28.5654500Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:26:28.5654803Z #define __USE_SVID 1 2025-05-07T20:26:28.5655047Z #define __constant__ __location__(constant) 2025-05-07T20:26:28.5655356Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:26:28.5655649Z #define __device__ __location__(device) 2025-05-07T20:26:28.5655975Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:26:28.5656289Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:26:28.5656876Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:26:28.5657155Z #define CUDART_DEVICE __device__ 2025-05-07T20:26:28.5657499Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:26:28.5657859Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:26:28.5658128Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:26:28.5658490Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:26:28.5658864Z #define __STDC_UTF_16__ 1 2025-05-07T20:26:28.5659101Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:26:28.5659459Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:26:28.5660077Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:26:28.5660380Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:26:28.5660648Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:26:28.5660914Z #define NGROUPS_MAX 65536 2025-05-07T20:26:28.5661163Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:26:28.5661425Z #define __USE_ISOC95 1 2025-05-07T20:26:28.5661651Z #define _TIME_H 1 2025-05-07T20:26:28.5661920Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:26:28.5662243Z #define __USE_ISOC99 1 2025-05-07T20:26:28.5662565Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:26:28.5662931Z #define HOST_NAME_MAX 64 2025-05-07T20:26:28.5663173Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:26:28.5663432Z #define _IOS_ATEND 4 2025-05-07T20:26:28.5663667Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:26:28.5663985Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:26:28.5664396Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:28.5664749Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:26:28.5665029Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:26:28.5665358Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:26:28.5665674Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:26:28.5665924Z #define _STDIO_H 1 2025-05-07T20:26:28.5666374Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:26:28.5666844Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:26:28.5667203Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:28.5667568Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:26:28.5667854Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:26:28.5668121Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:26:28.5668377Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:26:28.5668667Z #define __cpp_raw_strings 200710L 2025-05-07T20:26:28.5668973Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:28.5669276Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:26:28.5669545Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:26:28.5669819Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:26:28.5670117Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:26:28.5670387Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:26:28.5670667Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:26:28.5671022Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:26:28.5671378Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:26:28.5671618Z #define __USE_XOPEN 1 2025-05-07T20:26:28.5671858Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:26:28.5672287Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:28.5672725Z #define __USE_XOPEN2K 1 2025-05-07T20:26:28.5672963Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:26:28.5673223Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:26:28.5673513Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:26:28.5673783Z #define __cpp_fold_expressions 201603L 2025-05-07T20:26:28.5674289Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:26:28.5674812Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:26:28.5675090Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:26:28.5675443Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:26:28.5675938Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:26:28.5676364Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:26:28.5676758Z #define __END_NAMESPACE_C99 2025-05-07T20:26:28.5677019Z #define __glibcxx_integral_traps true 2025-05-07T20:26:28.5677306Z #define _POSIX_PATH_MAX 256 2025-05-07T20:26:28.5677558Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:26:28.5677805Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:26:28.5678072Z #define _ISOC11_SOURCE 1 2025-05-07T20:26:28.5678322Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:26:28.5678694Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:26:28.5678993Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:26:28.5679362Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:26:28.5679740Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:26:28.5680009Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:26:28.5680270Z #define _IO_UNITBUF 020000 2025-05-07T20:26:28.5680522Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:26:28.5680779Z #define __FD_SETSIZE 1024 2025-05-07T20:26:28.5681025Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:26:28.5681295Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:26:28.5681632Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:26:28.5681985Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:26:28.5682250Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:26:28.5682549Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:26:28.5682868Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:26:28.5683147Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:26:28.5683447Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:26:28.5683932Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:26:28.5684213Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:26:28.5684535Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:26:28.5684810Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:26:28.5685081Z #define __USE_POSIX199506 1 2025-05-07T20:26:28.5685323Z #define _FEATURES_H 1 2025-05-07T20:26:28.5685556Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:26:28.5685940Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:26:28.5686347Z #define __stub_getmsg 2025-05-07T20:26:28.5686569Z #define _IO_FIXED 010000 2025-05-07T20:26:28.5686834Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:26:28.5687140Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:26:28.5687399Z #define __stub_setlogin 2025-05-07T20:26:28.5687638Z #define __stub_fattach 2025-05-07T20:26:28.5687878Z #define __cplusplus 201703L 2025-05-07T20:26:28.5688139Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:26:28.5688409Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:26:28.5688661Z #define INFINITY (__builtin_inff()) 2025-05-07T20:26:28.5688935Z #define _IO_UNBUFFERED 2 2025-05-07T20:26:28.5689409Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:26:28.5689929Z #define _IO_INTERNAL 010 2025-05-07T20:26:28.5690177Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:26:28.5690500Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:28.5690847Z #define __dev_t_defined 2025-05-07T20:26:28.5691080Z #define __DEPRECATED 1 2025-05-07T20:26:28.5691297Z #define __S32_TYPE int 2025-05-07T20:26:28.5691541Z #define __cpp_rvalue_references 200610L 2025-05-07T20:26:28.5691830Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:26:28.5692078Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:26:28.5692331Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:26:28.5692929Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:26:28.5693560Z #define _G_HAVE_MREMAP 1 2025-05-07T20:26:28.5693859Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:28.5694196Z #define OVERFLOW 3 2025-05-07T20:26:28.5694441Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:26:28.5694743Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:26:28.5695241Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:28.5695580Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:26:28.5695903Z #define __SSE2_MATH__ 1 2025-05-07T20:26:28.5696144Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:26:28.5696450Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:28.5696743Z #define _IO_STDIO_H 2025-05-07T20:26:28.5696988Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:26:28.5697275Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:26:28.5697592Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:26:28.5697977Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:28.5698281Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:26:28.5698540Z #define __amd64 1 2025-05-07T20:26:28.5698752Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:26:28.5699014Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:26:28.5699288Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:26:28.5699563Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:26:28.5699866Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:26:28.5700128Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:26:28.5700414Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:26:28.5700670Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:26:28.5700917Z #define __bounded 2025-05-07T20:26:28.5701143Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:26:28.5701428Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:26:28.5701708Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:26:28.5701970Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:26:28.5702236Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:28.5702555Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:26:28.5702968Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:28.5703360Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:26:28.5703629Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:26:28.5703963Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:26:28.5704296Z #define STA_PLL 0x0001 2025-05-07T20:26:28.5704539Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:26:28.5704808Z #define __GNUG__ 11 2025-05-07T20:26:28.5705032Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:26:28.5705296Z #define _T_WCHAR 2025-05-07T20:26:28.5705529Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:26:28.5705827Z #define __specialization_static 2025-05-07T20:26:28.5706127Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:26:28.5706428Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:26:28.5706689Z #define cudaArraySparse 0x40 2025-05-07T20:26:28.5706950Z #define STA_PPSFREQ 0x0002 2025-05-07T20:26:28.5707190Z #define __GLIBCXX__ 20230528 2025-05-07T20:26:28.5707474Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:26:28.5707772Z #define _WCHAR_T 2025-05-07T20:26:28.5707983Z #define __cudaCDP2Free 2025-05-07T20:26:28.5708617Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:26:28.5709301Z #define __cpp_nsdmi 200809L 2025-05-07T20:26:28.5709705Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:26:28.5710132Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:26:28.5710404Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:26:28.5710669Z #define cudaArrayCubemap 0x04 2025-05-07T20:26:28.5710992Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:28.5711348Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:26:28.5711588Z #define __NO_CTYPE 1 2025-05-07T20:26:28.5711816Z #define __stub_bdflush 2025-05-07T20:26:28.5712169Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:26:28.5712580Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:26:28.5712875Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:26:28.5713135Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:26:28.5713407Z #define __cpp_initializer_lists 200806L 2025-05-07T20:26:28.5713797Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:26:28.5714088Z #define __U16_TYPE unsigned short int 2025-05-07T20:26:28.5714421Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:26:28.5714759Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:26:28.5715031Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:26:28.5715307Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:26:28.5715641Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:26:28.5715977Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:26:28.5716331Z #define _IO_STDIO 040000 2025-05-07T20:26:28.5716648Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:26:28.5717026Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:26:28.5717330Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:26:28.5717613Z #define _PTRDIFF_T 2025-05-07T20:26:28.5717820Z #define _MOVE_H 1 2025-05-07T20:26:28.5718034Z #define __cpp_hex_float 201603L 2025-05-07T20:26:28.5718290Z #define ADJ_TAI 0x0080 2025-05-07T20:26:28.5718514Z #define __ptrvalue 2025-05-07T20:26:28.5718728Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:26:28.5718976Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:26:28.5719256Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:26:28.5719545Z #define MATH_ERREXCEPT 2 2025-05-07T20:26:28.5719792Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:26:28.5720069Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:26:28.5720463Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:26:28.5720843Z #define __USE_GNU 1 2025-05-07T20:26:28.5721071Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:26:28.5721553Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:26:28.5721927Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:26:28.5722371Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:26:28.5722760Z #define WEXITED 4 2025-05-07T20:26:28.5722972Z #define _IO_NO_READS 4 2025-05-07T20:26:28.5723278Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:26:28.5723745Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:26:28.5724031Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:26:28.5724338Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:26:28.5724654Z #define __uid_t_defined 2025-05-07T20:26:28.5724897Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:26:28.5725185Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:26:28.5725463Z #define WNOHANG 1 2025-05-07T20:26:28.5725709Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:26:28.5726018Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:26:28.5726294Z #define cudaEventDefault 0x00 2025-05-07T20:26:28.5726597Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:26:28.5726913Z #define NL_SETMAX INT_MAX 2025-05-07T20:26:28.5727159Z #define __x86_64 1 2025-05-07T20:26:28.5727394Z #define __cudaCDP2LaunchDevice 2025-05-07T20:26:28.5727784Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:28.5728264Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:26:28.5728759Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:28.5729190Z #define __PTRDIFF_T 2025-05-07T20:26:28.5729507Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:26:28.5729883Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:26:28.5730159Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:28.5730446Z #define _Mlong_double_ long double 2025-05-07T20:26:28.5730730Z #define __cpp_lambdas 200907L 2025-05-07T20:26:28.5730989Z #define _IO_DEC 020 2025-05-07T20:26:28.5731209Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:26:28.5731477Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:26:28.5731765Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:26:28.5732039Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:26:28.5732298Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:26:28.5732706Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:26:28.5733026Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:26:28.5733298Z #define _ANSI_STDDEF_H 2025-05-07T20:26:28.5733571Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:26:28.5733888Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:26:28.5734247Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:26:28.5734639Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:26:28.5734923Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:26:28.5735309Z #define __cpp_template_auto 201606L 2025-05-07T20:26:28.5735670Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:26:28.5736046Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:26:28.5736308Z #define __key_t_defined 2025-05-07T20:26:28.5736562Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:26:28.5736933Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:26:28.5737405Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:26:28.5737778Z #define __GNUC_VA_LIST 2025-05-07T20:26:28.5738116Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:28.5738836Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:26:28.5739118Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:26:28.5739395Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:26:28.5739687Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:26:28.5739929Z #define __WCOREFLAG 0x80 2025-05-07T20:26:28.5740189Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:26:28.5740505Z #define cudaEventDisableTiming 0x02 2025-05-07T20:26:28.5740775Z #define __LP64__ 1 2025-05-07T20:26:28.5741018Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:26:28.5741332Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:26:28.5741604Z #define _IO_off64_t __off64_t 2025-05-07T20:26:28.5741860Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:28.5742117Z #define __time_t_defined 1 2025-05-07T20:26:28.5742372Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:26:28.5742704Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:26:28.5743067Z #define __USE_UNIX98 1 2025-05-07T20:26:28.5743304Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:26:28.5743565Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:26:28.5743828Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:26:28.5744124Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:26:28.5744424Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:26:28.5744678Z #define SEEK_CUR 1 2025-05-07T20:26:28.5744912Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:28.5745168Z #define _ASSERT_H 1 2025-05-07T20:26:28.5745731Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:26:28.5746361Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:26:28.5746632Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:26:28.5746876Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:26:28.5747149Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:26:28.5747421Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:26:28.5747787Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:28.5748191Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:26:28.5748841Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:26:28.5749484Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:26:28.5749780Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:26:28.5750129Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:26:28.5750503Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:26:28.5750769Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:26:28.5751051Z #define cudaArrayDefault 0x00 2025-05-07T20:26:28.5751328Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:26:28.5751866Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:26:28.5752151Z #define TLOSS 5 2025-05-07T20:26:28.5752366Z #define __ssize_t_defined 2025-05-07T20:26:28.5752610Z #define __CUDACC_VER_BUILD__ 85 2025-05-07T20:26:28.5752879Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:26:28.5753167Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:26:28.5753450Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:26:28.5753812Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:26:28.5754193Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:26:28.5754677Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:26:28.5754982Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:26:28.5755274Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:26:28.5755560Z #define __REGISTER_PREFIX__ 2025-05-07T20:26:28.5755813Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:26:28.5756149Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:26:28.5756545Z #define _IOS_NOREPLACE 64 2025-05-07T20:26:28.5756782Z #define __cdecl 2025-05-07T20:26:28.5757023Z #define cudaEventInterprocess 0x04 2025-05-07T20:26:28.5757351Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:26:28.5757671Z #define LOGIN_NAME_MAX 256 2025-05-07T20:26:28.5757919Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:26:28.5758184Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:26:28.5758465Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:26:28.5758728Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:26:28.5759035Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:26:28.5759363Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:26:28.5759756Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:28.5760183Z #define ADJ_NANO 0x2000 2025-05-07T20:26:28.5760487Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:26:28.5760834Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:26:28.5761125Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:26:28.5761384Z #define __FLT_DIG__ 6 2025-05-07T20:26:28.5761726Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:26:28.5762124Z #define __NO_INLINE__ 1 2025-05-07T20:26:28.5762427Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:28.5762776Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:26:28.5763027Z #define ADJ_STATUS 0x0010 2025-05-07T20:26:28.5763289Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:26:28.5763581Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:26:28.5764007Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:28.5764301Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:26:28.5764590Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:26:28.5764959Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:26:28.5765372Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:26:28.5765717Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:26:28.5766057Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:26:28.5766301Z #define MAX_CANON 255 2025-05-07T20:26:28.5766556Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:26:28.5766808Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:26:28.5767073Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:26:28.5774655Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:26:28.5774974Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:26:28.5775268Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:26:28.5775553Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:26:28.5775892Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:26:28.5776210Z #define __VERSION__ "11.4.0" 2025-05-07T20:26:28.5776468Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:26:28.5776754Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:26:28.5777050Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:26:28.5777331Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:26:28.5777635Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:26:28.5778094Z #define __UINT64_C(c) c ## UL 2025-05-07T20:26:28.5778359Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:26:28.5778611Z #define _SYS_TYPES_H 1 2025-05-07T20:26:28.5778855Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:26:28.5779121Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:26:28.5779366Z #define _SYS_CDEFS_H 1 2025-05-07T20:26:28.5779602Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:26:28.5779880Z #define __cpp_unicode_characters 201411L 2025-05-07T20:26:28.5780174Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:26:28.5780419Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:26:28.5780817Z #define __cudaCDP2StreamDestroy 2025-05-07T20:26:28.5781090Z #define FP_SUBNORMAL 3 2025-05-07T20:26:28.5781335Z #define cudaOccupancyDefault 0x00 2025-05-07T20:26:28.5781616Z #define _INITIALIZER_LIST 2025-05-07T20:26:28.5781874Z #define _STDC_PREDEF_H 1 2025-05-07T20:26:28.5782119Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:26:28.5782395Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:26:28.5782695Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:26:28.5782947Z #define _IO_file_flags _flags 2025-05-07T20:26:28.5783200Z #define __USE_XOPEN2K8 1 2025-05-07T20:26:28.5783437Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:26:28.5783712Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:26:28.5783988Z #define HUGE 3.40282347e+38F 2025-05-07T20:26:28.5784243Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:26:28.5784617Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:26:28.5785011Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:26:28.5785318Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:26:28.5785590Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:26:28.5785843Z #define _BSD_SOURCE 1 2025-05-07T20:26:28.5786078Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:26:28.5786925Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:26:28.5787770Z #define __catch(X) catch(X) 2025-05-07T20:26:28.5788028Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:26:28.5788309Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:26:28.5788581Z #define __TIMER_T_TYPE void * 2025-05-07T20:26:28.5788832Z #define __STRING(x) #x 2025-05-07T20:26:28.5789068Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:26:28.5789337Z #define _T_PTRDIFF_ 2025-05-07T20:26:28.5789584Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:26:28.5789888Z #define cudaEventWaitExternal 0x01 2025-05-07T20:26:28.5790157Z #define __unbounded 2025-05-07T20:26:28.5790398Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:28.5790688Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:26:28.5790960Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:28.5791260Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:26:28.5791536Z #define __cpp_lib_is_final 201402L 2025-05-07T20:26:28.5791824Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:26:28.5792155Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:26:28.5792462Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:26:28.5792731Z #define __managed__ __location__(managed) 2025-05-07T20:26:28.5793027Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:26:28.5793420Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:28.5793838Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:26:28.5794088Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:26:28.5794456Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:26:28.5794865Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:26:28.5795106Z #define _SYS_SIZE_T_H 2025-05-07T20:26:28.5795393Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:26:28.5795727Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:26:28.5795998Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:26:28.5796328Z #define _CRTIMP 2025-05-07T20:26:28.5796667Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:26:28.5796930Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:26:28.5797230Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:28.5797553Z #define STA_PPSJITTER 0x0200 2025-05-07T20:26:28.5797896Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:26:28.5798302Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:28.5798615Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:26:28.5798893Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:26:28.5799253Z #define __SIZE_T__ 2025-05-07T20:26:28.5799466Z #define __stub_gtty 2025-05-07T20:26:28.5799689Z #define __pid_t_defined 2025-05-07T20:26:28.5799940Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:26:28.5800231Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:28.5800536Z #define __glibcxx_function_requires(...) 2025-05-07T20:26:28.5800817Z #define __SM_80_RT_HPP__ 2025-05-07T20:26:28.5801059Z #define __need_clockid_t 2025-05-07T20:26:28.5801306Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:26:28.5801554Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:26:28.5801868Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:26:28.5802183Z #define _IO_HEX 0100 2025-05-07T20:26:28.5802431Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:26:28.5802761Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:26:28.5803069Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:26:28.5803343Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:26:28.5803906Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:28.5804352Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:26:28.5804664Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:26:28.5804949Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:26:28.5805061Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:26:28.5805163Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:26:28.5805243Z #define __stub_sstk 2025-05-07T20:26:28.5805346Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:26:28.5805499Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:26:28.5805579Z #define __wur 2025-05-07T20:26:28.5805701Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:26:28.5805786Z #define _G_HAVE_MMAP 1 2025-05-07T20:26:28.5805877Z #define _IO_OCT 040 2025-05-07T20:26:28.5805971Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:26:28.5806058Z #define NL_MSGMAX INT_MAX 2025-05-07T20:26:28.5806155Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:26:28.5806284Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:26:28.5806378Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:26:28.5806485Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:26:28.5806669Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:26:28.5806762Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:26:28.5806856Z #define _STL_ALGOBASE_H 1 2025-05-07T20:26:28.5806962Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:26:28.5807054Z #define __off64_t_defined 2025-05-07T20:26:28.5807155Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:26:28.5807241Z #define __FLT128_DIG__ 33 2025-05-07T20:26:28.5807351Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:26:28.5807447Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:26:28.5807528Z #define __INT32_C(c) c 2025-05-07T20:26:28.5807628Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:26:28.5807722Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:26:28.5807814Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:26:28.5807909Z #define __PDP_ENDIAN 3412 2025-05-07T20:26:28.5807999Z #define _ISOC95_SOURCE 1 2025-05-07T20:26:28.5808098Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:26:28.5808233Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:26:28.5808326Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:26:28.5808416Z #define __SM_90_RT_HPP__ 2025-05-07T20:26:28.5808513Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:26:28.5808604Z #define __have_pthread_attr_t 1 2025-05-07T20:26:28.5808806Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:26:28.5809027Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:26:28.5809132Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:26:28.5809242Z #define __cudaCDP2EventRecord 2025-05-07T20:26:28.5809335Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:26:28.5809418Z #define htole32(x) (x) 2025-05-07T20:26:28.5809671Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:26:28.5809789Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:26:28.5809966Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:26:28.5810125Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:26:28.5810262Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:26:28.5810392Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:26:28.5810527Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:26:28.5810617Z #define ADJ_OFFSET 0x0001 2025-05-07T20:26:28.5810728Z #define cudaArrayLayered 0x01 2025-05-07T20:26:28.5810893Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:26:28.5810999Z #define cudaEventRecordDefault 0x00 2025-05-07T20:26:28.5811100Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:26:28.5811198Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:26:28.5811276Z #define unix 1 2025-05-07T20:26:28.5811376Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:26:28.5811465Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:26:28.5811563Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:26:28.5811683Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:26:28.5811764Z #define __USE_POSIX 1 2025-05-07T20:26:28.5811861Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:26:28.5811990Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:26:28.5812076Z #define __THROWNL throw () 2025-05-07T20:26:28.5812173Z #define __cpp_rtti 199711L 2025-05-07T20:26:28.5812274Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:26:28.5812360Z #define __PMT(args) args 2025-05-07T20:26:28.5812481Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:28.5812625Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:26:28.5812736Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:26:28.5812833Z #define _SIZE_T_DECLARED 2025-05-07T20:26:28.5812925Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:26:28.5813019Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:26:28.5813406Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:26:28.5813507Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:26:28.5813606Z #define XATTR_LIST_MAX 65536 2025-05-07T20:26:28.5813699Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:26:28.5813836Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:26:28.5813926Z #define _WCHAR_T_H 2025-05-07T20:26:28.5814014Z #define __FLT64X_DIG__ 18 2025-05-07T20:26:28.5814100Z #define _IO_SHOWBASE 0200 2025-05-07T20:26:28.5814195Z #define _POSIX_QLIMIT 1 2025-05-07T20:26:28.5814291Z #define __INT8_TYPE__ signed char 2025-05-07T20:26:28.5814393Z #define __SURFACE_TYPES_H__ 2025-05-07T20:26:28.5814478Z #define __CUDA_ARCH__ 520 2025-05-07T20:26:28.5814584Z #define __cpp_digit_separators 201309L 2025-05-07T20:26:28.5814670Z #define __ELF__ 1 2025-05-07T20:26:28.5814769Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:26:28.5814867Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:26:28.5814957Z #define STA_INS 0x0010 2025-05-07T20:26:28.5815054Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:26:28.5815228Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:26:28.5815327Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:26:28.5815419Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:26:28.5815527Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:28.5815641Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:26:28.5815739Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:26:28.5815940Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:26:28.5816039Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:26:28.5816215Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:26:28.5816401Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:26:28.5816499Z #define _IO_funlockfile(_fp) 2025-05-07T20:26:28.5816820Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:28.5816952Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:26:28.5817167Z #define __DRIVER_TYPES_H__ 2025-05-07T20:26:28.5817250Z #define __FLT_RADIX__ 2 2025-05-07T20:26:28.5817357Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:26:28.5817520Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:26:28.5817620Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:26:28.5817715Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:26:28.5817815Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:26:28.5817925Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:26:28.5818020Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:26:28.5818118Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:26:28.5818208Z #define WORD_BIT 32 2025-05-07T20:26:28.5818294Z #define _IO_USER_BUF 1 2025-05-07T20:26:28.5818384Z #define __VECTOR_TYPES_H__ 2025-05-07T20:26:28.5818490Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:28.5818598Z #define cudaHostAllocPortable 0x01 2025-05-07T20:26:28.5818699Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:26:28.5818793Z #define __long_double_t long double 2025-05-07T20:26:28.5818889Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:26:28.5818986Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:26:28.5819386Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:26:28.5819468Z #define __k8 1 2025-05-07T20:26:28.5819664Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:26:28.5819835Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:26:28.5819949Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:26:28.5820052Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:26:28.5820146Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:26:28.5820249Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:26:28.5820341Z #define __blksize_t_defined 2025-05-07T20:26:28.5820432Z #define _IO_SHOWPOINT 0400 2025-05-07T20:26:28.5820534Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:26:28.5820646Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:26:28.5820742Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:26:28.5820851Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:26:28.5820943Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:26:28.5821034Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:26:28.5821293Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:26:28.5821632Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:26:28.5821738Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:26:28.5821833Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:26:28.5821914Z #define SEEK_SET 0 2025-05-07T20:26:28.5822018Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:26:28.5822111Z #define __CUDA_API_VER_MINOR__ 6 2025-05-07T20:26:28.5822300Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:26:28.5822409Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:26:28.5822510Z #define __cudaCDP2GetLastError 2025-05-07T20:26:28.5822607Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:26:28.5822700Z #define _MATH_H_MATHDEF 1 2025-05-07T20:26:28.5823014Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:26:28.5823117Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:26:28.5823211Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:26:28.5823299Z #define __stub_sigreturn 2025-05-07T20:26:28.5823631Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:26:28.5823727Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:26:28.5823816Z #define __HOST_CONFIG_H__ 2025-05-07T20:26:28.5823917Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:26:28.5823999Z #define CLOCK_TAI 11 2025-05-07T20:26:28.5824102Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:26:28.5824193Z #define __restrict_arr 2025-05-07T20:26:28.5824302Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:26:28.5824518Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:26:28.5825039Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:28.5825221Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:26:28.5825311Z #define __USE_MISC 1 2025-05-07T20:26:28.5825420Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:26:28.5825517Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:26:28.5825609Z #define _GCC_LIMITS_H_ 2025-05-07T20:26:28.5825711Z #define __LDBL_DIG__ 18 2025-05-07T20:26:28.5825808Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:26:28.5825907Z #define __malloc_and_calloc_defined 2025-05-07T20:26:28.5826005Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:26:28.5826108Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:26:28.5826209Z #define __x86_64__ 1 2025-05-07T20:26:28.5826302Z #define _SIZE_T_ 2025-05-07T20:26:28.5827188Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:26:28.5827293Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:26:28.5827391Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:26:28.5827504Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:26:28.5827626Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:26:28.5827718Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:26:28.5827826Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:26:28.5827952Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:26:28.5828088Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:26:28.5828183Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:26:28.5828650Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:28.5828773Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:26:28.5828923Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:26:28.5829020Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:26:28.5829117Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:26:28.5829209Z #define STA_FLL 0x0008 2025-05-07T20:26:28.5829350Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:26:28.5829443Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:26:28.5829567Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:28.5829674Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:26:28.5829765Z #define __stub_revoke 2025-05-07T20:26:28.5829854Z #define __timer_t_defined 1 2025-05-07T20:26:28.5829985Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:26:28.5830089Z #define INT_MAX __INT_MAX__ 2025-05-07T20:26:28.5830194Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:26:28.5830297Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:26:28.5830401Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:26:28.5830501Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:26:28.5830606Z #define cudaArrayTextureGather 0x08 2025-05-07T20:26:28.5830709Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:26:28.5830939Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:26:28.5831033Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:26:28.5831124Z #define _IO_off_t __off_t 2025-05-07T20:26:28.5831208Z #define __FLT64_DIG__ 15 2025-05-07T20:26:28.5831429Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:26:28.5831523Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:26:28.5831647Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:28.5831775Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:26:28.5831950Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:26:28.5832050Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:26:28.5832137Z #define NULL __null 2025-05-07T20:26:28.5832264Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:26:28.5832365Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:26:28.5832466Z #define __U64_TYPE unsigned long int 2025-05-07T20:26:28.5832557Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:26:28.5832662Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:26:28.5832744Z #define FP_ZERO 2 2025-05-07T20:26:28.5832837Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:26:28.5832993Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:26:28.5833098Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:28.5833179Z #define __WCHAR_T__ 2025-05-07T20:26:28.5833276Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:26:28.5833467Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:26:28.5833620Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:26:28.5833720Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:26:28.5833838Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:26:28.5833951Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:28.5834083Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:26:28.5834208Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:26:28.5834305Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:26:28.5834400Z #define _SIGSET_H_types 1 2025-05-07T20:26:28.5834512Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:26:28.5834623Z #define __cpp_unicode_literals 200710L 2025-05-07T20:26:28.5834768Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:26:28.5834865Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:26:28.5834988Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:26:28.5835113Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:26:28.5835216Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:26:28.5835353Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:26:28.5835523Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:26:28.5835624Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:26:28.5835726Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:26:28.5835820Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:26:28.5835911Z #define STA_MODE 0x4000 2025-05-07T20:26:28.5836022Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:26:28.5836144Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:26:28.5836274Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:26:28.5836384Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:26:28.5836478Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:26:28.5836591Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:26:28.5836684Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:26:28.5836798Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:26:28.5836886Z #define __SIZE_WIDTH__ 64 2025-05-07T20:26:28.5837005Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:28.5837088Z #define __SEG_FS 1 2025-05-07T20:26:28.5837175Z #define _IO_size_t size_t 2025-05-07T20:26:28.5837269Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:26:28.5837370Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:26:28.5837454Z #define __stub_lchmod 2025-05-07T20:26:28.5837544Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:26:28.5837661Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:28.5837839Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:26:28.5837924Z #define __SEG_GS 1 2025-05-07T20:26:28.5838108Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:26:28.5838196Z #define _IOS_APPEND 8 2025-05-07T20:26:28.5838296Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:26:28.5838800Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:26:28.5838957Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:26:28.5839093Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:26:28.5839455Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:26:28.5839539Z #define htole16(x) (x) 2025-05-07T20:26:28.5839655Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:28.5839749Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:26:28.5839839Z #define __INT16_TYPE__ short int 2025-05-07T20:26:28.5839947Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:26:28.5840057Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:26:28.5840169Z #define __cpp_structured_bindings 201606L 2025-05-07T20:26:28.5840301Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:26:28.5840388Z #define __SIZEOF_INT__ 4 2025-05-07T20:26:28.5840484Z #define __WCLONE 0x80000000 2025-05-07T20:26:28.5840575Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:26:28.5840656Z #define SEEK_HOLE 4 2025-05-07T20:26:28.5840750Z #define TIMER_ABSTIME 1 2025-05-07T20:26:28.5840842Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:26:28.5840933Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:26:28.5841115Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:28.5841233Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:28.5841330Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:26:28.5841448Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:26:28.5841543Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:26:28.5841670Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:26:28.5841761Z #define _LINUX_LIMITS_H 2025-05-07T20:26:28.5841842Z #define linux 1 2025-05-07T20:26:28.5841943Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:26:28.5842053Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:26:28.5842147Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:26:28.5842246Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:26:28.5842353Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:26:28.5842496Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:26:28.5842595Z #define __cpp_lib_hypot 201603 2025-05-07T20:26:28.5842689Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:26:28.5842787Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:26:28.5842881Z #define MOD_NANO ADJ_NANO 2025-05-07T20:26:28.5842961Z #define htole64(x) (x) 2025-05-07T20:26:28.5843064Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:26:28.5843186Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:26:28.5843279Z #define _IO_UPPERCASE 01000 2025-05-07T20:26:28.5843886Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:26:28.5843975Z #define __USE_POSIX2 1 2025-05-07T20:26:28.5844071Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:26:28.5844163Z #define __WALL 0x40000000 2025-05-07T20:26:28.5844259Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:26:28.5844340Z #define _XLOCALE_H 1 2025-05-07T20:26:28.5844441Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:26:28.5844537Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:26:28.5844630Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:26:28.5844742Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:26:28.5844833Z #define __EXCEPTIONS 1 2025-05-07T20:26:28.5844938Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:26:28.5845128Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:26:28.5845211Z #define __WORDSIZE 64 2025-05-07T20:26:28.5845312Z #define CLOCK_MONOTONIC 1 2025-05-07T20:26:28.5845398Z #define _STL_RELOPS_H 1 2025-05-07T20:26:28.5845489Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:26:28.5845762Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:26:28.5845861Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:26:28.5845955Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:26:28.5846059Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:26:28.5846358Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:26:28.5846590Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:28.5846713Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:26:28.5846888Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:26:28.5846996Z #define __cpp_range_based_for 201603L 2025-05-07T20:26:28.5847106Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:26:28.5847205Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:26:28.5847315Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:26:28.5847493Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:26:28.5847590Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:26:28.5847692Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:26:28.5847794Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:26:28.5847969Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:28.5848081Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:26:28.5848164Z #define _STRING_H 1 2025-05-07T20:26:28.5848270Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:26:28.5848358Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:26:28.5848453Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:26:28.5848593Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:26:28.5848688Z #define __code_model_small__ 1 2025-05-07T20:26:28.5848776Z #define _PSTL_CONFIG_H 2025-05-07T20:26:28.5848881Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:26:28.5848993Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:26:28.5849090Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:26:28.5849197Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:26:28.5849536Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:28.5849638Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:26:28.5849721Z #define le64toh(x) (x) 2025-05-07T20:26:28.5849811Z #define FILENAME_MAX 4096 2025-05-07T20:26:28.5849966Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:26:28.5850078Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:26:28.5850158Z #define L_cuserid 9 2025-05-07T20:26:28.5850252Z #define __ino_t_defined 2025-05-07T20:26:28.5850338Z #define __k8__ 1 2025-05-07T20:26:28.5850432Z #define __INTPTR_TYPE__ long int 2025-05-07T20:26:28.5850544Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:26:28.5850632Z #define __int8_t_defined 2025-05-07T20:26:28.5850727Z #define __WCHAR_TYPE__ int 2025-05-07T20:26:28.5850825Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:26:28.5850936Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:26:28.5851041Z #define __SLONGWORD_TYPE long int 2025-05-07T20:26:28.5851128Z #define _IOS_TRUNC 16 2025-05-07T20:26:28.5851246Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:26:28.5851402Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:26:28.5851484Z #define __HAVE_COLUMN 2025-05-07T20:26:28.5851570Z #define __stub_fdetach 2025-05-07T20:26:28.5851980Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:26:28.5852066Z #define __pic__ 2 2025-05-07T20:26:28.5852195Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:28.5852291Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:26:28.5852382Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:26:28.5852488Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:26:28.5852573Z #define __stub_chflags 2025-05-07T20:26:28.5852661Z #define CLOCK_BOOTTIME 7 2025-05-07T20:26:28.5852750Z #define __need_IOV_MAX 2025-05-07T20:26:28.5852856Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:26:28.5853047Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:26:28.5853155Z #define __cpp_decltype 200707L 2025-05-07T20:26:28.5853253Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:26:28.5853344Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:26:28.5853456Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:26:28.5853543Z #define TTY_NAME_MAX 32 2025-05-07T20:26:28.5853708Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:26:28.5853826Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:28.5854071Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:26:28.5854188Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:26:28.5854282Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:26:28.5854372Z #define STA_PPSTIME 0x0004 2025-05-07T20:26:28.5854460Z #define __import__ 2025-05-07T20:26:28.5854548Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:26:28.5854681Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:26:28.5854774Z #define __export__ 2025-05-07T20:26:28.5854889Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:26:28.5854989Z #define cudaMemAttachHost 0x02 2025-05-07T20:26:28.5855159Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:28.5855254Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:26:28.5855345Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:26:28.5855440Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:26:28.5855528Z #define _WCHAR_T_DECLARED 2025-05-07T20:26:28.5855650Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:26:28.5855775Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:26:28.5855877Z #define __cpp_inline_variables 201606L 2025-05-07T20:26:28.5855971Z #define WNOWAIT 0x01000000 2025-05-07T20:26:28.5856050Z #define PLOSS 6 2025-05-07T20:26:28.5856149Z #define M_LN10 2.30258509299404568402 2025-05-07T20:26:28.5856461Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:26:28.5856555Z #define EXIT_SUCCESS 0 2025-05-07T20:26:28.5856654Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:26:28.5856744Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:26:28.5856841Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:26:28.5856936Z #define __thread__ __thread 2025-05-07T20:26:28.5857029Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:26:28.5857122Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:26:28.5857233Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:26:28.5857453Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:28.5857568Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:26:28.5857667Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:26:28.5857748Z #define __linux__ 1 2025-05-07T20:26:28.5857842Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:26:28.5857975Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:26:28.5858065Z #define __S16_TYPE short int 2025-05-07T20:26:28.5858427Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:26:28.5858533Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:26:28.5858719Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:26:28.5858821Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:26:28.5858916Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:26:28.5858997Z #define _T_SIZE_ 2025-05-07T20:26:28.5859106Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:28.5859224Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:26:28.5859318Z #define _PSTL_VERSION 12000 2025-05-07T20:26:28.5859443Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:26:28.5859536Z #define __WNOTHREAD 0x20000000 2025-05-07T20:26:28.5859637Z #define _G_va_list __gnuc_va_list 2025-05-07T20:26:28.5859765Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:26:28.5859849Z #define _IOS_INPUT 1 2025-05-07T20:26:28.5859948Z #define __USE_LARGEFILE64 1 2025-05-07T20:26:28.5860139Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:26:28.5860230Z #define __INT64_TYPE__ long int 2025-05-07T20:26:28.5860336Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:26:28.5860435Z #define __shared__ __location__(shared) 2025-05-07T20:26:28.5860525Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:26:28.5860685Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:26:28.5860774Z #define __gid_t_defined 2025-05-07T20:26:28.5860896Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:26:28.5860992Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:26:28.5861264Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:26:28.5861365Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:26:28.5861456Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:26:28.5861540Z #define ___int_size_t_h 2025-05-07T20:26:28.5861650Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:28.5861771Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:26:28.5861932Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:26:28.5862043Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:26:28.5862135Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:26:28.5862239Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:26:28.5862335Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:26:28.5862456Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:28.5862574Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:26:28.5862691Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:26:28.5862779Z #define __clock_t_defined 1 2025-05-07T20:26:28.5862893Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:26:28.5863000Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:26:28.5863089Z #define __GLIBC_MINOR__ 17 2025-05-07T20:26:28.5863189Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:26:28.5863288Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:26:28.5863394Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:26:28.5863488Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:26:28.5863660Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:28.5863750Z #define __SSE__ 1 2025-05-07T20:26:28.5863844Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:26:28.5863938Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:26:28.5864026Z #define _CTYPE_H 1 2025-05-07T20:26:28.5864115Z #define __sigset_t_defined 2025-05-07T20:26:28.5864209Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:26:28.5864307Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:26:28.5864391Z #define MOD_TAI ADJ_TAI 2025-05-07T20:26:28.5864490Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:26:28.5864586Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:26:28.5864668Z #define __SM_70_RT_H__ 2025-05-07T20:26:28.5864758Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:26:28.5864866Z #define cudaEventWaitDefault 0x00 2025-05-07T20:26:28.5864957Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:26:28.5865123Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:28.5865220Z #define _POSIX_MAX_CANON 255 2025-05-07T20:26:28.5865326Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:26:28.5865425Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:26:28.5865514Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:26:28.5865596Z #define __amd64__ 1 2025-05-07T20:26:28.5865691Z #define __WINT_WIDTH__ 32 2025-05-07T20:26:28.5865794Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:26:28.5866057Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:28.5866163Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:26:28.5866242Z #define EOF (-1) 2025-05-07T20:26:28.5866341Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:26:28.5866431Z #define __USE_POSIX199309 1 2025-05-07T20:26:28.5866525Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:26:28.5866623Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:26:28.5866718Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:26:28.5866813Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:26:28.5867015Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:26:28.5867107Z #define ____mbstate_t_defined 1 2025-05-07T20:26:28.5867194Z #define STA_NANO 0x2000 2025-05-07T20:26:28.5867296Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:26:28.5867387Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:26:28.5867470Z #define _IO_LINKED 0x80 2025-05-07T20:26:28.5867570Z #define __cpp_lib_launder 201606 2025-05-07T20:26:28.5867661Z #define __SIZEOF_INT128__ 16 2025-05-07T20:26:28.5867763Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:26:28.5867860Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:26:28.5868033Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:26:28.5868182Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:26:28.5868287Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:28.5868385Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:28.5868484Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:26:28.5868576Z #define __W_CONTINUED 0xffff 2025-05-07T20:26:28.5868663Z #define __ATOMIC_RELAXED 0 2025-05-07T20:26:28.5868808Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:26:28.5868925Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:28.5869127Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:26:28.5869313Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:26:28.5869397Z #define __stub_stty 2025-05-07T20:26:28.5869571Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:26:28.5869656Z #define le16toh(x) (x) 2025-05-07T20:26:28.5869767Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:26:28.5869945Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:26:28.5870025Z #define _SIZET_ 2025-05-07T20:26:28.5870113Z #define XATTR_NAME_MAX 255 2025-05-07T20:26:28.5870201Z #define _SVID_SOURCE 1 2025-05-07T20:26:28.5870279Z #define _LP64 1 2025-05-07T20:26:28.5870365Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:26:28.5870619Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:26:28.5870728Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:26:28.5870820Z #define __UINT8_C(c) c 2025-05-07T20:26:28.5870912Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:26:28.5871003Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:26:28.5871119Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:26:28.5883903Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:26:28.5884046Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:26:28.5884148Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:26:28.5884265Z #define CUDARTAPI 2025-05-07T20:26:28.5884350Z #define IOV_MAX 1024 2025-05-07T20:26:28.5884504Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:26:28.5884613Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:26:28.5884718Z #define cudaMemAttachSingle 0x04 2025-05-07T20:26:28.5884803Z #define __wchar_t__ 2025-05-07T20:26:28.5884916Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:26:28.5884999Z #define SEEK_END 2 2025-05-07T20:26:28.5885097Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:26:28.5885280Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:26:28.5885381Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:26:28.5885531Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:26:28.5885622Z #define ____FILE_defined 1 2025-05-07T20:26:28.5885740Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:26:28.5885846Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:26:28.5885937Z #define _ISOC99_SOURCE 1 2025-05-07T20:26:28.5886045Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:26:28.5886345Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:28.5886474Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:26:28.5886558Z #define _IO_RIGHT 04 2025-05-07T20:26:28.5886661Z #define __END_NAMESPACE_STD 2025-05-07T20:26:28.5886848Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:28.5887092Z #define _GLIBCXX_STD_C std 2025-05-07T20:26:28.5887214Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:26:28.5887310Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:26:28.5887416Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:26:28.5887498Z #define _STDDEF_H_ 2025-05-07T20:26:28.5887670Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:28.5887774Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:26:28.5887892Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:26:28.5888182Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:26:28.5888300Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:28.5888443Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:26:28.5888571Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:26:28.5888675Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:26:28.5888787Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:26:28.5888899Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:26:28.5889013Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:26:28.5889108Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:26:28.5889211Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:26:28.5889307Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:26:28.5889487Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:26:28.5889582Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:26:28.5889760Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:26:28.5889870Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:26:28.5889963Z #define __STDCPP_THREADS__ 1 2025-05-07T20:26:28.5890104Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:26:28.5890204Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:26:28.5890296Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:26:28.5890396Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:26:28.5890495Z #define P_tmpdir "/tmp" 2025-05-07T20:26:28.5890618Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:26:28.5890710Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:26:28.5890819Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:26:28.5890981Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:26:28.5891156Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:26:28.5891255Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:26:28.5891378Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:26:28.5891494Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:26:28.5891602Z #define __location__(a) __annotate__(a) 2025-05-07T20:26:28.5891829Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:26:28.5891933Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:26:28.5892044Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:26:28.5892137Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:26:28.5892232Z #define __STDC_UTF_32__ 1 2025-05-07T20:26:28.5892324Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:26:28.5892421Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:26:28.5892523Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:26:28.5892602Z #define __FXSR__ 1 2025-05-07T20:26:28.5892688Z #define _SIZE_T 2025-05-07T20:26:28.5892789Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:26:28.5892900Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:26:28.5893072Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:28.5893218Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:26:28.5893315Z #define _IO_ssize_t __ssize_t 2025-05-07T20:26:28.5893418Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:26:28.5893599Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:28.5893798Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:26:28.5893893Z #define _GXX_NULLPTR_T 2025-05-07T20:26:28.5894016Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:26:28.5894233Z #define FOPEN_MAX 16 2025-05-07T20:26:28.5894327Z #define __BIG_ENDIAN 4321 2025-05-07T20:26:28.5894444Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:28.5894548Z #define __suseconds_t_defined 2025-05-07T20:26:28.5894635Z #define __off_t_defined 2025-05-07T20:26:28.5894719Z #define stderr stderr 2025-05-07T20:26:28.5894821Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:26:28.5894930Z #define __glibcxx_requires_string(_String) 2025-05-07T20:26:28.5895027Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:26:28.5895205Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:26:28.5895614Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:26:28.5895714Z #define __mode_t_defined 2025-05-07T20:26:28.5895801Z #define _GCC_SIZE_T 2025-05-07T20:26:28.5895903Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:28.5896016Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:26:28.5896129Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:26:28.5896223Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:26:28.5896321Z #define __UINT32_C(c) c ## U 2025-05-07T20:26:28.5896424Z #define __cpp_alias_templates 200704L 2025-05-07T20:26:28.5896531Z #define cudaHostAllocMapped 0x02 2025-05-07T20:26:28.5896641Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:26:28.5896732Z #define _STL_ITERATOR_H 1 2025-05-07T20:26:28.5896813Z #define __size_t__ 2025-05-07T20:26:28.5896948Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:26:28.5897047Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:26:28.5897167Z #define cudaEventRecordExternal 0x01 2025-05-07T20:26:28.5897317Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:26:28.5897411Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:26:28.5897590Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:26:28.5897679Z #define _ENDIAN_H 1 2025-05-07T20:26:28.5897784Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:26:28.5897892Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:26:28.5897996Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:26:28.5898076Z #define __try try 2025-05-07T20:26:28.5898179Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:26:28.5898276Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:26:28.5898374Z #define __INT8_MAX__ 0x7f 2025-05-07T20:26:28.5898633Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:26:28.5898722Z #define __LONG_WIDTH__ 64 2025-05-07T20:26:28.5898808Z #define __PIC__ 2 2025-05-07T20:26:28.5898925Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:26:28.5899046Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:26:28.5899182Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:26:28.5899280Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:26:28.5899376Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:26:28.5899568Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:28.5899671Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:26:28.5899769Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:26:28.5899867Z #define _IO_uid_t __uid_t 2025-05-07T20:26:28.5899962Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:26:28.5900101Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:26:28.5900193Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:26:28.5900336Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:28.5900443Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:26:28.5900563Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:26:28.5900652Z #define LONG_BIT 64 2025-05-07T20:26:28.5900764Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:26:28.5900863Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:26:28.5900987Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:26:28.5901085Z #define __fsfilcnt_t_defined 2025-05-07T20:26:28.5901175Z #define __blkcnt_t_defined 2025-05-07T20:26:28.5901539Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:28.5901631Z #define __USE_LARGEFILE 1 2025-05-07T20:26:28.5901727Z #define __cpp_constexpr 201603L 2025-05-07T20:26:28.5901830Z #define CUDART_VERSION 12060 2025-05-07T20:26:28.5901920Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:26:28.5902021Z #define cudaDeviceMapHost 0x08 2025-05-07T20:26:28.5902117Z #define _GLIBCXX_CMATH 1 2025-05-07T20:26:28.5902313Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:26:28.5902406Z #define __lldiv_t_defined 1 2025-05-07T20:26:28.5902567Z #define __SSE2__ 1 2025-05-07T20:26:28.5902650Z #define _IOLBF 1 2025-05-07T20:26:28.5902749Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:26:28.5902852Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:26:28.5902956Z #define __cpp_deduction_guides 201703L 2025-05-07T20:26:28.5903060Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:26:28.5903169Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:26:28.5903256Z #define __INT32_TYPE__ int 2025-05-07T20:26:28.5903360Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:26:28.5903464Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:26:28.5903563Z #define __cpp_exceptions 199711L 2025-05-07T20:26:28.5903663Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:26:28.5903772Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:26:28.5903861Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:26:28.5903979Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:26:28.5904138Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:26:28.5904245Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:26:28.5904341Z #define __SWORD_TYPE long int 2025-05-07T20:26:28.5904439Z #define __INTMAX_TYPE__ long int 2025-05-07T20:26:28.5904540Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:26:28.5904633Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:26:28.5904724Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:26:28.5905010Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:28.5905108Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:26:28.5905253Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:26:28.5905340Z #define _T_SIZE 2025-05-07T20:26:28.5905447Z #define cudaHostAllocDefault 0x00 2025-05-07T20:26:28.5905572Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:26:28.5905699Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:26:28.5905790Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:26:28.5905885Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:26:28.5906005Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:26:28.5906102Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:26:28.5906206Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:28.5906298Z #define __ATOMIC_CONSUME 1 2025-05-07T20:26:28.5906474Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:26:28.5906573Z #define __GNUC_MINOR__ 4 2025-05-07T20:26:28.5906672Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:26:28.5906769Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:26:28.5906891Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:28.5906974Z #define __PIE__ 2 2025-05-07T20:26:28.5907079Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:26:28.5907182Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:26:28.5907373Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:26:28.5907599Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:28.5907692Z #define __nlink_t_defined 2025-05-07T20:26:28.5907823Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:26:28.5907945Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:26:28.5908034Z #define _XOPEN_LIM_H 1 2025-05-07T20:26:28.5908295Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:28.5908418Z #define __cpp_template_template_args 201611L 2025-05-07T20:26:28.5908519Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:26:28.5908755Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:26:28.5908852Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:26:28.5908941Z #define __FILE_defined 1 2025-05-07T20:26:28.5909126Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:26:28.5909222Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:26:28.5909318Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:26:28.5909434Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:26:28.5909547Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:26:28.5909735Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:26:28.5909845Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:26:28.5909933Z #define __INT16_C(c) c 2025-05-07T20:26:28.5910032Z #define __U32_TYPE unsigned int 2025-05-07T20:26:28.5910140Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:26:28.5910262Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:26:28.5910348Z #define __STDC__ 1 2025-05-07T20:26:28.5910451Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:26:28.5910551Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:26:28.5910655Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:26:28.5910807Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:26:28.5910895Z #define __FLT32X_DIG__ 15 2025-05-07T20:26:28.5910999Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:26:28.5911097Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:26:28.5911208Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:26:28.5911322Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:26:28.5911425Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:26:28.5911534Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:26:28.5911617Z #define stdin stdin 2025-05-07T20:26:28.5911707Z #define __ino64_t_defined 2025-05-07T20:26:28.5911797Z #define STA_CLK 0x8000 2025-05-07T20:26:28.5911889Z #define __clockid_t_defined 1 2025-05-07T20:26:28.5912033Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:26:28.5912208Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:26:28.5912309Z #define __cudaCDP2MemsetAsync 2025-05-07T20:26:28.5912411Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:26:28.5912518Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:26:28.5912625Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:26:28.5912821Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:26:28.5912917Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:26:28.5913452Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:26:28.5913546Z #define DOMAIN 1 2025-05-07T20:26:28.5913639Z #define M_LN2 0.69314718055994530942 2025-05-07T20:26:28.5913723Z #define __NVCC__ 1 2025-05-07T20:26:28.5913838Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:26:28.5913959Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:28.5914062Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:26:28.5914172Z #define __throw_exception_again throw 2025-05-07T20:26:28.5914264Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:26:28.5914359Z #define __EXCEPTION_H 1 2025-05-07T20:26:28.5914454Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:26:28.5914557Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:26:28.5914865Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:28.5914978Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:26:28.5915077Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:26:28.5915183Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:26:28.5915284Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:26:28.5915381Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:26:28.5915530Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:26:28.5915636Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:28.5915838Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:26:28.5915934Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:26:28.5916039Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:26:28.5916164Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:26:28.5916277Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:26:28.5916423Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:26:28.5916527Z #define __useconds_t_defined 2025-05-07T20:26:28.5916625Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:26:28.5916884Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:26:28.5917040Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:26:28.5917126Z #define __SSE_MATH__ 1 2025-05-07T20:26:28.5917213Z #define _IO_wint_t wint_t 2025-05-07T20:26:28.5917314Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:26:28.5917403Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:26:28.5917504Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:26:28.5917624Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:26:28.5917721Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:26:28.5917819Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:26:28.5917904Z #define __USE_ATFILE 1 2025-05-07T20:26:28.5917994Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:26:28.5918094Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:26:28.5918183Z #define _GCC_PTRDIFF_T 2025-05-07T20:26:28.5918408Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:28.5918511Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:26:28.5918621Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:26:28.5918730Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:26:28.5918839Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:26:28.5918922Z #define _STDLIB_H 1 2025-05-07T20:26:28.5919066Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:26:28.5919161Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:26:28.5919255Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:26:28.5919397Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:28.5919504Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:28.5919599Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:26:28.5919789Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:26:28.5919944Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:26:28.5920058Z #define __glibcxx_requires_nonempty() 2025-05-07T20:26:28.5920172Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:26:28.5920263Z #define __ldiv_t_defined 1 2025-05-07T20:26:28.5920454Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:26:28.5920546Z #define ___int_ptrdiff_t_h 2025-05-07T20:26:28.5920713Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:28.5920822Z #define __cudaCDP2EventDestroy 2025-05-07T20:26:28.5920913Z #define __HOST_DEFINES_H__ 2025-05-07T20:26:28.5921012Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:26:28.5921123Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:28.5921220Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:26:28.5921301Z #define CUDART_CB 2025-05-07T20:26:28.5921409Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:26:28.5921530Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:26:28.5921625Z #define MB_LEN_MAX 16 2025-05-07T20:26:28.5921847Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:28.5921947Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:26:28.5922079Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:26:28.5922196Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:26:28.5922290Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:26:28.5922446Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:26:28.5922551Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:26:28.5922635Z #define _GNU_SOURCE 1 2025-05-07T20:26:28.5922726Z #define __stub_putmsg 2025-05-07T20:26:28.5922808Z #define __CUDACC__ 1 2025-05-07T20:26:28.5922989Z #define __N(msgid) (msgid) 2025-05-07T20:26:28.5923078Z #define __P(args) args 2025-05-07T20:26:28.5923330Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:26:28.5923438Z #define __cpp_init_captures 201304L 2025-05-07T20:26:28.5923542Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:26:28.5923770Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:26:28.5923874Z #define __cpp_lib_as_const 201510 2025-05-07T20:26:28.5923955Z #define __WCHAR_T 2025-05-07T20:26:28.5924127Z #define __ATOMIC_RELEASE 3 2025-05-07T20:26:28.5924227Z #define __fsblkcnt_t_defined 2025-05-07T20:26:28.5924343Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:26:28.5924442Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:26:28.5924456Z 2025-05-07T20:26:28.6209762Z 2025-05-07T20:26:28.6210607Z + conda run -n build_binary nvcc --version 2025-05-07T20:26:28.6210626Z 2025-05-07T20:26:30.5355558Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:26:30.5355962Z Copyright (c) 2005-2024 NVIDIA Corporation 2025-05-07T20:26:30.5356275Z Built on Tue_Oct_29_23:50:19_PDT_2024 2025-05-07T20:26:30.5356584Z Cuda compilation tools, release 12.6, V12.6.85 2025-05-07T20:26:30.5356918Z Build cuda_12.6.r12.6/compiler.35059454_0 2025-05-07T20:26:30.5357128Z 2025-05-07T20:26:30.6012413Z 2025-05-07T20:26:30.6022863Z /usr/bin/nvidia-smi 2025-05-07T20:26:30.6028328Z + nvidia-smi 2025-05-07T20:26:30.6028488Z 2025-05-07T20:26:30.6209461Z Wed May 7 20:26:30 2025 2025-05-07T20:26:30.6209998Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:30.6210604Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:26:30.6211098Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:30.6211599Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:26:30.6212135Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:26:30.6212570Z | | | MIG M. | 2025-05-07T20:26:30.6212911Z |=========================================+========================+======================| 2025-05-07T20:26:30.6380879Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:26:30.6381512Z | 0% 27C P8 15W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:26:30.6382249Z | | | N/A | 2025-05-07T20:26:30.6382763Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:30.6385777Z 2025-05-07T20:26:30.6386593Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:30.6387257Z | Processes: | 2025-05-07T20:26:30.6387820Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:26:30.6388332Z | ID ID Usage | 2025-05-07T20:26:30.6388793Z |=========================================================================================| 2025-05-07T20:26:30.6392262Z | No running processes found | 2025-05-07T20:26:30.6393129Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:30.8780251Z 2025-05-07T20:26:30.8785627Z [INSTALL] Successfully installed CUDA 12.6.3 2025-05-07T20:26:30.8840143Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:26:30.8840757Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:26:30.8853641Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:26:30.8854089Z env: 2025-05-07T20:26:30.8854402Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:26:30.8854864Z BUILD_ENV: build_binary 2025-05-07T20:26:30.8855209Z BUILD_TARGET: genai 2025-05-07T20:26:30.8855592Z BUILD_VARIANT: cuda 2025-05-07T20:26:30.8856016Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:26:30.8856330Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:26:30.8856714Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:26:30.8857424Z ##[endgroup] 2025-05-07T20:26:31.2268715Z ################################################################################ 2025-05-07T20:26:31.2269173Z # Install PyTorch (PIP) 2025-05-07T20:26:31.2269640Z # 2025-05-07T20:26:31.2285563Z # [2025-05-07T20:26:31.228Z] + install_pytorch_pip build_binary nightly cuda/12.6.3 2025-05-07T20:26:31.2286155Z ################################################################################ 2025-05-07T20:26:31.2286421Z 2025-05-07T20:26:31.2314031Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:26:32.2290843Z Channels: 2025-05-07T20:26:32.2291186Z - conda-forge 2025-05-07T20:26:32.2291646Z Platform: linux-64 2025-05-07T20:26:35.6013320Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:26:36.3440313Z Solving environment: \ | / done 2025-05-07T20:26:36.5604767Z 2025-05-07T20:26:36.5605250Z ## Package Plan ## 2025-05-07T20:26:36.5605674Z 2025-05-07T20:26:36.5605937Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:36.5606345Z 2025-05-07T20:26:36.5606473Z added / updated specs: 2025-05-07T20:26:36.5606865Z - numpy 2025-05-07T20:26:36.5607066Z 2025-05-07T20:26:36.5607084Z 2025-05-07T20:26:36.5607309Z The following packages will be downloaded: 2025-05-07T20:26:36.5607555Z 2025-05-07T20:26:36.5607703Z package | build 2025-05-07T20:26:36.5608201Z ---------------------------|----------------- 2025-05-07T20:26:36.5608974Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:26:36.5609489Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:26:36.5610015Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:26:36.5610652Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:26:36.5611205Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:26:36.5611716Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:26:36.5612349Z numpy-2.2.5 | py311h5d046bc_0 8.6 MB conda-forge 2025-05-07T20:26:36.5612828Z ------------------------------------------------------------ 2025-05-07T20:26:36.5613502Z Total: 15.9 MB 2025-05-07T20:26:36.5613793Z 2025-05-07T20:26:36.5613953Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:36.5614242Z 2025-05-07T20:26:36.5614487Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:26:36.5615136Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:26:36.5615743Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:26:36.5616406Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:26:36.5617142Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:26:36.5617790Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:26:36.5618697Z numpy conda-forge/linux-64::numpy-2.2.5-py311h5d046bc_0 2025-05-07T20:26:36.5618991Z 2025-05-07T20:26:36.5618996Z 2025-05-07T20:26:36.5619000Z 2025-05-07T20:26:36.5619226Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:36.5619760Z numpy-2.2.5 | 8.6 MB | | 0% 2025-05-07T20:26:36.5620279Z 2025-05-07T20:26:36.5620746Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:36.5621005Z 2025-05-07T20:26:36.5621014Z 2025-05-07T20:26:36.5628406Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:26:36.5628701Z 2025-05-07T20:26:36.5628706Z 2025-05-07T20:26:36.5628710Z 2025-05-07T20:26:36.5637213Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:26:36.5637759Z 2025-05-07T20:26:36.5637763Z 2025-05-07T20:26:36.5637767Z 2025-05-07T20:26:36.5643228Z 2025-05-07T20:26:36.5657145Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:36.5657454Z 2025-05-07T20:26:36.5657468Z 2025-05-07T20:26:36.5657472Z 2025-05-07T20:26:36.5657476Z 2025-05-07T20:26:36.5669780Z 2025-05-07T20:26:36.5713533Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:36.5713892Z 2025-05-07T20:26:36.5713896Z 2025-05-07T20:26:36.5713900Z 2025-05-07T20:26:36.5713904Z 2025-05-07T20:26:36.5713907Z 2025-05-07T20:26:36.5713911Z 2025-05-07T20:26:36.6286941Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:36.6287322Z 2025-05-07T20:26:36.6287328Z 2025-05-07T20:26:36.6298465Z 2025-05-07T20:26:36.6455841Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:36.6456138Z 2025-05-07T20:26:36.6456158Z 2025-05-07T20:26:36.6456162Z 2025-05-07T20:26:36.6456166Z 2025-05-07T20:26:36.7150035Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:36.7150331Z 2025-05-07T20:26:36.7150335Z 2025-05-07T20:26:36.7150339Z 2025-05-07T20:26:36.7150343Z 2025-05-07T20:26:36.7150347Z 2025-05-07T20:26:36.7459310Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:36.7459693Z 2025-05-07T20:26:36.7459697Z 2025-05-07T20:26:36.7459701Z 2025-05-07T20:26:36.7459704Z 2025-05-07T20:26:36.7459708Z 2025-05-07T20:26:36.7459766Z 2025-05-07T20:26:36.7532856Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:36.7533180Z 2025-05-07T20:26:36.7533184Z 2025-05-07T20:26:36.7533188Z 2025-05-07T20:26:36.7533191Z 2025-05-07T20:26:36.7588031Z 2025-05-07T20:26:36.7845819Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:36.7846106Z 2025-05-07T20:26:36.7846113Z 2025-05-07T20:26:36.7846211Z 2025-05-07T20:26:36.7846218Z 2025-05-07T20:26:36.7846232Z 2025-05-07T20:26:36.7846286Z 2025-05-07T20:26:36.9013572Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:36.9013899Z 2025-05-07T20:26:36.9013955Z 2025-05-07T20:26:36.9013959Z 2025-05-07T20:26:36.9013963Z 2025-05-07T20:26:36.9014257Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:36.9014547Z 2025-05-07T20:26:36.9014551Z 2025-05-07T20:26:36.9014555Z 2025-05-07T20:26:36.9014559Z 2025-05-07T20:26:36.9062877Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:36.9063145Z 2025-05-07T20:26:36.9063149Z 2025-05-07T20:26:36.9063153Z 2025-05-07T20:26:36.9063157Z 2025-05-07T20:26:36.9063161Z 2025-05-07T20:26:36.9150069Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:36.9150340Z 2025-05-07T20:26:36.9150344Z 2025-05-07T20:26:36.9150348Z 2025-05-07T20:26:36.9150711Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:36.9150990Z 2025-05-07T20:26:36.9150998Z 2025-05-07T20:26:36.9151002Z 2025-05-07T20:26:36.9160956Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:36.9161232Z 2025-05-07T20:26:36.9161236Z 2025-05-07T20:26:36.9161240Z 2025-05-07T20:26:36.9161542Z 2025-05-07T20:26:36.9161547Z 2025-05-07T20:26:36.9161550Z 2025-05-07T20:26:36.9331250Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:36.9331523Z 2025-05-07T20:26:36.9422566Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:36.9422823Z 2025-05-07T20:26:36.9422827Z 2025-05-07T20:26:36.9727289Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:26:36.9795995Z numpy-2.2.5 | 8.6 MB | | 0% 2025-05-07T20:26:36.9796234Z 2025-05-07T20:26:36.9796923Z 2025-05-07T20:26:37.0089620Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:37.0091195Z 2025-05-07T20:26:37.0430653Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:37.0430907Z 2025-05-07T20:26:37.0430911Z 2025-05-07T20:26:37.0432905Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:37.0433163Z 2025-05-07T20:26:37.0433167Z 2025-05-07T20:26:37.0815727Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:37.0967044Z numpy-2.2.5 | 8.6 MB | ########2 | 82% 2025-05-07T20:26:37.1220058Z numpy-2.2.5 | 8.6 MB | ########## | 100% 2025-05-07T20:26:37.1220516Z 2025-05-07T20:26:37.1221918Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:37.1222400Z 2025-05-07T20:26:37.5345521Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:37.5354475Z numpy-2.2.5 | 8.6 MB | ########## | 100% 2025-05-07T20:26:37.5354815Z 2025-05-07T20:26:37.5355013Z 2025-05-07T20:26:37.5355203Z  2025-05-07T20:26:37.5355415Z 2025-05-07T20:26:37.5355420Z 2025-05-07T20:26:37.5355584Z  2025-05-07T20:26:37.5355796Z 2025-05-07T20:26:37.5355800Z 2025-05-07T20:26:37.5355804Z 2025-05-07T20:26:37.5355981Z  2025-05-07T20:26:37.5356195Z 2025-05-07T20:26:37.5356199Z 2025-05-07T20:26:37.5356203Z 2025-05-07T20:26:37.5356213Z 2025-05-07T20:26:37.5356387Z  2025-05-07T20:26:37.5356608Z 2025-05-07T20:26:37.5356611Z 2025-05-07T20:26:37.5356615Z 2025-05-07T20:26:37.5356619Z 2025-05-07T20:26:37.5356623Z 2025-05-07T20:26:37.5356798Z  2025-05-07T20:26:37.5357021Z 2025-05-07T20:26:37.5357025Z 2025-05-07T20:26:37.5357028Z 2025-05-07T20:26:37.5357032Z 2025-05-07T20:26:37.5357036Z 2025-05-07T20:26:37.5357039Z 2025-05-07T20:26:37.5357375Z  done 2025-05-07T20:26:37.6360511Z Preparing transaction: \ done 2025-05-07T20:26:37.8367706Z Verifying transaction: / - done 2025-05-07T20:26:37.9374697Z Executing transaction: | done 2025-05-07T20:26:38.1290167Z ################################################################################ 2025-05-07T20:26:38.1290552Z # Install Package From PyTorch PIP: torch 2025-05-07T20:26:38.1290850Z # 2025-05-07T20:26:38.1305622Z # [2025-05-07T20:26:38.130Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3 2025-05-07T20:26:38.1306096Z ################################################################################ 2025-05-07T20:26:38.1306321Z 2025-05-07T20:26:38.1321116Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:26:38.2238924Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:26:38.2239491Z ################################################################################ 2025-05-07T20:26:38.2239877Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:26:38.2240161Z # 2025-05-07T20:26:38.2260087Z # [2025-05-07T20:26:38.225Z] + __prepare_pip_arguments torch nightly cuda/12.6.3 2025-05-07T20:26:38.2260893Z ################################################################################ 2025-05-07T20:26:38.2261122Z 2025-05-07T20:26:38.2285819Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:26:38.2311442Z [INSTALL] Extracted package variant: cu126 2025-05-07T20:26:38.2328158Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:26:38.2328937Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:26:38.2337667Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:26:38.2347339Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ... 2025-05-07T20:26:38.2369526Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:27:56.6986810Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:27:56.6987290Z Collecting torch 2025-05-07T20:27:56.6987983Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:27:56.6988959Z Collecting filelock (from torch) 2025-05-07T20:27:56.6989595Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:27:56.6991020Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from torch) (4.13.2) 2025-05-07T20:27:56.6991850Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:27:56.6992586Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:27:56.6993471Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 36.5 MB/s eta 0:00:00 2025-05-07T20:27:56.6993844Z Collecting networkx (from torch) 2025-05-07T20:27:56.6994336Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB) 2025-05-07T20:27:56.6994991Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 19.8 MB/s eta 0:00:00 2025-05-07T20:27:56.6995340Z Collecting jinja2 (from torch) 2025-05-07T20:27:56.6995814Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:27:56.6996310Z Collecting fsspec (from torch) 2025-05-07T20:27:56.6996807Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:27:56.6997381Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch) 2025-05-07T20:27:56.6998096Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB) 2025-05-07T20:27:56.6998876Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 73.2 MB/s eta 0:00:00 2025-05-07T20:27:56.6999299Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch) 2025-05-07T20:27:56.7000036Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB) 2025-05-07T20:27:56.7000810Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 9.8 MB/s eta 0:00:00 2025-05-07T20:27:56.7001206Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch) 2025-05-07T20:27:56.7001909Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB) 2025-05-07T20:27:56.7002680Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 40.6 MB/s eta 0:00:00 2025-05-07T20:27:56.7003065Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch) 2025-05-07T20:27:56.7003740Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB) 2025-05-07T20:27:56.7004733Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 36.4 MB/s eta 0:00:00 2025-05-07T20:27:56.7005136Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch) 2025-05-07T20:27:56.7006365Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB) 2025-05-07T20:27:56.7007242Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 66.1 MB/s eta 0:00:00 2025-05-07T20:27:56.7007621Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch) 2025-05-07T20:27:56.7008292Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB) 2025-05-07T20:27:56.7009052Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 153.9 MB/s eta 0:00:00 2025-05-07T20:27:56.7009549Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch) 2025-05-07T20:27:56.7010276Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB) 2025-05-07T20:27:56.7011288Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 208.6 MB/s eta 0:00:00 2025-05-07T20:27:56.7011693Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch) 2025-05-07T20:27:56.7012402Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB) 2025-05-07T20:27:56.7013177Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 148.5 MB/s eta 0:00:00 2025-05-07T20:27:56.7013566Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch) 2025-05-07T20:27:56.7014263Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB) 2025-05-07T20:27:56.7015043Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 144.5 MB/s eta 0:00:00 2025-05-07T20:27:56.7015429Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:27:56.7016126Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:27:56.7016912Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 161.9 MB/s eta 0:00:00 2025-05-07T20:27:56.7017292Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:27:56.7018160Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:27:56.7019003Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch) 2025-05-07T20:27:56.7019657Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB) 2025-05-07T20:27:56.7020326Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch) 2025-05-07T20:27:56.7021095Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB) 2025-05-07T20:27:56.7021967Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 144.5 MB/s eta 0:00:00 2025-05-07T20:27:56.7022358Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch) 2025-05-07T20:27:56.7023149Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:27:56.7023957Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:27:56.7024789Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:27:56.7026093Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1) 2025-05-07T20:27:56.7027088Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:27:56.7027638Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:27:56.7028374Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 56.7 MB/s eta 0:00:00 2025-05-07T20:27:56.7028874Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:27:56.7029768Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB) 2025-05-07T20:27:56.7030831Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp311-cp311-manylinux_2_28_x86_64.whl (825.6 MB) 2025-05-07T20:27:56.7031655Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.6/825.6 MB 36.6 MB/s eta 0:00:00 2025-05-07T20:27:56.7032421Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB) 2025-05-07T20:27:56.7033257Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 21.0 MB/s eta 0:00:00 2025-05-07T20:27:56.7034003Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:27:56.7035085Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 103.0 MB/s eta 0:00:00 2025-05-07T20:27:56.7036054Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB) 2025-05-07T20:27:56.7037105Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 134.1 MB/s eta 0:00:00 2025-05-07T20:27:56.7039431Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:27:56.7041057Z 2025-05-07T20:27:56.7043060Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126 2025-05-07T20:27:56.7045227Z 2025-05-07T20:27:58.9324440Z torch 2.8.0.dev20250507+cu126 2025-05-07T20:27:58.9327344Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126) 2025-05-07T20:28:02.3932401Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:28:05.8624040Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:05.8624518Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:28:09.2564083Z True 2025-05-07T20:28:09.2564337Z True 2025-05-07T20:28:09.2564443Z 2025-05-07T20:28:09.3213109Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:28:09.3256212Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:09.3256827Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:09.3270474Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:09.3270863Z env: 2025-05-07T20:28:09.3271099Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:09.3271425Z BUILD_ENV: build_binary 2025-05-07T20:28:09.3271668Z BUILD_TARGET: genai 2025-05-07T20:28:09.3271894Z BUILD_VARIANT: cuda 2025-05-07T20:28:09.3272130Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:09.3272378Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:09.3272679Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:09.3273202Z ##[endgroup] 2025-05-07T20:28:09.6639500Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:28:09.6641580Z ################################################################################ 2025-05-07T20:28:09.6642202Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:28:09.6642568Z # 2025-05-07T20:28:09.6657372Z # [2025-05-07T20:28:09.665Z] + collect_pytorch_env_info build_binary 2025-05-07T20:28:09.6657813Z ################################################################################ 2025-05-07T20:28:09.6658030Z 2025-05-07T20:28:09.6672770Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:09.7738643Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:09.7749386Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:28:09.7750060Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:28:09.7750459Z 2025-05-07T20:28:09.8618394Z 2025-05-07T20:28:09.8619121Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:28:09.8642289Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:28:15.8144150Z Collecting environment information... 2025-05-07T20:28:15.8144756Z PyTorch version: 2.8.0.dev20250507+cu126 2025-05-07T20:28:15.8145238Z Is debug build: False 2025-05-07T20:28:15.8145499Z CUDA used to build PyTorch: 12.6 2025-05-07T20:28:15.8145778Z ROCM used to build PyTorch: N/A 2025-05-07T20:28:15.8145964Z 2025-05-07T20:28:15.8146073Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:28:15.8146403Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:28:15.8146809Z Clang version: Could not collect 2025-05-07T20:28:15.8147189Z CMake version: Could not collect 2025-05-07T20:28:15.8147752Z Libc version: glibc-2.34 2025-05-07T20:28:15.8147979Z 2025-05-07T20:28:15.8148409Z Python version: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] (64-bit runtime) 2025-05-07T20:28:15.8149162Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:28:15.8149584Z Is CUDA available: True 2025-05-07T20:28:15.8149848Z CUDA runtime version: 12.6.85 2025-05-07T20:28:15.8150121Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:28:15.8150449Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:28:15.8150781Z Nvidia driver version: 570.133.07 2025-05-07T20:28:15.8151213Z cuDNN version: Could not collect 2025-05-07T20:28:15.8151484Z HIP runtime version: N/A 2025-05-07T20:28:15.8151746Z MIOpen runtime version: N/A 2025-05-07T20:28:15.8152016Z Is XNNPACK available: True 2025-05-07T20:28:15.8152175Z 2025-05-07T20:28:15.8152253Z CPU: 2025-05-07T20:28:15.8152471Z Architecture: x86_64 2025-05-07T20:28:15.8152812Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:28:15.8153321Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:28:15.8153714Z Byte Order: Little Endian 2025-05-07T20:28:15.8154038Z CPU(s): 16 2025-05-07T20:28:15.8154327Z On-line CPU(s) list: 0-15 2025-05-07T20:28:15.8155025Z Vendor ID: AuthenticAMD 2025-05-07T20:28:15.8155376Z Model name: AMD EPYC 7R32 2025-05-07T20:28:15.8155701Z CPU family: 23 2025-05-07T20:28:15.8155979Z Model: 49 2025-05-07T20:28:15.8156415Z Thread(s) per core: 2 2025-05-07T20:28:15.8156937Z Core(s) per socket: 8 2025-05-07T20:28:15.8157298Z Socket(s): 1 2025-05-07T20:28:15.8157577Z Stepping: 0 2025-05-07T20:28:15.8157873Z BogoMIPS: 5599.99 2025-05-07T20:28:15.8160120Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:28:15.8162379Z Hypervisor vendor: KVM 2025-05-07T20:28:15.8162683Z Virtualization type: full 2025-05-07T20:28:15.8163023Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:28:15.8163388Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:28:15.8164036Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:28:15.8164390Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:28:15.8164709Z NUMA node(s): 1 2025-05-07T20:28:15.8165017Z NUMA node0 CPU(s): 0-15 2025-05-07T20:28:15.8165346Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:28:15.8165835Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:28:15.8166196Z Vulnerability L1tf: Not affected 2025-05-07T20:28:15.8166555Z Vulnerability Mds: Not affected 2025-05-07T20:28:15.8166900Z Vulnerability Meltdown: Not affected 2025-05-07T20:28:15.8167408Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:28:15.8167778Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:28:15.8168321Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:28:15.8169312Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:28:15.8169930Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:28:15.8170622Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:28:15.8171476Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:28:15.8172157Z Vulnerability Srbds: Not affected 2025-05-07T20:28:15.8172509Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:28:15.8172746Z 2025-05-07T20:28:15.8172849Z Versions of relevant libraries: 2025-05-07T20:28:15.8173117Z [pip3] numpy==2.2.5 2025-05-07T20:28:15.8173364Z [pip3] nvidia-cublas-cu12==12.6.4.1 2025-05-07T20:28:15.8173672Z [pip3] nvidia-cuda-cupti-cu12==12.6.80 2025-05-07T20:28:15.8173984Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 2025-05-07T20:28:15.8174422Z [pip3] nvidia-cuda-runtime-cu12==12.6.77 2025-05-07T20:28:15.8174823Z [pip3] nvidia-cudnn-cu12==9.5.1.17 2025-05-07T20:28:15.8175114Z [pip3] nvidia-cufft-cu12==11.3.0.4 2025-05-07T20:28:15.8175412Z [pip3] nvidia-curand-cu12==10.3.7.77 2025-05-07T20:28:15.8175710Z [pip3] nvidia-cusolver-cu12==11.7.1.2 2025-05-07T20:28:15.8176141Z [pip3] nvidia-cusparse-cu12==12.5.4.2 2025-05-07T20:28:15.8176602Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:28:15.8176897Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:28:15.8177181Z [pip3] nvidia-nvjitlink-cu12==12.6.85 2025-05-07T20:28:15.8177478Z [pip3] nvidia-nvtx-cu12==12.6.77 2025-05-07T20:28:15.8177760Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:28:15.8178082Z [pip3] torch==2.8.0.dev20250507+cu126 2025-05-07T20:28:15.8178456Z [conda] cuda-cudart 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:15.8178938Z [conda] cuda-cudart-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:15.8179575Z [conda] cuda-cudart-dev_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:15.8180100Z [conda] cuda-cudart-static 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:15.8180783Z [conda] cuda-cudart-static_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:15.8181314Z [conda] cuda-cudart_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:15.8181799Z [conda] cuda-cupti 12.6.80 hbd13f7d_0 conda-forge 2025-05-07T20:28:15.8182393Z [conda] cuda-cupti-dev 12.6.80 h5888daf_0 conda-forge 2025-05-07T20:28:15.8182876Z [conda] cuda-libraries 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:15.8183369Z [conda] cuda-libraries-dev 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:15.8183848Z [conda] cuda-nvrtc 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:15.8184440Z [conda] cuda-nvrtc-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:15.8184905Z [conda] cuda-nvtx 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:15.8185354Z [conda] cuda-opencl 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:15.8185835Z [conda] cuda-opencl-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:15.8186314Z [conda] cuda-runtime 12.6.3 ha804496_0 conda-forge 2025-05-07T20:28:15.8186763Z [conda] libcublas 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:15.8187370Z [conda] libcublas-dev 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:15.8187836Z [conda] libcufft 11.3.0.4 hbd13f7d_0 conda-forge 2025-05-07T20:28:15.8188298Z [conda] libcufft-dev 11.3.0.4 h5888daf_0 conda-forge 2025-05-07T20:28:15.8188756Z [conda] libcurand 10.3.7.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:15.8189220Z [conda] libcurand-dev 10.3.7.77 h5888daf_0 conda-forge 2025-05-07T20:28:15.8189811Z [conda] libcusolver 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:15.8190304Z [conda] libcusolver-dev 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:15.8190787Z [conda] libcusparse 12.5.4.2 hbd13f7d_0 conda-forge 2025-05-07T20:28:15.8191390Z [conda] libcusparse-dev 12.5.4.2 h5888daf_0 conda-forge 2025-05-07T20:28:15.8191879Z [conda] libnvjitlink 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:15.8192360Z [conda] libnvjitlink-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:15.8192825Z [conda] numpy 2.2.5 py311h5d046bc_0 conda-forge 2025-05-07T20:28:15.8193288Z [conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi 2025-05-07T20:28:15.8193907Z [conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi 2025-05-07T20:28:15.8194406Z [conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:15.8194913Z [conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:15.8195520Z [conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi 2025-05-07T20:28:15.8196090Z [conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi 2025-05-07T20:28:15.8196566Z [conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi 2025-05-07T20:28:15.8197056Z [conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi 2025-05-07T20:28:15.8197549Z [conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi 2025-05-07T20:28:15.8198043Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:28:15.8198529Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:28:15.8199012Z [conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi 2025-05-07T20:28:15.8199487Z [conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:15.8200222Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:28:15.8200687Z [conda] torch 2.8.0.dev20250507+cu126 pypi_0 pypi 2025-05-07T20:28:15.8200957Z 2025-05-07T20:28:15.8942407Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:15.8943075Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:15.8955737Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:15.8956088Z env: 2025-05-07T20:28:15.8956315Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:15.8956607Z BUILD_ENV: build_binary 2025-05-07T20:28:15.8956857Z BUILD_TARGET: genai 2025-05-07T20:28:15.8957092Z BUILD_VARIANT: cuda 2025-05-07T20:28:15.8957338Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:15.8957585Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:15.8957894Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:15.8958228Z ##[endgroup] 2025-05-07T20:28:16.2397725Z ################################################################################ 2025-05-07T20:28:16.2398082Z # Prepare FBGEMM-GPU Build 2025-05-07T20:28:16.2398326Z # 2025-05-07T20:28:16.2415132Z # [2025-05-07T20:28:16.241Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:28:16.2415552Z ################################################################################ 2025-05-07T20:28:16.2415771Z 2025-05-07T20:28:16.2430789Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:16.3388290Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:16.3412330Z [BUILD] Running git submodules update ... 2025-05-07T20:28:16.3436057Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:28:16.3802553Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:28:16.3803474Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:28:16.3804511Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:28:16.3805304Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:28:16.3806100Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:28:16.3806942Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:28:16.3807733Z Synchronizing submodule url for '../external/json' 2025-05-07T20:28:16.3836293Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:28:16.4390703Z [BUILD] Installing other build dependencies ... 2025-05-07T20:28:16.4413285Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:28:18.8859355Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:28:18.9046586Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:28:19.0034016Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:28:19.0069814Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:28:19.2169016Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:28:19.2214757Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:28:19.3254154Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:28:19.3291012Z Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:28:19.6313822Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:28:19.6352361Z Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:28:19.6877707Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:28:19.6881391Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:28:19.7582185Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:28:19.7625797Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:28:19.8049789Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 21)) (2.2.5) 2025-05-07T20:28:19.8626074Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:28:19.8721787Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:28:19.9871320Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:28:19.9904738Z Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:28:20.0924969Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:28:20.0973417Z Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:28:20.1490569Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:28:20.2105295Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:28:20.2150248Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:28:20.3081186Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:28:20.3113195Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:28:20.4127630Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:28:20.4177312Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:28:20.5266835Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:20.5298356Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:28:20.6265302Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:20.6297801Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:28:20.7307227Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:20.7340691Z Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:20.8373737Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:20.8405913Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:20.8993252Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:28:20.9491394Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:20.9533043Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:28:20.9928789Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:28:21.0385289Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:28:21.0418653Z Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:28:21.0863246Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:28:21.1476788Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:21.1512629Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:28:21.2000184Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:28:21.2494547Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:28:21.3079769Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:28:21.9113365Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 46.2 MB/s eta 0:00:00 2025-05-07T20:28:21.9147356Z Downloading click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:28:21.9782501Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:28:22.0387960Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:28:22.0986501Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:28:22.1615561Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:28:22.2239390Z Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (762 kB) 2025-05-07T20:28:22.2837138Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 763.0/763.0 kB 8.7 MB/s eta 0:00:00 2025-05-07T20:28:22.2901625Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:28:22.3431334Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:22.3934085Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:28:22.4573193Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:28:22.5207913Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:28:22.5725256Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:28:22.6336334Z Downloading distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:28:22.6941627Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:22.7479926Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:28:22.8079460Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:28:22.9900829Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions 2025-05-07T20:28:25.3914093Z 2025-05-07T20:28:25.3941825Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0 2025-05-07T20:28:25.5768861Z ################################################################################ 2025-05-07T20:28:25.5769348Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:28:25.5769705Z # 2025-05-07T20:28:25.5788581Z # [2025-05-07T20:28:25.578Z] + install_triton_pip build_binary 2025-05-07T20:28:25.5789137Z ################################################################################ 2025-05-07T20:28:25.5789476Z 2025-05-07T20:28:25.5789823Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:28:25.5790438Z ################################################################################ 2025-05-07T20:28:25.5790975Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:28:25.5791413Z # 2025-05-07T20:28:25.5809483Z # [2025-05-07T20:28:25.580Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:25.5810000Z ################################################################################ 2025-05-07T20:28:25.5810214Z 2025-05-07T20:28:25.5827848Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:25.6746242Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:25.6746793Z ################################################################################ 2025-05-07T20:28:25.6747267Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:28:25.6747635Z # 2025-05-07T20:28:25.6767709Z # [2025-05-07T20:28:25.676Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:25.6768191Z ################################################################################ 2025-05-07T20:28:25.6768821Z 2025-05-07T20:28:25.6815564Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:28:25.6831064Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:28:25.6831574Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:25.6840166Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:25.6850219Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:28:25.6870403Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:33.3933499Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:28:33.3934875Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:28:33.3935611Z 2025-05-07T20:28:33.3935823Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:33.3936241Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:33.3937042Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:28:33.3938245Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB) 2025-05-07T20:28:33.3939781Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 55.0 MB/s eta 0:00:00 2025-05-07T20:28:33.3940167Z Installing collected packages: pytorch-triton 2025-05-07T20:28:33.3940516Z Attempting uninstall: pytorch-triton 2025-05-07T20:28:33.3940904Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:28:33.3941329Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:28:33.3941755Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:28:33.3942186Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:28:33.3942448Z 2025-05-07T20:28:35.6218593Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:28:35.6222551Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:28:37.7927469Z ################################################################################ 2025-05-07T20:28:37.7928065Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:28:37.7929025Z ################################################################################ 2025-05-07T20:28:37.7937563Z 2025-05-07T20:28:39.8577745Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:28:41.9874387Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:28:41.9878693Z [BUILD] Successfully ran git submodules update 2025-05-07T20:28:41.9923728Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:41.9924408Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:41.9936155Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:41.9936510Z env: 2025-05-07T20:28:41.9936737Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:41.9937033Z BUILD_ENV: build_binary 2025-05-07T20:28:41.9937279Z BUILD_TARGET: genai 2025-05-07T20:28:41.9937510Z BUILD_VARIANT: cuda 2025-05-07T20:28:41.9937740Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:41.9937993Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:41.9938299Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:41.9938802Z ##[endgroup] 2025-05-07T20:28:42.3328473Z ################################################################################ 2025-05-07T20:28:42.3329241Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:28:42.3329503Z # 2025-05-07T20:28:42.3345928Z # [2025-05-07T20:28:42.334Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.3346580Z ################################################################################ 2025-05-07T20:28:42.3346796Z 2025-05-07T20:28:42.3347159Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.3347853Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.3348191Z 2025-05-07T20:28:42.3465491Z d2bc5ec7f2c503b96ed71ce870e3919d4c82a2c7 fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.3468162Z 2025-05-07T20:28:42.3468592Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.3468943Z 2025-05-07T20:28:42.3597480Z fb057b0fc70bac7d6bace794c1630e92472ffbffb4b9efd8fa610079134b2303 fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.3600662Z 2025-05-07T20:28:42.3601539Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.3602174Z 2025-05-07T20:28:42.3829704Z d723859d888c0acd7c881d03de8ae205 fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.3831807Z 2025-05-07T20:28:42.3844266Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl ... 2025-05-07T20:28:42.3865249Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:45.0613768Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:45.0614732Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5) 2025-05-07T20:28:45.0615603Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:28:45.0616038Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:28:45.0616313Z 2025-05-07T20:28:52.0321405Z ################################################################################ 2025-05-07T20:28:52.0321797Z [CHECK] !!!! INFO !!!! 2025-05-07T20:28:52.0322171Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:52.0322603Z [CHECK] CUDA version reported by PyTorch is: 12.6 2025-05-07T20:28:52.0322915Z [CHECK] 2025-05-07T20:28:52.0323240Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:28:52.0323877Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:28:52.0324274Z ################################################################################ 2025-05-07T20:28:52.0324495Z 2025-05-07T20:28:52.0324635Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:28:56.0211401Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:28:59.9993656Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:03.9908067Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:03.9911655Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:29:15.9725240Z ################################################################################ 2025-05-07T20:29:15.9725801Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:29:15.9726260Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:29:15.9726736Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:29:15.9727218Z ################################################################################ 2025-05-07T20:29:15.9727528Z 2025-05-07T20:29:23.9549368Z ################################################################################ 2025-05-07T20:29:23.9550869Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:29:23.9552224Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:29:23.9553839Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:29:23.9554354Z ################################################################################ 2025-05-07T20:29:23.9554580Z 2025-05-07T20:29:23.9554737Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:29:27.9707806Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:29:31.9403885Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:29:36.0485215Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:29:40.0445570Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:29:40.0450078Z [INSTALL] Check for operator registrations ... 2025-05-07T20:29:43.9552527Z fbgemm.nccl_init 2025-05-07T20:29:43.9552764Z 2025-05-07T20:29:44.0232577Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:29:47.9423808Z fbgemm.gqa_attn_splitk 2025-05-07T20:29:47.9424076Z 2025-05-07T20:29:48.0070966Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:29:51.9232277Z fbgemm.rope_qkv_decoding 2025-05-07T20:29:51.9232553Z 2025-05-07T20:29:51.9900067Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:29:51.9900807Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:29:51.9937152Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:29:51.9937610Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:29:51.9953742Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:51.9954096Z env: 2025-05-07T20:29:51.9954321Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:51.9954615Z BUILD_ENV: build_binary 2025-05-07T20:29:51.9954861Z BUILD_TARGET: genai 2025-05-07T20:29:51.9955093Z BUILD_VARIANT: cuda 2025-05-07T20:29:51.9955330Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:29:51.9955583Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:51.9955886Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:51.9956221Z ##[endgroup] 2025-05-07T20:29:52.3330801Z ################################################################################ 2025-05-07T20:29:52.3331323Z # Test All FBGEMM-GPU Modules 2025-05-07T20:29:52.3331660Z # 2025-05-07T20:29:52.3348497Z # [2025-05-07T20:29:52.334Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:29:52.3349066Z ################################################################################ 2025-05-07T20:29:52.3349359Z 2025-05-07T20:30:00.3346791Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:30:00.3347354Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:30:00.3347753Z [TEST] Determined the test directories: 2025-05-07T20:30:00.3348070Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:30:00.3348364Z fbgemm_gpu/experimental/example/test 2025-05-07T20:30:00.3348670Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:30:00.3348857Z 2025-05-07T20:30:00.3357403Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:30:00.3364524Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:30:00.3364962Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:30:00.3365250Z 2025-05-07T20:30:00.7650341Z 2025-05-07T20:30:00.7650788Z [TEST] Installing PyTest ... 2025-05-07T20:30:00.7674960Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:30:01.8748471Z Channels: 2025-05-07T20:30:01.8748720Z - conda-forge 2025-05-07T20:30:01.8748955Z Platform: linux-64 2025-05-07T20:30:05.3015720Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:30:06.4607586Z Solving environment: \ | / done 2025-05-07T20:30:06.6889336Z 2025-05-07T20:30:06.6889755Z ## Package Plan ## 2025-05-07T20:30:06.6889939Z 2025-05-07T20:30:06.6890216Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:30:06.6890626Z 2025-05-07T20:30:06.6890726Z added / updated specs: 2025-05-07T20:30:06.6890989Z - expecttest 2025-05-07T20:30:06.6891202Z - pytest 2025-05-07T20:30:06.6891327Z 2025-05-07T20:30:06.6891332Z 2025-05-07T20:30:06.6891453Z The following packages will be downloaded: 2025-05-07T20:30:06.6891714Z 2025-05-07T20:30:06.6891836Z package | build 2025-05-07T20:30:06.6892151Z ---------------------------|----------------- 2025-05-07T20:30:06.6892527Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:30:06.6892990Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:30:06.6893452Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:30:06.6893885Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:30:06.6894317Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:30:06.6894740Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:30:06.6895141Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:30:06.6896052Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:30:06.6896463Z ------------------------------------------------------------ 2025-05-07T20:30:06.6896808Z Total: 428 KB 2025-05-07T20:30:06.6897018Z 2025-05-07T20:30:06.6897148Z The following NEW packages will be INSTALLED: 2025-05-07T20:30:06.6897373Z 2025-05-07T20:30:06.6897576Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:30:06.6898093Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:30:06.6898624Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:30:06.6899091Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:30:06.6899558Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:30:06.6900001Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:30:06.6900432Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:30:06.6900852Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:30:06.6901110Z 2025-05-07T20:30:06.6901114Z 2025-05-07T20:30:06.6901118Z 2025-05-07T20:30:06.6901263Z Downloading and Extracting Packages: ...working... 2025-05-07T20:30:06.6901633Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:30:06.6901859Z 2025-05-07T20:30:06.6902249Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:30:06.6902483Z 2025-05-07T20:30:06.6902487Z 2025-05-07T20:30:06.6912689Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:30:06.6913026Z 2025-05-07T20:30:06.6913032Z 2025-05-07T20:30:06.6920579Z 2025-05-07T20:30:06.6938214Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:30:06.6938706Z 2025-05-07T20:30:06.6938711Z 2025-05-07T20:30:06.6938714Z 2025-05-07T20:30:06.6939354Z 2025-05-07T20:30:06.6955547Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:30:06.6956125Z 2025-05-07T20:30:06.6956130Z 2025-05-07T20:30:06.6956134Z 2025-05-07T20:30:06.6956138Z 2025-05-07T20:30:06.6958819Z 2025-05-07T20:30:06.6960382Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:30:06.6960642Z 2025-05-07T20:30:06.6960646Z 2025-05-07T20:30:06.6960650Z 2025-05-07T20:30:06.6960662Z 2025-05-07T20:30:06.6960666Z 2025-05-07T20:30:06.6975724Z 2025-05-07T20:30:06.6988993Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:30:06.6989286Z 2025-05-07T20:30:06.6989297Z 2025-05-07T20:30:06.6989301Z 2025-05-07T20:30:06.6989304Z 2025-05-07T20:30:06.6989308Z 2025-05-07T20:30:06.6989312Z 2025-05-07T20:30:06.6992979Z 2025-05-07T20:30:06.7607119Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:30:06.7607415Z 2025-05-07T20:30:06.7607419Z 2025-05-07T20:30:06.7607423Z 2025-05-07T20:30:06.7609613Z 2025-05-07T20:30:06.8023088Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:06.8023403Z 2025-05-07T20:30:06.8023407Z 2025-05-07T20:30:06.8023411Z 2025-05-07T20:30:06.8023415Z 2025-05-07T20:30:06.8025354Z 2025-05-07T20:30:06.8109913Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:30:06.8110176Z 2025-05-07T20:30:06.8110180Z 2025-05-07T20:30:06.8110184Z 2025-05-07T20:30:06.8110187Z 2025-05-07T20:30:06.8112886Z 2025-05-07T20:30:06.9391901Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:06.9392176Z 2025-05-07T20:30:06.9392180Z 2025-05-07T20:30:06.9392184Z 2025-05-07T20:30:06.9392188Z 2025-05-07T20:30:06.9392192Z 2025-05-07T20:30:06.9574412Z 2025-05-07T20:30:07.0582708Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:07.0583019Z 2025-05-07T20:30:07.0583022Z 2025-05-07T20:30:07.0583026Z 2025-05-07T20:30:07.0583030Z 2025-05-07T20:30:07.0583034Z 2025-05-07T20:30:07.0584955Z 2025-05-07T20:30:07.0810122Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:07.0810425Z 2025-05-07T20:30:07.0810430Z 2025-05-07T20:30:07.0810433Z 2025-05-07T20:30:07.0811852Z 2025-05-07T20:30:07.0818395Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:07.0818682Z 2025-05-07T20:30:07.0818686Z 2025-05-07T20:30:07.0818690Z 2025-05-07T20:30:07.0818932Z 2025-05-07T20:30:07.0900390Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:07.0900679Z 2025-05-07T20:30:07.0900683Z 2025-05-07T20:30:07.0900687Z 2025-05-07T20:30:07.0900691Z 2025-05-07T20:30:07.0900695Z 2025-05-07T20:30:07.0907911Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:07.0908171Z 2025-05-07T20:30:07.0908175Z 2025-05-07T20:30:07.0908179Z 2025-05-07T20:30:07.0908183Z 2025-05-07T20:30:07.0908187Z 2025-05-07T20:30:07.0908191Z 2025-05-07T20:30:07.0956423Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:07.0956725Z 2025-05-07T20:30:07.0956729Z 2025-05-07T20:30:07.0956733Z 2025-05-07T20:30:07.0956737Z 2025-05-07T20:30:07.0956741Z 2025-05-07T20:30:07.0956745Z 2025-05-07T20:30:07.0956748Z 2025-05-07T20:30:07.0960004Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:07.0960336Z 2025-05-07T20:30:07.0960343Z 2025-05-07T20:30:07.0960348Z 2025-05-07T20:30:07.0960351Z 2025-05-07T20:30:07.0960355Z 2025-05-07T20:30:07.0960359Z 2025-05-07T20:30:07.0960362Z 2025-05-07T20:30:07.1035331Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:07.1035610Z 2025-05-07T20:30:07.1035614Z 2025-05-07T20:30:07.1035618Z 2025-05-07T20:30:07.1035621Z 2025-05-07T20:30:07.1035625Z 2025-05-07T20:30:07.1035629Z 2025-05-07T20:30:07.1035633Z 2025-05-07T20:30:07.1141694Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:07.1141968Z 2025-05-07T20:30:07.1141972Z 2025-05-07T20:30:07.1148988Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:30:07.1149899Z 2025-05-07T20:30:07.1153204Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:30:07.1153527Z 2025-05-07T20:30:07.1153531Z 2025-05-07T20:30:07.1166965Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:07.1167461Z 2025-05-07T20:30:07.1348742Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:07.1348997Z 2025-05-07T20:30:07.1349001Z 2025-05-07T20:30:07.1370371Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:07.1370627Z 2025-05-07T20:30:07.1398741Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:07.1501636Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:30:07.1715019Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:07.1715497Z 2025-05-07T20:30:07.1715505Z 2025-05-07T20:30:07.1715513Z 2025-05-07T20:30:07.1724618Z pluggy-1.5.0 | 23 KB | ######9 | 69%  2025-05-07T20:30:07.1724877Z 2025-05-07T20:30:07.1724881Z 2025-05-07T20:30:07.1725889Z 2025-05-07T20:30:07.1768100Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:07.1829922Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:07.1830151Z 2025-05-07T20:30:07.1830155Z 2025-05-07T20:30:07.1830166Z 2025-05-07T20:30:07.1836552Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:07.1837227Z 2025-05-07T20:30:07.1837594Z 2025-05-07T20:30:07.1837872Z  2025-05-07T20:30:07.1838214Z 2025-05-07T20:30:07.1838218Z 2025-05-07T20:30:07.1838998Z  2025-05-07T20:30:07.1839299Z 2025-05-07T20:30:07.1839304Z 2025-05-07T20:30:07.1839308Z 2025-05-07T20:30:07.1839530Z  2025-05-07T20:30:07.1840009Z 2025-05-07T20:30:07.1840017Z 2025-05-07T20:30:07.1840086Z 2025-05-07T20:30:07.1840092Z 2025-05-07T20:30:07.1840483Z  2025-05-07T20:30:07.1840750Z 2025-05-07T20:30:07.1840753Z 2025-05-07T20:30:07.1840797Z 2025-05-07T20:30:07.1840801Z 2025-05-07T20:30:07.1840804Z 2025-05-07T20:30:07.1841017Z  2025-05-07T20:30:07.1841260Z 2025-05-07T20:30:07.1841264Z 2025-05-07T20:30:07.1841268Z 2025-05-07T20:30:07.1841271Z 2025-05-07T20:30:07.1841311Z 2025-05-07T20:30:07.1841315Z 2025-05-07T20:30:07.1841546Z  2025-05-07T20:30:07.1841771Z 2025-05-07T20:30:07.1841775Z 2025-05-07T20:30:07.1841778Z 2025-05-07T20:30:07.1841782Z 2025-05-07T20:30:07.1841882Z 2025-05-07T20:30:07.1841886Z 2025-05-07T20:30:07.1841889Z 2025-05-07T20:30:07.1851020Z  done 2025-05-07T20:30:07.2844709Z Preparing transaction: \ done 2025-05-07T20:30:07.3849608Z Verifying transaction: / done 2025-05-07T20:30:09.2875952Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:30:09.4232043Z [TEST] Checking imports ... 2025-05-07T20:30:13.3988476Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:13.4001327Z [TEST] Setting feature flags ... 2025-05-07T20:30:13.4001773Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:30:13.4002106Z 2025-05-07T20:30:13.8283032Z 2025-05-07T20:30:13.8283910Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:30:13.8285297Z ################################################################################ 2025-05-07T20:30:13.8285753Z # Run FBGEMM-GPU Tests: 2025-05-07T20:30:13.8286056Z # 2025-05-07T20:30:13.8305685Z # [2025-05-07T20:30:13.830Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:30:13.8306464Z ################################################################################ 2025-05-07T20:30:13.8306678Z 2025-05-07T20:30:13.8313480Z [TEST] Enumerating ALL test files ... 2025-05-07T20:30:13.8342217Z ./attention/gqa_test.py 2025-05-07T20:30:13.8342541Z ./coalesce/coalesce_test.py 2025-05-07T20:30:13.8342931Z ./comm/multi_gpu_car_test.py 2025-05-07T20:30:13.8343218Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:30:13.8343506Z ./kv_cache/kv_cache_test.py 2025-05-07T20:30:13.8343769Z ./moe/activation_test.py 2025-05-07T20:30:13.8344019Z ./moe/gather_scatter_test.py 2025-05-07T20:30:13.8344271Z ./moe/layers_test.py 2025-05-07T20:30:13.8344510Z ./moe/shuffling_test.py 2025-05-07T20:30:13.8344761Z ./quantize/quantize_test.py 2025-05-07T20:30:13.8344921Z 2025-05-07T20:30:13.8345036Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:30:13.8345253Z 2025-05-07T20:30:13.8362698Z ################################################################################ 2025-05-07T20:30:13.8378141Z # [2025-05-07T20:30:13.837Z] Run Python Test Suite: 2025-05-07T20:30:13.8378519Z # ./attention/gqa_test.py 2025-05-07T20:30:13.8378898Z ################################################################################ 2025-05-07T20:30:13.8402501Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:30:13.8403107Z 2025-05-07T20:30:16.3632409Z ============================= test session starts ============================== 2025-05-07T20:30:16.3633486Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:16.3634351Z cachedir: .pytest_cache 2025-05-07T20:30:16.3635339Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:16.3636940Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:16.3637644Z plugins: hypothesis-6.131.14 2025-05-07T20:30:18.0868819Z collecting ... collected 2 items 2025-05-07T20:30:18.0869047Z 2025-05-07T20:30:52.9141952Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:30:52.9142573Z self=, 2025-05-07T20:30:52.9142973Z int4_kv=False, 2025-05-07T20:30:52.9143242Z num_groups=1, 2025-05-07T20:30:52.9143505Z B=1, 2025-05-07T20:30:52.9143726Z MAX_T=4, 2025-05-07T20:30:52.9143964Z N_H_L=1, 2025-05-07T20:30:52.9144211Z ) 2025-05-07T20:30:52.9144446Z Trying example: test_gqa( 2025-05-07T20:30:52.9144810Z self=, 2025-05-07T20:30:52.9145202Z int4_kv=True, 2025-05-07T20:30:52.9145453Z num_groups=1, 2025-05-07T20:30:52.9145705Z B=1, 2025-05-07T20:30:52.9145941Z MAX_T=4, 2025-05-07T20:30:52.9146187Z N_H_L=1, 2025-05-07T20:30:52.9146424Z ) 2025-05-07T20:30:52.9146703Z Trying example: test_gqa( 2025-05-07T20:30:52.9147069Z self=, 2025-05-07T20:30:52.9147460Z int4_kv=True, 2025-05-07T20:30:52.9147714Z num_groups=4, 2025-05-07T20:30:52.9147952Z B=23, 2025-05-07T20:30:52.9148182Z MAX_T=33, 2025-05-07T20:30:52.9148420Z N_H_L=68, 2025-05-07T20:30:52.9148653Z ) 2025-05-07T20:30:52.9148895Z Trying example: test_gqa( 2025-05-07T20:30:52.9149256Z self=, 2025-05-07T20:30:52.9149631Z int4_kv=True, 2025-05-07T20:30:52.9149887Z num_groups=4, 2025-05-07T20:30:52.9150133Z B=77, 2025-05-07T20:30:52.9150350Z MAX_T=4, 2025-05-07T20:30:52.9150583Z N_H_L=1, 2025-05-07T20:30:52.9150811Z ) 2025-05-07T20:30:52.9151036Z Trying example: test_gqa( 2025-05-07T20:30:52.9151419Z self=, 2025-05-07T20:30:52.9151818Z int4_kv=True, 2025-05-07T20:30:52.9152072Z num_groups=4, 2025-05-07T20:30:52.9152313Z B=77, 2025-05-07T20:30:52.9153078Z MAX_T=52, 2025-05-07T20:30:52.9153317Z N_H_L=67, 2025-05-07T20:30:52.9153546Z ) 2025-05-07T20:30:52.9153783Z Trying example: test_gqa( 2025-05-07T20:30:52.9154134Z self=, 2025-05-07T20:30:52.9154510Z int4_kv=False, 2025-05-07T20:30:52.9154792Z num_groups=4, 2025-05-07T20:30:52.9155042Z B=57, 2025-05-07T20:30:52.9155263Z MAX_T=45, 2025-05-07T20:30:52.9155503Z N_H_L=120, 2025-05-07T20:30:52.9155739Z ) 2025-05-07T20:30:52.9155962Z Trying example: test_gqa( 2025-05-07T20:30:52.9156313Z self=, 2025-05-07T20:30:52.9156696Z int4_kv=True, 2025-05-07T20:30:52.9156944Z num_groups=4, 2025-05-07T20:30:52.9157193Z B=52, 2025-05-07T20:30:52.9157419Z MAX_T=42, 2025-05-07T20:30:52.9157644Z N_H_L=53, 2025-05-07T20:30:52.9157874Z ) 2025-05-07T20:30:52.9158105Z Trying example: test_gqa( 2025-05-07T20:30:52.9158445Z self=, 2025-05-07T20:30:52.9158841Z int4_kv=True, 2025-05-07T20:30:52.9159094Z num_groups=1, 2025-05-07T20:30:52.9159334Z B=77, 2025-05-07T20:30:52.9159560Z MAX_T=95, 2025-05-07T20:30:52.9159796Z N_H_L=53, 2025-05-07T20:30:52.9160027Z ) 2025-05-07T20:30:52.9160253Z Trying example: test_gqa( 2025-05-07T20:30:52.9160603Z self=, 2025-05-07T20:30:52.9160980Z int4_kv=True, 2025-05-07T20:30:52.9161223Z num_groups=4, 2025-05-07T20:30:52.9161476Z B=113, 2025-05-07T20:30:52.9161704Z MAX_T=48, 2025-05-07T20:30:52.9161958Z N_H_L=96, 2025-05-07T20:30:52.9162215Z ) 2025-05-07T20:30:52.9162452Z Trying example: test_gqa( 2025-05-07T20:30:52.9162796Z self=, 2025-05-07T20:30:52.9163179Z int4_kv=False, 2025-05-07T20:30:52.9163645Z num_groups=1, 2025-05-07T20:30:52.9163890Z B=51, 2025-05-07T20:30:52.9164116Z MAX_T=61, 2025-05-07T20:30:52.9164352Z N_H_L=69, 2025-05-07T20:30:52.9164803Z ) 2025-05-07T20:30:52.9165050Z Trying example: test_gqa( 2025-05-07T20:30:52.9165401Z self=, 2025-05-07T20:30:52.9165776Z int4_kv=False, 2025-05-07T20:30:52.9166031Z num_groups=4, 2025-05-07T20:30:52.9166284Z B=17, 2025-05-07T20:30:52.9166510Z MAX_T=113, 2025-05-07T20:30:52.9166751Z N_H_L=65, 2025-05-07T20:30:52.9166984Z ) 2025-05-07T20:30:52.9167209Z Trying example: test_gqa( 2025-05-07T20:30:52.9167561Z self=, 2025-05-07T20:30:52.9167949Z int4_kv=False, 2025-05-07T20:30:52.9168209Z num_groups=4, 2025-05-07T20:30:52.9168475Z B=17, 2025-05-07T20:30:52.9168722Z MAX_T=65, 2025-05-07T20:30:52.9168979Z N_H_L=65, 2025-05-07T20:30:52.9169211Z ) 2025-05-07T20:30:52.9169463Z Trying example: test_gqa( 2025-05-07T20:30:52.9169873Z self=, 2025-05-07T20:30:52.9170278Z int4_kv=False, 2025-05-07T20:30:52.9170538Z num_groups=4, 2025-05-07T20:30:52.9170819Z B=65, 2025-05-07T20:30:52.9171067Z MAX_T=65, 2025-05-07T20:30:52.9171296Z N_H_L=65, 2025-05-07T20:30:52.9171530Z ) 2025-05-07T20:30:52.9171801Z Trying example: test_gqa( 2025-05-07T20:30:52.9172143Z self=, 2025-05-07T20:30:52.9172524Z int4_kv=False, 2025-05-07T20:30:52.9172776Z num_groups=1, 2025-05-07T20:30:52.9173028Z B=6, 2025-05-07T20:30:52.9173250Z MAX_T=108, 2025-05-07T20:30:52.9173490Z N_H_L=14, 2025-05-07T20:30:52.9173718Z ) 2025-05-07T20:30:52.9173942Z Trying example: test_gqa( 2025-05-07T20:30:52.9174290Z self=, 2025-05-07T20:30:52.9174672Z int4_kv=False, 2025-05-07T20:30:52.9174921Z num_groups=1, 2025-05-07T20:30:52.9175168Z B=6, 2025-05-07T20:30:52.9175395Z MAX_T=14, 2025-05-07T20:30:52.9175621Z N_H_L=14, 2025-05-07T20:30:52.9175850Z ) 2025-05-07T20:30:52.9176082Z Trying example: test_gqa( 2025-05-07T20:30:52.9176479Z self=, 2025-05-07T20:30:52.9176959Z int4_kv=False, 2025-05-07T20:30:52.9177212Z num_groups=1, 2025-05-07T20:30:52.9177451Z B=6, 2025-05-07T20:30:52.9177676Z MAX_T=6, 2025-05-07T20:30:52.9177907Z N_H_L=14, 2025-05-07T20:30:52.9178129Z ) 2025-05-07T20:30:52.9178358Z Trying example: test_gqa( 2025-05-07T20:30:52.9178705Z self=, 2025-05-07T20:30:52.9179081Z int4_kv=False, 2025-05-07T20:30:52.9179332Z num_groups=1, 2025-05-07T20:30:52.9179576Z B=6, 2025-05-07T20:30:52.9179794Z MAX_T=6, 2025-05-07T20:30:52.9180028Z N_H_L=6, 2025-05-07T20:30:52.9180255Z ) 2025-05-07T20:30:52.9180481Z Trying example: test_gqa( 2025-05-07T20:30:52.9180833Z self=, 2025-05-07T20:30:52.9181220Z int4_kv=False, 2025-05-07T20:30:52.9181522Z num_groups=1, 2025-05-07T20:30:52.9181763Z B=70, 2025-05-07T20:30:52.9181987Z MAX_T=94, 2025-05-07T20:30:52.9182219Z N_H_L=78, 2025-05-07T20:30:52.9182455Z ) 2025-05-07T20:30:52.9182689Z Trying example: test_gqa( 2025-05-07T20:30:52.9183040Z self=, 2025-05-07T20:30:52.9183413Z int4_kv=False, 2025-05-07T20:30:52.9183666Z num_groups=1, 2025-05-07T20:30:52.9183918Z B=78, 2025-05-07T20:30:52.9184134Z MAX_T=94, 2025-05-07T20:30:52.9184369Z N_H_L=78, 2025-05-07T20:30:52.9184598Z ) 2025-05-07T20:30:52.9184823Z Trying example: test_gqa( 2025-05-07T20:30:52.9185172Z self=, 2025-05-07T20:30:52.9185555Z int4_kv=False, 2025-05-07T20:30:52.9185800Z num_groups=1, 2025-05-07T20:30:52.9186045Z B=94, 2025-05-07T20:30:52.9186269Z MAX_T=94, 2025-05-07T20:30:52.9186492Z N_H_L=78, 2025-05-07T20:30:52.9186720Z ) 2025-05-07T20:30:52.9186950Z Trying example: test_gqa( 2025-05-07T20:30:52.9187288Z self=, 2025-05-07T20:30:52.9187668Z int4_kv=False, 2025-05-07T20:30:52.9188030Z num_groups=1, 2025-05-07T20:30:52.9188279Z B=94, 2025-05-07T20:30:52.9188509Z MAX_T=94, 2025-05-07T20:30:52.9188746Z N_H_L=94, 2025-05-07T20:30:52.9188968Z ) 2025-05-07T20:30:52.9189199Z Trying example: test_gqa( 2025-05-07T20:30:52.9189547Z self=, 2025-05-07T20:30:52.9189924Z int4_kv=False, 2025-05-07T20:30:52.9190169Z num_groups=4, 2025-05-07T20:30:52.9190413Z B=41, 2025-05-07T20:30:52.9190638Z MAX_T=105, 2025-05-07T20:30:52.9190891Z N_H_L=126, 2025-05-07T20:30:52.9191097Z ) 2025-05-07T20:30:52.9191284Z Trying example: test_gqa( 2025-05-07T20:30:52.9191567Z self=, 2025-05-07T20:30:52.9191879Z int4_kv=False, 2025-05-07T20:30:52.9192087Z num_groups=4, 2025-05-07T20:30:52.9192292Z B=105, 2025-05-07T20:30:52.9192481Z MAX_T=105, 2025-05-07T20:30:52.9192683Z N_H_L=126, 2025-05-07T20:30:52.9192873Z ) 2025-05-07T20:30:52.9193063Z Trying example: test_gqa( 2025-05-07T20:30:52.9193358Z self=, 2025-05-07T20:30:52.9193667Z int4_kv=False, 2025-05-07T20:30:52.9193882Z num_groups=4, 2025-05-07T20:30:52.9194088Z B=105, 2025-05-07T20:30:52.9194269Z MAX_T=105, 2025-05-07T20:30:52.9194468Z N_H_L=105, 2025-05-07T20:30:52.9194661Z ) 2025-05-07T20:30:52.9194845Z Trying example: test_gqa( 2025-05-07T20:30:52.9195134Z self=, 2025-05-07T20:30:52.9195443Z int4_kv=True, 2025-05-07T20:30:52.9195653Z num_groups=1, 2025-05-07T20:30:52.9195852Z B=95, 2025-05-07T20:30:52.9196040Z MAX_T=114, 2025-05-07T20:30:52.9196238Z N_H_L=43, 2025-05-07T20:30:52.9196422Z ) 2025-05-07T20:30:52.9196613Z Trying example: test_gqa( 2025-05-07T20:30:52.9196904Z self=, 2025-05-07T20:30:52.9197204Z int4_kv=True, 2025-05-07T20:30:52.9197410Z num_groups=1, 2025-05-07T20:30:52.9197615Z B=43, 2025-05-07T20:30:52.9197795Z MAX_T=114, 2025-05-07T20:30:52.9198095Z N_H_L=43, 2025-05-07T20:30:52.9198288Z ) 2025-05-07T20:30:52.9198473Z Trying example: test_gqa( 2025-05-07T20:30:52.9198769Z self=, 2025-05-07T20:30:52.9199086Z int4_kv=True, 2025-05-07T20:30:52.9199284Z num_groups=1, 2025-05-07T20:30:52.9199676Z B=43, 2025-05-07T20:30:52.9199865Z MAX_T=43, 2025-05-07T20:30:52.9200053Z N_H_L=43, 2025-05-07T20:30:52.9200242Z ) 2025-05-07T20:30:52.9200437Z Trying example: test_gqa( 2025-05-07T20:30:52.9200719Z self=, 2025-05-07T20:30:52.9201033Z int4_kv=False, 2025-05-07T20:30:52.9201240Z num_groups=1, 2025-05-07T20:30:52.9201444Z B=21, 2025-05-07T20:30:52.9201620Z MAX_T=38, 2025-05-07T20:30:52.9201816Z N_H_L=42, 2025-05-07T20:30:52.9202007Z ) 2025-05-07T20:30:52.9202189Z Trying example: test_gqa( 2025-05-07T20:30:52.9202477Z self=, 2025-05-07T20:30:52.9202792Z int4_kv=False, 2025-05-07T20:30:52.9203002Z num_groups=1, 2025-05-07T20:30:52.9203212Z B=38, 2025-05-07T20:30:52.9203521Z MAX_T=38, 2025-05-07T20:30:52.9203723Z N_H_L=42, 2025-05-07T20:30:52.9203916Z ) 2025-05-07T20:30:52.9204111Z Trying example: test_gqa( 2025-05-07T20:30:52.9204400Z self=, 2025-05-07T20:30:52.9204712Z int4_kv=False, 2025-05-07T20:30:52.9204929Z num_groups=1, 2025-05-07T20:30:52.9205129Z B=38, 2025-05-07T20:30:52.9205319Z MAX_T=42, 2025-05-07T20:30:52.9205508Z N_H_L=42, 2025-05-07T20:30:52.9205694Z ) 2025-05-07T20:30:52.9205887Z Trying example: test_gqa( 2025-05-07T20:30:52.9206176Z self=, 2025-05-07T20:30:52.9206483Z int4_kv=False, 2025-05-07T20:30:52.9206696Z num_groups=1, 2025-05-07T20:30:52.9206903Z B=42, 2025-05-07T20:30:52.9207083Z MAX_T=42, 2025-05-07T20:30:52.9207281Z N_H_L=42, 2025-05-07T20:30:52.9207472Z ) 2025-05-07T20:30:52.9207757Z Trying example: test_gqa( 2025-05-07T20:30:52.9208060Z self=, 2025-05-07T20:30:52.9208374Z int4_kv=True, 2025-05-07T20:30:52.9208574Z num_groups=1, 2025-05-07T20:30:52.9208778Z B=74, 2025-05-07T20:30:52.9208961Z MAX_T=20, 2025-05-07T20:30:52.9209156Z N_H_L=15, 2025-05-07T20:30:52.9209337Z ) 2025-05-07T20:30:52.9209526Z Trying example: test_gqa( 2025-05-07T20:30:52.9209818Z self=, 2025-05-07T20:30:52.9210122Z int4_kv=True, 2025-05-07T20:30:52.9210332Z num_groups=1, 2025-05-07T20:30:52.9210536Z B=20, 2025-05-07T20:30:52.9210715Z MAX_T=20, 2025-05-07T20:30:52.9210913Z N_H_L=15, 2025-05-07T20:30:52.9211103Z ) 2025-05-07T20:30:52.9211289Z Trying example: test_gqa( 2025-05-07T20:30:52.9211579Z self=, 2025-05-07T20:30:52.9211890Z int4_kv=True, 2025-05-07T20:30:52.9212089Z num_groups=1, 2025-05-07T20:30:52.9212294Z B=20, 2025-05-07T20:30:52.9212485Z MAX_T=15, 2025-05-07T20:30:52.9212677Z N_H_L=15, 2025-05-07T20:30:52.9212871Z ) 2025-05-07T20:30:52.9213060Z Trying example: test_gqa( 2025-05-07T20:30:52.9213341Z self=, 2025-05-07T20:30:52.9213663Z int4_kv=True, 2025-05-07T20:30:52.9213872Z num_groups=1, 2025-05-07T20:30:52.9214068Z B=15, 2025-05-07T20:30:52.9214258Z MAX_T=20, 2025-05-07T20:30:52.9214452Z N_H_L=15, 2025-05-07T20:30:52.9214638Z ) 2025-05-07T20:30:52.9214834Z Trying example: test_gqa( 2025-05-07T20:30:52.9215127Z self=, 2025-05-07T20:30:52.9215429Z int4_kv=True, 2025-05-07T20:30:52.9215634Z num_groups=1, 2025-05-07T20:30:52.9215839Z B=15, 2025-05-07T20:30:52.9216020Z MAX_T=15, 2025-05-07T20:30:52.9216219Z N_H_L=15, 2025-05-07T20:30:52.9216413Z ) 2025-05-07T20:30:52.9216609Z Trying example: test_gqa( 2025-05-07T20:30:52.9216892Z self=, 2025-05-07T20:30:52.9217330Z int4_kv=False, 2025-05-07T20:30:52.9217546Z num_groups=4, 2025-05-07T20:30:52.9217744Z B=117, 2025-05-07T20:30:52.9217935Z MAX_T=104, 2025-05-07T20:30:52.9218134Z N_H_L=69, 2025-05-07T20:30:52.9218316Z ) 2025-05-07T20:30:52.9218509Z Trying example: test_gqa( 2025-05-07T20:30:52.9218796Z self=, 2025-05-07T20:30:52.9219099Z int4_kv=False, 2025-05-07T20:30:52.9219311Z num_groups=4, 2025-05-07T20:30:52.9219517Z B=117, 2025-05-07T20:30:52.9219698Z MAX_T=117, 2025-05-07T20:30:52.9219895Z N_H_L=69, 2025-05-07T20:30:52.9220085Z ) 2025-05-07T20:30:52.9220269Z Trying example: test_gqa( 2025-05-07T20:30:52.9220582Z self=, 2025-05-07T20:30:52.9220895Z int4_kv=False, 2025-05-07T20:30:52.9221099Z num_groups=4, 2025-05-07T20:30:52.9221304Z B=69, 2025-05-07T20:30:52.9221517Z MAX_T=117, 2025-05-07T20:30:52.9221737Z N_H_L=69, 2025-05-07T20:30:52.9221927Z ) 2025-05-07T20:30:52.9222129Z Trying example: test_gqa( 2025-05-07T20:30:52.9222417Z self=, 2025-05-07T20:30:52.9222798Z int4_kv=False, 2025-05-07T20:30:52.9223098Z num_groups=4, 2025-05-07T20:30:52.9223352Z B=117, 2025-05-07T20:30:52.9232768Z MAX_T=69, 2025-05-07T20:30:52.9233072Z N_H_L=69, 2025-05-07T20:30:52.9233280Z ) 2025-05-07T20:30:52.9233476Z PASSED 2025-05-07T20:30:52.9457225Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:30:52.9457558Z 2025-05-07T20:30:52.9458202Z =========================== short test summary info ============================ 2025-05-07T20:30:52.9459207Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when CUDA is not available or xformers is not available 2025-05-07T20:30:52.9460145Z ======================== 1 passed, 1 skipped in 37.06s ========================= 2025-05-07T20:30:53.6294817Z 2025-05-07T20:30:53.6295373Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:30:53.6317036Z [TEST] Python test time for ./attention/gqa_test.py: 40 seconds 2025-05-07T20:30:53.6317326Z 2025-05-07T20:30:53.6317460Z 2025-05-07T20:30:53.6317466Z 2025-05-07T20:30:53.6317493Z 2025-05-07T20:30:53.6338027Z ################################################################################ 2025-05-07T20:30:53.6353845Z # [2025-05-07T20:30:53.635Z] Run Python Test Suite: 2025-05-07T20:30:53.6354224Z # ./coalesce/coalesce_test.py 2025-05-07T20:30:53.6354600Z ################################################################################ 2025-05-07T20:30:53.6380319Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:30:53.6380928Z 2025-05-07T20:30:55.7972950Z ============================= test session starts ============================== 2025-05-07T20:30:55.7973657Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:55.7974210Z cachedir: .pytest_cache 2025-05-07T20:30:55.7974802Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:55.7975523Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:55.7975940Z plugins: hypothesis-6.131.14 2025-05-07T20:30:57.4715670Z collecting ... collected 1 item 2025-05-07T20:30:57.4716095Z 2025-05-07T20:30:58.2035159Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:30:58.2035506Z 2025-05-07T20:30:58.2035653Z ============================== 1 passed in 2.52s =============================== 2025-05-07T20:30:58.8898670Z 2025-05-07T20:30:58.8899261Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:30:58.8919804Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:30:58.8920138Z 2025-05-07T20:30:58.8920501Z 2025-05-07T20:30:58.8920506Z 2025-05-07T20:30:58.8920520Z 2025-05-07T20:30:58.8941206Z ################################################################################ 2025-05-07T20:30:58.8956474Z # [2025-05-07T20:30:58.895Z] Run Python Test Suite: 2025-05-07T20:30:58.8956802Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:30:58.8957098Z ################################################################################ 2025-05-07T20:30:58.8981191Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:30:58.8981818Z 2025-05-07T20:31:01.0619113Z ============================= test session starts ============================== 2025-05-07T20:31:01.0619738Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:01.0620269Z cachedir: .pytest_cache 2025-05-07T20:31:01.0620866Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:01.0621611Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:01.0622028Z plugins: hypothesis-6.131.14 2025-05-07T20:31:02.7706633Z collecting ... collected 5 items 2025-05-07T20:31:02.7706961Z 2025-05-07T20:31:02.7716276Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:31:02.7736706Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:31:02.7745486Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:31:02.7752217Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:31:02.7766620Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:31:02.7767093Z 2025-05-07T20:31:02.7767610Z =========================== short test summary info ============================ 2025-05-07T20:31:02.7768345Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:02.7769276Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:02.7770203Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:02.7771128Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:02.7772054Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:02.7772701Z ============================== 5 skipped in 1.83s ============================== 2025-05-07T20:31:03.3837838Z 2025-05-07T20:31:03.3840585Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:31:03.3862476Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds 2025-05-07T20:31:03.3862890Z 2025-05-07T20:31:03.3862896Z 2025-05-07T20:31:03.3862902Z 2025-05-07T20:31:03.3862907Z 2025-05-07T20:31:03.3883548Z ################################################################################ 2025-05-07T20:31:03.3902190Z # [2025-05-07T20:31:03.389Z] Run Python Test Suite: 2025-05-07T20:31:03.3902679Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:03.3903122Z ################################################################################ 2025-05-07T20:31:03.3926795Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:03.3927554Z 2025-05-07T20:31:05.5462058Z ============================= test session starts ============================== 2025-05-07T20:31:05.5463198Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:05.5463710Z cachedir: .pytest_cache 2025-05-07T20:31:05.5464284Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:05.5465008Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:05.5465418Z plugins: hypothesis-6.131.14 2025-05-07T20:31:07.3892566Z collecting ... collected 2 items 2025-05-07T20:31:07.3893208Z 2025-05-07T20:31:07.3901556Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:31:07.3915511Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:31:07.3916120Z 2025-05-07T20:31:07.3916386Z =========================== short test summary info ============================ 2025-05-07T20:31:07.3917090Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:07.3917922Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:07.3918523Z ============================== 2 skipped in 1.96s ============================== 2025-05-07T20:31:08.0128909Z 2025-05-07T20:31:08.0129704Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:08.0148812Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds 2025-05-07T20:31:08.0149272Z 2025-05-07T20:31:08.0149277Z 2025-05-07T20:31:08.0149281Z 2025-05-07T20:31:08.0149285Z 2025-05-07T20:31:08.0171148Z ################################################################################ 2025-05-07T20:31:08.0186421Z # [2025-05-07T20:31:08.018Z] Run Python Test Suite: 2025-05-07T20:31:08.0187113Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:31:08.0187416Z ################################################################################ 2025-05-07T20:31:08.0211325Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:31:08.0211940Z 2025-05-07T20:31:10.1767531Z ============================= test session starts ============================== 2025-05-07T20:31:10.1768359Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:10.1768887Z cachedir: .pytest_cache 2025-05-07T20:31:10.1769464Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:10.1770199Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:10.1770604Z plugins: hypothesis-6.131.14 2025-05-07T20:31:11.9613015Z collecting ... collected 4 items 2025-05-07T20:31:11.9613349Z 2025-05-07T20:31:14.8073416Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:31:14.8199294Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:31:14.8345859Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:31:14.8471333Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:31:14.8471689Z 2025-05-07T20:31:14.8471847Z =========================== short test summary info ============================ 2025-05-07T20:31:14.8472538Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when H100 is not available or MI300 is not available 2025-05-07T20:31:14.8473463Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when xformers is not available 2025-05-07T20:31:14.8474099Z ============================== 4 skipped in 4.79s ============================== 2025-05-07T20:31:16.6951106Z 2025-05-07T20:31:16.6951856Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:31:16.6972132Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds 2025-05-07T20:31:16.6972475Z 2025-05-07T20:31:16.6972481Z 2025-05-07T20:31:16.6972486Z 2025-05-07T20:31:16.6972491Z 2025-05-07T20:31:16.6994392Z ################################################################################ 2025-05-07T20:31:16.7010190Z # [2025-05-07T20:31:16.700Z] Run Python Test Suite: 2025-05-07T20:31:16.7010638Z # ./moe/activation_test.py 2025-05-07T20:31:16.7010972Z ################################################################################ 2025-05-07T20:31:16.7034911Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:31:16.7035515Z 2025-05-07T20:31:18.8645616Z ============================= test session starts ============================== 2025-05-07T20:31:18.8646291Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:18.8646827Z cachedir: .pytest_cache 2025-05-07T20:31:18.8647414Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:18.8648142Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:18.8648561Z plugins: hypothesis-6.131.14 2025-05-07T20:31:20.5162861Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:20.6685356Z collecting ... collected 2 items 2025-05-07T20:31:20.6685577Z 2025-05-07T20:31:26.1336560Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:31:26.1337206Z self=, 2025-05-07T20:31:26.1338025Z T=1, 2025-05-07T20:31:26.1338234Z D=5120, 2025-05-07T20:31:26.1338670Z contiguous=True, 2025-05-07T20:31:26.1338905Z compiled=True, 2025-05-07T20:31:26.1339115Z ) 2025-05-07T20:31:26.1339319Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1339704Z self=, 2025-05-07T20:31:26.1340082Z T=4096, 2025-05-07T20:31:26.1340275Z D=5120, 2025-05-07T20:31:26.1340479Z contiguous=True, 2025-05-07T20:31:26.1340699Z compiled=True, 2025-05-07T20:31:26.1340911Z ) 2025-05-07T20:31:26.1341116Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1341491Z self=, 2025-05-07T20:31:26.1341872Z T=4096, 2025-05-07T20:31:26.1342068Z D=7168, 2025-05-07T20:31:26.1342261Z contiguous=False, 2025-05-07T20:31:26.1342493Z compiled=False, 2025-05-07T20:31:26.1342703Z ) 2025-05-07T20:31:26.1342894Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1343280Z self=, 2025-05-07T20:31:26.1343664Z T=4096, 2025-05-07T20:31:26.1343856Z D=5120, 2025-05-07T20:31:26.1344047Z contiguous=False, 2025-05-07T20:31:26.1344276Z compiled=True, 2025-05-07T20:31:26.1344487Z ) 2025-05-07T20:31:26.1344678Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1345056Z self=, 2025-05-07T20:31:26.1345440Z T=1, 2025-05-07T20:31:26.1345623Z D=7168, 2025-05-07T20:31:26.1345821Z contiguous=True, 2025-05-07T20:31:26.1346053Z compiled=True, 2025-05-07T20:31:26.1347716Z ) 2025-05-07T20:31:26.1347918Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1348293Z self=, 2025-05-07T20:31:26.1348669Z T=1, 2025-05-07T20:31:26.1348857Z D=7168, 2025-05-07T20:31:26.1349058Z contiguous=False, 2025-05-07T20:31:26.1349281Z compiled=True, 2025-05-07T20:31:26.1349492Z ) 2025-05-07T20:31:26.1349871Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1350244Z self=, 2025-05-07T20:31:26.1350628Z T=4096, 2025-05-07T20:31:26.1350822Z D=5120, 2025-05-07T20:31:26.1351012Z contiguous=False, 2025-05-07T20:31:26.1351245Z compiled=False, 2025-05-07T20:31:26.1351456Z ) 2025-05-07T20:31:26.1351646Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1352027Z self=, 2025-05-07T20:31:26.1352450Z T=1, 2025-05-07T20:31:26.1352641Z D=7168, 2025-05-07T20:31:26.1352831Z contiguous=True, 2025-05-07T20:31:26.1353056Z compiled=False, 2025-05-07T20:31:26.1353262Z ) 2025-05-07T20:31:26.1353454Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1353831Z self=, 2025-05-07T20:31:26.1354210Z T=2048, 2025-05-07T20:31:26.1354394Z D=5120, 2025-05-07T20:31:26.1354597Z contiguous=True, 2025-05-07T20:31:26.1354829Z compiled=True, 2025-05-07T20:31:26.1355029Z ) 2025-05-07T20:31:26.1355227Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1355601Z self=, 2025-05-07T20:31:26.1355976Z T=2048, 2025-05-07T20:31:26.1356167Z D=7168, 2025-05-07T20:31:26.1356363Z contiguous=True, 2025-05-07T20:31:26.1356580Z compiled=True, 2025-05-07T20:31:26.1356789Z ) 2025-05-07T20:31:26.1356991Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1357364Z self=, 2025-05-07T20:31:26.1357744Z T=2048, 2025-05-07T20:31:26.1357926Z D=7168, 2025-05-07T20:31:26.1358123Z contiguous=True, 2025-05-07T20:31:26.1358345Z compiled=False, 2025-05-07T20:31:26.1358553Z ) 2025-05-07T20:31:26.1358750Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1359122Z self=, 2025-05-07T20:31:26.1359643Z T=128, 2025-05-07T20:31:26.1359834Z D=5120, 2025-05-07T20:31:26.1360030Z contiguous=False, 2025-05-07T20:31:26.1360265Z compiled=True, 2025-05-07T20:31:26.1360475Z ) 2025-05-07T20:31:26.1360673Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1361054Z self=, 2025-05-07T20:31:26.1361439Z T=128, 2025-05-07T20:31:26.1361626Z D=5120, 2025-05-07T20:31:26.1361826Z contiguous=True, 2025-05-07T20:31:26.1362053Z compiled=True, 2025-05-07T20:31:26.1362257Z ) 2025-05-07T20:31:26.1362461Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1362839Z self=, 2025-05-07T20:31:26.1363218Z T=16384, 2025-05-07T20:31:26.1363591Z D=5120, 2025-05-07T20:31:26.1363792Z contiguous=False, 2025-05-07T20:31:26.1364016Z compiled=True, 2025-05-07T20:31:26.1364225Z ) 2025-05-07T20:31:26.1364425Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1364809Z self=, 2025-05-07T20:31:26.1365192Z T=16384, 2025-05-07T20:31:26.1365391Z D=5120, 2025-05-07T20:31:26.1365591Z contiguous=False, 2025-05-07T20:31:26.1365816Z compiled=False, 2025-05-07T20:31:26.1366021Z ) 2025-05-07T20:31:26.1366220Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1366593Z self=, 2025-05-07T20:31:26.1366978Z T=128, 2025-05-07T20:31:26.1367168Z D=7168, 2025-05-07T20:31:26.1367372Z contiguous=True, 2025-05-07T20:31:26.1367598Z compiled=False, 2025-05-07T20:31:26.1367807Z ) 2025-05-07T20:31:26.1368006Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1368376Z self=, 2025-05-07T20:31:26.1368765Z T=128, 2025-05-07T20:31:26.1368958Z D=7168, 2025-05-07T20:31:26.1369153Z contiguous=False, 2025-05-07T20:31:26.1369383Z compiled=False, 2025-05-07T20:31:26.1369691Z ) 2025-05-07T20:31:26.1369883Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1370262Z self=, 2025-05-07T20:31:26.1370647Z T=1, 2025-05-07T20:31:26.1370835Z D=5120, 2025-05-07T20:31:26.1371029Z contiguous=False, 2025-05-07T20:31:26.1371257Z compiled=False, 2025-05-07T20:31:26.1371469Z ) 2025-05-07T20:31:26.1371660Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1372033Z self=, 2025-05-07T20:31:26.1372443Z T=1, 2025-05-07T20:31:26.1372646Z D=7168, 2025-05-07T20:31:26.1372842Z contiguous=False, 2025-05-07T20:31:26.1373070Z compiled=False, 2025-05-07T20:31:26.1373278Z ) 2025-05-07T20:31:26.1373478Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1373855Z self=, 2025-05-07T20:31:26.1374231Z T=4096, 2025-05-07T20:31:26.1374420Z D=5120, 2025-05-07T20:31:26.1374621Z contiguous=True, 2025-05-07T20:31:26.1374847Z compiled=False, 2025-05-07T20:31:26.1375051Z ) 2025-05-07T20:31:26.1375248Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1375617Z self=, 2025-05-07T20:31:26.1376000Z T=128, 2025-05-07T20:31:26.1376187Z D=7168, 2025-05-07T20:31:26.1376390Z contiguous=True, 2025-05-07T20:31:26.1376612Z compiled=True, 2025-05-07T20:31:26.1376820Z ) 2025-05-07T20:31:26.1377018Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1377389Z self=, 2025-05-07T20:31:26.1377773Z T=1, 2025-05-07T20:31:26.1377959Z D=5120, 2025-05-07T20:31:26.1378151Z contiguous=False, 2025-05-07T20:31:26.1378381Z compiled=True, 2025-05-07T20:31:26.1378588Z ) 2025-05-07T20:31:26.1378785Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1379159Z self=, 2025-05-07T20:31:26.1379646Z T=4096, 2025-05-07T20:31:26.1379833Z D=7168, 2025-05-07T20:31:26.1380032Z contiguous=True, 2025-05-07T20:31:26.1380260Z compiled=False, 2025-05-07T20:31:26.1380466Z ) 2025-05-07T20:31:26.1380665Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1381046Z self=, 2025-05-07T20:31:26.1381422Z T=4096, 2025-05-07T20:31:26.1381612Z D=7168, 2025-05-07T20:31:26.1381811Z contiguous=False, 2025-05-07T20:31:26.1382034Z compiled=True, 2025-05-07T20:31:26.1382238Z ) 2025-05-07T20:31:26.1382437Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1382840Z self=, 2025-05-07T20:31:26.1383238Z T=128, 2025-05-07T20:31:26.1383429Z D=5120, 2025-05-07T20:31:26.1383624Z contiguous=True, 2025-05-07T20:31:26.1383841Z compiled=False, 2025-05-07T20:31:26.1384049Z ) 2025-05-07T20:31:26.1384248Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1384625Z self=, 2025-05-07T20:31:26.1385010Z T=128, 2025-05-07T20:31:26.1385202Z D=5120, 2025-05-07T20:31:26.1385390Z contiguous=False, 2025-05-07T20:31:26.1385621Z compiled=False, 2025-05-07T20:31:26.1385829Z ) 2025-05-07T20:31:26.1386020Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1386395Z self=, 2025-05-07T20:31:26.1386779Z T=1, 2025-05-07T20:31:26.1386958Z D=5120, 2025-05-07T20:31:26.1387159Z contiguous=True, 2025-05-07T20:31:26.1387385Z compiled=False, 2025-05-07T20:31:26.1387582Z ) 2025-05-07T20:31:26.1387780Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1388159Z self=, 2025-05-07T20:31:26.1388537Z T=2048, 2025-05-07T20:31:26.1388728Z D=7168, 2025-05-07T20:31:26.1388943Z contiguous=False, 2025-05-07T20:31:26.1389172Z compiled=True, 2025-05-07T20:31:26.1389464Z ) 2025-05-07T20:31:26.1389669Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1390044Z self=, 2025-05-07T20:31:26.1390429Z T=2048, 2025-05-07T20:31:26.1390613Z D=7168, 2025-05-07T20:31:26.1390811Z contiguous=False, 2025-05-07T20:31:26.1391044Z compiled=False, 2025-05-07T20:31:26.1391248Z ) 2025-05-07T20:31:26.1391451Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1391830Z self=, 2025-05-07T20:31:26.1392207Z T=16384, 2025-05-07T20:31:26.1392409Z D=7168, 2025-05-07T20:31:26.1392609Z contiguous=False, 2025-05-07T20:31:26.1392835Z compiled=True, 2025-05-07T20:31:26.1393043Z ) 2025-05-07T20:31:26.1393247Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1393616Z self=, 2025-05-07T20:31:26.1394004Z T=16384, 2025-05-07T20:31:26.1394199Z D=7168, 2025-05-07T20:31:26.1394402Z contiguous=True, 2025-05-07T20:31:26.1394628Z compiled=True, 2025-05-07T20:31:26.1394834Z ) 2025-05-07T20:31:26.1395024Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1395400Z self=, 2025-05-07T20:31:26.1395782Z T=4096, 2025-05-07T20:31:26.1395972Z D=7168, 2025-05-07T20:31:26.1396176Z contiguous=True, 2025-05-07T20:31:26.1396410Z compiled=True, 2025-05-07T20:31:26.1396610Z ) 2025-05-07T20:31:26.1396813Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1397189Z self=, 2025-05-07T20:31:26.1397564Z T=2048, 2025-05-07T20:31:26.1397757Z D=5120, 2025-05-07T20:31:26.1397965Z contiguous=False, 2025-05-07T20:31:26.1398187Z compiled=False, 2025-05-07T20:31:26.1398401Z ) 2025-05-07T20:31:26.1398602Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1399068Z self=, 2025-05-07T20:31:26.1399796Z T=2048, 2025-05-07T20:31:26.1400055Z D=5120, 2025-05-07T20:31:26.1409239Z contiguous=True, 2025-05-07T20:31:26.1409483Z compiled=False, 2025-05-07T20:31:26.1409689Z ) 2025-05-07T20:31:26.1409892Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1410277Z self=, 2025-05-07T20:31:26.1410663Z T=128, 2025-05-07T20:31:26.1410844Z D=7168, 2025-05-07T20:31:26.1411045Z contiguous=False, 2025-05-07T20:31:26.1411276Z compiled=True, 2025-05-07T20:31:26.1411476Z ) 2025-05-07T20:31:26.1411678Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1412059Z self=, 2025-05-07T20:31:26.1412432Z T=16384, 2025-05-07T20:31:26.1412631Z D=5120, 2025-05-07T20:31:26.1412830Z contiguous=True, 2025-05-07T20:31:26.1413044Z compiled=True, 2025-05-07T20:31:26.1413251Z ) 2025-05-07T20:31:26.1413460Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1413831Z self=, 2025-05-07T20:31:26.1414211Z T=2048, 2025-05-07T20:31:26.1414399Z D=5120, 2025-05-07T20:31:26.1414591Z contiguous=False, 2025-05-07T20:31:26.1414816Z compiled=True, 2025-05-07T20:31:26.1415019Z ) 2025-05-07T20:31:26.1415213Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1415585Z self=, 2025-05-07T20:31:26.1415968Z T=16384, 2025-05-07T20:31:26.1416155Z D=5120, 2025-05-07T20:31:26.1416354Z contiguous=True, 2025-05-07T20:31:26.1416588Z compiled=False, 2025-05-07T20:31:26.1416805Z ) 2025-05-07T20:31:26.1416996Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1417375Z self=, 2025-05-07T20:31:26.1417763Z T=16384, 2025-05-07T20:31:26.1417951Z D=7168, 2025-05-07T20:31:26.1418148Z contiguous=False, 2025-05-07T20:31:26.1418379Z compiled=False, 2025-05-07T20:31:26.1418704Z ) 2025-05-07T20:31:26.1418903Z Trying example: test_silu_mul( 2025-05-07T20:31:26.1419282Z self=, 2025-05-07T20:31:26.1419652Z T=16384, 2025-05-07T20:31:26.1419844Z D=7168, 2025-05-07T20:31:26.1420043Z contiguous=True, 2025-05-07T20:31:26.1420266Z compiled=False, 2025-05-07T20:31:26.1420484Z ) 2025-05-07T20:31:26.1420678Z PASSED 2025-05-07T20:31:26.2038905Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.2039994Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:26.2041361Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.2043201Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.2044360Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2045673Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.2047052Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.2048393Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2049647Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.2051033Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.2052099Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2053380Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.2054642Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:26.2055875Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.2057088Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:26.2057920Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2058947Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:26.2060127Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:26.2060927Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:26.2062141Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.2063595Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.2064734Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:26.2065790Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:26.2066974Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.2068336Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.2069394Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.2070315Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.2071168Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:26.2072211Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.2197085Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.2198321Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:26.2199657Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.2201106Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.2202335Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2204021Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.2205402Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.2206386Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2207915Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.2209289Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.2210344Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2211621Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.2212923Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:26.2214140Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.2215344Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:26.2216168Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2217189Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:26.2218382Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:26.2219184Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:26.2220389Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.2221666Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.2222782Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:26.2223829Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:26.2225018Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.2226371Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.2227429Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.2228340Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.2229073Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:26.2230093Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.2581620Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.2582840Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:26.2584173Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.2585620Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.2586616Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2587917Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.2589297Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.2590276Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2591849Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.2593292Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.2594348Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2595630Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.2596877Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:26.2598106Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.2599312Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:26.2600139Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2601163Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:26.2602184Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:26.2602978Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:26.2604508Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.2605792Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.2606910Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:26.2607958Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:26.2609137Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.2610502Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.2611573Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.2612487Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.2613224Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:26.2614329Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.2621989Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.2623331Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:26.2625012Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.2626804Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.2628021Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2629675Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.2631420Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.2632644Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2634189Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.2635926Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.2637369Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2639140Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.2640395Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:26.2641615Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.2642822Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:26.2643781Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:26.2644804Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:26.2645824Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:26.2646618Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:26.2647987Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.2649279Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.2650401Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:26.2651465Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:26.2652644Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.2654008Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.2655077Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.2655994Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.2656737Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:26.2657751Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.6790521Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:26.6792582Z self=, 2025-05-07T20:31:26.6793211Z T=1, 2025-05-07T20:31:26.6793421Z D=5120, 2025-05-07T20:31:26.6793620Z scale_ub=None, 2025-05-07T20:31:26.6793834Z contiguous=True, 2025-05-07T20:31:26.6794061Z compiled=True, 2025-05-07T20:31:26.6794266Z ) 2025-05-07T20:31:26.6794589Z self = 2025-05-07T20:31:26.6795075Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:26.6795335Z 2025-05-07T20:31:26.6795416Z @given( 2025-05-07T20:31:26.6795648Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:26.6795964Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:26.6796262Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:26.6796590Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:26.6796923Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:26.6797210Z ) 2025-05-07T20:31:26.6797561Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:26.6798005Z def test_silu_mul_quant( 2025-05-07T20:31:26.6798269Z self, 2025-05-07T20:31:26.6798469Z T: int, 2025-05-07T20:31:26.6798663Z D: int, 2025-05-07T20:31:26.6798883Z scale_ub: Optional[float], 2025-05-07T20:31:26.6799158Z contiguous: bool, 2025-05-07T20:31:26.6799391Z compiled: bool, 2025-05-07T20:31:26.6799621Z ) -> None: 2025-05-07T20:31:26.6799844Z torch.manual_seed(2025) 2025-05-07T20:31:26.6800082Z 2025-05-07T20:31:26.6800361Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:26.6800709Z 2025-05-07T20:31:26.6800896Z x_sign = torch.sign(x) 2025-05-07T20:31:26.6801191Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:26.6801510Z x = x_sign * x_clamp 2025-05-07T20:31:26.6801914Z x0 = x[:, :D] 2025-05-07T20:31:26.6802143Z x1 = x[:, D:] 2025-05-07T20:31:26.6802358Z 2025-05-07T20:31:26.6802541Z if contiguous: 2025-05-07T20:31:26.6802784Z x0 = x0.contiguous() 2025-05-07T20:31:26.6803044Z x1 = x1.contiguous() 2025-05-07T20:31:26.6803281Z 2025-05-07T20:31:26.6803679Z if scale_ub is not None: 2025-05-07T20:31:26.6803956Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:26.6804295Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:26.6804603Z ) 2025-05-07T20:31:26.6804799Z else: 2025-05-07T20:31:26.6805011Z scale_ub_tensor = None 2025-05-07T20:31:26.6805258Z 2025-05-07T20:31:26.6805497Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:26.6805814Z op = silu_mul_quant 2025-05-07T20:31:26.6806056Z if compiled: 2025-05-07T20:31:26.6806300Z op = torch.compile(op) 2025-05-07T20:31:26.6806604Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:26.6806874Z 2025-05-07T20:31:26.6807069Z y_fp8, y_scale = fn() 2025-05-07T20:31:26.6807355Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:26.6807639Z 2025-05-07T20:31:26.6807875Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:26.6808210Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:26.6808503Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:26.6808815Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:26.6809172Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:26.6809480Z 2025-05-07T20:31:26.6809675Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:26.6809874Z 2025-05-07T20:31:26.6809976Z moe/activation_test.py:126: 2025-05-07T20:31:26.6810272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:26.6810607Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:26.6811029Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:26.6811826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:26.6812586Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:26.6813133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:26.6813829Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:26.6814527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:26.6815265Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:26.6816019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:26.6816785Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:26.6817521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:26.6818164Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:26.6818768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:26.6819289Z fn() 2025-05-07T20:31:26.6819805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:26.6820384Z self.fn.run( 2025-05-07T20:31:26.6820858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:26.6821391Z kernel = self.compile( 2025-05-07T20:31:26.6822011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:26.6822680Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.6823080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:26.6823312Z 2025-05-07T20:31:26.6823525Z self = 2025-05-07T20:31:26.6824598Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:26.6825981Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09369c3060>} 2025-05-07T20:31:26.6827329Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:26.6828361Z context = 2025-05-07T20:31:26.6828647Z 2025-05-07T20:31:26.6828822Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:26.6829343Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.6829814Z module_map=module_map) 2025-05-07T20:31:26.6830185Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.6830538Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:26.6830809Z E ^ 2025-05-07T20:31:26.6831274Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.6831726Z 2025-05-07T20:31:26.6832158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:26.6832759Z 2025-05-07T20:31:26.6832863Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:26.6833280Z self=, 2025-05-07T20:31:26.6833686Z T=2048, 2025-05-07T20:31:26.6833869Z D=5120, 2025-05-07T20:31:26.6834062Z scale_ub=1200.0, 2025-05-07T20:31:26.6834288Z contiguous=True, 2025-05-07T20:31:26.6834511Z compiled=False, 2025-05-07T20:31:26.6834714Z ) 2025-05-07T20:31:27.0255545Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:27.0256624Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:27.0257982Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:27.0259437Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:27.0260414Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.0261713Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:27.0263095Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.0264374Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.0265615Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:27.0266996Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.0268053Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.0269337Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:27.0270585Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:27.0271803Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:27.0273016Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:27.0273839Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.0274865Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:27.0276032Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:27.0276829Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:27.0278039Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:27.0279322Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:27.0280440Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:27.0281499Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:27.0282681Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:27.0284204Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:27.0285262Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.0286171Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.0287028Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:27.0288065Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.1223491Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:27.1226088Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:27.1228745Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:27.1231626Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:27.1233138Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.1234450Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:27.1235840Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.1236824Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.1238724Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:27.1240263Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.1241609Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.1243288Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:27.1244723Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:27.1245953Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:27.1247176Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:27.1248011Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.1249037Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:27.1250295Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:27.1251103Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:27.1252317Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:27.1253662Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:27.1254787Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:27.1255840Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:27.1257031Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:27.1258397Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:27.1259471Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.1260395Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.1261147Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:27.1262184Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.3909285Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:27.3910379Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:27.3911732Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:27.3913300Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:27.3914305Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.3915622Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:27.3917021Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.3918005Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.3919703Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:27.3921116Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.3922179Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.3923639Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:27.3924899Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:27.3926147Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:27.3927362Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:27.3928184Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.3929219Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:27.3930242Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:27.3931050Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:27.3932486Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:27.3933832Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:27.3934954Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:27.3936005Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:27.3946435Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:27.3947904Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:27.3948978Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.3949890Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.3950641Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:27.3951830Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.4056457Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:27.4057517Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:27.4058863Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:27.4060287Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:27.4061278Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.4062594Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:27.4064035Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.4065028Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.4066254Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:27.4067810Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.4068884Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.4070177Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:27.4071434Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:27.4072657Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:27.4073874Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:27.4074709Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.4075742Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:27.4076772Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:27.4077566Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:27.4078900Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:27.4080195Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:27.4081320Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:27.4082361Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:27.4083722Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:27.4085099Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:27.4086167Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.4087094Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.4087834Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:27.4088868Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.7225644Z self = 2025-05-07T20:31:27.7226851Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:27.7229342Z 2025-05-07T20:31:27.7229617Z @given( 2025-05-07T20:31:27.7230027Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:27.7230574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:27.7231042Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:27.7231417Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:27.7231816Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:27.7232240Z ) 2025-05-07T20:31:27.7232760Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:27.7233460Z def test_silu_mul_quant( 2025-05-07T20:31:27.7233771Z self, 2025-05-07T20:31:27.7233975Z T: int, 2025-05-07T20:31:27.7234173Z D: int, 2025-05-07T20:31:27.7234499Z scale_ub: Optional[float], 2025-05-07T20:31:27.7234787Z contiguous: bool, 2025-05-07T20:31:27.7235049Z compiled: bool, 2025-05-07T20:31:27.7235374Z ) -> None: 2025-05-07T20:31:27.7235605Z torch.manual_seed(2025) 2025-05-07T20:31:27.7235851Z 2025-05-07T20:31:27.7236140Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:27.7236507Z 2025-05-07T20:31:27.7236709Z x_sign = torch.sign(x) 2025-05-07T20:31:27.7237015Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:27.7237342Z x = x_sign * x_clamp 2025-05-07T20:31:27.7237584Z x0 = x[:, :D] 2025-05-07T20:31:27.7237821Z x1 = x[:, D:] 2025-05-07T20:31:27.7238045Z 2025-05-07T20:31:27.7238239Z if contiguous: 2025-05-07T20:31:27.7238837Z x0 = x0.contiguous() 2025-05-07T20:31:27.7239110Z x1 = x1.contiguous() 2025-05-07T20:31:27.7239362Z 2025-05-07T20:31:27.7239570Z if scale_ub is not None: 2025-05-07T20:31:27.7239839Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:27.7240426Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:27.7240761Z ) 2025-05-07T20:31:27.7240957Z else: 2025-05-07T20:31:27.7241180Z scale_ub_tensor = None 2025-05-07T20:31:27.7241446Z 2025-05-07T20:31:27.7241689Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:27.7242005Z op = silu_mul_quant 2025-05-07T20:31:27.7242263Z if compiled: 2025-05-07T20:31:27.7242518Z op = torch.compile(op) 2025-05-07T20:31:27.7242813Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:27.7243101Z 2025-05-07T20:31:27.7243300Z > y_fp8, y_scale = fn() 2025-05-07T20:31:27.7243646Z 2025-05-07T20:31:27.7243748Z moe/activation_test.py:117: 2025-05-07T20:31:27.7244056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:27.7244402Z moe/activation_test.py:115: in fn 2025-05-07T20:31:27.7244685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:27.7245400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:27.7246100Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:27.7246651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:27.7247342Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:27.7248021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:27.7248571Z kernel = self.compile( 2025-05-07T20:31:27.7249126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:27.7249783Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.7250195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:27.7250556Z 2025-05-07T20:31:27.7250779Z self = 2025-05-07T20:31:27.7251854Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:27.7253248Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09369deac0>} 2025-05-07T20:31:27.7254593Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:27.7255629Z context = 2025-05-07T20:31:27.7255917Z 2025-05-07T20:31:27.7256098Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:27.7256627Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.7257102Z module_map=module_map) 2025-05-07T20:31:27.7257475Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.7257824Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.7258094Z E ^ 2025-05-07T20:31:27.7258563Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.7259017Z 2025-05-07T20:31:27.7259445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:27.7259962Z 2025-05-07T20:31:27.7260067Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:27.7260486Z self=, 2025-05-07T20:31:27.7260978Z T=2048, 2025-05-07T20:31:27.7261179Z D=5120, 2025-05-07T20:31:27.7261377Z scale_ub=1200.0, 2025-05-07T20:31:27.7261613Z contiguous=True, 2025-05-07T20:31:27.7261836Z compiled=True, 2025-05-07T20:31:27.7262043Z ) 2025-05-07T20:31:27.7262373Z self = 2025-05-07T20:31:27.7262900Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:27.7263199Z 2025-05-07T20:31:27.7263279Z @given( 2025-05-07T20:31:27.7263519Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:27.7263842Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:27.7264146Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:27.7264483Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:27.7264816Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:27.7265111Z ) 2025-05-07T20:31:27.7265467Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:27.7265925Z def test_silu_mul_quant( 2025-05-07T20:31:27.7266176Z self, 2025-05-07T20:31:27.7266370Z T: int, 2025-05-07T20:31:27.7266576Z D: int, 2025-05-07T20:31:27.7266801Z scale_ub: Optional[float], 2025-05-07T20:31:27.7267072Z contiguous: bool, 2025-05-07T20:31:27.7267320Z compiled: bool, 2025-05-07T20:31:27.7267553Z ) -> None: 2025-05-07T20:31:27.7267771Z torch.manual_seed(2025) 2025-05-07T20:31:27.7268021Z 2025-05-07T20:31:27.7268303Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:27.7268646Z 2025-05-07T20:31:27.7268854Z x_sign = torch.sign(x) 2025-05-07T20:31:27.7269155Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:27.7269466Z x = x_sign * x_clamp 2025-05-07T20:31:27.7269714Z x0 = x[:, :D] 2025-05-07T20:31:27.7269935Z x1 = x[:, D:] 2025-05-07T20:31:27.7270142Z 2025-05-07T20:31:27.7270341Z if contiguous: 2025-05-07T20:31:27.7270664Z x0 = x0.contiguous() 2025-05-07T20:31:27.7270928Z x1 = x1.contiguous() 2025-05-07T20:31:27.7271166Z 2025-05-07T20:31:27.7271363Z if scale_ub is not None: 2025-05-07T20:31:27.7271642Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:27.7271976Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:27.7272286Z ) 2025-05-07T20:31:27.7272484Z else: 2025-05-07T20:31:27.7272690Z scale_ub_tensor = None 2025-05-07T20:31:27.7272971Z 2025-05-07T20:31:27.7273240Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:27.7273552Z op = silu_mul_quant 2025-05-07T20:31:27.7273812Z if compiled: 2025-05-07T20:31:27.7274067Z op = torch.compile(op) 2025-05-07T20:31:27.7274364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:27.7274642Z 2025-05-07T20:31:27.7274844Z y_fp8, y_scale = fn() 2025-05-07T20:31:27.7275136Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:27.7275433Z 2025-05-07T20:31:27.7275676Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:27.7276011Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:27.7276304Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:27.7276630Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:27.7276991Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:27.7277299Z 2025-05-07T20:31:27.7277506Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:27.7277699Z 2025-05-07T20:31:27.7277806Z moe/activation_test.py:126: 2025-05-07T20:31:27.7278101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:27.7278448Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:27.7278783Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:27.7279814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:27.7280601Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:27.7281159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:27.7281855Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:27.7282560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:27.7283296Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:27.7284213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:27.7284977Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:27.7285723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:27.7286380Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:27.7286995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:27.7287531Z fn() 2025-05-07T20:31:27.7288048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:27.7288644Z self.fn.run( 2025-05-07T20:31:27.7289131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:27.7289669Z kernel = self.compile( 2025-05-07T20:31:27.7290265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:27.7291201Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.7291768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:27.7292016Z 2025-05-07T20:31:27.7292228Z self = 2025-05-07T20:31:27.7293316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:27.7294687Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f093c5387c0>} 2025-05-07T20:31:27.7296036Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:27.7297072Z context = 2025-05-07T20:31:27.7297370Z 2025-05-07T20:31:27.7297541Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:27.7298070Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.7298544Z module_map=module_map) 2025-05-07T20:31:27.7298920Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.7299280Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:27.7299554Z E ^ 2025-05-07T20:31:27.7300026Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.7300481Z 2025-05-07T20:31:27.7300905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:27.7301429Z 2025-05-07T20:31:27.7301536Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:27.7302077Z self=, 2025-05-07T20:31:27.7302485Z T=16384, 2025-05-07T20:31:27.7302681Z D=7168, 2025-05-07T20:31:27.7302881Z scale_ub=1200.0, 2025-05-07T20:31:27.7303109Z contiguous=False, 2025-05-07T20:31:27.7303334Z compiled=False, 2025-05-07T20:31:27.7303542Z ) 2025-05-07T20:31:27.9702724Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:27.9703867Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:27.9705227Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:27.9706678Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:27.9707660Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.9708971Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:27.9710363Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.9711351Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.9712858Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:27.9714243Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.9715308Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.9716592Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:27.9717850Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:27.9719067Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:27.9720276Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:27.9721115Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:27.9722148Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:27.9723355Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:27.9724260Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:27.9725474Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:27.9726757Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:27.9727876Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:27.9728925Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:27.9730112Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:27.9731473Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:27.9732535Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.9733500Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.9734240Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:27.9735267Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:28.0417146Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:28.0418201Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:28.0419539Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:28.0420976Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:28.0421962Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:28.0423302Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:28.0424713Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:28.0425696Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:28.0427167Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:28.0428571Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:28.0429642Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:28.0430935Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:28.0432202Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:28.0433436Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:28.0434648Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:28.0435482Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:28.0436514Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:28.0437542Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:28.0438343Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:28.0439986Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:28.0441282Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:28.0442412Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:28.0443688Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:28.0444879Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:28.0446257Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:28.0447324Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:28.0448243Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:28.0448983Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:28.0450132Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:28.4577300Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:28.4578380Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:28.4579721Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:28.4581156Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:28.4582154Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:28.4583483Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:28.4584863Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:28.4585849Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:28.4587088Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:28.4588750Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:28.4589814Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:28.4591086Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:28.4592346Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:28.4593575Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:28.4594802Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:28.4595640Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:28.4596663Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:28.4597693Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:28.4598498Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:28.4599864Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:28.4601161Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:28.4602279Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:28.4603329Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:28.4604680Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:28.4606041Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:28.4607100Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:28.4608023Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:28.4608774Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:28.4609796Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:28.4719656Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:28.4722084Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:28.4724237Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:28.4725673Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:28.4726650Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:28.4727961Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:28.4729351Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:28.4730331Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:28.4731561Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:28.4733074Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:28.4734197Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:28.4735472Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:28.4736715Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:28.4737937Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:28.4739364Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:28.4740190Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:28.4741213Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:28.4742235Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:28.4743034Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:28.4744300Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:28.4745715Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:28.4746828Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:28.4747875Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:28.4749060Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:28.4750421Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:28.4751490Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:28.4752407Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:28.4753148Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:28.4754168Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:29.2729158Z self = 2025-05-07T20:31:29.2729922Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:29.2730692Z 2025-05-07T20:31:29.2730807Z @given( 2025-05-07T20:31:29.2731046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:29.2731376Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:29.2731687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:29.2732027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:29.2732355Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:29.2732646Z ) 2025-05-07T20:31:29.2733006Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:29.2733453Z def test_silu_mul_quant( 2025-05-07T20:31:29.2733705Z self, 2025-05-07T20:31:29.2733988Z T: int, 2025-05-07T20:31:29.2742417Z D: int, 2025-05-07T20:31:29.2742658Z scale_ub: Optional[float], 2025-05-07T20:31:29.2742927Z contiguous: bool, 2025-05-07T20:31:29.2743179Z compiled: bool, 2025-05-07T20:31:29.2743449Z ) -> None: 2025-05-07T20:31:29.2743679Z torch.manual_seed(2025) 2025-05-07T20:31:29.2743931Z 2025-05-07T20:31:29.2744210Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:29.2744549Z 2025-05-07T20:31:29.2744746Z x_sign = torch.sign(x) 2025-05-07T20:31:29.2745041Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:29.2745356Z x = x_sign * x_clamp 2025-05-07T20:31:29.2745591Z x0 = x[:, :D] 2025-05-07T20:31:29.2745810Z x1 = x[:, D:] 2025-05-07T20:31:29.2746020Z 2025-05-07T20:31:29.2746199Z if contiguous: 2025-05-07T20:31:29.2746431Z x0 = x0.contiguous() 2025-05-07T20:31:29.2746690Z x1 = x1.contiguous() 2025-05-07T20:31:29.2746920Z 2025-05-07T20:31:29.2747115Z if scale_ub is not None: 2025-05-07T20:31:29.2747391Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:29.2747724Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:29.2748030Z ) 2025-05-07T20:31:29.2748450Z else: 2025-05-07T20:31:29.2748656Z scale_ub_tensor = None 2025-05-07T20:31:29.2748910Z 2025-05-07T20:31:29.2749145Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:29.2749451Z op = silu_mul_quant 2025-05-07T20:31:29.2749700Z if compiled: 2025-05-07T20:31:29.2749947Z op = torch.compile(op) 2025-05-07T20:31:29.2750246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:29.2750514Z 2025-05-07T20:31:29.2750706Z > y_fp8, y_scale = fn() 2025-05-07T20:31:29.2750870Z 2025-05-07T20:31:29.2750977Z moe/activation_test.py:117: 2025-05-07T20:31:29.2751266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:29.2751598Z moe/activation_test.py:115: in fn 2025-05-07T20:31:29.2751876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:29.2752562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:29.2753258Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:29.2753799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:29.2754486Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:29.2755146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:29.2755681Z kernel = self.compile( 2025-05-07T20:31:29.2756231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:29.2756891Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:29.2757284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:29.2757518Z 2025-05-07T20:31:29.2757852Z self = 2025-05-07T20:31:29.2758940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:29.2760366Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09369ddc60>} 2025-05-07T20:31:29.2761704Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:29.2762723Z context = 2025-05-07T20:31:29.2763023Z 2025-05-07T20:31:29.2763192Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:29.2763831Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:29.2764306Z module_map=module_map) 2025-05-07T20:31:29.2764665Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:29.2765016Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:29.2765277Z E ^ 2025-05-07T20:31:29.2765734Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:29.2766191Z 2025-05-07T20:31:29.2766609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:29.2767133Z 2025-05-07T20:31:29.2767235Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:29.2767648Z self=, 2025-05-07T20:31:29.2768041Z T=1, 2025-05-07T20:31:29.2768227Z D=7168, 2025-05-07T20:31:29.2768422Z scale_ub=None, 2025-05-07T20:31:29.2768633Z contiguous=True, 2025-05-07T20:31:29.2768978Z compiled=True, 2025-05-07T20:31:29.2769183Z ) 2025-05-07T20:31:29.2769494Z self = 2025-05-07T20:31:29.2769977Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:29.2770235Z 2025-05-07T20:31:29.2770320Z @given( 2025-05-07T20:31:29.2770547Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:29.2770860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:29.2771170Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:29.2771506Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:29.2771831Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:29.2772118Z ) 2025-05-07T20:31:29.2772461Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:29.2772903Z def test_silu_mul_quant( 2025-05-07T20:31:29.2773140Z self, 2025-05-07T20:31:29.2773339Z T: int, 2025-05-07T20:31:29.2773535Z D: int, 2025-05-07T20:31:29.2773743Z scale_ub: Optional[float], 2025-05-07T20:31:29.2774010Z contiguous: bool, 2025-05-07T20:31:29.2774246Z compiled: bool, 2025-05-07T20:31:29.2774459Z ) -> None: 2025-05-07T20:31:29.2774674Z torch.manual_seed(2025) 2025-05-07T20:31:29.2774913Z 2025-05-07T20:31:29.2775175Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:29.2775516Z 2025-05-07T20:31:29.2775704Z x_sign = torch.sign(x) 2025-05-07T20:31:29.2775986Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:29.2776294Z x = x_sign * x_clamp 2025-05-07T20:31:29.2776531Z x0 = x[:, :D] 2025-05-07T20:31:29.2776734Z x1 = x[:, D:] 2025-05-07T20:31:29.2776943Z 2025-05-07T20:31:29.2777128Z if contiguous: 2025-05-07T20:31:29.2777352Z x0 = x0.contiguous() 2025-05-07T20:31:29.2777685Z x1 = x1.contiguous() 2025-05-07T20:31:29.2777927Z 2025-05-07T20:31:29.2778115Z if scale_ub is not None: 2025-05-07T20:31:29.2778378Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:29.2778706Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:29.2779010Z ) 2025-05-07T20:31:29.2779196Z else: 2025-05-07T20:31:29.2779400Z scale_ub_tensor = None 2025-05-07T20:31:29.2779650Z 2025-05-07T20:31:29.2779873Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:29.2780183Z op = silu_mul_quant 2025-05-07T20:31:29.2780429Z if compiled: 2025-05-07T20:31:29.2780670Z op = torch.compile(op) 2025-05-07T20:31:29.2780966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:29.2781237Z 2025-05-07T20:31:29.2781420Z y_fp8, y_scale = fn() 2025-05-07T20:31:29.2781700Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:29.2781993Z 2025-05-07T20:31:29.2782229Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:29.2782553Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:29.2782840Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:29.2783151Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:29.2783502Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:29.2783808Z 2025-05-07T20:31:29.2784005Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:29.2784196Z 2025-05-07T20:31:29.2784293Z moe/activation_test.py:126: 2025-05-07T20:31:29.2784587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:29.2784918Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:29.2785247Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:29.2786033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:29.2786875Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:29.2787423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:29.2788097Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:29.2788788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:29.2789512Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:29.2790268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:29.2791012Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:29.2791749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:29.2792395Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:29.2792999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:29.2793512Z fn() 2025-05-07T20:31:29.2794052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:29.2794652Z self.fn.run( 2025-05-07T20:31:29.2795112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:29.2795640Z kernel = self.compile( 2025-05-07T20:31:29.2796184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:29.2796839Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:29.2797226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:29.2797546Z 2025-05-07T20:31:29.2797754Z self = 2025-05-07T20:31:29.2798828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:29.2800192Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09367e0360>} 2025-05-07T20:31:29.2801523Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:29.2802545Z context = 2025-05-07T20:31:29.2802834Z 2025-05-07T20:31:29.2803006Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:29.2803626Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:29.2804086Z module_map=module_map) 2025-05-07T20:31:29.2804451Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:29.2804810Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:29.2805076Z E ^ 2025-05-07T20:31:29.2805532Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:29.2806017Z 2025-05-07T20:31:29.2806558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:29.2807082Z 2025-05-07T20:31:29.2807192Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:29.2807679Z self=, 2025-05-07T20:31:29.2808089Z T=4096, 2025-05-07T20:31:29.2808375Z D=5120, 2025-05-07T20:31:29.2808565Z scale_ub=None, 2025-05-07T20:31:29.2808771Z contiguous=False, 2025-05-07T20:31:29.2808992Z compiled=False, 2025-05-07T20:31:29.2809192Z ) 2025-05-07T20:31:29.6262345Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:29.6263507Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:29.6264869Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:29.6266363Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:29.6267370Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:29.6268696Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:29.6270104Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:29.6271096Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:29.6272708Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:29.6274238Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:29.6275399Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:29.6276763Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:29.6278050Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:29.6279306Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:29.6280543Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:29.6281387Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:29.6282432Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:29.6283721Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:29.6284528Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:29.6285939Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:29.6287238Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:29.6288367Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:29.6289434Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:29.6290630Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:29.6292014Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:29.6293087Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:29.6294062Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:29.6294819Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:29.6295964Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:29.8740941Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:29.8742045Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:29.8743384Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:29.8744873Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:29.8745869Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:29.8747187Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:29.8748565Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:29.8749543Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:29.8750769Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:29.8752473Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:29.8753550Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:29.8754848Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:29.8756093Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:29.8757315Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:29.8758524Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:29.8759340Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:29.8760364Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:29.8761382Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:29.8762174Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:29.8763639Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:29.8764933Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:29.8766047Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:29.8767092Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:29.8768275Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:29.8769628Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:29.8770685Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:29.8771595Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:29.8772333Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:29.8773346Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:30.2407175Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.2408627Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:30.2409963Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.2411400Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.2412375Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.2413676Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.2415066Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.2416045Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.2417276Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.2418823Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.2419886Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.2421167Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.2422416Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:30.2423641Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.2424902Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:30.2425728Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.2426754Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:30.2427775Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:30.2428562Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:30.2429770Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.2431580Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.2432693Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:30.2433742Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:30.2434918Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.2436277Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.2437346Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.2438256Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.2439299Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:30.2440316Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:30.2549559Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.2550782Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:30.2552123Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.2553543Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.2554561Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.2555870Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.2557252Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.2558234Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.2559463Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.2560831Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.2561898Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.2563303Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.2564682Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:30.2565900Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.2567106Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:30.2567945Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.2568969Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:30.2569991Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:30.2570776Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:30.2571983Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.2573341Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.2574514Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:30.2575559Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:30.2576735Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.2578088Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.2579160Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.2580079Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.2580821Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:30.2581836Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.0235352Z self = 2025-05-07T20:31:32.0236063Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:32.0236402Z 2025-05-07T20:31:32.0236483Z @given( 2025-05-07T20:31:32.0236755Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:32.0237502Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:32.0237818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:32.0238156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:32.0238770Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:32.0239069Z ) 2025-05-07T20:31:32.0239426Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:32.0239880Z def test_silu_mul_quant( 2025-05-07T20:31:32.0240122Z self, 2025-05-07T20:31:32.0240325Z T: int, 2025-05-07T20:31:32.0240528Z D: int, 2025-05-07T20:31:32.0240747Z scale_ub: Optional[float], 2025-05-07T20:31:32.0241029Z contiguous: bool, 2025-05-07T20:31:32.0241275Z compiled: bool, 2025-05-07T20:31:32.0241500Z ) -> None: 2025-05-07T20:31:32.0241726Z torch.manual_seed(2025) 2025-05-07T20:31:32.0241970Z 2025-05-07T20:31:32.0242283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:32.0242636Z 2025-05-07T20:31:32.0242829Z x_sign = torch.sign(x) 2025-05-07T20:31:32.0243129Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:32.0243550Z x = x_sign * x_clamp 2025-05-07T20:31:32.0243788Z x0 = x[:, :D] 2025-05-07T20:31:32.0244009Z x1 = x[:, D:] 2025-05-07T20:31:32.0244246Z 2025-05-07T20:31:32.0244457Z if contiguous: 2025-05-07T20:31:32.0244693Z x0 = x0.contiguous() 2025-05-07T20:31:32.0244953Z x1 = x1.contiguous() 2025-05-07T20:31:32.0245190Z 2025-05-07T20:31:32.0245384Z if scale_ub is not None: 2025-05-07T20:31:32.0245665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:32.0245997Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:32.0246312Z ) 2025-05-07T20:31:32.0246515Z else: 2025-05-07T20:31:32.0246730Z scale_ub_tensor = None 2025-05-07T20:31:32.0247149Z 2025-05-07T20:31:32.0247396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:32.0247715Z op = silu_mul_quant 2025-05-07T20:31:32.0247961Z if compiled: 2025-05-07T20:31:32.0248211Z op = torch.compile(op) 2025-05-07T20:31:32.0248517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.0248788Z 2025-05-07T20:31:32.0248982Z > y_fp8, y_scale = fn() 2025-05-07T20:31:32.0249145Z 2025-05-07T20:31:32.0249251Z moe/activation_test.py:117: 2025-05-07T20:31:32.0249540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.0249874Z moe/activation_test.py:115: in fn 2025-05-07T20:31:32.0250155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.0250851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:32.0251544Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:32.0252101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:32.0252794Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:32.0253464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:32.0254006Z kernel = self.compile( 2025-05-07T20:31:32.0254558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:32.0255228Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.0255628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.0255866Z 2025-05-07T20:31:32.0256076Z self = 2025-05-07T20:31:32.0257185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:32.0258753Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09367e1c60>} 2025-05-07T20:31:32.0260109Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:32.0261139Z context = 2025-05-07T20:31:32.0261440Z 2025-05-07T20:31:32.0261608Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:32.0262141Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.0262619Z module_map=module_map) 2025-05-07T20:31:32.0262988Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.0263350Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.0263616Z E ^ 2025-05-07T20:31:32.0264081Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.0264548Z 2025-05-07T20:31:32.0264977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:32.0265553Z 2025-05-07T20:31:32.0265658Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:32.0266189Z self=, 2025-05-07T20:31:32.0274887Z T=4096, 2025-05-07T20:31:32.0275109Z D=7168, 2025-05-07T20:31:32.0275305Z scale_ub=None, 2025-05-07T20:31:32.0275542Z contiguous=False, 2025-05-07T20:31:32.0275784Z compiled=False, 2025-05-07T20:31:32.0275996Z ) 2025-05-07T20:31:32.0276494Z self = 2025-05-07T20:31:32.0277010Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:32.0277291Z 2025-05-07T20:31:32.0277370Z @given( 2025-05-07T20:31:32.0277610Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:32.0277932Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:32.0278239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:32.0278579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:32.0278919Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:32.0279211Z ) 2025-05-07T20:31:32.0279560Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:32.0280013Z def test_silu_mul_quant( 2025-05-07T20:31:32.0280267Z self, 2025-05-07T20:31:32.0280459Z T: int, 2025-05-07T20:31:32.0280667Z D: int, 2025-05-07T20:31:32.0280903Z scale_ub: Optional[float], 2025-05-07T20:31:32.0281179Z contiguous: bool, 2025-05-07T20:31:32.0281426Z compiled: bool, 2025-05-07T20:31:32.0281658Z ) -> None: 2025-05-07T20:31:32.0281871Z torch.manual_seed(2025) 2025-05-07T20:31:32.0282122Z 2025-05-07T20:31:32.0282406Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:32.0282748Z 2025-05-07T20:31:32.0282947Z x_sign = torch.sign(x) 2025-05-07T20:31:32.0283250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:32.0283677Z x = x_sign * x_clamp 2025-05-07T20:31:32.0283925Z x0 = x[:, :D] 2025-05-07T20:31:32.0284154Z x1 = x[:, D:] 2025-05-07T20:31:32.0284375Z 2025-05-07T20:31:32.0284560Z if contiguous: 2025-05-07T20:31:32.0284799Z x0 = x0.contiguous() 2025-05-07T20:31:32.0285064Z x1 = x1.contiguous() 2025-05-07T20:31:32.0285302Z 2025-05-07T20:31:32.0285510Z if scale_ub is not None: 2025-05-07T20:31:32.0285894Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:32.0286231Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:32.0286541Z ) 2025-05-07T20:31:32.0286740Z else: 2025-05-07T20:31:32.0286945Z scale_ub_tensor = None 2025-05-07T20:31:32.0287202Z 2025-05-07T20:31:32.0287438Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:32.0287747Z op = silu_mul_quant 2025-05-07T20:31:32.0288000Z if compiled: 2025-05-07T20:31:32.0288254Z op = torch.compile(op) 2025-05-07T20:31:32.0288546Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.0288827Z 2025-05-07T20:31:32.0289027Z > y_fp8, y_scale = fn() 2025-05-07T20:31:32.0289194Z 2025-05-07T20:31:32.0289304Z moe/activation_test.py:117: 2025-05-07T20:31:32.0289597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.0289943Z moe/activation_test.py:115: in fn 2025-05-07T20:31:32.0290242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.0290935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:32.0291640Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:32.0292186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:32.0292879Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:32.0293548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:32.0294092Z kernel = self.compile( 2025-05-07T20:31:32.0294667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:32.0295355Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.0295856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.0296097Z 2025-05-07T20:31:32.0296305Z self = 2025-05-07T20:31:32.0297391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:32.0298773Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09367e2c00>} 2025-05-07T20:31:32.0300112Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:32.0301151Z context = 2025-05-07T20:31:32.0301443Z 2025-05-07T20:31:32.0301624Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:32.0302154Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.0302617Z module_map=module_map) 2025-05-07T20:31:32.0302993Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.0303352Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.0303608Z E ^ 2025-05-07T20:31:32.0304075Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.0304540Z 2025-05-07T20:31:32.0305015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:32.0305536Z 2025-05-07T20:31:32.0305639Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:32.0306055Z self=, 2025-05-07T20:31:32.0306540Z T=128, 2025-05-07T20:31:32.0306730Z D=7168, 2025-05-07T20:31:32.0306923Z scale_ub=None, 2025-05-07T20:31:32.0307134Z contiguous=False, 2025-05-07T20:31:32.0307359Z compiled=True, 2025-05-07T20:31:32.0307565Z ) 2025-05-07T20:31:32.0786340Z self = 2025-05-07T20:31:32.0787707Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:32.0788260Z 2025-05-07T20:31:32.0788415Z @given( 2025-05-07T20:31:32.0788873Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:32.0789488Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:32.0790097Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:32.0790747Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:32.0791396Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:32.0791973Z ) 2025-05-07T20:31:32.0792684Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:32.0793566Z def test_silu_mul_quant( 2025-05-07T20:31:32.0794033Z self, 2025-05-07T20:31:32.0794412Z T: int, 2025-05-07T20:31:32.0794784Z D: int, 2025-05-07T20:31:32.0794997Z scale_ub: Optional[float], 2025-05-07T20:31:32.0795273Z contiguous: bool, 2025-05-07T20:31:32.0795515Z compiled: bool, 2025-05-07T20:31:32.0795738Z ) -> None: 2025-05-07T20:31:32.0795954Z torch.manual_seed(2025) 2025-05-07T20:31:32.0796196Z 2025-05-07T20:31:32.0796466Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:32.0796809Z 2025-05-07T20:31:32.0797005Z x_sign = torch.sign(x) 2025-05-07T20:31:32.0797296Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:32.0797601Z x = x_sign * x_clamp 2025-05-07T20:31:32.0797840Z x0 = x[:, :D] 2025-05-07T20:31:32.0798414Z x1 = x[:, D:] 2025-05-07T20:31:32.0798624Z 2025-05-07T20:31:32.0798815Z if contiguous: 2025-05-07T20:31:32.0799047Z x0 = x0.contiguous() 2025-05-07T20:31:32.0799301Z x1 = x1.contiguous() 2025-05-07T20:31:32.0799542Z 2025-05-07T20:31:32.0799730Z if scale_ub is not None: 2025-05-07T20:31:32.0800000Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:32.0800346Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:32.0800654Z ) 2025-05-07T20:31:32.0800844Z else: 2025-05-07T20:31:32.0801054Z scale_ub_tensor = None 2025-05-07T20:31:32.0801307Z 2025-05-07T20:31:32.0801534Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:32.0801849Z op = silu_mul_quant 2025-05-07T20:31:32.0802097Z if compiled: 2025-05-07T20:31:32.0802343Z op = torch.compile(op) 2025-05-07T20:31:32.0802641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.0802925Z 2025-05-07T20:31:32.0803120Z y_fp8, y_scale = fn() 2025-05-07T20:31:32.0803534Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:32.0803829Z 2025-05-07T20:31:32.0804075Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:32.0804406Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:32.0804700Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:32.0805016Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:32.0805374Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:32.0805687Z 2025-05-07T20:31:32.0805892Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:32.0806084Z 2025-05-07T20:31:32.0806193Z moe/activation_test.py:126: 2025-05-07T20:31:32.0806485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.0806826Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:32.0807326Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:32.0808117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:32.0808881Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:32.0809436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:32.0810131Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:32.0810822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:32.0811557Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:32.0812321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:32.0813075Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:32.0813822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:32.0814469Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:32.0815074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:32.0815596Z fn() 2025-05-07T20:31:32.0816107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:32.0816695Z self.fn.run( 2025-05-07T20:31:32.0817169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:32.0817703Z kernel = self.compile( 2025-05-07T20:31:32.0818361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:32.0819036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.0819433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.0819670Z 2025-05-07T20:31:32.0819878Z self = 2025-05-07T20:31:32.0820963Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:32.0822350Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09367e3f60>} 2025-05-07T20:31:32.0823714Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:32.0824759Z context = 2025-05-07T20:31:32.0825093Z 2025-05-07T20:31:32.0825260Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:32.0825795Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.0826266Z module_map=module_map) 2025-05-07T20:31:32.0826627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.0826984Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:32.0827251Z E ^ 2025-05-07T20:31:32.0827716Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.0828179Z 2025-05-07T20:31:32.0828602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:32.0829134Z 2025-05-07T20:31:32.0829325Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:32.0829741Z self=, 2025-05-07T20:31:32.0830141Z T=128, 2025-05-07T20:31:32.0830333Z D=7168, 2025-05-07T20:31:32.0830528Z scale_ub=None, 2025-05-07T20:31:32.0830737Z contiguous=False, 2025-05-07T20:31:32.0830967Z compiled=False, 2025-05-07T20:31:32.0831180Z ) 2025-05-07T20:31:32.2356679Z self = 2025-05-07T20:31:32.2357322Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:32.2357670Z 2025-05-07T20:31:32.2357753Z @given( 2025-05-07T20:31:32.2357987Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:32.2358297Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:32.2358606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:32.2358962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:32.2359302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:32.2359591Z ) 2025-05-07T20:31:32.2359945Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:32.2360386Z def test_silu_mul_quant( 2025-05-07T20:31:32.2360638Z self, 2025-05-07T20:31:32.2360835Z T: int, 2025-05-07T20:31:32.2361033Z D: int, 2025-05-07T20:31:32.2361243Z scale_ub: Optional[float], 2025-05-07T20:31:32.2361514Z contiguous: bool, 2025-05-07T20:31:32.2361750Z compiled: bool, 2025-05-07T20:31:32.2361979Z ) -> None: 2025-05-07T20:31:32.2362198Z torch.manual_seed(2025) 2025-05-07T20:31:32.2362442Z 2025-05-07T20:31:32.2362717Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:32.2363062Z 2025-05-07T20:31:32.2363255Z x_sign = torch.sign(x) 2025-05-07T20:31:32.2363700Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:32.2364371Z x = x_sign * x_clamp 2025-05-07T20:31:32.2364623Z x0 = x[:, :D] 2025-05-07T20:31:32.2364850Z x1 = x[:, D:] 2025-05-07T20:31:32.2365062Z 2025-05-07T20:31:32.2365248Z if contiguous: 2025-05-07T20:31:32.2365486Z x0 = x0.contiguous() 2025-05-07T20:31:32.2365745Z x1 = x1.contiguous() 2025-05-07T20:31:32.2365982Z 2025-05-07T20:31:32.2366180Z if scale_ub is not None: 2025-05-07T20:31:32.2366458Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:32.2366793Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:32.2367111Z ) 2025-05-07T20:31:32.2367308Z else: 2025-05-07T20:31:32.2367517Z scale_ub_tensor = None 2025-05-07T20:31:32.2367771Z 2025-05-07T20:31:32.2368007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:32.2368322Z op = silu_mul_quant 2025-05-07T20:31:32.2368579Z if compiled: 2025-05-07T20:31:32.2368838Z op = torch.compile(op) 2025-05-07T20:31:32.2369135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.2369412Z 2025-05-07T20:31:32.2369609Z > y_fp8, y_scale = fn() 2025-05-07T20:31:32.2369781Z 2025-05-07T20:31:32.2369887Z moe/activation_test.py:117: 2025-05-07T20:31:32.2370288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.2370640Z moe/activation_test.py:115: in fn 2025-05-07T20:31:32.2370925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.2371620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:32.2372323Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:32.2372866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:32.2373562Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:32.2374444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:32.2374989Z kernel = self.compile( 2025-05-07T20:31:32.2375539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:32.2376209Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.2376605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.2376842Z 2025-05-07T20:31:32.2377050Z self = 2025-05-07T20:31:32.2378135Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:32.2379526Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090e923ec0>} 2025-05-07T20:31:32.2380874Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:32.2381909Z context = 2025-05-07T20:31:32.2382203Z 2025-05-07T20:31:32.2382374Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:32.2382896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.2383368Z module_map=module_map) 2025-05-07T20:31:32.2383736Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.2384098Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.2384361Z E ^ 2025-05-07T20:31:32.2384924Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.2385430Z 2025-05-07T20:31:32.2385862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:32.2386379Z 2025-05-07T20:31:32.2386483Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:32.2386900Z self=, 2025-05-07T20:31:32.2387303Z T=4096, 2025-05-07T20:31:32.2387495Z D=5120, 2025-05-07T20:31:32.2387685Z scale_ub=1200.0, 2025-05-07T20:31:32.2387910Z contiguous=True, 2025-05-07T20:31:32.2388134Z compiled=False, 2025-05-07T20:31:32.2388337Z ) 2025-05-07T20:31:32.2388659Z self = 2025-05-07T20:31:32.2389160Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:32.2389441Z 2025-05-07T20:31:32.2389521Z @given( 2025-05-07T20:31:32.2389751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:32.2390063Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:32.2390364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:32.2390695Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:32.2391022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:32.2391312Z ) 2025-05-07T20:31:32.2391656Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:32.2392098Z def test_silu_mul_quant( 2025-05-07T20:31:32.2392341Z self, 2025-05-07T20:31:32.2392529Z T: int, 2025-05-07T20:31:32.2392726Z D: int, 2025-05-07T20:31:32.2392944Z scale_ub: Optional[float], 2025-05-07T20:31:32.2393208Z contiguous: bool, 2025-05-07T20:31:32.2393449Z compiled: bool, 2025-05-07T20:31:32.2393673Z ) -> None: 2025-05-07T20:31:32.2393895Z torch.manual_seed(2025) 2025-05-07T20:31:32.2394226Z 2025-05-07T20:31:32.2394496Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:32.2394854Z 2025-05-07T20:31:32.2395084Z x_sign = torch.sign(x) 2025-05-07T20:31:32.2395380Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:32.2395691Z x = x_sign * x_clamp 2025-05-07T20:31:32.2395931Z x0 = x[:, :D] 2025-05-07T20:31:32.2396151Z x1 = x[:, D:] 2025-05-07T20:31:32.2396364Z 2025-05-07T20:31:32.2396544Z if contiguous: 2025-05-07T20:31:32.2396780Z x0 = x0.contiguous() 2025-05-07T20:31:32.2397038Z x1 = x1.contiguous() 2025-05-07T20:31:32.2397272Z 2025-05-07T20:31:32.2397467Z if scale_ub is not None: 2025-05-07T20:31:32.2397739Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:32.2398069Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:32.2398380Z ) 2025-05-07T20:31:32.2398589Z else: 2025-05-07T20:31:32.2398797Z scale_ub_tensor = None 2025-05-07T20:31:32.2399056Z 2025-05-07T20:31:32.2399294Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:32.2399600Z op = silu_mul_quant 2025-05-07T20:31:32.2399852Z if compiled: 2025-05-07T20:31:32.2400097Z op = torch.compile(op) 2025-05-07T20:31:32.2400387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.2400664Z 2025-05-07T20:31:32.2400858Z > y_fp8, y_scale = fn() 2025-05-07T20:31:32.2401022Z 2025-05-07T20:31:32.2401124Z moe/activation_test.py:117: 2025-05-07T20:31:32.2401410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.2401744Z moe/activation_test.py:115: in fn 2025-05-07T20:31:32.2402025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.2402798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:32.2403646Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:32.2404192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:32.2404884Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:32.2405550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:32.2406090Z kernel = self.compile( 2025-05-07T20:31:32.2406639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:32.2407298Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.2407695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.2407931Z 2025-05-07T20:31:32.2408145Z self = 2025-05-07T20:31:32.2409233Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:32.2410606Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090e720720>} 2025-05-07T20:31:32.2411948Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:32.2412980Z context = 2025-05-07T20:31:32.2413275Z 2025-05-07T20:31:32.2413444Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:32.2413977Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.2414527Z module_map=module_map) 2025-05-07T20:31:32.2414892Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.2415251Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.2415508Z E ^ 2025-05-07T20:31:32.2415977Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.2416438Z 2025-05-07T20:31:32.2416861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:32.2417382Z 2025-05-07T20:31:32.2417495Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:32.2417904Z self=, 2025-05-07T20:31:32.2418307Z T=1, 2025-05-07T20:31:32.2418499Z D=5120, 2025-05-07T20:31:32.2418685Z scale_ub=None, 2025-05-07T20:31:32.2418906Z contiguous=True, 2025-05-07T20:31:32.2419139Z compiled=True, 2025-05-07T20:31:32.2419338Z ) 2025-05-07T20:31:32.5657986Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.5659050Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:32.5660397Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.5661845Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.5663197Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.5664577Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.5665989Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.5666970Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.5668210Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.5669602Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.5670669Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.5671948Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.5673198Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:32.5674431Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.5675812Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:32.5676646Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.5677675Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:32.5678693Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:32.5688062Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:32.5689313Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.5690619Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.5691749Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.5692806Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:32.5694103Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.5695472Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.5696546Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.5697472Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.5698214Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:32.5699244Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.6514318Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.6515776Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:32.6517134Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.6518572Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.6519563Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.6521246Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.6522642Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.6523797Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.6525060Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.6526481Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.6527564Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.6528856Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.6530116Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:32.6531497Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.6532716Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:32.6533554Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.6534589Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:32.6535625Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:32.6536429Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:32.6537653Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.6539230Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.6540363Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.6541417Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:32.6542599Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.6543977Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.6545172Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.6546094Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.6546843Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:32.6547866Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.9136342Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.9137495Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:32.9139154Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.9140621Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.9141605Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.9143370Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.9144836Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.9145839Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.9147089Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.9148486Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.9149557Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.9150860Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.9152123Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:32.9153365Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.9154599Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:32.9155628Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.9156666Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:32.9157700Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:32.9158513Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:32.9159732Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.9161045Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.9162177Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.9163235Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:32.9164570Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.9165951Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.9167252Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.9168199Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.9168962Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:32.9169991Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.9276756Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.9277837Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:32.9279184Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.9280792Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.9281793Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.9283111Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.9284828Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.9285808Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.9287035Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.9288409Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.9289470Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.9290744Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.9291990Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:32.9293206Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.9294407Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:32.9295353Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.9296379Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:32.9297403Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:32.9298202Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:32.9299419Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.9300705Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.9301822Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.9302863Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:32.9304045Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.9305448Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.9306505Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.9307508Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.9308248Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:32.9309267Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:33.1258632Z self = 2025-05-07T20:31:33.1259281Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:33.1259557Z 2025-05-07T20:31:33.1259640Z @given( 2025-05-07T20:31:33.1259892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:33.1260213Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:33.1260544Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:33.1260893Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:33.1261224Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:33.1261503Z ) 2025-05-07T20:31:33.1261857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:33.1262300Z def test_silu_mul_quant( 2025-05-07T20:31:33.1262535Z self, 2025-05-07T20:31:33.1262733Z T: int, 2025-05-07T20:31:33.1262935Z D: int, 2025-05-07T20:31:33.1263146Z scale_ub: Optional[float], 2025-05-07T20:31:33.1263421Z contiguous: bool, 2025-05-07T20:31:33.1263664Z compiled: bool, 2025-05-07T20:31:33.1263890Z ) -> None: 2025-05-07T20:31:33.1264109Z torch.manual_seed(2025) 2025-05-07T20:31:33.1264359Z 2025-05-07T20:31:33.1264635Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:33.1264992Z 2025-05-07T20:31:33.1265562Z x_sign = torch.sign(x) 2025-05-07T20:31:33.1265865Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:33.1266173Z x = x_sign * x_clamp 2025-05-07T20:31:33.1266414Z x0 = x[:, :D] 2025-05-07T20:31:33.1266637Z x1 = x[:, D:] 2025-05-07T20:31:33.1266849Z 2025-05-07T20:31:33.1267041Z if contiguous: 2025-05-07T20:31:33.1267277Z x0 = x0.contiguous() 2025-05-07T20:31:33.1267530Z x1 = x1.contiguous() 2025-05-07T20:31:33.1267776Z 2025-05-07T20:31:33.1267968Z if scale_ub is not None: 2025-05-07T20:31:33.1268237Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:33.1268576Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:33.1268886Z ) 2025-05-07T20:31:33.1269075Z else: 2025-05-07T20:31:33.1269287Z scale_ub_tensor = None 2025-05-07T20:31:33.1269538Z 2025-05-07T20:31:33.1269766Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:33.1270082Z op = silu_mul_quant 2025-05-07T20:31:33.1270333Z if compiled: 2025-05-07T20:31:33.1270581Z op = torch.compile(op) 2025-05-07T20:31:33.1270875Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:33.1271150Z 2025-05-07T20:31:33.1271344Z y_fp8, y_scale = fn() 2025-05-07T20:31:33.1271625Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:33.1271914Z 2025-05-07T20:31:33.1272151Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:33.1272481Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:33.1272782Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:33.1273100Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:33.1273454Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:33.1273769Z 2025-05-07T20:31:33.1273971Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:33.1274166Z 2025-05-07T20:31:33.1274281Z moe/activation_test.py:126: 2025-05-07T20:31:33.1274741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:33.1275084Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:33.1275415Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:33.1276204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:33.1276961Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:33.1277510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:33.1278204Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:33.1278892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:33.1279622Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:33.1280388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:33.1281145Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:33.1281874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:33.1282521Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:33.1283131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:33.1283818Z fn() 2025-05-07T20:31:33.1284332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:33.1284916Z self.fn.run( 2025-05-07T20:31:33.1285502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:33.1286039Z kernel = self.compile( 2025-05-07T20:31:33.1286587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:33.1287250Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:33.1287643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:33.1287880Z 2025-05-07T20:31:33.1288090Z self = 2025-05-07T20:31:33.1289169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:33.1290558Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0924366660>} 2025-05-07T20:31:33.1291908Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:33.1292933Z context = 2025-05-07T20:31:33.1293226Z 2025-05-07T20:31:33.1293394Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:33.1293920Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:33.1294388Z module_map=module_map) 2025-05-07T20:31:33.1294747Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:33.1295118Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:33.1295389Z E ^ 2025-05-07T20:31:33.1295854Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:33.1296450Z 2025-05-07T20:31:33.1296872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:33.1297395Z 2025-05-07T20:31:33.1297499Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:33.1297918Z self=, 2025-05-07T20:31:33.1298314Z T=2048, 2025-05-07T20:31:33.1298502Z D=5120, 2025-05-07T20:31:33.1298699Z scale_ub=None, 2025-05-07T20:31:33.1298911Z contiguous=True, 2025-05-07T20:31:33.1299137Z compiled=True, 2025-05-07T20:31:33.1299347Z ) 2025-05-07T20:31:33.4458250Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:33.4459385Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:33.4460766Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:33.4462223Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:33.4463210Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.4464530Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:33.4466285Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:33.4467288Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.4468651Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:33.4470058Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:33.4471141Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.4472427Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:33.4473693Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:33.4474924Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:33.4476366Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:33.4477274Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.4478540Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:33.4479898Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:33.4480708Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:33.4481930Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:33.4483229Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:33.4484557Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:33.4485713Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:33.4486907Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:33.4488264Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:33.4489319Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:33.4490420Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:33.4491163Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:33.4492179Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:33.5329717Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:33.5330818Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:33.5332199Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:33.5333673Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:33.5334673Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.5336064Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:33.5337483Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:33.5339222Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.5340460Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:33.5341855Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:33.5342934Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.5344224Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:33.5345484Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:33.5346702Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:33.5347915Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:33.5348744Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.5349936Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:33.5350971Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:33.5351759Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:33.5352972Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:33.5354262Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:33.5355387Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:33.5356440Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:33.5357619Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:33.5358975Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:33.5360036Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:33.5360952Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:33.5361814Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:33.5362840Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:33.7951631Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:33.7952730Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:33.7954096Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:33.7955602Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:33.7956572Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.7957880Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:33.7959264Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:33.7960662Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.7961904Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:33.7963276Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:33.7964555Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.7965844Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:33.7967098Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:33.7968324Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:33.7969529Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:33.7970361Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.7971388Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:33.7972416Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:33.7973382Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:33.7974587Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:33.7975871Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:33.7976990Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:33.7978037Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:33.7979216Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:33.7980584Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:33.7981644Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:33.7982557Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:33.7983299Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:33.7984392Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:33.8097386Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:33.8098461Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:33.8099787Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:33.8101220Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:33.8102193Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.8103494Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:33.8104876Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:33.8105851Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.8107076Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:33.8108641Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:33.8109689Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.8110962Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:33.8112209Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:33.8113435Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:33.8114643Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:33.8115468Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.8116492Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:33.8117509Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:33.8118428Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:33.8119641Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:33.8120924Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:33.8122045Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:33.8123092Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:33.8124425Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:33.8125788Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:33.8126858Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:33.8127776Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:33.8128527Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:33.8129556Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.1948708Z self = 2025-05-07T20:31:34.1949362Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:34.1949642Z 2025-05-07T20:31:34.1949731Z @given( 2025-05-07T20:31:34.1949967Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:34.1950293Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:34.1950606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:34.1950937Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:34.1951272Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:34.1951567Z ) 2025-05-07T20:31:34.1951931Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:34.1952374Z def test_silu_mul_quant( 2025-05-07T20:31:34.1952620Z self, 2025-05-07T20:31:34.1952843Z T: int, 2025-05-07T20:31:34.1953050Z D: int, 2025-05-07T20:31:34.1953270Z scale_ub: Optional[float], 2025-05-07T20:31:34.1953544Z contiguous: bool, 2025-05-07T20:31:34.1953782Z compiled: bool, 2025-05-07T20:31:34.1954029Z ) -> None: 2025-05-07T20:31:34.1954252Z torch.manual_seed(2025) 2025-05-07T20:31:34.1954497Z 2025-05-07T20:31:34.1954774Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:34.1955115Z 2025-05-07T20:31:34.1955318Z x_sign = torch.sign(x) 2025-05-07T20:31:34.1955615Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:34.1955923Z x = x_sign * x_clamp 2025-05-07T20:31:34.1956169Z x0 = x[:, :D] 2025-05-07T20:31:34.1956392Z x1 = x[:, D:] 2025-05-07T20:31:34.1956599Z 2025-05-07T20:31:34.1956794Z if contiguous: 2025-05-07T20:31:34.1957032Z x0 = x0.contiguous() 2025-05-07T20:31:34.1957287Z x1 = x1.contiguous() 2025-05-07T20:31:34.1958002Z 2025-05-07T20:31:34.1958418Z if scale_ub is not None: 2025-05-07T20:31:34.1966964Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:34.1967327Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:34.1967646Z ) 2025-05-07T20:31:34.1967843Z else: 2025-05-07T20:31:34.1968064Z scale_ub_tensor = None 2025-05-07T20:31:34.1968327Z 2025-05-07T20:31:34.1968562Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:34.1968889Z op = silu_mul_quant 2025-05-07T20:31:34.1969144Z if compiled: 2025-05-07T20:31:34.1969390Z op = torch.compile(op) 2025-05-07T20:31:34.1969693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:34.1969977Z 2025-05-07T20:31:34.1970169Z y_fp8, y_scale = fn() 2025-05-07T20:31:34.1970466Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:34.1970765Z 2025-05-07T20:31:34.1971022Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:34.1971369Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:34.1971666Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:34.1971987Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:34.1972345Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:34.1972662Z 2025-05-07T20:31:34.1972870Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:34.1973068Z 2025-05-07T20:31:34.1973171Z moe/activation_test.py:126: 2025-05-07T20:31:34.1973476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:34.1973818Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:34.1974150Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:34.1974937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:34.1975999Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:34.1976550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:34.1977230Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:34.1977928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:34.1978659Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:34.1979419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:34.1980164Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:34.1980896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:34.1981545Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:34.1982162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:34.1982676Z fn() 2025-05-07T20:31:34.1983188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:34.1983772Z self.fn.run( 2025-05-07T20:31:34.1984237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:34.1984769Z kernel = self.compile( 2025-05-07T20:31:34.1985314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:34.1985971Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.1986368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:34.1986602Z 2025-05-07T20:31:34.1986904Z self = 2025-05-07T20:31:34.1987987Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:34.1989366Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f092455a5c0>} 2025-05-07T20:31:34.1990692Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:34.1991720Z context = 2025-05-07T20:31:34.1992013Z 2025-05-07T20:31:34.1992188Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:34.1992723Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.1993186Z module_map=module_map) 2025-05-07T20:31:34.1993555Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.1993920Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:34.1994193Z E ^ 2025-05-07T20:31:34.1994658Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.1995121Z 2025-05-07T20:31:34.1995540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:34.1996057Z 2025-05-07T20:31:34.1996172Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:34.1996579Z self=, 2025-05-07T20:31:34.1997085Z T=128, 2025-05-07T20:31:34.1997367Z D=5120, 2025-05-07T20:31:34.1997662Z scale_ub=None, 2025-05-07T20:31:34.1997875Z contiguous=True, 2025-05-07T20:31:34.1998102Z compiled=True, 2025-05-07T20:31:34.1998313Z ) 2025-05-07T20:31:34.5262495Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:34.5263593Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:34.5264943Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:34.5266467Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:34.5267455Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.5268765Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:34.5270155Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.5271145Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.5272759Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:34.5274154Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.5275215Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.5276493Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:34.5277749Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:34.5278970Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:34.5280178Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:34.5281004Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.5282027Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:34.5283041Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:34.5283970Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:34.5285352Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:34.5286636Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:34.5287749Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:34.5288784Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:34.5289964Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:34.5291328Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:34.5292389Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.5293304Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.5294038Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:34.5295132Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.6157461Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:34.6158542Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:34.6159906Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:34.6161345Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:34.6162353Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.6163915Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:34.6165305Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.6166298Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.6167530Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:34.6169251Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.6170317Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.6171600Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:34.6172850Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:34.6174068Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:34.6175283Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:34.6176111Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.6177140Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:34.6178161Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:34.6178951Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:34.6180315Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:34.6181613Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:34.6182736Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:34.6183777Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:34.6184971Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:34.6186336Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:34.6187399Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.6188509Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.6189247Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:34.6190274Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.8813725Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:34.8815216Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:34.8816563Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:34.8818116Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:34.8819108Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.8820425Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:34.8821823Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.8822823Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.8824069Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:34.8825618Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.8826705Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.8828002Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:34.8829266Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:34.8830500Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:34.8831719Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:34.8832565Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.8833602Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:34.8834638Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:34.8835448Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:34.8836670Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:34.8838274Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:34.8839721Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:34.8840786Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:34.8841983Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:34.8843552Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:34.8844635Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.8845612Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.8846372Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:34.8847396Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.8953287Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:34.8954565Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:34.8955981Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:34.8957458Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:34.8958430Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.8960878Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:34.8962282Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.8963273Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.8964687Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:34.8966068Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.8967150Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.8970007Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:34.8971267Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:34.8972497Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:34.8973707Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:34.8974549Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.8975602Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:34.8976661Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:34.8977462Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:34.8978668Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:34.8980234Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:34.8981406Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:34.8982464Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:34.8983646Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:34.8985008Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:34.8986078Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.8986998Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.8987742Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:34.8988763Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.1330424Z self = 2025-05-07T20:31:35.1331114Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:35.1331393Z 2025-05-07T20:31:35.1331485Z @given( 2025-05-07T20:31:35.1331721Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:35.1332071Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:35.1332766Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:35.1333098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:35.1333434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:35.1333729Z ) 2025-05-07T20:31:35.1334086Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:35.1334535Z def test_silu_mul_quant( 2025-05-07T20:31:35.1334791Z self, 2025-05-07T20:31:35.1334997Z T: int, 2025-05-07T20:31:35.1335189Z D: int, 2025-05-07T20:31:35.1335415Z scale_ub: Optional[float], 2025-05-07T20:31:35.1335695Z contiguous: bool, 2025-05-07T20:31:35.1335939Z compiled: bool, 2025-05-07T20:31:35.1336167Z ) -> None: 2025-05-07T20:31:35.1336388Z torch.manual_seed(2025) 2025-05-07T20:31:35.1336625Z 2025-05-07T20:31:35.1336918Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:35.1337285Z 2025-05-07T20:31:35.1337475Z x_sign = torch.sign(x) 2025-05-07T20:31:35.1337773Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:35.1338086Z x = x_sign * x_clamp 2025-05-07T20:31:35.1338320Z x0 = x[:, :D] 2025-05-07T20:31:35.1338902Z x1 = x[:, D:] 2025-05-07T20:31:35.1339117Z 2025-05-07T20:31:35.1339298Z if contiguous: 2025-05-07T20:31:35.1339533Z x0 = x0.contiguous() 2025-05-07T20:31:35.1339791Z x1 = x1.contiguous() 2025-05-07T20:31:35.1340032Z 2025-05-07T20:31:35.1340218Z if scale_ub is not None: 2025-05-07T20:31:35.1340491Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:35.1340829Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:35.1341134Z ) 2025-05-07T20:31:35.1341329Z else: 2025-05-07T20:31:35.1341539Z scale_ub_tensor = None 2025-05-07T20:31:35.1341781Z 2025-05-07T20:31:35.1342185Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:35.1342515Z op = silu_mul_quant 2025-05-07T20:31:35.1342760Z if compiled: 2025-05-07T20:31:35.1343012Z op = torch.compile(op) 2025-05-07T20:31:35.1343315Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.1343585Z 2025-05-07T20:31:35.1343783Z y_fp8, y_scale = fn() 2025-05-07T20:31:35.1344074Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:35.1344355Z 2025-05-07T20:31:35.1344593Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:35.1344947Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:35.1345245Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:35.1345729Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:35.1346096Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:35.1346404Z 2025-05-07T20:31:35.1346613Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:35.1346812Z 2025-05-07T20:31:35.1346923Z moe/activation_test.py:126: 2025-05-07T20:31:35.1347217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.1347561Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:35.1347891Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:35.1348681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:35.1349446Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:35.1349997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:35.1350682Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:35.1351368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:35.1352239Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:35.1353000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:35.1353755Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:35.1354484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:35.1355128Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:35.1355735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:35.1356264Z fn() 2025-05-07T20:31:35.1356770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:35.1357355Z self.fn.run( 2025-05-07T20:31:35.1357833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:35.1358368Z kernel = self.compile( 2025-05-07T20:31:35.1358919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:35.1359583Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.1359981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.1360211Z 2025-05-07T20:31:35.1360422Z self = 2025-05-07T20:31:35.1361504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:35.1363009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0924558400>} 2025-05-07T20:31:35.1364533Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:35.1365561Z context = 2025-05-07T20:31:35.1365856Z 2025-05-07T20:31:35.1366025Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:35.1366549Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.1367024Z module_map=module_map) 2025-05-07T20:31:35.1367389Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.1367755Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:35.1368030Z E ^ 2025-05-07T20:31:35.1368501Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.1368968Z 2025-05-07T20:31:35.1369387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:35.1369914Z 2025-05-07T20:31:35.1370018Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:35.1370438Z self=, 2025-05-07T20:31:35.1370838Z T=4096, 2025-05-07T20:31:35.1371039Z D=5120, 2025-05-07T20:31:35.1371233Z scale_ub=None, 2025-05-07T20:31:35.1371447Z contiguous=True, 2025-05-07T20:31:35.1371679Z compiled=True, 2025-05-07T20:31:35.1371891Z ) 2025-05-07T20:31:35.4684509Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:35.4686364Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:35.4689516Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:35.4692375Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:35.4694309Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.4696306Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:35.4697696Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.4698675Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.4699903Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:35.4701276Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.4702484Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.4703775Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:35.4705027Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:35.4706253Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:35.4707455Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:35.4708290Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.4709319Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:35.4710344Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:35.4711137Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:35.4712338Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:35.4713626Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:35.4714824Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:35.4715865Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:35.4717036Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:35.4718387Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:35.4719446Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.4720365Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:35.4721103Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:35.4722112Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.5564526Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:35.5566096Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:35.5567837Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:35.5569300Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:35.5570268Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.5571573Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:35.5572958Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.5573940Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.5575166Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:35.5576553Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.5577609Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.5578888Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:35.5580289Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:35.5581510Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:35.5582724Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:35.5583545Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.5584576Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:35.5585599Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:35.5586394Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:35.5587600Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:35.5588889Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:35.5590085Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:35.5591138Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:35.5592321Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:35.5593677Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:35.5594736Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.5595657Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:35.5596457Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:35.5597494Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.8184592Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:35.8185666Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:35.8187084Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:35.8188949Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:35.8189939Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.8191247Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:35.8192621Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.8193608Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.8194837Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:35.8196272Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.8197327Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.8198740Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:35.8199999Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:35.8210490Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:35.8211735Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:35.8212569Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.8213617Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:35.8214646Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:35.8215436Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:35.8216708Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:35.8217999Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:35.8219119Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:35.8220959Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:35.8222134Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:35.8223494Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:35.8224554Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.8225469Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:35.8226268Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:35.8227286Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.8329581Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:35.8330887Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:35.8332588Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:35.8334590Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:35.8335613Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.8336939Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:35.8338327Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.8339657Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.8340907Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:35.8342285Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.8343349Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.8344635Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:35.8345943Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:35.8347355Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:35.8348563Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:35.8349399Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:35.8350443Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:35.8351472Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:35.8352275Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:35.8353483Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:35.8354775Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:35.8355898Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:35.8356958Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:35.8358260Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:35.8359621Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:35.8360686Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.8361601Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:35.8362345Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:35.8363612Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.0717911Z self = 2025-05-07T20:31:36.0718524Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:36.0718809Z 2025-05-07T20:31:36.0718911Z @given( 2025-05-07T20:31:36.0719162Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.0719505Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.0719849Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.0720191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.0720541Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.0720865Z ) 2025-05-07T20:31:36.0721240Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.0721730Z def test_silu_mul_quant( 2025-05-07T20:31:36.0722530Z self, 2025-05-07T20:31:36.0722754Z T: int, 2025-05-07T20:31:36.0722961Z D: int, 2025-05-07T20:31:36.0723202Z scale_ub: Optional[float], 2025-05-07T20:31:36.0723669Z contiguous: bool, 2025-05-07T20:31:36.0723918Z compiled: bool, 2025-05-07T20:31:36.0724165Z ) -> None: 2025-05-07T20:31:36.0724396Z torch.manual_seed(2025) 2025-05-07T20:31:36.0724645Z 2025-05-07T20:31:36.0724933Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.0725305Z 2025-05-07T20:31:36.0725499Z x_sign = torch.sign(x) 2025-05-07T20:31:36.0725805Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.0726124Z x = x_sign * x_clamp 2025-05-07T20:31:36.0726373Z x0 = x[:, :D] 2025-05-07T20:31:36.0726589Z x1 = x[:, D:] 2025-05-07T20:31:36.0726812Z 2025-05-07T20:31:36.0727010Z if contiguous: 2025-05-07T20:31:36.0727247Z x0 = x0.contiguous() 2025-05-07T20:31:36.0727526Z x1 = x1.contiguous() 2025-05-07T20:31:36.0727776Z 2025-05-07T20:31:36.0727966Z if scale_ub is not None: 2025-05-07T20:31:36.0728243Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.0728586Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.0728901Z ) 2025-05-07T20:31:36.0729099Z else: 2025-05-07T20:31:36.0729316Z scale_ub_tensor = None 2025-05-07T20:31:36.0729566Z 2025-05-07T20:31:36.0729801Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.0730114Z op = silu_mul_quant 2025-05-07T20:31:36.0730357Z if compiled: 2025-05-07T20:31:36.0730604Z op = torch.compile(op) 2025-05-07T20:31:36.0730908Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.0731175Z 2025-05-07T20:31:36.0731369Z y_fp8, y_scale = fn() 2025-05-07T20:31:36.0731836Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:36.0732141Z 2025-05-07T20:31:36.0732374Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.0732707Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:36.0733003Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:36.0733313Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:36.0733674Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:36.0733992Z 2025-05-07T20:31:36.0734190Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:36.0734391Z 2025-05-07T20:31:36.0734494Z moe/activation_test.py:126: 2025-05-07T20:31:36.0734793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.0735136Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:36.0735460Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:36.0736260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:36.0737023Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:36.0737569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.0738255Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.0739319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:36.0740060Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:36.0740815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:36.0741572Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:36.0742320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:36.0743121Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:36.0743724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:36.0744250Z fn() 2025-05-07T20:31:36.0744767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:36.0745347Z self.fn.run( 2025-05-07T20:31:36.0745874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.0746415Z kernel = self.compile( 2025-05-07T20:31:36.0746964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.0747621Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.0748025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.0748263Z 2025-05-07T20:31:36.0748479Z self = 2025-05-07T20:31:36.0749563Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.0750950Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090db2e7a0>} 2025-05-07T20:31:36.0752305Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.0753338Z context = 2025-05-07T20:31:36.0753625Z 2025-05-07T20:31:36.0753930Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.0754459Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.0754936Z module_map=module_map) 2025-05-07T20:31:36.0755313Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.0755679Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:36.0755948Z E ^ 2025-05-07T20:31:36.0756419Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.0756871Z 2025-05-07T20:31:36.0757300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.0757815Z 2025-05-07T20:31:36.0757921Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.0758348Z self=, 2025-05-07T20:31:36.0758759Z T=16384, 2025-05-07T20:31:36.0758952Z D=5120, 2025-05-07T20:31:36.0759141Z scale_ub=None, 2025-05-07T20:31:36.0759357Z contiguous=True, 2025-05-07T20:31:36.0759583Z compiled=True, 2025-05-07T20:31:36.0759791Z ) 2025-05-07T20:31:36.1037078Z W0507 20:31:36.102000 86685 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:31:36.1038338Z W0507 20:31:36.102000 86685 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:31:36.1039949Z W0507 20:31:36.102000 86685 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:31:36.1040960Z W0507 20:31:36.102000 86685 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:31:36.1042416Z W0507 20:31:36.102000 86685 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:31:36.1724109Z self = 2025-05-07T20:31:36.1724895Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:36.1725285Z 2025-05-07T20:31:36.1725405Z @given( 2025-05-07T20:31:36.1725653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.1726004Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.1726316Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.1726650Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.1726979Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.1727272Z ) 2025-05-07T20:31:36.1727645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.1728100Z def test_silu_mul_quant( 2025-05-07T20:31:36.1728343Z self, 2025-05-07T20:31:36.1728545Z T: int, 2025-05-07T20:31:36.1728746Z D: int, 2025-05-07T20:31:36.1728974Z scale_ub: Optional[float], 2025-05-07T20:31:36.1729246Z contiguous: bool, 2025-05-07T20:31:36.1729483Z compiled: bool, 2025-05-07T20:31:36.1729715Z ) -> None: 2025-05-07T20:31:36.1729932Z torch.manual_seed(2025) 2025-05-07T20:31:36.1730170Z 2025-05-07T20:31:36.1730450Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.1730799Z 2025-05-07T20:31:36.1730995Z x_sign = torch.sign(x) 2025-05-07T20:31:36.1731292Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.1731611Z x = x_sign * x_clamp 2025-05-07T20:31:36.1731856Z x0 = x[:, :D] 2025-05-07T20:31:36.1732067Z x1 = x[:, D:] 2025-05-07T20:31:36.1732643Z 2025-05-07T20:31:36.1732840Z if contiguous: 2025-05-07T20:31:36.1733070Z x0 = x0.contiguous() 2025-05-07T20:31:36.1733327Z x1 = x1.contiguous() 2025-05-07T20:31:36.1733568Z 2025-05-07T20:31:36.1733754Z if scale_ub is not None: 2025-05-07T20:31:36.1734031Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.1734369Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.1734674Z ) 2025-05-07T20:31:36.1734868Z else: 2025-05-07T20:31:36.1735084Z scale_ub_tensor = None 2025-05-07T20:31:36.1735331Z 2025-05-07T20:31:36.1735567Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.1735880Z op = silu_mul_quant 2025-05-07T20:31:36.1736122Z if compiled: 2025-05-07T20:31:36.1736370Z op = torch.compile(op) 2025-05-07T20:31:36.1736671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.1736950Z 2025-05-07T20:31:36.1737151Z y_fp8, y_scale = fn() 2025-05-07T20:31:36.1737441Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:36.1737734Z 2025-05-07T20:31:36.1737967Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.1738301Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:36.1738884Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:36.1739194Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:36.1739556Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:36.1739872Z 2025-05-07T20:31:36.1740069Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:36.1740271Z 2025-05-07T20:31:36.1740372Z moe/activation_test.py:126: 2025-05-07T20:31:36.1740667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.1741003Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:36.1741335Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:36.1742294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:36.1743053Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:36.1743598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.1744294Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.1744998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:36.1745728Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:36.1746484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:36.1747252Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:36.1747996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:36.1748646Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:36.1749246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:36.1749774Z fn() 2025-05-07T20:31:36.1750290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:36.1750869Z self.fn.run( 2025-05-07T20:31:36.1751342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.1751885Z kernel = self.compile( 2025-05-07T20:31:36.1752432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.1753211Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.1753623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.1753853Z 2025-05-07T20:31:36.1754073Z self = 2025-05-07T20:31:36.1755148Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.1756574Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090d9de200>} 2025-05-07T20:31:36.1757910Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.1758951Z context = 2025-05-07T20:31:36.1759240Z 2025-05-07T20:31:36.1759416Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.1759936Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.1760407Z module_map=module_map) 2025-05-07T20:31:36.1760776Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.1761137Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:36.1761398Z E ^ 2025-05-07T20:31:36.1761867Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.1762323Z 2025-05-07T20:31:36.1762753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.1763269Z 2025-05-07T20:31:36.1763456Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.1764000Z self=, 2025-05-07T20:31:36.1764404Z T=1, 2025-05-07T20:31:36.1764587Z D=5120, 2025-05-07T20:31:36.1764784Z scale_ub=1200.0, 2025-05-07T20:31:36.1765013Z contiguous=True, 2025-05-07T20:31:36.1765232Z compiled=True, 2025-05-07T20:31:36.1765445Z ) 2025-05-07T20:31:36.4914223Z self = 2025-05-07T20:31:36.4915665Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:36.4916271Z 2025-05-07T20:31:36.4916400Z @given( 2025-05-07T20:31:36.4916666Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.4916985Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.4917294Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.4917629Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.4917982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.4918283Z ) 2025-05-07T20:31:36.4918643Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.4919090Z def test_silu_mul_quant( 2025-05-07T20:31:36.4919335Z self, 2025-05-07T20:31:36.4919532Z T: int, 2025-05-07T20:31:36.4919736Z D: int, 2025-05-07T20:31:36.4919963Z scale_ub: Optional[float], 2025-05-07T20:31:36.4920235Z contiguous: bool, 2025-05-07T20:31:36.4920479Z compiled: bool, 2025-05-07T20:31:36.4920714Z ) -> None: 2025-05-07T20:31:36.4920926Z torch.manual_seed(2025) 2025-05-07T20:31:36.4921175Z 2025-05-07T20:31:36.4921457Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.4921797Z 2025-05-07T20:31:36.4921997Z x_sign = torch.sign(x) 2025-05-07T20:31:36.4922296Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.4922608Z x = x_sign * x_clamp 2025-05-07T20:31:36.4923181Z x0 = x[:, :D] 2025-05-07T20:31:36.4923538Z x1 = x[:, D:] 2025-05-07T20:31:36.4923747Z 2025-05-07T20:31:36.4923942Z if contiguous: 2025-05-07T20:31:36.4924181Z x0 = x0.contiguous() 2025-05-07T20:31:36.4924440Z x1 = x1.contiguous() 2025-05-07T20:31:36.4924685Z 2025-05-07T20:31:36.4924883Z if scale_ub is not None: 2025-05-07T20:31:36.4925158Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.4925494Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.4925808Z ) 2025-05-07T20:31:36.4926004Z else: 2025-05-07T20:31:36.4926215Z scale_ub_tensor = None 2025-05-07T20:31:36.4926470Z 2025-05-07T20:31:36.4926708Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.4927019Z op = silu_mul_quant 2025-05-07T20:31:36.4927272Z if compiled: 2025-05-07T20:31:36.4927519Z op = torch.compile(op) 2025-05-07T20:31:36.4927831Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.4928110Z 2025-05-07T20:31:36.4928308Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.4928473Z 2025-05-07T20:31:36.4928575Z moe/activation_test.py:117: 2025-05-07T20:31:36.4928877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.4929217Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.4929509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.4930071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:36.4930642Z return fn(*args, **kwargs) 2025-05-07T20:31:36.4931312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.4932001Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.4932558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.4933480Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.4934152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.4934687Z kernel = self.compile( 2025-05-07T20:31:36.4935236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.4935904Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.4936310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.4936545Z 2025-05-07T20:31:36.4936754Z self = 2025-05-07T20:31:36.4937844Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.4939520Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090d451d00>} 2025-05-07T20:31:36.4940873Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.4941901Z context = 2025-05-07T20:31:36.4942196Z 2025-05-07T20:31:36.4942366Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.4942896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.4943368Z module_map=module_map) 2025-05-07T20:31:36.4943862Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.4944231Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.4944495Z E ^ 2025-05-07T20:31:36.4944962Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.4945425Z 2025-05-07T20:31:36.4945847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.4946413Z 2025-05-07T20:31:36.4946532Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.4946953Z self=, 2025-05-07T20:31:36.4947355Z T=1, 2025-05-07T20:31:36.4947546Z D=5120, 2025-05-07T20:31:36.4947748Z scale_ub=None, 2025-05-07T20:31:36.4947961Z contiguous=False, 2025-05-07T20:31:36.4948191Z compiled=True, 2025-05-07T20:31:36.4948403Z ) 2025-05-07T20:31:36.5450303Z self = 2025-05-07T20:31:36.5451077Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:36.5451463Z 2025-05-07T20:31:36.5451578Z @given( 2025-05-07T20:31:36.5451904Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.5452339Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.5452740Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.5453082Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.5453419Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.5453702Z ) 2025-05-07T20:31:36.5454055Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.5454505Z def test_silu_mul_quant( 2025-05-07T20:31:36.5454744Z self, 2025-05-07T20:31:36.5454946Z T: int, 2025-05-07T20:31:36.5455168Z D: int, 2025-05-07T20:31:36.5455385Z scale_ub: Optional[float], 2025-05-07T20:31:36.5455670Z contiguous: bool, 2025-05-07T20:31:36.5456107Z compiled: bool, 2025-05-07T20:31:36.5456333Z ) -> None: 2025-05-07T20:31:36.5456558Z torch.manual_seed(2025) 2025-05-07T20:31:36.5456808Z 2025-05-07T20:31:36.5457086Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.5457437Z 2025-05-07T20:31:36.5457643Z x_sign = torch.sign(x) 2025-05-07T20:31:36.5457944Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.5458256Z x = x_sign * x_clamp 2025-05-07T20:31:36.5458506Z x0 = x[:, :D] 2025-05-07T20:31:36.5458729Z x1 = x[:, D:] 2025-05-07T20:31:36.5458940Z 2025-05-07T20:31:36.5469223Z if contiguous: 2025-05-07T20:31:36.5469522Z x0 = x0.contiguous() 2025-05-07T20:31:36.5469786Z x1 = x1.contiguous() 2025-05-07T20:31:36.5470032Z 2025-05-07T20:31:36.5470241Z if scale_ub is not None: 2025-05-07T20:31:36.5470515Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.5470881Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.5471200Z ) 2025-05-07T20:31:36.5471412Z else: 2025-05-07T20:31:36.5471621Z scale_ub_tensor = None 2025-05-07T20:31:36.5471881Z 2025-05-07T20:31:36.5472130Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.5472448Z op = silu_mul_quant 2025-05-07T20:31:36.5472714Z if compiled: 2025-05-07T20:31:36.5472969Z op = torch.compile(op) 2025-05-07T20:31:36.5473267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.5473551Z 2025-05-07T20:31:36.5473750Z y_fp8, y_scale = fn() 2025-05-07T20:31:36.5474036Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:36.5474336Z 2025-05-07T20:31:36.5474583Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.5474919Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:36.5475380Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:36.5475711Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:36.5476076Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:36.5476386Z 2025-05-07T20:31:36.5476596Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:36.5476790Z 2025-05-07T20:31:36.5476903Z moe/activation_test.py:126: 2025-05-07T20:31:36.5477203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.5477547Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:36.5477884Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:36.5478677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:36.5479448Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:36.5480015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.5480716Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.5481411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:36.5482142Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:36.5482906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:36.5483773Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:36.5484506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:36.5485156Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:36.5485772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:36.5486418Z fn() 2025-05-07T20:31:36.5486950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:36.5487539Z self.fn.run( 2025-05-07T20:31:36.5488018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.5488596Z kernel = self.compile( 2025-05-07T20:31:36.5489149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.5489821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.5490222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.5490468Z 2025-05-07T20:31:36.5490678Z self = 2025-05-07T20:31:36.5491772Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.5493151Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090d451260>} 2025-05-07T20:31:36.5494506Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.5495540Z context = 2025-05-07T20:31:36.5495823Z 2025-05-07T20:31:36.5495987Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.5496516Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.5497069Z module_map=module_map) 2025-05-07T20:31:36.5497440Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.5497805Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:36.5498082Z E ^ 2025-05-07T20:31:36.5498554Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.5499010Z 2025-05-07T20:31:36.5499433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.5499960Z 2025-05-07T20:31:36.5500064Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.5500486Z self=, 2025-05-07T20:31:36.5500898Z T=1, 2025-05-07T20:31:36.5501080Z D=5120, 2025-05-07T20:31:36.5501280Z scale_ub=None, 2025-05-07T20:31:36.5501502Z contiguous=True, 2025-05-07T20:31:36.5501727Z compiled=False, 2025-05-07T20:31:36.5501942Z ) 2025-05-07T20:31:36.6674677Z self = 2025-05-07T20:31:36.6675281Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:36.6675661Z 2025-05-07T20:31:36.6675779Z @given( 2025-05-07T20:31:36.6676159Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.6676588Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.6677003Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.6677364Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.6677683Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.6677963Z ) 2025-05-07T20:31:36.6678307Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.6678744Z def test_silu_mul_quant( 2025-05-07T20:31:36.6678982Z self, 2025-05-07T20:31:36.6679179Z T: int, 2025-05-07T20:31:36.6679368Z D: int, 2025-05-07T20:31:36.6679783Z scale_ub: Optional[float], 2025-05-07T20:31:36.6680054Z contiguous: bool, 2025-05-07T20:31:36.6680284Z compiled: bool, 2025-05-07T20:31:36.6680512Z ) -> None: 2025-05-07T20:31:36.6680725Z torch.manual_seed(2025) 2025-05-07T20:31:36.6680963Z 2025-05-07T20:31:36.6681228Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.6681564Z 2025-05-07T20:31:36.6681753Z x_sign = torch.sign(x) 2025-05-07T20:31:36.6682036Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.6682345Z x = x_sign * x_clamp 2025-05-07T20:31:36.6682583Z x0 = x[:, :D] 2025-05-07T20:31:36.6682794Z x1 = x[:, D:] 2025-05-07T20:31:36.6683004Z 2025-05-07T20:31:36.6683192Z if contiguous: 2025-05-07T20:31:36.6683519Z x0 = x0.contiguous() 2025-05-07T20:31:36.6683783Z x1 = x1.contiguous() 2025-05-07T20:31:36.6684023Z 2025-05-07T20:31:36.6684214Z if scale_ub is not None: 2025-05-07T20:31:36.6684493Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.6684828Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.6685130Z ) 2025-05-07T20:31:36.6685337Z else: 2025-05-07T20:31:36.6685561Z scale_ub_tensor = None 2025-05-07T20:31:36.6685839Z 2025-05-07T20:31:36.6686080Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.6686481Z op = silu_mul_quant 2025-05-07T20:31:36.6686764Z if compiled: 2025-05-07T20:31:36.6687027Z op = torch.compile(op) 2025-05-07T20:31:36.6687359Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.6687665Z 2025-05-07T20:31:36.6687860Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.6688048Z 2025-05-07T20:31:36.6688151Z moe/activation_test.py:117: 2025-05-07T20:31:36.6688482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.6688982Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.6689266Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.6689958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.6690650Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.6691183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.6691868Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.6692537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.6693066Z kernel = self.compile( 2025-05-07T20:31:36.6693610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.6694275Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.6694676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.6694904Z 2025-05-07T20:31:36.6695112Z self = 2025-05-07T20:31:36.6696189Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.6697555Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090decff60>} 2025-05-07T20:31:36.6698885Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.6699907Z context = 2025-05-07T20:31:36.6700274Z 2025-05-07T20:31:36.6700440Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.6700962Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.6701428Z module_map=module_map) 2025-05-07T20:31:36.6701789Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.6702142Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.6702404Z E ^ 2025-05-07T20:31:36.6702872Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.6703322Z 2025-05-07T20:31:36.6703741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.6704262Z 2025-05-07T20:31:36.6704364Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.6704783Z self=, 2025-05-07T20:31:36.6705192Z T=128, 2025-05-07T20:31:36.6705374Z D=5120, 2025-05-07T20:31:36.6705567Z scale_ub=None, 2025-05-07T20:31:36.6705788Z contiguous=False, 2025-05-07T20:31:36.6706005Z compiled=True, 2025-05-07T20:31:36.6706213Z ) 2025-05-07T20:31:36.6706535Z self = 2025-05-07T20:31:36.6707019Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:36.6707294Z 2025-05-07T20:31:36.6707371Z @given( 2025-05-07T20:31:36.6707602Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.6707907Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.6708214Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.6708543Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.6708874Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.6709241Z ) 2025-05-07T20:31:36.6709595Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.6710036Z def test_silu_mul_quant( 2025-05-07T20:31:36.6710272Z self, 2025-05-07T20:31:36.6710457Z T: int, 2025-05-07T20:31:36.6710647Z D: int, 2025-05-07T20:31:36.6710855Z scale_ub: Optional[float], 2025-05-07T20:31:36.6711125Z contiguous: bool, 2025-05-07T20:31:36.6711362Z compiled: bool, 2025-05-07T20:31:36.6711581Z ) -> None: 2025-05-07T20:31:36.6711802Z torch.manual_seed(2025) 2025-05-07T20:31:36.6712043Z 2025-05-07T20:31:36.6712306Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.6712650Z 2025-05-07T20:31:36.6712843Z x_sign = torch.sign(x) 2025-05-07T20:31:36.6713126Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.6713435Z x = x_sign * x_clamp 2025-05-07T20:31:36.6713678Z x0 = x[:, :D] 2025-05-07T20:31:36.6713894Z x1 = x[:, D:] 2025-05-07T20:31:36.6714108Z 2025-05-07T20:31:36.6714291Z if contiguous: 2025-05-07T20:31:36.6714523Z x0 = x0.contiguous() 2025-05-07T20:31:36.6714771Z x1 = x1.contiguous() 2025-05-07T20:31:36.6715010Z 2025-05-07T20:31:36.6715202Z if scale_ub is not None: 2025-05-07T20:31:36.6715467Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.6715798Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.6716108Z ) 2025-05-07T20:31:36.6716311Z else: 2025-05-07T20:31:36.6716558Z scale_ub_tensor = None 2025-05-07T20:31:36.6716809Z 2025-05-07T20:31:36.6717035Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.6717348Z op = silu_mul_quant 2025-05-07T20:31:36.6717597Z if compiled: 2025-05-07T20:31:36.6717838Z op = torch.compile(op) 2025-05-07T20:31:36.6718141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.6718531Z 2025-05-07T20:31:36.6718719Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.6718890Z 2025-05-07T20:31:36.6718989Z moe/activation_test.py:117: 2025-05-07T20:31:36.6719287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.6719619Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.6719895Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.6720455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:36.6721019Z return fn(*args, **kwargs) 2025-05-07T20:31:36.6721683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.6722376Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.6722927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.6723745Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.6724407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.6724946Z kernel = self.compile( 2025-05-07T20:31:36.6725494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.6726155Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.6726545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.6726781Z 2025-05-07T20:31:36.6726988Z self = 2025-05-07T20:31:36.6728149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.6729529Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090decea20>} 2025-05-07T20:31:36.6730871Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.6731898Z context = 2025-05-07T20:31:36.6732191Z 2025-05-07T20:31:36.6732357Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.6732881Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.6733341Z module_map=module_map) 2025-05-07T20:31:36.6733703Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.6734064Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.6734318Z E ^ 2025-05-07T20:31:36.6734792Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.6735249Z 2025-05-07T20:31:36.6735667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.6736180Z 2025-05-07T20:31:36.6736291Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.6736696Z self=, 2025-05-07T20:31:36.6737100Z T=128, 2025-05-07T20:31:36.6737288Z D=7168, 2025-05-07T20:31:36.6737475Z scale_ub=1200.0, 2025-05-07T20:31:36.6737697Z contiguous=False, 2025-05-07T20:31:36.6737921Z compiled=False, 2025-05-07T20:31:36.6738117Z ) 2025-05-07T20:31:36.7616236Z self = 2025-05-07T20:31:36.7616816Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:36.7617401Z 2025-05-07T20:31:36.7617532Z @given( 2025-05-07T20:31:36.7617860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.7618318Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.7618743Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.7619128Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.7619465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.7619758Z ) 2025-05-07T20:31:36.7620124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.7620565Z def test_silu_mul_quant( 2025-05-07T20:31:36.7620813Z self, 2025-05-07T20:31:36.7621018Z T: int, 2025-05-07T20:31:36.7621221Z D: int, 2025-05-07T20:31:36.7621447Z scale_ub: Optional[float], 2025-05-07T20:31:36.7621729Z contiguous: bool, 2025-05-07T20:31:36.7621977Z compiled: bool, 2025-05-07T20:31:36.7622218Z ) -> None: 2025-05-07T20:31:36.7622441Z torch.manual_seed(2025) 2025-05-07T20:31:36.7622692Z 2025-05-07T20:31:36.7622970Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.7623309Z 2025-05-07T20:31:36.7623515Z x_sign = torch.sign(x) 2025-05-07T20:31:36.7623812Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.7624130Z x = x_sign * x_clamp 2025-05-07T20:31:36.7624367Z x0 = x[:, :D] 2025-05-07T20:31:36.7624614Z x1 = x[:, D:] 2025-05-07T20:31:36.7624830Z 2025-05-07T20:31:36.7625023Z if contiguous: 2025-05-07T20:31:36.7625256Z x0 = x0.contiguous() 2025-05-07T20:31:36.7625520Z x1 = x1.contiguous() 2025-05-07T20:31:36.7625767Z 2025-05-07T20:31:36.7625963Z if scale_ub is not None: 2025-05-07T20:31:36.7626233Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.7626738Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.7627064Z ) 2025-05-07T20:31:36.7627260Z else: 2025-05-07T20:31:36.7627480Z scale_ub_tensor = None 2025-05-07T20:31:36.7627738Z 2025-05-07T20:31:36.7627971Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.7628290Z op = silu_mul_quant 2025-05-07T20:31:36.7628546Z if compiled: 2025-05-07T20:31:36.7628791Z op = torch.compile(op) 2025-05-07T20:31:36.7629095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.7629372Z 2025-05-07T20:31:36.7629568Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.7629739Z 2025-05-07T20:31:36.7629839Z moe/activation_test.py:117: 2025-05-07T20:31:36.7630140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.7630479Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.7630762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.7631469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.7632173Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.7632712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.7633404Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.7634081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.7634621Z kernel = self.compile( 2025-05-07T20:31:36.7635167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.7635835Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.7636240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.7636564Z 2025-05-07T20:31:36.7636782Z self = 2025-05-07T20:31:36.7637853Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.7639406Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090d4525c0>} 2025-05-07T20:31:36.7640749Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.7641777Z context = 2025-05-07T20:31:36.7642066Z 2025-05-07T20:31:36.7642240Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.7642773Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.7643244Z module_map=module_map) 2025-05-07T20:31:36.7643699Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.7644049Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.7644311Z E ^ 2025-05-07T20:31:36.7644788Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.7645240Z 2025-05-07T20:31:36.7645663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.7646188Z 2025-05-07T20:31:36.7646293Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.7646709Z self=, 2025-05-07T20:31:36.7647118Z T=128, 2025-05-07T20:31:36.7647433Z D=5120, 2025-05-07T20:31:36.7647634Z scale_ub=None, 2025-05-07T20:31:36.7647851Z contiguous=False, 2025-05-07T20:31:36.7648077Z compiled=False, 2025-05-07T20:31:36.7648282Z ) 2025-05-07T20:31:36.7648606Z self = 2025-05-07T20:31:36.7649094Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:36.7649369Z 2025-05-07T20:31:36.7649450Z @given( 2025-05-07T20:31:36.7649683Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.7650006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.7650308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.7650635Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.7650973Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.7651255Z ) 2025-05-07T20:31:36.7651609Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.7652055Z def test_silu_mul_quant( 2025-05-07T20:31:36.7652297Z self, 2025-05-07T20:31:36.7652494Z T: int, 2025-05-07T20:31:36.7652698Z D: int, 2025-05-07T20:31:36.7652911Z scale_ub: Optional[float], 2025-05-07T20:31:36.7653190Z contiguous: bool, 2025-05-07T20:31:36.7653432Z compiled: bool, 2025-05-07T20:31:36.7653652Z ) -> None: 2025-05-07T20:31:36.7653872Z torch.manual_seed(2025) 2025-05-07T20:31:36.7654118Z 2025-05-07T20:31:36.7654393Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.7654730Z 2025-05-07T20:31:36.7654928Z x_sign = torch.sign(x) 2025-05-07T20:31:36.7655223Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.7655532Z x = x_sign * x_clamp 2025-05-07T20:31:36.7655774Z x0 = x[:, :D] 2025-05-07T20:31:36.7655996Z x1 = x[:, D:] 2025-05-07T20:31:36.7656200Z 2025-05-07T20:31:36.7656398Z if contiguous: 2025-05-07T20:31:36.7656760Z x0 = x0.contiguous() 2025-05-07T20:31:36.7657015Z x1 = x1.contiguous() 2025-05-07T20:31:36.7657267Z 2025-05-07T20:31:36.7657462Z if scale_ub is not None: 2025-05-07T20:31:36.7657731Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.7658069Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.7658382Z ) 2025-05-07T20:31:36.7658575Z else: 2025-05-07T20:31:36.7658786Z scale_ub_tensor = None 2025-05-07T20:31:36.7659044Z 2025-05-07T20:31:36.7659274Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.7659587Z op = silu_mul_quant 2025-05-07T20:31:36.7659841Z if compiled: 2025-05-07T20:31:36.7660088Z op = torch.compile(op) 2025-05-07T20:31:36.7660381Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.7660655Z 2025-05-07T20:31:36.7660848Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.7661027Z 2025-05-07T20:31:36.7661126Z moe/activation_test.py:117: 2025-05-07T20:31:36.7661418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.7661751Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.7662028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.7662730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.7663716Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.7664354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.7665193Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.7665989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.7666618Z kernel = self.compile( 2025-05-07T20:31:36.7667399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.7668172Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.7668648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.7668927Z 2025-05-07T20:31:36.7669215Z self = 2025-05-07T20:31:36.7670405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.7680161Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090d38b7e0>} 2025-05-07T20:31:36.7681535Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.7682580Z context = 2025-05-07T20:31:36.7682872Z 2025-05-07T20:31:36.7683050Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.7683685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.7684161Z module_map=module_map) 2025-05-07T20:31:36.7684534Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.7684897Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.7685153Z E ^ 2025-05-07T20:31:36.7685627Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.7686085Z 2025-05-07T20:31:36.7686518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.7687152Z 2025-05-07T20:31:36.7687266Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.7687680Z self=, 2025-05-07T20:31:36.7688093Z T=128, 2025-05-07T20:31:36.7688283Z D=5120, 2025-05-07T20:31:36.7688476Z scale_ub=1200.0, 2025-05-07T20:31:36.7688707Z contiguous=True, 2025-05-07T20:31:36.7688936Z compiled=False, 2025-05-07T20:31:36.7689139Z ) 2025-05-07T20:31:37.1106299Z self = 2025-05-07T20:31:37.1106816Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:37.1107213Z 2025-05-07T20:31:37.1107334Z @given( 2025-05-07T20:31:37.1107661Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.1108095Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.1108515Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.1108906Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.1109235Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.1109518Z ) 2025-05-07T20:31:37.1109866Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.1110307Z def test_silu_mul_quant( 2025-05-07T20:31:37.1110549Z self, 2025-05-07T20:31:37.1110745Z T: int, 2025-05-07T20:31:37.1110941Z D: int, 2025-05-07T20:31:37.1111160Z scale_ub: Optional[float], 2025-05-07T20:31:37.1111428Z contiguous: bool, 2025-05-07T20:31:37.1111670Z compiled: bool, 2025-05-07T20:31:37.1111903Z ) -> None: 2025-05-07T20:31:37.1112117Z torch.manual_seed(2025) 2025-05-07T20:31:37.1112360Z 2025-05-07T20:31:37.1112637Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.1112977Z 2025-05-07T20:31:37.1113339Z x_sign = torch.sign(x) 2025-05-07T20:31:37.1113638Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.1113940Z x = x_sign * x_clamp 2025-05-07T20:31:37.1114178Z x0 = x[:, :D] 2025-05-07T20:31:37.1114398Z x1 = x[:, D:] 2025-05-07T20:31:37.1114602Z 2025-05-07T20:31:37.1114798Z if contiguous: 2025-05-07T20:31:37.1115039Z x0 = x0.contiguous() 2025-05-07T20:31:37.1115288Z x1 = x1.contiguous() 2025-05-07T20:31:37.1115530Z 2025-05-07T20:31:37.1115726Z if scale_ub is not None: 2025-05-07T20:31:37.1115991Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.1116330Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.1116642Z ) 2025-05-07T20:31:37.1116839Z else: 2025-05-07T20:31:37.1117044Z scale_ub_tensor = None 2025-05-07T20:31:37.1117295Z 2025-05-07T20:31:37.1117530Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.1117847Z op = silu_mul_quant 2025-05-07T20:31:37.1118098Z if compiled: 2025-05-07T20:31:37.1118343Z op = torch.compile(op) 2025-05-07T20:31:37.1118634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.1118910Z 2025-05-07T20:31:37.1119104Z > y_fp8, y_scale = fn() 2025-05-07T20:31:37.1119266Z 2025-05-07T20:31:37.1119369Z moe/activation_test.py:117: 2025-05-07T20:31:37.1119653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.1119980Z moe/activation_test.py:115: in fn 2025-05-07T20:31:37.1120258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.1120943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:37.1121639Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:37.1122182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.1122994Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.1123838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.1124374Z kernel = self.compile( 2025-05-07T20:31:37.1124919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.1125576Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.1125968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.1126200Z 2025-05-07T20:31:37.1126405Z self = 2025-05-07T20:31:37.1127487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.1128855Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c6dea20>} 2025-05-07T20:31:37.1130190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.1131216Z context = 2025-05-07T20:31:37.1131504Z 2025-05-07T20:31:37.1131668Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.1132186Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.1132649Z module_map=module_map) 2025-05-07T20:31:37.1133101Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.1133459Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.1133710Z E ^ 2025-05-07T20:31:37.1134173Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.1134629Z 2025-05-07T20:31:37.1135047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.1135562Z 2025-05-07T20:31:37.1135667Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.1136068Z self=, 2025-05-07T20:31:37.1136469Z T=1, 2025-05-07T20:31:37.1136649Z D=7168, 2025-05-07T20:31:37.1136833Z scale_ub=1200.0, 2025-05-07T20:31:37.1137052Z contiguous=True, 2025-05-07T20:31:37.1137271Z compiled=True, 2025-05-07T20:31:37.1137464Z ) 2025-05-07T20:31:37.1137777Z self = 2025-05-07T20:31:37.1138267Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:37.1138784Z 2025-05-07T20:31:37.1138870Z @given( 2025-05-07T20:31:37.1139094Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.1139407Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.1139707Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.1140024Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.1140348Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.1140629Z ) 2025-05-07T20:31:37.1140970Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.1141410Z def test_silu_mul_quant( 2025-05-07T20:31:37.1141649Z self, 2025-05-07T20:31:37.1141843Z T: int, 2025-05-07T20:31:37.1142031Z D: int, 2025-05-07T20:31:37.1142244Z scale_ub: Optional[float], 2025-05-07T20:31:37.1142511Z contiguous: bool, 2025-05-07T20:31:37.1142894Z compiled: bool, 2025-05-07T20:31:37.1143113Z ) -> None: 2025-05-07T20:31:37.1143326Z torch.manual_seed(2025) 2025-05-07T20:31:37.1143557Z 2025-05-07T20:31:37.1143825Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.1144162Z 2025-05-07T20:31:37.1144345Z x_sign = torch.sign(x) 2025-05-07T20:31:37.1144631Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.1144934Z x = x_sign * x_clamp 2025-05-07T20:31:37.1145164Z x0 = x[:, :D] 2025-05-07T20:31:37.1145376Z x1 = x[:, D:] 2025-05-07T20:31:37.1145583Z 2025-05-07T20:31:37.1145763Z if contiguous: 2025-05-07T20:31:37.1146032Z x0 = x0.contiguous() 2025-05-07T20:31:37.1146290Z x1 = x1.contiguous() 2025-05-07T20:31:37.1146519Z 2025-05-07T20:31:37.1146705Z if scale_ub is not None: 2025-05-07T20:31:37.1146974Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.1147309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.1147614Z ) 2025-05-07T20:31:37.1147805Z else: 2025-05-07T20:31:37.1148006Z scale_ub_tensor = None 2025-05-07T20:31:37.1148248Z 2025-05-07T20:31:37.1148478Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.1148788Z op = silu_mul_quant 2025-05-07T20:31:37.1149025Z if compiled: 2025-05-07T20:31:37.1149272Z op = torch.compile(op) 2025-05-07T20:31:37.1149564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.1149827Z 2025-05-07T20:31:37.1150018Z > y_fp8, y_scale = fn() 2025-05-07T20:31:37.1150177Z 2025-05-07T20:31:37.1150276Z moe/activation_test.py:117: 2025-05-07T20:31:37.1150567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.1150889Z moe/activation_test.py:115: in fn 2025-05-07T20:31:37.1151163Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.1151846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:37.1152405Z return fn(*args, **kwargs) 2025-05-07T20:31:37.1153064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:37.1153752Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:37.1154286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.1154963Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.1155627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.1156159Z kernel = self.compile( 2025-05-07T20:31:37.1156691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.1157360Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.1157751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.1157975Z 2025-05-07T20:31:37.1158182Z self = 2025-05-07T20:31:37.1159250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.1160628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c6dd440>} 2025-05-07T20:31:37.1161976Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.1163078Z context = 2025-05-07T20:31:37.1163362Z 2025-05-07T20:31:37.1163633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.1164148Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.1164612Z module_map=module_map) 2025-05-07T20:31:37.1164972Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.1165312Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.1165562Z E ^ 2025-05-07T20:31:37.1166023Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.1166473Z 2025-05-07T20:31:37.1166895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.1167409Z 2025-05-07T20:31:37.1167522Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.1167936Z self=, 2025-05-07T20:31:37.1168340Z T=1, 2025-05-07T20:31:37.1168518Z D=7168, 2025-05-07T20:31:37.1168707Z scale_ub=1200.0, 2025-05-07T20:31:37.1168925Z contiguous=False, 2025-05-07T20:31:37.1169140Z compiled=True, 2025-05-07T20:31:37.1169337Z ) 2025-05-07T20:31:37.2193801Z self = 2025-05-07T20:31:37.2194457Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:37.2194729Z 2025-05-07T20:31:37.2194806Z @given( 2025-05-07T20:31:37.2195033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.2195342Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.2195640Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.2195973Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.2196503Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.2196788Z ) 2025-05-07T20:31:37.2197137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.2197579Z def test_silu_mul_quant( 2025-05-07T20:31:37.2197822Z self, 2025-05-07T20:31:37.2198011Z T: int, 2025-05-07T20:31:37.2198207Z D: int, 2025-05-07T20:31:37.2198426Z scale_ub: Optional[float], 2025-05-07T20:31:37.2198692Z contiguous: bool, 2025-05-07T20:31:37.2198932Z compiled: bool, 2025-05-07T20:31:37.2199158Z ) -> None: 2025-05-07T20:31:37.2199364Z torch.manual_seed(2025) 2025-05-07T20:31:37.2199607Z 2025-05-07T20:31:37.2199878Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.2200216Z 2025-05-07T20:31:37.2200410Z x_sign = torch.sign(x) 2025-05-07T20:31:37.2200700Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.2201012Z x = x_sign * x_clamp 2025-05-07T20:31:37.2201257Z x0 = x[:, :D] 2025-05-07T20:31:37.2201472Z x1 = x[:, D:] 2025-05-07T20:31:37.2201679Z 2025-05-07T20:31:37.2201872Z if contiguous: 2025-05-07T20:31:37.2202106Z x0 = x0.contiguous() 2025-05-07T20:31:37.2202360Z x1 = x1.contiguous() 2025-05-07T20:31:37.2202603Z 2025-05-07T20:31:37.2202794Z if scale_ub is not None: 2025-05-07T20:31:37.2203066Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.2203553Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.2203864Z ) 2025-05-07T20:31:37.2204055Z else: 2025-05-07T20:31:37.2204261Z scale_ub_tensor = None 2025-05-07T20:31:37.2204514Z 2025-05-07T20:31:37.2204750Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.2205059Z op = silu_mul_quant 2025-05-07T20:31:37.2205324Z if compiled: 2025-05-07T20:31:37.2205577Z op = torch.compile(op) 2025-05-07T20:31:37.2206025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.2206320Z 2025-05-07T20:31:37.2206516Z > y_fp8, y_scale = fn() 2025-05-07T20:31:37.2206678Z 2025-05-07T20:31:37.2206780Z moe/activation_test.py:117: 2025-05-07T20:31:37.2207070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.2207404Z moe/activation_test.py:115: in fn 2025-05-07T20:31:37.2207686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.2208242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:37.2208806Z return fn(*args, **kwargs) 2025-05-07T20:31:37.2209469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:37.2210160Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:37.2210699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.2211395Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.2212063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.2212596Z kernel = self.compile( 2025-05-07T20:31:37.2213141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.2213804Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.2214205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.2214432Z 2025-05-07T20:31:37.2214641Z self = 2025-05-07T20:31:37.2215811Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.2217192Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c6dcc20>} 2025-05-07T20:31:37.2218550Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.2219579Z context = 2025-05-07T20:31:37.2219864Z 2025-05-07T20:31:37.2220032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.2220555Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.2221027Z module_map=module_map) 2025-05-07T20:31:37.2221389Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.2221754Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.2222014Z E ^ 2025-05-07T20:31:37.2222483Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.2222936Z 2025-05-07T20:31:37.2223357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.2223882Z 2025-05-07T20:31:37.2223985Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.2224398Z self=, 2025-05-07T20:31:37.2224801Z T=1, 2025-05-07T20:31:37.2224980Z D=7168, 2025-05-07T20:31:37.2225175Z scale_ub=None, 2025-05-07T20:31:37.2225392Z contiguous=False, 2025-05-07T20:31:37.2225609Z compiled=True, 2025-05-07T20:31:37.2225807Z ) 2025-05-07T20:31:37.2916270Z self = 2025-05-07T20:31:37.2916953Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:37.2917214Z 2025-05-07T20:31:37.2917288Z @given( 2025-05-07T20:31:37.2917517Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.2917836Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.2918136Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.2918465Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.2918791Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.2919079Z ) 2025-05-07T20:31:37.2919427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.2919869Z def test_silu_mul_quant( 2025-05-07T20:31:37.2920111Z self, 2025-05-07T20:31:37.2920297Z T: int, 2025-05-07T20:31:37.2920496Z D: int, 2025-05-07T20:31:37.2920711Z scale_ub: Optional[float], 2025-05-07T20:31:37.2920988Z contiguous: bool, 2025-05-07T20:31:37.2921226Z compiled: bool, 2025-05-07T20:31:37.2921448Z ) -> None: 2025-05-07T20:31:37.2921654Z torch.manual_seed(2025) 2025-05-07T20:31:37.2921894Z 2025-05-07T20:31:37.2922163Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.2922500Z 2025-05-07T20:31:37.2922697Z x_sign = torch.sign(x) 2025-05-07T20:31:37.2922989Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.2923294Z x = x_sign * x_clamp 2025-05-07T20:31:37.2923638Z x0 = x[:, :D] 2025-05-07T20:31:37.2923851Z x1 = x[:, D:] 2025-05-07T20:31:37.2924056Z 2025-05-07T20:31:37.2924236Z if contiguous: 2025-05-07T20:31:37.2924467Z x0 = x0.contiguous() 2025-05-07T20:31:37.2924723Z x1 = x1.contiguous() 2025-05-07T20:31:37.2924956Z 2025-05-07T20:31:37.2925146Z if scale_ub is not None: 2025-05-07T20:31:37.2925540Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.2925879Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.2926190Z ) 2025-05-07T20:31:37.2926383Z else: 2025-05-07T20:31:37.2926589Z scale_ub_tensor = None 2025-05-07T20:31:37.2926837Z 2025-05-07T20:31:37.2927066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.2927376Z op = silu_mul_quant 2025-05-07T20:31:37.2927623Z if compiled: 2025-05-07T20:31:37.2927868Z op = torch.compile(op) 2025-05-07T20:31:37.2928155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.2928427Z 2025-05-07T20:31:37.2928621Z y_fp8, y_scale = fn() 2025-05-07T20:31:37.2928905Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:37.2929188Z 2025-05-07T20:31:37.2929426Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.2929756Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:37.2930051Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:37.2930366Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:37.2930727Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:37.2931034Z 2025-05-07T20:31:37.2931237Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:37.2931429Z 2025-05-07T20:31:37.2931536Z moe/activation_test.py:126: 2025-05-07T20:31:37.2931831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.2932162Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:37.2932488Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:37.2933279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:37.2934030Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:37.2934584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.2935359Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.2936056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:37.2936772Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:37.2937528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:37.2938280Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:37.2939196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:37.2939835Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:37.2940447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:37.2940975Z fn() 2025-05-07T20:31:37.2941482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:37.2942063Z self.fn.run( 2025-05-07T20:31:37.2942532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.2943063Z kernel = self.compile( 2025-05-07T20:31:37.2943598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.2944254Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.2944646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.2944878Z 2025-05-07T20:31:37.2945088Z self = 2025-05-07T20:31:37.2946296Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.2948030Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090dc9eb60>} 2025-05-07T20:31:37.2949713Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.2950984Z context = 2025-05-07T20:31:37.2951323Z 2025-05-07T20:31:37.2951509Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.2952135Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.2952691Z module_map=module_map) 2025-05-07T20:31:37.2953101Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.2953500Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:37.2953793Z E ^ 2025-05-07T20:31:37.2954335Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.2954889Z 2025-05-07T20:31:37.2955399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.2956040Z 2025-05-07T20:31:37.2956149Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.2956624Z self=, 2025-05-07T20:31:37.2957090Z T=1, 2025-05-07T20:31:37.2957278Z D=5120, 2025-05-07T20:31:37.2957483Z scale_ub=1200.0, 2025-05-07T20:31:37.2957725Z contiguous=False, 2025-05-07T20:31:37.2957966Z compiled=True, 2025-05-07T20:31:37.2958326Z ) 2025-05-07T20:31:37.4166244Z self = 2025-05-07T20:31:37.4167487Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:37.4168029Z 2025-05-07T20:31:37.4168188Z @given( 2025-05-07T20:31:37.4168640Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.4169261Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.4169853Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.4170504Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.4171146Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.4171704Z ) 2025-05-07T20:31:37.4172379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.4173259Z def test_silu_mul_quant( 2025-05-07T20:31:37.4173721Z self, 2025-05-07T20:31:37.4174096Z T: int, 2025-05-07T20:31:37.4174505Z D: int, 2025-05-07T20:31:37.4174922Z scale_ub: Optional[float], 2025-05-07T20:31:37.4175440Z contiguous: bool, 2025-05-07T20:31:37.4175865Z compiled: bool, 2025-05-07T20:31:37.4176113Z ) -> None: 2025-05-07T20:31:37.4176345Z torch.manual_seed(2025) 2025-05-07T20:31:37.4176585Z 2025-05-07T20:31:37.4176855Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.4177192Z 2025-05-07T20:31:37.4177388Z x_sign = torch.sign(x) 2025-05-07T20:31:37.4177679Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.4177988Z x = x_sign * x_clamp 2025-05-07T20:31:37.4178220Z x0 = x[:, :D] 2025-05-07T20:31:37.4178440Z x1 = x[:, D:] 2025-05-07T20:31:37.4178643Z 2025-05-07T20:31:37.4178822Z if contiguous: 2025-05-07T20:31:37.4179054Z x0 = x0.contiguous() 2025-05-07T20:31:37.4179317Z x1 = x1.contiguous() 2025-05-07T20:31:37.4179726Z 2025-05-07T20:31:37.4179920Z if scale_ub is not None: 2025-05-07T20:31:37.4180194Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.4180522Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.4180832Z ) 2025-05-07T20:31:37.4181025Z else: 2025-05-07T20:31:37.4181227Z scale_ub_tensor = None 2025-05-07T20:31:37.4181470Z 2025-05-07T20:31:37.4181700Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.4182009Z op = silu_mul_quant 2025-05-07T20:31:37.4182258Z if compiled: 2025-05-07T20:31:37.4182503Z op = torch.compile(op) 2025-05-07T20:31:37.4182799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.4183070Z 2025-05-07T20:31:37.4183264Z > y_fp8, y_scale = fn() 2025-05-07T20:31:37.4183424Z 2025-05-07T20:31:37.4183528Z moe/activation_test.py:117: 2025-05-07T20:31:37.4183823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.4184151Z moe/activation_test.py:115: in fn 2025-05-07T20:31:37.4184429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.4184984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:37.4185550Z return fn(*args, **kwargs) 2025-05-07T20:31:37.4186213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:37.4186902Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:37.4187436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.4188121Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.4188793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.4189324Z kernel = self.compile( 2025-05-07T20:31:37.4190000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.4190659Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.4191051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.4191277Z 2025-05-07T20:31:37.4191484Z self = 2025-05-07T20:31:37.4192554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.4193909Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090dc9f880>} 2025-05-07T20:31:37.4195244Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.4196267Z context = 2025-05-07T20:31:37.4196552Z 2025-05-07T20:31:37.4196719Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.4197248Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.4197719Z module_map=module_map) 2025-05-07T20:31:37.4198076Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.4198428Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.4198685Z E ^ 2025-05-07T20:31:37.4199148Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.4199603Z 2025-05-07T20:31:37.4200112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.4200634Z 2025-05-07T20:31:37.4200739Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.4201147Z self=, 2025-05-07T20:31:37.4201544Z T=1, 2025-05-07T20:31:37.4201723Z D=5120, 2025-05-07T20:31:37.4201911Z scale_ub=1200.0, 2025-05-07T20:31:37.4202129Z contiguous=False, 2025-05-07T20:31:37.4202347Z compiled=False, 2025-05-07T20:31:37.4202543Z ) 2025-05-07T20:31:37.4202857Z self = 2025-05-07T20:31:37.4203342Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:37.4203735Z 2025-05-07T20:31:37.4203812Z @given( 2025-05-07T20:31:37.4204042Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.4204355Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.4204654Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.4204975Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.4205296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.4205572Z ) 2025-05-07T20:31:37.4205916Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.4206403Z def test_silu_mul_quant( 2025-05-07T20:31:37.4206641Z self, 2025-05-07T20:31:37.4206826Z T: int, 2025-05-07T20:31:37.4207019Z D: int, 2025-05-07T20:31:37.4207235Z scale_ub: Optional[float], 2025-05-07T20:31:37.4207501Z contiguous: bool, 2025-05-07T20:31:37.4207730Z compiled: bool, 2025-05-07T20:31:37.4207949Z ) -> None: 2025-05-07T20:31:37.4208153Z torch.manual_seed(2025) 2025-05-07T20:31:37.4208388Z 2025-05-07T20:31:37.4208654Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.4208993Z 2025-05-07T20:31:37.4209273Z x_sign = torch.sign(x) 2025-05-07T20:31:37.4209559Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.4209855Z x = x_sign * x_clamp 2025-05-07T20:31:37.4210084Z x0 = x[:, :D] 2025-05-07T20:31:37.4210290Z x1 = x[:, D:] 2025-05-07T20:31:37.4210491Z 2025-05-07T20:31:37.4210674Z if contiguous: 2025-05-07T20:31:37.4210903Z x0 = x0.contiguous() 2025-05-07T20:31:37.4211149Z x1 = x1.contiguous() 2025-05-07T20:31:37.4211381Z 2025-05-07T20:31:37.4211572Z if scale_ub is not None: 2025-05-07T20:31:37.4211834Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.4212161Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.4212467Z ) 2025-05-07T20:31:37.4212654Z else: 2025-05-07T20:31:37.4212854Z scale_ub_tensor = None 2025-05-07T20:31:37.4213096Z 2025-05-07T20:31:37.4213329Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.4213651Z op = silu_mul_quant 2025-05-07T20:31:37.4213894Z if compiled: 2025-05-07T20:31:37.4214132Z op = torch.compile(op) 2025-05-07T20:31:37.4214423Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.4214687Z 2025-05-07T20:31:37.4214877Z > y_fp8, y_scale = fn() 2025-05-07T20:31:37.4215036Z 2025-05-07T20:31:37.4215136Z moe/activation_test.py:117: 2025-05-07T20:31:37.4215414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.4215738Z moe/activation_test.py:115: in fn 2025-05-07T20:31:37.4216011Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.4216693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:37.4217380Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:37.4217999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.4218687Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.4219344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.4219875Z kernel = self.compile( 2025-05-07T20:31:37.4220414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.4221072Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.4221460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.4221686Z 2025-05-07T20:31:37.4221891Z self = 2025-05-07T20:31:37.4222965Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.4224325Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090ce76480>} 2025-05-07T20:31:37.4225658Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.4226678Z context = 2025-05-07T20:31:37.4226968Z 2025-05-07T20:31:37.4227132Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.4227647Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.4228104Z module_map=module_map) 2025-05-07T20:31:37.4228465Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.4228961Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.4229211Z E ^ 2025-05-07T20:31:37.4229672Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.4230124Z 2025-05-07T20:31:37.4230544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.4231060Z 2025-05-07T20:31:37.4231166Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.4231567Z self=, 2025-05-07T20:31:37.4231968Z T=16384, 2025-05-07T20:31:37.4232152Z D=5120, 2025-05-07T20:31:37.4232336Z scale_ub=1200.0, 2025-05-07T20:31:37.4232555Z contiguous=False, 2025-05-07T20:31:37.4232773Z compiled=True, 2025-05-07T20:31:37.4232967Z ) 2025-05-07T20:31:37.7119091Z self = 2025-05-07T20:31:37.7120415Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:37.7120976Z 2025-05-07T20:31:37.7121138Z @given( 2025-05-07T20:31:37.7121580Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.7122197Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.7122795Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.7123597Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.7124264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.7124811Z ) 2025-05-07T20:31:37.7125502Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.7126348Z def test_silu_mul_quant( 2025-05-07T20:31:37.7126581Z self, 2025-05-07T20:31:37.7126777Z T: int, 2025-05-07T20:31:37.7126974Z D: int, 2025-05-07T20:31:37.7127184Z scale_ub: Optional[float], 2025-05-07T20:31:37.7127645Z contiguous: bool, 2025-05-07T20:31:37.7127895Z compiled: bool, 2025-05-07T20:31:37.7128116Z ) -> None: 2025-05-07T20:31:37.7128333Z torch.manual_seed(2025) 2025-05-07T20:31:37.7128571Z 2025-05-07T20:31:37.7128836Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.7129179Z 2025-05-07T20:31:37.7129378Z x_sign = torch.sign(x) 2025-05-07T20:31:37.7129664Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.7129966Z x = x_sign * x_clamp 2025-05-07T20:31:37.7130202Z x0 = x[:, :D] 2025-05-07T20:31:37.7130417Z x1 = x[:, D:] 2025-05-07T20:31:37.7130615Z 2025-05-07T20:31:37.7130800Z if contiguous: 2025-05-07T20:31:37.7131031Z x0 = x0.contiguous() 2025-05-07T20:31:37.7131283Z x1 = x1.contiguous() 2025-05-07T20:31:37.7131521Z 2025-05-07T20:31:37.7131710Z if scale_ub is not None: 2025-05-07T20:31:37.7131979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.7132315Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.7132627Z ) 2025-05-07T20:31:37.7132811Z else: 2025-05-07T20:31:37.7133021Z scale_ub_tensor = None 2025-05-07T20:31:37.7133270Z 2025-05-07T20:31:37.7133494Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.7133802Z op = silu_mul_quant 2025-05-07T20:31:37.7134049Z if compiled: 2025-05-07T20:31:37.7134289Z op = torch.compile(op) 2025-05-07T20:31:37.7134584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.7134853Z 2025-05-07T20:31:37.7135045Z > y_fp8, y_scale = fn() 2025-05-07T20:31:37.7135207Z 2025-05-07T20:31:37.7135306Z moe/activation_test.py:117: 2025-05-07T20:31:37.7135592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.7135919Z moe/activation_test.py:115: in fn 2025-05-07T20:31:37.7136196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.7136913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:37.7137470Z return fn(*args, **kwargs) 2025-05-07T20:31:37.7138129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:37.7138998Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:37.7139538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.7140222Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.7140883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.7141413Z kernel = self.compile( 2025-05-07T20:31:37.7141965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.7142628Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.7143016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.7143247Z 2025-05-07T20:31:37.7143453Z self = 2025-05-07T20:31:37.7144523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.7145882Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090ce751c0>} 2025-05-07T20:31:37.7147389Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.7148425Z context = 2025-05-07T20:31:37.7148718Z 2025-05-07T20:31:37.7148883Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.7149405Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.7149868Z module_map=module_map) 2025-05-07T20:31:37.7150234Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.7150586Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.7150842Z E ^ 2025-05-07T20:31:37.7151305Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.7151762Z 2025-05-07T20:31:37.7152184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.7152702Z 2025-05-07T20:31:37.7152808Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.7153213Z self=, 2025-05-07T20:31:37.7153613Z T=2048, 2025-05-07T20:31:37.7153798Z D=7168, 2025-05-07T20:31:37.7153987Z scale_ub=1200.0, 2025-05-07T20:31:37.7154205Z contiguous=False, 2025-05-07T20:31:37.7154435Z compiled=True, 2025-05-07T20:31:37.7154635Z ) 2025-05-07T20:31:37.7154948Z self = 2025-05-07T20:31:37.7155449Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:37.7155721Z 2025-05-07T20:31:37.7155802Z @given( 2025-05-07T20:31:37.7156024Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.7156336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.7156638Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.7156962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.7157416Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.7157702Z ) 2025-05-07T20:31:37.7158050Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.7158484Z def test_silu_mul_quant( 2025-05-07T20:31:37.7158722Z self, 2025-05-07T20:31:37.7158915Z T: int, 2025-05-07T20:31:37.7159104Z D: int, 2025-05-07T20:31:37.7159316Z scale_ub: Optional[float], 2025-05-07T20:31:37.7159588Z contiguous: bool, 2025-05-07T20:31:37.7159820Z compiled: bool, 2025-05-07T20:31:37.7160041Z ) -> None: 2025-05-07T20:31:37.7160258Z torch.manual_seed(2025) 2025-05-07T20:31:37.7160487Z 2025-05-07T20:31:37.7160759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.7161099Z 2025-05-07T20:31:37.7161291Z x_sign = torch.sign(x) 2025-05-07T20:31:37.7161587Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.7161902Z x = x_sign * x_clamp 2025-05-07T20:31:37.7162133Z x0 = x[:, :D] 2025-05-07T20:31:37.7162347Z x1 = x[:, D:] 2025-05-07T20:31:37.7162556Z 2025-05-07T20:31:37.7162741Z if contiguous: 2025-05-07T20:31:37.7162963Z x0 = x0.contiguous() 2025-05-07T20:31:37.7163217Z x1 = x1.contiguous() 2025-05-07T20:31:37.7163535Z 2025-05-07T20:31:37.7163716Z if scale_ub is not None: 2025-05-07T20:31:37.7163988Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.7164319Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.7164620Z ) 2025-05-07T20:31:37.7164807Z else: 2025-05-07T20:31:37.7165020Z scale_ub_tensor = None 2025-05-07T20:31:37.7165263Z 2025-05-07T20:31:37.7165495Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.7165803Z op = silu_mul_quant 2025-05-07T20:31:37.7166127Z if compiled: 2025-05-07T20:31:37.7166378Z op = torch.compile(op) 2025-05-07T20:31:37.7166671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.7166938Z 2025-05-07T20:31:37.7167131Z > y_fp8, y_scale = fn() 2025-05-07T20:31:37.7167298Z 2025-05-07T20:31:37.7167395Z moe/activation_test.py:117: 2025-05-07T20:31:37.7167690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.7168012Z moe/activation_test.py:115: in fn 2025-05-07T20:31:37.7168291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.7168848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:37.7169402Z return fn(*args, **kwargs) 2025-05-07T20:31:37.7170060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:37.7170753Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:37.7171300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.7171977Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.7172641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.7173174Z kernel = self.compile( 2025-05-07T20:31:37.7173709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.7174364Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.7174757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.7174983Z 2025-05-07T20:31:37.7175195Z self = 2025-05-07T20:31:37.7176266Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.7177712Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090ce76fc0>} 2025-05-07T20:31:37.7179057Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.7180083Z context = 2025-05-07T20:31:37.7180369Z 2025-05-07T20:31:37.7180542Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.7181058Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.7181530Z module_map=module_map) 2025-05-07T20:31:37.7181904Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.7182253Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.7182509Z E ^ 2025-05-07T20:31:37.7182971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.7183419Z 2025-05-07T20:31:37.7183841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.7184355Z 2025-05-07T20:31:37.8078255Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.8078680Z self=, 2025-05-07T20:31:37.8079083Z T=1, 2025-05-07T20:31:37.8079264Z D=5120, 2025-05-07T20:31:37.8079490Z scale_ub=None, 2025-05-07T20:31:37.8079833Z contiguous=False, 2025-05-07T20:31:37.8080089Z compiled=False, 2025-05-07T20:31:37.8080297Z ) 2025-05-07T20:31:37.8080907Z self = 2025-05-07T20:31:37.8081408Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:37.8081670Z 2025-05-07T20:31:37.8081754Z @given( 2025-05-07T20:31:37.8081972Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.8082281Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.8082580Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.8082903Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.8083225Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.8083626Z ) 2025-05-07T20:31:37.8083967Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.8084398Z def test_silu_mul_quant( 2025-05-07T20:31:37.8084633Z self, 2025-05-07T20:31:37.8084822Z T: int, 2025-05-07T20:31:37.8085012Z D: int, 2025-05-07T20:31:37.8085228Z scale_ub: Optional[float], 2025-05-07T20:31:37.8085496Z contiguous: bool, 2025-05-07T20:31:37.8085724Z compiled: bool, 2025-05-07T20:31:37.8085944Z ) -> None: 2025-05-07T20:31:37.8086157Z torch.manual_seed(2025) 2025-05-07T20:31:37.8086388Z 2025-05-07T20:31:37.8086653Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.8086987Z 2025-05-07T20:31:37.8087168Z x_sign = torch.sign(x) 2025-05-07T20:31:37.8087454Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.8087758Z x = x_sign * x_clamp 2025-05-07T20:31:37.8087986Z x0 = x[:, :D] 2025-05-07T20:31:37.8088197Z x1 = x[:, D:] 2025-05-07T20:31:37.8088399Z 2025-05-07T20:31:37.8088584Z if contiguous: 2025-05-07T20:31:37.8088806Z x0 = x0.contiguous() 2025-05-07T20:31:37.8089057Z x1 = x1.contiguous() 2025-05-07T20:31:37.8089299Z 2025-05-07T20:31:37.8089487Z if scale_ub is not None: 2025-05-07T20:31:37.8089892Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.8090222Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.8097446Z ) 2025-05-07T20:31:37.8097680Z else: 2025-05-07T20:31:37.8097900Z scale_ub_tensor = None 2025-05-07T20:31:37.8098152Z 2025-05-07T20:31:37.8098383Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.8098705Z op = silu_mul_quant 2025-05-07T20:31:37.8098955Z if compiled: 2025-05-07T20:31:37.8099202Z op = torch.compile(op) 2025-05-07T20:31:37.8099497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.8099760Z 2025-05-07T20:31:37.8099956Z > y_fp8, y_scale = fn() 2025-05-07T20:31:37.8100117Z 2025-05-07T20:31:37.8100224Z moe/activation_test.py:117: 2025-05-07T20:31:37.8100512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.8100853Z moe/activation_test.py:115: in fn 2025-05-07T20:31:37.8101136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.8101821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:37.8102499Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:37.8103034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.8103709Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.8104371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.8104899Z kernel = self.compile( 2025-05-07T20:31:37.8105440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.8106090Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.8106593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.8106829Z 2025-05-07T20:31:37.8107037Z self = 2025-05-07T20:31:37.8108113Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.8109467Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c5b0860>} 2025-05-07T20:31:37.8110795Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.8111812Z context = 2025-05-07T20:31:37.8112103Z 2025-05-07T20:31:37.8112266Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.8112783Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.8113242Z module_map=module_map) 2025-05-07T20:31:37.8113595Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.8113943Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.8114203Z E ^ 2025-05-07T20:31:37.8114659Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.8115112Z 2025-05-07T20:31:37.8115528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.8116062Z 2025-05-07T20:31:37.8116161Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.8116580Z self=, 2025-05-07T20:31:37.8117101Z T=4096, 2025-05-07T20:31:37.8117280Z D=7168, 2025-05-07T20:31:37.8117465Z scale_ub=1200.0, 2025-05-07T20:31:37.8117685Z contiguous=False, 2025-05-07T20:31:37.8117902Z compiled=False, 2025-05-07T20:31:37.8118102Z ) 2025-05-07T20:31:37.8118422Z self = 2025-05-07T20:31:37.8118909Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:37.8119190Z 2025-05-07T20:31:37.8119262Z @given( 2025-05-07T20:31:37.8119487Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.8119791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.8120088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.8120409Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.8120734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.8121025Z ) 2025-05-07T20:31:37.8121369Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.8121806Z def test_silu_mul_quant( 2025-05-07T20:31:37.8122031Z self, 2025-05-07T20:31:37.8122223Z T: int, 2025-05-07T20:31:37.8122414Z D: int, 2025-05-07T20:31:37.8122618Z scale_ub: Optional[float], 2025-05-07T20:31:37.8122886Z contiguous: bool, 2025-05-07T20:31:37.8123121Z compiled: bool, 2025-05-07T20:31:37.8123331Z ) -> None: 2025-05-07T20:31:37.8123648Z torch.manual_seed(2025) 2025-05-07T20:31:37.8123887Z 2025-05-07T20:31:37.8124157Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.8124494Z 2025-05-07T20:31:37.8124685Z x_sign = torch.sign(x) 2025-05-07T20:31:37.8124970Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.8125276Z x = x_sign * x_clamp 2025-05-07T20:31:37.8125593Z x0 = x[:, :D] 2025-05-07T20:31:37.8125808Z x1 = x[:, D:] 2025-05-07T20:31:37.8126013Z 2025-05-07T20:31:37.8126209Z if contiguous: 2025-05-07T20:31:37.8126463Z x0 = x0.contiguous() 2025-05-07T20:31:37.8126712Z x1 = x1.contiguous() 2025-05-07T20:31:37.8126948Z 2025-05-07T20:31:37.8127130Z if scale_ub is not None: 2025-05-07T20:31:37.8127392Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.8127716Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.8128011Z ) 2025-05-07T20:31:37.8128194Z else: 2025-05-07T20:31:37.8128394Z scale_ub_tensor = None 2025-05-07T20:31:37.8128631Z 2025-05-07T20:31:37.8128859Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.8129165Z op = silu_mul_quant 2025-05-07T20:31:37.8129404Z if compiled: 2025-05-07T20:31:37.8129645Z op = torch.compile(op) 2025-05-07T20:31:37.8129939Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.8130208Z 2025-05-07T20:31:37.8130386Z > y_fp8, y_scale = fn() 2025-05-07T20:31:37.8130554Z 2025-05-07T20:31:37.8130649Z moe/activation_test.py:117: 2025-05-07T20:31:37.8130937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.8131259Z moe/activation_test.py:115: in fn 2025-05-07T20:31:37.8131534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.8132216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:37.8132899Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:37.8133426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.8134104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.8134770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.8135375Z kernel = self.compile( 2025-05-07T20:31:37.8135913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.8136568Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.8136958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.8137182Z 2025-05-07T20:31:37.8137387Z self = 2025-05-07T20:31:37.8138711Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.8140107Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c5b19e0>} 2025-05-07T20:31:37.8141444Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.8142457Z context = 2025-05-07T20:31:37.8142747Z 2025-05-07T20:31:37.8142912Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.8143430Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.8143891Z module_map=module_map) 2025-05-07T20:31:37.8144243Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.8144590Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.8144844Z E ^ 2025-05-07T20:31:37.8145448Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.8145909Z 2025-05-07T20:31:37.8146328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.8146851Z 2025-05-07T20:31:37.8146952Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.8147360Z self=, 2025-05-07T20:31:37.8147750Z T=16384, 2025-05-07T20:31:37.8147936Z D=7168, 2025-05-07T20:31:37.8148121Z scale_ub=None, 2025-05-07T20:31:37.8148322Z contiguous=True, 2025-05-07T20:31:37.8148537Z compiled=True, 2025-05-07T20:31:37.8148733Z ) 2025-05-07T20:31:37.9524533Z self = 2025-05-07T20:31:37.9525052Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:37.9525359Z 2025-05-07T20:31:37.9525439Z @given( 2025-05-07T20:31:37.9525679Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.9526008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.9526488Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.9526848Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.9527177Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.9527461Z ) 2025-05-07T20:31:37.9527813Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.9528261Z def test_silu_mul_quant( 2025-05-07T20:31:37.9528494Z self, 2025-05-07T20:31:37.9528689Z T: int, 2025-05-07T20:31:37.9528892Z D: int, 2025-05-07T20:31:37.9529105Z scale_ub: Optional[float], 2025-05-07T20:31:37.9529378Z contiguous: bool, 2025-05-07T20:31:37.9529618Z compiled: bool, 2025-05-07T20:31:37.9529837Z ) -> None: 2025-05-07T20:31:37.9530053Z torch.manual_seed(2025) 2025-05-07T20:31:37.9530293Z 2025-05-07T20:31:37.9530569Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.9531082Z 2025-05-07T20:31:37.9531274Z x_sign = torch.sign(x) 2025-05-07T20:31:37.9531562Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.9531874Z x = x_sign * x_clamp 2025-05-07T20:31:37.9532112Z x0 = x[:, :D] 2025-05-07T20:31:37.9532327Z x1 = x[:, D:] 2025-05-07T20:31:37.9532525Z 2025-05-07T20:31:37.9532713Z if contiguous: 2025-05-07T20:31:37.9532949Z x0 = x0.contiguous() 2025-05-07T20:31:37.9533201Z x1 = x1.contiguous() 2025-05-07T20:31:37.9533440Z 2025-05-07T20:31:37.9533632Z if scale_ub is not None: 2025-05-07T20:31:37.9533899Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.9534236Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.9534548Z ) 2025-05-07T20:31:37.9534733Z else: 2025-05-07T20:31:37.9534935Z scale_ub_tensor = None 2025-05-07T20:31:37.9535200Z 2025-05-07T20:31:37.9535424Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.9535731Z op = silu_mul_quant 2025-05-07T20:31:37.9535981Z if compiled: 2025-05-07T20:31:37.9536220Z op = torch.compile(op) 2025-05-07T20:31:37.9536517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.9536788Z 2025-05-07T20:31:37.9536980Z > y_fp8, y_scale = fn() 2025-05-07T20:31:37.9537144Z 2025-05-07T20:31:37.9537241Z moe/activation_test.py:117: 2025-05-07T20:31:37.9537538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.9537870Z moe/activation_test.py:115: in fn 2025-05-07T20:31:37.9538148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.9538913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:37.9539480Z return fn(*args, **kwargs) 2025-05-07T20:31:37.9540268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:37.9540961Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:37.9541494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.9542180Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.9542840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.9543376Z kernel = self.compile( 2025-05-07T20:31:37.9543919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.9544580Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.9544971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.9545212Z 2025-05-07T20:31:37.9545425Z self = 2025-05-07T20:31:37.9546496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.9547856Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c5b2b60>} 2025-05-07T20:31:37.9549197Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.9550220Z context = 2025-05-07T20:31:37.9550514Z 2025-05-07T20:31:37.9550685Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.9551365Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.9551825Z module_map=module_map) 2025-05-07T20:31:37.9552200Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.9552552Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.9552811Z E ^ 2025-05-07T20:31:37.9553270Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.9553726Z 2025-05-07T20:31:37.9554149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.9554665Z 2025-05-07T20:31:37.9554774Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.9555186Z self=, 2025-05-07T20:31:37.9555592Z T=4096, 2025-05-07T20:31:37.9555796Z D=5120, 2025-05-07T20:31:37.9555984Z scale_ub=None, 2025-05-07T20:31:37.9556198Z contiguous=False, 2025-05-07T20:31:37.9556419Z compiled=True, 2025-05-07T20:31:37.9556622Z ) 2025-05-07T20:31:37.9556936Z self = 2025-05-07T20:31:37.9557432Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:37.9557705Z 2025-05-07T20:31:37.9557787Z @given( 2025-05-07T20:31:37.9558012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.9558322Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.9558625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.9558948Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.9559271Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.9559556Z ) 2025-05-07T20:31:37.9559984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.9560431Z def test_silu_mul_quant( 2025-05-07T20:31:37.9560674Z self, 2025-05-07T20:31:37.9560875Z T: int, 2025-05-07T20:31:37.9561061Z D: int, 2025-05-07T20:31:37.9561283Z scale_ub: Optional[float], 2025-05-07T20:31:37.9561549Z contiguous: bool, 2025-05-07T20:31:37.9561779Z compiled: bool, 2025-05-07T20:31:37.9562003Z ) -> None: 2025-05-07T20:31:37.9562212Z torch.manual_seed(2025) 2025-05-07T20:31:37.9562447Z 2025-05-07T20:31:37.9562720Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.9563064Z 2025-05-07T20:31:37.9563251Z x_sign = torch.sign(x) 2025-05-07T20:31:37.9563638Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.9563950Z x = x_sign * x_clamp 2025-05-07T20:31:37.9564183Z x0 = x[:, :D] 2025-05-07T20:31:37.9564397Z x1 = x[:, D:] 2025-05-07T20:31:37.9564605Z 2025-05-07T20:31:37.9564814Z if contiguous: 2025-05-07T20:31:37.9565041Z x0 = x0.contiguous() 2025-05-07T20:31:37.9565297Z x1 = x1.contiguous() 2025-05-07T20:31:37.9565539Z 2025-05-07T20:31:37.9565725Z if scale_ub is not None: 2025-05-07T20:31:37.9565996Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.9566326Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.9566625Z ) 2025-05-07T20:31:37.9566816Z else: 2025-05-07T20:31:37.9567027Z scale_ub_tensor = None 2025-05-07T20:31:37.9567279Z 2025-05-07T20:31:37.9567509Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.9567823Z op = silu_mul_quant 2025-05-07T20:31:37.9568068Z if compiled: 2025-05-07T20:31:37.9568308Z op = torch.compile(op) 2025-05-07T20:31:37.9568603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.9568881Z 2025-05-07T20:31:37.9569063Z > y_fp8, y_scale = fn() 2025-05-07T20:31:37.9569329Z 2025-05-07T20:31:37.9569427Z moe/activation_test.py:117: 2025-05-07T20:31:37.9569724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.9570055Z moe/activation_test.py:115: in fn 2025-05-07T20:31:37.9570340Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.9570899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:37.9571461Z return fn(*args, **kwargs) 2025-05-07T20:31:37.9572115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:37.9572811Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:37.9573354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.9574033Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.9574708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.9575254Z kernel = self.compile( 2025-05-07T20:31:37.9575796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.9576453Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.9576844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.9577073Z 2025-05-07T20:31:37.9577287Z self = 2025-05-07T20:31:37.9578363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.9579795Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c5b3d80>} 2025-05-07T20:31:37.9581154Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.9582189Z context = 2025-05-07T20:31:37.9582474Z 2025-05-07T20:31:37.9582654Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.9583177Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.9583653Z module_map=module_map) 2025-05-07T20:31:37.9584013Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.9584365Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.9584619Z E ^ 2025-05-07T20:31:37.9585087Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.9585545Z 2025-05-07T20:31:37.9585966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.9586489Z 2025-05-07T20:31:38.0743335Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:38.0743782Z self=, 2025-05-07T20:31:38.0744302Z T=4096, 2025-05-07T20:31:38.0744491Z D=5120, 2025-05-07T20:31:38.0744757Z scale_ub=1200.0, 2025-05-07T20:31:38.0745026Z contiguous=False, 2025-05-07T20:31:38.0745313Z compiled=False, 2025-05-07T20:31:38.0745519Z ) 2025-05-07T20:31:38.0745833Z self = 2025-05-07T20:31:38.0746333Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:38.0746612Z 2025-05-07T20:31:38.0746704Z @given( 2025-05-07T20:31:38.0747117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:38.0747430Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:38.0747733Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:38.0748052Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:38.0748384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:38.0748665Z ) 2025-05-07T20:31:38.0749012Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:38.0749446Z def test_silu_mul_quant( 2025-05-07T20:31:38.0749685Z self, 2025-05-07T20:31:38.0749878Z T: int, 2025-05-07T20:31:38.0750069Z D: int, 2025-05-07T20:31:38.0750284Z scale_ub: Optional[float], 2025-05-07T20:31:38.0750554Z contiguous: bool, 2025-05-07T20:31:38.0750787Z compiled: bool, 2025-05-07T20:31:38.0751010Z ) -> None: 2025-05-07T20:31:38.0751219Z torch.manual_seed(2025) 2025-05-07T20:31:38.0751464Z 2025-05-07T20:31:38.0751739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:38.0752080Z 2025-05-07T20:31:38.0752273Z x_sign = torch.sign(x) 2025-05-07T20:31:38.0752563Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:38.0752875Z x = x_sign * x_clamp 2025-05-07T20:31:38.0753112Z x0 = x[:, :D] 2025-05-07T20:31:38.0753325Z x1 = x[:, D:] 2025-05-07T20:31:38.0753532Z 2025-05-07T20:31:38.0753724Z if contiguous: 2025-05-07T20:31:38.0753950Z x0 = x0.contiguous() 2025-05-07T20:31:38.0754206Z x1 = x1.contiguous() 2025-05-07T20:31:38.0754443Z 2025-05-07T20:31:38.0754624Z if scale_ub is not None: 2025-05-07T20:31:38.0754897Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:38.0755226Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:38.0755529Z ) 2025-05-07T20:31:38.0756337Z else: 2025-05-07T20:31:38.0756577Z scale_ub_tensor = None 2025-05-07T20:31:38.0756819Z 2025-05-07T20:31:38.0757052Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:38.0757365Z op = silu_mul_quant 2025-05-07T20:31:38.0757603Z if compiled: 2025-05-07T20:31:38.0757846Z op = torch.compile(op) 2025-05-07T20:31:38.0758142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.0758408Z 2025-05-07T20:31:38.0758598Z > y_fp8, y_scale = fn() 2025-05-07T20:31:38.0758762Z 2025-05-07T20:31:38.0758859Z moe/activation_test.py:117: 2025-05-07T20:31:38.0759147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.0759469Z moe/activation_test.py:115: in fn 2025-05-07T20:31:38.0759749Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.0760446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:38.0761136Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:38.0761673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:38.0762357Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:38.0763023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:38.0763710Z kernel = self.compile( 2025-05-07T20:31:38.0764256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:38.0764917Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.0765307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.0765535Z 2025-05-07T20:31:38.0765740Z self = 2025-05-07T20:31:38.0766821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:38.0768271Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c444c20>} 2025-05-07T20:31:38.0769612Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:38.0770633Z context = 2025-05-07T20:31:38.0770927Z 2025-05-07T20:31:38.0771093Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:38.0771621Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.0772092Z module_map=module_map) 2025-05-07T20:31:38.0772450Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.0772814Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.0773072Z E ^ 2025-05-07T20:31:38.0773532Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.0773987Z 2025-05-07T20:31:38.0774404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:38.0774925Z 2025-05-07T20:31:38.0775029Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:38.0775442Z self=, 2025-05-07T20:31:38.0775842Z T=4096, 2025-05-07T20:31:38.0782559Z D=5120, 2025-05-07T20:31:38.0782766Z scale_ub=1200.0, 2025-05-07T20:31:38.0783003Z contiguous=False, 2025-05-07T20:31:38.0783350Z compiled=True, 2025-05-07T20:31:38.0783555Z ) 2025-05-07T20:31:38.0783877Z self = 2025-05-07T20:31:38.0784380Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:38.0784656Z 2025-05-07T20:31:38.0784733Z @given( 2025-05-07T20:31:38.0784967Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:38.0785275Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:38.0785572Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:38.0785903Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:38.0786261Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:38.0786559Z ) 2025-05-07T20:31:38.0786898Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:38.0787339Z def test_silu_mul_quant( 2025-05-07T20:31:38.0787580Z self, 2025-05-07T20:31:38.0787771Z T: int, 2025-05-07T20:31:38.0787978Z D: int, 2025-05-07T20:31:38.0788194Z scale_ub: Optional[float], 2025-05-07T20:31:38.0788458Z contiguous: bool, 2025-05-07T20:31:38.0788726Z compiled: bool, 2025-05-07T20:31:38.0789046Z ) -> None: 2025-05-07T20:31:38.0789290Z torch.manual_seed(2025) 2025-05-07T20:31:38.0789532Z 2025-05-07T20:31:38.0789811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:38.0790150Z 2025-05-07T20:31:38.0790342Z x_sign = torch.sign(x) 2025-05-07T20:31:38.0790636Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:38.0790941Z x = x_sign * x_clamp 2025-05-07T20:31:38.0791182Z x0 = x[:, :D] 2025-05-07T20:31:38.0791396Z x1 = x[:, D:] 2025-05-07T20:31:38.0791602Z 2025-05-07T20:31:38.0791789Z if contiguous: 2025-05-07T20:31:38.0792024Z x0 = x0.contiguous() 2025-05-07T20:31:38.0792294Z x1 = x1.contiguous() 2025-05-07T20:31:38.0792629Z 2025-05-07T20:31:38.0792821Z if scale_ub is not None: 2025-05-07T20:31:38.0793089Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:38.0793415Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:38.0793722Z ) 2025-05-07T20:31:38.0793912Z else: 2025-05-07T20:31:38.0794113Z scale_ub_tensor = None 2025-05-07T20:31:38.0794368Z 2025-05-07T20:31:38.0794601Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:38.0794905Z op = silu_mul_quant 2025-05-07T20:31:38.0795156Z if compiled: 2025-05-07T20:31:38.0795402Z op = torch.compile(op) 2025-05-07T20:31:38.0795692Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.0795965Z 2025-05-07T20:31:38.0796355Z > y_fp8, y_scale = fn() 2025-05-07T20:31:38.0796516Z 2025-05-07T20:31:38.0796616Z moe/activation_test.py:117: 2025-05-07T20:31:38.0796905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.0797241Z moe/activation_test.py:115: in fn 2025-05-07T20:31:38.0797530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.0798084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:38.0798649Z return fn(*args, **kwargs) 2025-05-07T20:31:38.0799311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:38.0800244Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:38.0800861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:38.0801547Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:38.0802215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:38.0802862Z kernel = self.compile( 2025-05-07T20:31:38.0803516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:38.0804174Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.0804570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.0804801Z 2025-05-07T20:31:38.0805008Z self = 2025-05-07T20:31:38.0806084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:38.0807448Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c445f80>} 2025-05-07T20:31:38.0808797Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:38.0809821Z context = 2025-05-07T20:31:38.0810103Z 2025-05-07T20:31:38.0810272Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:38.0810790Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.0811256Z module_map=module_map) 2025-05-07T20:31:38.0811616Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.0811968Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.0812219Z E ^ 2025-05-07T20:31:38.0812684Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.0813138Z 2025-05-07T20:31:38.0813647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:38.0814160Z 2025-05-07T20:31:38.1694062Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:38.1694504Z self=, 2025-05-07T20:31:38.1695037Z T=2048, 2025-05-07T20:31:38.1695304Z D=7168, 2025-05-07T20:31:38.1695506Z scale_ub=1200.0, 2025-05-07T20:31:38.1695797Z contiguous=False, 2025-05-07T20:31:38.1696026Z compiled=False, 2025-05-07T20:31:38.1696223Z ) 2025-05-07T20:31:38.1696547Z self = 2025-05-07T20:31:38.1697047Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:38.1697321Z 2025-05-07T20:31:38.1697405Z @given( 2025-05-07T20:31:38.1697635Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:38.1697958Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:38.1698299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:38.1698626Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:38.1698953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:38.1699233Z ) 2025-05-07T20:31:38.1699586Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:38.1700031Z def test_silu_mul_quant( 2025-05-07T20:31:38.1700269Z self, 2025-05-07T20:31:38.1700467Z T: int, 2025-05-07T20:31:38.1700668Z D: int, 2025-05-07T20:31:38.1700884Z scale_ub: Optional[float], 2025-05-07T20:31:38.1701156Z contiguous: bool, 2025-05-07T20:31:38.1701401Z compiled: bool, 2025-05-07T20:31:38.1701624Z ) -> None: 2025-05-07T20:31:38.1701838Z torch.manual_seed(2025) 2025-05-07T20:31:38.1702076Z 2025-05-07T20:31:38.1702518Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:38.1702876Z 2025-05-07T20:31:38.1703079Z x_sign = torch.sign(x) 2025-05-07T20:31:38.1703371Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:38.1703676Z x = x_sign * x_clamp 2025-05-07T20:31:38.1703917Z x0 = x[:, :D] 2025-05-07T20:31:38.1704137Z x1 = x[:, D:] 2025-05-07T20:31:38.1704343Z 2025-05-07T20:31:38.1704531Z if contiguous: 2025-05-07T20:31:38.1704763Z x0 = x0.contiguous() 2025-05-07T20:31:38.1705014Z x1 = x1.contiguous() 2025-05-07T20:31:38.1705257Z 2025-05-07T20:31:38.1705452Z if scale_ub is not None: 2025-05-07T20:31:38.1705722Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:38.1706059Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:38.1706369Z ) 2025-05-07T20:31:38.1706556Z else: 2025-05-07T20:31:38.1706767Z scale_ub_tensor = None 2025-05-07T20:31:38.1707016Z 2025-05-07T20:31:38.1707248Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:38.1707557Z op = silu_mul_quant 2025-05-07T20:31:38.1707804Z if compiled: 2025-05-07T20:31:38.1708048Z op = torch.compile(op) 2025-05-07T20:31:38.1708336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.1708607Z 2025-05-07T20:31:38.1708794Z > y_fp8, y_scale = fn() 2025-05-07T20:31:38.1708955Z 2025-05-07T20:31:38.1709052Z moe/activation_test.py:117: 2025-05-07T20:31:38.1709343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.1709676Z moe/activation_test.py:115: in fn 2025-05-07T20:31:38.1709949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.1710636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:38.1711327Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:38.1711874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:38.1712682Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:38.1713346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:38.1713878Z kernel = self.compile( 2025-05-07T20:31:38.1714417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:38.1715077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.1715471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.1715698Z 2025-05-07T20:31:38.1715907Z self = 2025-05-07T20:31:38.1716974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:38.1718514Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c446d40>} 2025-05-07T20:31:38.1719857Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:38.1720886Z context = 2025-05-07T20:31:38.1721172Z 2025-05-07T20:31:38.1721342Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:38.1721864Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.1722332Z module_map=module_map) 2025-05-07T20:31:38.1722795Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.1723149Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.1723535Z E ^ 2025-05-07T20:31:38.1724003Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.1724454Z 2025-05-07T20:31:38.1724878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:38.1725395Z 2025-05-07T20:31:38.1725497Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:38.1725911Z self=, 2025-05-07T20:31:38.1726310Z T=1, 2025-05-07T20:31:38.1726486Z D=7168, 2025-05-07T20:31:38.1726675Z scale_ub=None, 2025-05-07T20:31:38.1726885Z contiguous=True, 2025-05-07T20:31:38.1727099Z compiled=False, 2025-05-07T20:31:38.1727298Z ) 2025-05-07T20:31:38.1727628Z self = 2025-05-07T20:31:38.1728115Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:38.1728373Z 2025-05-07T20:31:38.1728449Z @given( 2025-05-07T20:31:38.1728672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:38.1728983Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:38.1729281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:38.1729606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:38.1729932Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:38.1730209Z ) 2025-05-07T20:31:38.1730554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:38.1730998Z def test_silu_mul_quant( 2025-05-07T20:31:38.1731230Z self, 2025-05-07T20:31:38.1731416Z T: int, 2025-05-07T20:31:38.1731610Z D: int, 2025-05-07T20:31:38.1731826Z scale_ub: Optional[float], 2025-05-07T20:31:38.1732223Z contiguous: bool, 2025-05-07T20:31:38.1732461Z compiled: bool, 2025-05-07T20:31:38.1732680Z ) -> None: 2025-05-07T20:31:38.1732894Z torch.manual_seed(2025) 2025-05-07T20:31:38.1733137Z 2025-05-07T20:31:38.1733396Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:38.1733740Z 2025-05-07T20:31:38.1733935Z x_sign = torch.sign(x) 2025-05-07T20:31:38.1734223Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:38.1734527Z x = x_sign * x_clamp 2025-05-07T20:31:38.1734765Z x0 = x[:, :D] 2025-05-07T20:31:38.1734974Z x1 = x[:, D:] 2025-05-07T20:31:38.1735174Z 2025-05-07T20:31:38.1735360Z if contiguous: 2025-05-07T20:31:38.1735586Z x0 = x0.contiguous() 2025-05-07T20:31:38.1735841Z x1 = x1.contiguous() 2025-05-07T20:31:38.1736080Z 2025-05-07T20:31:38.1736292Z if scale_ub is not None: 2025-05-07T20:31:38.1736584Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:38.1736919Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:38.1737223Z ) 2025-05-07T20:31:38.1737411Z else: 2025-05-07T20:31:38.1737616Z scale_ub_tensor = None 2025-05-07T20:31:38.1737862Z 2025-05-07T20:31:38.1738087Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:38.1738628Z op = silu_mul_quant 2025-05-07T20:31:38.1738895Z if compiled: 2025-05-07T20:31:38.1739140Z op = torch.compile(op) 2025-05-07T20:31:38.1739435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.1739703Z 2025-05-07T20:31:38.1739893Z > y_fp8, y_scale = fn() 2025-05-07T20:31:38.1740056Z 2025-05-07T20:31:38.1740154Z moe/activation_test.py:117: 2025-05-07T20:31:38.1740442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.1740770Z moe/activation_test.py:115: in fn 2025-05-07T20:31:38.1741237Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.1741936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:38.1742626Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:38.1743162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:38.1743839Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:38.1744502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:38.1745031Z kernel = self.compile( 2025-05-07T20:31:38.1745565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:38.1746228Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.1746625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.1746854Z 2025-05-07T20:31:38.1747065Z self = 2025-05-07T20:31:38.1748128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:38.1749489Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c446fc0>} 2025-05-07T20:31:38.1750825Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:38.1751844Z context = 2025-05-07T20:31:38.1752256Z 2025-05-07T20:31:38.1752431Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:38.1752946Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.1753410Z module_map=module_map) 2025-05-07T20:31:38.1753771Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.1754119Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.1754374Z E ^ 2025-05-07T20:31:38.1754839Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.1755290Z 2025-05-07T20:31:38.1755716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:38.1756231Z 2025-05-07T20:31:38.1756333Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:38.1756750Z self=, 2025-05-07T20:31:38.1757157Z T=16384, 2025-05-07T20:31:38.1757344Z D=7168, 2025-05-07T20:31:38.1757534Z scale_ub=1200.0, 2025-05-07T20:31:38.1757750Z contiguous=False, 2025-05-07T20:31:38.1757968Z compiled=True, 2025-05-07T20:31:38.5815138Z ) 2025-05-07T20:31:38.5815819Z self = 2025-05-07T20:31:38.5816565Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:38.5816956Z 2025-05-07T20:31:38.5817063Z @given( 2025-05-07T20:31:38.5817381Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:38.5817788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:38.5818088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:38.5818417Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:38.5818744Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:38.5819028Z ) 2025-05-07T20:31:38.5819568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:38.5820047Z def test_silu_mul_quant( 2025-05-07T20:31:38.5820290Z self, 2025-05-07T20:31:38.5820481Z T: int, 2025-05-07T20:31:38.5820675Z D: int, 2025-05-07T20:31:38.5820886Z scale_ub: Optional[float], 2025-05-07T20:31:38.5821157Z contiguous: bool, 2025-05-07T20:31:38.5821389Z compiled: bool, 2025-05-07T20:31:38.5821608Z ) -> None: 2025-05-07T20:31:38.5821829Z torch.manual_seed(2025) 2025-05-07T20:31:38.5822070Z 2025-05-07T20:31:38.5822337Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:38.5822682Z 2025-05-07T20:31:38.5822874Z x_sign = torch.sign(x) 2025-05-07T20:31:38.5823163Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:38.5823476Z x = x_sign * x_clamp 2025-05-07T20:31:38.5823713Z x0 = x[:, :D] 2025-05-07T20:31:38.5823930Z x1 = x[:, D:] 2025-05-07T20:31:38.5824140Z 2025-05-07T20:31:38.5824324Z if contiguous: 2025-05-07T20:31:38.5824553Z x0 = x0.contiguous() 2025-05-07T20:31:38.5824808Z x1 = x1.contiguous() 2025-05-07T20:31:38.5825049Z 2025-05-07T20:31:38.5825237Z if scale_ub is not None: 2025-05-07T20:31:38.5825512Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:38.5825847Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:38.5826156Z ) 2025-05-07T20:31:38.5826345Z else: 2025-05-07T20:31:38.5826551Z scale_ub_tensor = None 2025-05-07T20:31:38.5826802Z 2025-05-07T20:31:38.5827030Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:38.5827343Z op = silu_mul_quant 2025-05-07T20:31:38.5827598Z if compiled: 2025-05-07T20:31:38.5827838Z op = torch.compile(op) 2025-05-07T20:31:38.5828135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.5828411Z 2025-05-07T20:31:38.5828726Z > y_fp8, y_scale = fn() 2025-05-07T20:31:38.5828894Z 2025-05-07T20:31:38.5828995Z moe/activation_test.py:117: 2025-05-07T20:31:38.5829290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.5829618Z moe/activation_test.py:115: in fn 2025-05-07T20:31:38.5829898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.5830461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:38.5831024Z return fn(*args, **kwargs) 2025-05-07T20:31:38.5831681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:38.5832372Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:38.5832911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:38.5833601Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:38.5834268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:38.5834801Z kernel = self.compile( 2025-05-07T20:31:38.5835344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:38.5835999Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.5836394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.5836632Z 2025-05-07T20:31:38.5836876Z self = 2025-05-07T20:31:38.5837973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:38.5839865Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffc3d300>} 2025-05-07T20:31:38.5841204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:38.5842231Z context = 2025-05-07T20:31:38.5842525Z 2025-05-07T20:31:38.5842692Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:38.5843218Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.5843834Z module_map=module_map) 2025-05-07T20:31:38.5844201Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.5844567Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.5844829Z E ^ 2025-05-07T20:31:38.5845291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.5845748Z 2025-05-07T20:31:38.5846170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:38.5846683Z 2025-05-07T20:31:38.5846793Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:38.5847198Z self=, 2025-05-07T20:31:38.5847596Z T=1, 2025-05-07T20:31:38.5847782Z D=7168, 2025-05-07T20:31:38.5847973Z scale_ub=None, 2025-05-07T20:31:38.5848180Z contiguous=False, 2025-05-07T20:31:38.5848402Z compiled=False, 2025-05-07T20:31:38.5848605Z ) 2025-05-07T20:31:38.5848914Z self = 2025-05-07T20:31:38.5849407Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:38.5849795Z 2025-05-07T20:31:38.5849875Z @given( 2025-05-07T20:31:38.5850100Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:38.5850409Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:38.5850708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:38.5851028Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:38.5851349Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:38.5851632Z ) 2025-05-07T20:31:38.5851980Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:38.5852417Z def test_silu_mul_quant( 2025-05-07T20:31:38.5852653Z self, 2025-05-07T20:31:38.5852846Z T: int, 2025-05-07T20:31:38.5853037Z D: int, 2025-05-07T20:31:38.5853251Z scale_ub: Optional[float], 2025-05-07T20:31:38.5853524Z contiguous: bool, 2025-05-07T20:31:38.5853753Z compiled: bool, 2025-05-07T20:31:38.5853979Z ) -> None: 2025-05-07T20:31:38.5854199Z torch.manual_seed(2025) 2025-05-07T20:31:38.5854431Z 2025-05-07T20:31:38.5854702Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:38.5855045Z 2025-05-07T20:31:38.5855230Z x_sign = torch.sign(x) 2025-05-07T20:31:38.5855520Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:38.5855826Z x = x_sign * x_clamp 2025-05-07T20:31:38.5856058Z x0 = x[:, :D] 2025-05-07T20:31:38.5856270Z x1 = x[:, D:] 2025-05-07T20:31:38.5856477Z 2025-05-07T20:31:38.5856670Z if contiguous: 2025-05-07T20:31:38.5856896Z x0 = x0.contiguous() 2025-05-07T20:31:38.5857146Z x1 = x1.contiguous() 2025-05-07T20:31:38.5857382Z 2025-05-07T20:31:38.5857569Z if scale_ub is not None: 2025-05-07T20:31:38.5857833Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:38.5864887Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:38.5865317Z ) 2025-05-07T20:31:38.5865518Z else: 2025-05-07T20:31:38.5865725Z scale_ub_tensor = None 2025-05-07T20:31:38.5865972Z 2025-05-07T20:31:38.5866208Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:38.5866522Z op = silu_mul_quant 2025-05-07T20:31:38.5866772Z if compiled: 2025-05-07T20:31:38.5867010Z op = torch.compile(op) 2025-05-07T20:31:38.5867304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.5867576Z 2025-05-07T20:31:38.5867760Z > y_fp8, y_scale = fn() 2025-05-07T20:31:38.5867929Z 2025-05-07T20:31:38.5868028Z moe/activation_test.py:117: 2025-05-07T20:31:38.5868322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.5868648Z moe/activation_test.py:115: in fn 2025-05-07T20:31:38.5868924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.5869618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:38.5870310Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:38.5870845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:38.5871530Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:38.5872195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:38.5872721Z kernel = self.compile( 2025-05-07T20:31:38.5873264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:38.5873920Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.5874317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.5874546Z 2025-05-07T20:31:38.5874761Z self = 2025-05-07T20:31:38.5875925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:38.5877332Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffc3e0c0>} 2025-05-07T20:31:38.5878669Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:38.5879682Z context = 2025-05-07T20:31:38.5879964Z 2025-05-07T20:31:38.5880128Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:38.5880654Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.5881114Z module_map=module_map) 2025-05-07T20:31:38.5881471Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.5881819Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.5882076Z E ^ 2025-05-07T20:31:38.5882536Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.5882984Z 2025-05-07T20:31:38.5883487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:38.5884005Z 2025-05-07T20:31:38.5884108Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:38.5884514Z self=, 2025-05-07T20:31:38.5884908Z T=2048, 2025-05-07T20:31:38.5885095Z D=7168, 2025-05-07T20:31:38.5885411Z scale_ub=None, 2025-05-07T20:31:38.5885623Z contiguous=False, 2025-05-07T20:31:38.5885839Z compiled=True, 2025-05-07T20:31:38.5886040Z ) 2025-05-07T20:31:38.6570197Z self = 2025-05-07T20:31:38.6570902Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:38.6571289Z 2025-05-07T20:31:38.6571391Z @given( 2025-05-07T20:31:38.6571721Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:38.6572135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:38.6572564Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:38.6573073Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:38.6573517Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:38.6573920Z ) 2025-05-07T20:31:38.6574276Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:38.6574726Z def test_silu_mul_quant( 2025-05-07T20:31:38.6574970Z self, 2025-05-07T20:31:38.6575158Z T: int, 2025-05-07T20:31:38.6575345Z D: int, 2025-05-07T20:31:38.6575557Z scale_ub: Optional[float], 2025-05-07T20:31:38.6575824Z contiguous: bool, 2025-05-07T20:31:38.6576054Z compiled: bool, 2025-05-07T20:31:38.6576274Z ) -> None: 2025-05-07T20:31:38.6576487Z torch.manual_seed(2025) 2025-05-07T20:31:38.6576719Z 2025-05-07T20:31:38.6576999Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:38.6577341Z 2025-05-07T20:31:38.6577527Z x_sign = torch.sign(x) 2025-05-07T20:31:38.6577814Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:38.6578118Z x = x_sign * x_clamp 2025-05-07T20:31:38.6578352Z x0 = x[:, :D] 2025-05-07T20:31:38.6578566Z x1 = x[:, D:] 2025-05-07T20:31:38.6578764Z 2025-05-07T20:31:38.6578945Z if contiguous: 2025-05-07T20:31:38.6579176Z x0 = x0.contiguous() 2025-05-07T20:31:38.6579601Z x1 = x1.contiguous() 2025-05-07T20:31:38.6579836Z 2025-05-07T20:31:38.6580024Z if scale_ub is not None: 2025-05-07T20:31:38.6580286Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:38.6580618Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:38.6580922Z ) 2025-05-07T20:31:38.6581106Z else: 2025-05-07T20:31:38.6581307Z scale_ub_tensor = None 2025-05-07T20:31:38.6581553Z 2025-05-07T20:31:38.6581785Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:38.6582087Z op = silu_mul_quant 2025-05-07T20:31:38.6582331Z if compiled: 2025-05-07T20:31:38.6582570Z op = torch.compile(op) 2025-05-07T20:31:38.6582856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.6583127Z 2025-05-07T20:31:38.6583318Z > y_fp8, y_scale = fn() 2025-05-07T20:31:38.6583480Z 2025-05-07T20:31:38.6583580Z moe/activation_test.py:117: 2025-05-07T20:31:38.6583872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.6584202Z moe/activation_test.py:115: in fn 2025-05-07T20:31:38.6584477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.6585036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:38.6585597Z return fn(*args, **kwargs) 2025-05-07T20:31:38.6586256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:38.6586939Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:38.6587481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:38.6588161Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:38.6588949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:38.6589487Z kernel = self.compile( 2025-05-07T20:31:38.6590028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:38.6590684Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.6591067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.6591299Z 2025-05-07T20:31:38.6591504Z self = 2025-05-07T20:31:38.6592583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:38.6593957Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffc3f560>} 2025-05-07T20:31:38.6595299Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:38.6596317Z context = 2025-05-07T20:31:38.6596610Z 2025-05-07T20:31:38.6596775Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:38.6597295Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.6597757Z module_map=module_map) 2025-05-07T20:31:38.6598113Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.6598468Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.6598723Z E ^ 2025-05-07T20:31:38.6599185Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.6599722Z 2025-05-07T20:31:38.6600141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:38.6600663Z 2025-05-07T20:31:38.6600761Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:38.6601173Z self=, 2025-05-07T20:31:38.6601568Z T=4096, 2025-05-07T20:31:38.6601750Z D=7168, 2025-05-07T20:31:38.6601935Z scale_ub=None, 2025-05-07T20:31:38.6602144Z contiguous=False, 2025-05-07T20:31:38.6602365Z compiled=True, 2025-05-07T20:31:38.6602562Z ) 2025-05-07T20:31:38.6602874Z self = 2025-05-07T20:31:38.6603503Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:38.6603777Z 2025-05-07T20:31:38.6603859Z @given( 2025-05-07T20:31:38.6604085Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:38.6604400Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:38.6604708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:38.6605029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:38.6605351Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:38.6605630Z ) 2025-05-07T20:31:38.6605974Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:38.6606408Z def test_silu_mul_quant( 2025-05-07T20:31:38.6606646Z self, 2025-05-07T20:31:38.6606836Z T: int, 2025-05-07T20:31:38.6607020Z D: int, 2025-05-07T20:31:38.6607230Z scale_ub: Optional[float], 2025-05-07T20:31:38.6607497Z contiguous: bool, 2025-05-07T20:31:38.6607734Z compiled: bool, 2025-05-07T20:31:38.6607953Z ) -> None: 2025-05-07T20:31:38.6608162Z torch.manual_seed(2025) 2025-05-07T20:31:38.6608395Z 2025-05-07T20:31:38.6608749Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:38.6609093Z 2025-05-07T20:31:38.6609285Z x_sign = torch.sign(x) 2025-05-07T20:31:38.6609569Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:38.6609871Z x = x_sign * x_clamp 2025-05-07T20:31:38.6610104Z x0 = x[:, :D] 2025-05-07T20:31:38.6610313Z x1 = x[:, D:] 2025-05-07T20:31:38.6610515Z 2025-05-07T20:31:38.6610700Z if contiguous: 2025-05-07T20:31:38.6610926Z x0 = x0.contiguous() 2025-05-07T20:31:38.6611179Z x1 = x1.contiguous() 2025-05-07T20:31:38.6611416Z 2025-05-07T20:31:38.6611597Z if scale_ub is not None: 2025-05-07T20:31:38.6611864Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:38.6612195Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:38.6612493Z ) 2025-05-07T20:31:38.6612680Z else: 2025-05-07T20:31:38.6612889Z scale_ub_tensor = None 2025-05-07T20:31:38.6613134Z 2025-05-07T20:31:38.6613357Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:38.6613664Z op = silu_mul_quant 2025-05-07T20:31:38.6613902Z if compiled: 2025-05-07T20:31:38.6614140Z op = torch.compile(op) 2025-05-07T20:31:38.6614426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.6614697Z 2025-05-07T20:31:38.6614880Z > y_fp8, y_scale = fn() 2025-05-07T20:31:38.6615045Z 2025-05-07T20:31:38.6615142Z moe/activation_test.py:117: 2025-05-07T20:31:38.6615428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.6615749Z moe/activation_test.py:115: in fn 2025-05-07T20:31:38.6616024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.6616579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:38.6617131Z return fn(*args, **kwargs) 2025-05-07T20:31:38.6617907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:38.6618600Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:38.6619129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:38.6619814Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:38.6620485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:38.6621021Z kernel = self.compile( 2025-05-07T20:31:38.6621562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:38.6622218Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.6622612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.6622849Z 2025-05-07T20:31:38.6623063Z self = 2025-05-07T20:31:38.6624128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:38.6625494Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffa287c0>} 2025-05-07T20:31:38.6626841Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:38.6627861Z context = 2025-05-07T20:31:38.6628148Z 2025-05-07T20:31:38.6628391Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:38.6628918Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.6629389Z module_map=module_map) 2025-05-07T20:31:38.6629745Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.6630086Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.6630341Z E ^ 2025-05-07T20:31:38.6630801Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.6631251Z 2025-05-07T20:31:38.6631671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:38.6632184Z 2025-05-07T20:31:38.7905536Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:38.7906141Z self=, 2025-05-07T20:31:38.7906744Z T=16384, 2025-05-07T20:31:38.7907033Z D=5120, 2025-05-07T20:31:38.7907287Z scale_ub=1200.0, 2025-05-07T20:31:38.7907594Z contiguous=False, 2025-05-07T20:31:38.7907904Z compiled=False, 2025-05-07T20:31:38.7908159Z ) 2025-05-07T20:31:38.7908475Z self = 2025-05-07T20:31:38.7908974Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:38.7909251Z 2025-05-07T20:31:38.7909333Z @given( 2025-05-07T20:31:38.7909557Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:38.7909869Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:38.7910171Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:38.7910491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:38.7910815Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:38.7911097Z ) 2025-05-07T20:31:38.7911441Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:38.7912050Z def test_silu_mul_quant( 2025-05-07T20:31:38.7912283Z self, 2025-05-07T20:31:38.7912477Z T: int, 2025-05-07T20:31:38.7912664Z D: int, 2025-05-07T20:31:38.7912880Z scale_ub: Optional[float], 2025-05-07T20:31:38.7913152Z contiguous: bool, 2025-05-07T20:31:38.7913384Z compiled: bool, 2025-05-07T20:31:38.7913602Z ) -> None: 2025-05-07T20:31:38.7913819Z torch.manual_seed(2025) 2025-05-07T20:31:38.7914051Z 2025-05-07T20:31:38.7914319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:38.7914658Z 2025-05-07T20:31:38.7914846Z x_sign = torch.sign(x) 2025-05-07T20:31:38.7915136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:38.7915440Z x = x_sign * x_clamp 2025-05-07T20:31:38.7915670Z x0 = x[:, :D] 2025-05-07T20:31:38.7915882Z x1 = x[:, D:] 2025-05-07T20:31:38.7916087Z 2025-05-07T20:31:38.7916272Z if contiguous: 2025-05-07T20:31:38.7916504Z x0 = x0.contiguous() 2025-05-07T20:31:38.7916757Z x1 = x1.contiguous() 2025-05-07T20:31:38.7916989Z 2025-05-07T20:31:38.7917178Z if scale_ub is not None: 2025-05-07T20:31:38.7917444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:38.7917775Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:38.7918071Z ) 2025-05-07T20:31:38.7918262Z else: 2025-05-07T20:31:38.7918467Z scale_ub_tensor = None 2025-05-07T20:31:38.7918711Z 2025-05-07T20:31:38.7918940Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:38.7919249Z op = silu_mul_quant 2025-05-07T20:31:38.7919490Z if compiled: 2025-05-07T20:31:38.7919733Z op = torch.compile(op) 2025-05-07T20:31:38.7920031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.7920293Z 2025-05-07T20:31:38.7920609Z > y_fp8, y_scale = fn() 2025-05-07T20:31:38.7920777Z 2025-05-07T20:31:38.7920880Z moe/activation_test.py:117: 2025-05-07T20:31:38.7921172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.7921495Z moe/activation_test.py:115: in fn 2025-05-07T20:31:38.7921773Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.7922465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:38.7923152Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:38.7923799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:38.7924480Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:38.7925145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:38.7925674Z kernel = self.compile( 2025-05-07T20:31:38.7926229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:38.7926934Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.7927319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.7927553Z 2025-05-07T20:31:38.7927763Z self = 2025-05-07T20:31:38.7928834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:38.7930197Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffa29620>} 2025-05-07T20:31:38.7931529Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:38.7932627Z context = 2025-05-07T20:31:38.7932916Z 2025-05-07T20:31:38.7933081Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:38.7933601Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.7934064Z module_map=module_map) 2025-05-07T20:31:38.7934427Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.7934782Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.7935041Z E ^ 2025-05-07T20:31:38.7935501Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.7935960Z 2025-05-07T20:31:38.7936384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:38.7936911Z 2025-05-07T20:31:38.7937014Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:38.7937424Z self=, 2025-05-07T20:31:38.7937821Z T=16384, 2025-05-07T20:31:38.7938012Z D=5120, 2025-05-07T20:31:38.7938198Z scale_ub=1200.0, 2025-05-07T20:31:38.7938627Z contiguous=True, 2025-05-07T20:31:38.7938852Z compiled=True, 2025-05-07T20:31:38.7939053Z ) 2025-05-07T20:31:38.7939364Z self = 2025-05-07T20:31:38.7939858Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:38.7940129Z 2025-05-07T20:31:38.7940215Z @given( 2025-05-07T20:31:38.7940437Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:38.7940747Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:38.7941569Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:38.7941904Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:38.7942226Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:38.7942509Z ) 2025-05-07T20:31:38.7942859Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:38.7943291Z def test_silu_mul_quant( 2025-05-07T20:31:38.7943552Z self, 2025-05-07T20:31:38.7943745Z T: int, 2025-05-07T20:31:38.7943938Z D: int, 2025-05-07T20:31:38.7944151Z scale_ub: Optional[float], 2025-05-07T20:31:38.7944418Z contiguous: bool, 2025-05-07T20:31:38.7944652Z compiled: bool, 2025-05-07T20:31:38.7944869Z ) -> None: 2025-05-07T20:31:38.7945079Z torch.manual_seed(2025) 2025-05-07T20:31:38.7945320Z 2025-05-07T20:31:38.7945582Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:38.7945922Z 2025-05-07T20:31:38.7946124Z x_sign = torch.sign(x) 2025-05-07T20:31:38.7946413Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:38.7946714Z x = x_sign * x_clamp 2025-05-07T20:31:38.7946949Z x0 = x[:, :D] 2025-05-07T20:31:38.7947159Z x1 = x[:, D:] 2025-05-07T20:31:38.7947361Z 2025-05-07T20:31:38.7947542Z if contiguous: 2025-05-07T20:31:38.7947772Z x0 = x0.contiguous() 2025-05-07T20:31:38.7948022Z x1 = x1.contiguous() 2025-05-07T20:31:38.7948260Z 2025-05-07T20:31:38.7948452Z if scale_ub is not None: 2025-05-07T20:31:38.7948716Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:38.7949045Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:38.7949348Z ) 2025-05-07T20:31:38.7949534Z else: 2025-05-07T20:31:38.7949741Z scale_ub_tensor = None 2025-05-07T20:31:38.7949992Z 2025-05-07T20:31:38.7950229Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:38.7950675Z op = silu_mul_quant 2025-05-07T20:31:38.7950922Z if compiled: 2025-05-07T20:31:38.7951157Z op = torch.compile(op) 2025-05-07T20:31:38.7951455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.7951727Z 2025-05-07T20:31:38.7951915Z > y_fp8, y_scale = fn() 2025-05-07T20:31:38.7952076Z 2025-05-07T20:31:38.7952174Z moe/activation_test.py:117: 2025-05-07T20:31:38.7952462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.7952791Z moe/activation_test.py:115: in fn 2025-05-07T20:31:38.7953068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.7953625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:38.7954184Z return fn(*args, **kwargs) 2025-05-07T20:31:38.7954849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:38.7955544Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:38.7956081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:38.7956816Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:38.7957477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:38.7958009Z kernel = self.compile( 2025-05-07T20:31:38.7958552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:38.7959216Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.7959608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.7959837Z 2025-05-07T20:31:38.7960154Z self = 2025-05-07T20:31:38.7961231Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:38.7962593Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffa2aa20>} 2025-05-07T20:31:38.7964038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:38.7965066Z context = 2025-05-07T20:31:38.7965360Z 2025-05-07T20:31:38.7965525Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:38.7966059Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.7966525Z module_map=module_map) 2025-05-07T20:31:38.7966885Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.7967242Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.7967499Z E ^ 2025-05-07T20:31:38.7967956Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.7968408Z 2025-05-07T20:31:38.7968827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:38.7969340Z 2025-05-07T20:31:39.1530351Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.1530795Z self=, 2025-05-07T20:31:39.1531406Z T=16384, 2025-05-07T20:31:39.1531683Z D=5120, 2025-05-07T20:31:39.1531893Z scale_ub=None, 2025-05-07T20:31:39.1532115Z contiguous=False, 2025-05-07T20:31:39.1532523Z compiled=True, 2025-05-07T20:31:39.1532737Z ) 2025-05-07T20:31:39.1533062Z self = 2025-05-07T20:31:39.1533559Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:39.1533848Z 2025-05-07T20:31:39.1533931Z @given( 2025-05-07T20:31:39.1534167Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.1534481Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.1534788Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.1535123Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.1535444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.1535735Z ) 2025-05-07T20:31:39.1536089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.1536565Z def test_silu_mul_quant( 2025-05-07T20:31:39.1536825Z self, 2025-05-07T20:31:39.1537041Z T: int, 2025-05-07T20:31:39.1537243Z D: int, 2025-05-07T20:31:39.1537457Z scale_ub: Optional[float], 2025-05-07T20:31:39.1537734Z contiguous: bool, 2025-05-07T20:31:39.1537974Z compiled: bool, 2025-05-07T20:31:39.1538194Z ) -> None: 2025-05-07T20:31:39.1538659Z torch.manual_seed(2025) 2025-05-07T20:31:39.1538904Z 2025-05-07T20:31:39.1539171Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.1539513Z 2025-05-07T20:31:39.1539706Z x_sign = torch.sign(x) 2025-05-07T20:31:39.1539992Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.1540304Z x = x_sign * x_clamp 2025-05-07T20:31:39.1540548Z x0 = x[:, :D] 2025-05-07T20:31:39.1540760Z x1 = x[:, D:] 2025-05-07T20:31:39.1540965Z 2025-05-07T20:31:39.1541155Z if contiguous: 2025-05-07T20:31:39.1541390Z x0 = x0.contiguous() 2025-05-07T20:31:39.1541783Z x1 = x1.contiguous() 2025-05-07T20:31:39.1542036Z 2025-05-07T20:31:39.1542233Z if scale_ub is not None: 2025-05-07T20:31:39.1542502Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.1542839Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.1543151Z ) 2025-05-07T20:31:39.1543344Z else: 2025-05-07T20:31:39.1543558Z scale_ub_tensor = None 2025-05-07T20:31:39.1543813Z 2025-05-07T20:31:39.1544041Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.1544362Z op = silu_mul_quant 2025-05-07T20:31:39.1544608Z if compiled: 2025-05-07T20:31:39.1544847Z op = torch.compile(op) 2025-05-07T20:31:39.1545145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.1545426Z 2025-05-07T20:31:39.1545612Z > y_fp8, y_scale = fn() 2025-05-07T20:31:39.1545787Z 2025-05-07T20:31:39.1545887Z moe/activation_test.py:117: 2025-05-07T20:31:39.1546186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.1546531Z moe/activation_test.py:115: in fn 2025-05-07T20:31:39.1546810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.1547379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:39.1547947Z return fn(*args, **kwargs) 2025-05-07T20:31:39.1548606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:39.1549299Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:39.1549839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.1550520Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.1551192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.1551852Z kernel = self.compile( 2025-05-07T20:31:39.1552416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.1553082Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.1553475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.1553704Z 2025-05-07T20:31:39.1553915Z self = 2025-05-07T20:31:39.1554997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.1556364Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffa2bc40>} 2025-05-07T20:31:39.1557713Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.1558736Z context = 2025-05-07T20:31:39.1559025Z 2025-05-07T20:31:39.1559198Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.1559725Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.1560188Z module_map=module_map) 2025-05-07T20:31:39.1560554Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.1560907Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.1561155Z E ^ 2025-05-07T20:31:39.1561699Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.1562156Z 2025-05-07T20:31:39.1562584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.1563097Z 2025-05-07T20:31:39.1563207Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.1563750Z self=, 2025-05-07T20:31:39.1564147Z T=2048, 2025-05-07T20:31:39.1564338Z D=5120, 2025-05-07T20:31:39.1564522Z scale_ub=None, 2025-05-07T20:31:39.1564741Z contiguous=False, 2025-05-07T20:31:39.1564963Z compiled=True, 2025-05-07T20:31:39.1565163Z ) 2025-05-07T20:31:39.2294062Z self = 2025-05-07T20:31:39.2294585Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:39.2294887Z 2025-05-07T20:31:39.2294999Z @given( 2025-05-07T20:31:39.2295322Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.2295653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.2296030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.2296403Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.2296724Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.2297011Z ) 2025-05-07T20:31:39.2297361Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.2297799Z def test_silu_mul_quant( 2025-05-07T20:31:39.2298042Z self, 2025-05-07T20:31:39.2298236Z T: int, 2025-05-07T20:31:39.2298427Z D: int, 2025-05-07T20:31:39.2298644Z scale_ub: Optional[float], 2025-05-07T20:31:39.2298912Z contiguous: bool, 2025-05-07T20:31:39.2299148Z compiled: bool, 2025-05-07T20:31:39.2299374Z ) -> None: 2025-05-07T20:31:39.2299590Z torch.manual_seed(2025) 2025-05-07T20:31:39.2299823Z 2025-05-07T20:31:39.2300102Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.2300617Z 2025-05-07T20:31:39.2300805Z x_sign = torch.sign(x) 2025-05-07T20:31:39.2301097Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.2301406Z x = x_sign * x_clamp 2025-05-07T20:31:39.2301643Z x0 = x[:, :D] 2025-05-07T20:31:39.2301853Z x1 = x[:, D:] 2025-05-07T20:31:39.2302061Z 2025-05-07T20:31:39.2302245Z if contiguous: 2025-05-07T20:31:39.2302471Z x0 = x0.contiguous() 2025-05-07T20:31:39.2302729Z x1 = x1.contiguous() 2025-05-07T20:31:39.2302967Z 2025-05-07T20:31:39.2303150Z if scale_ub is not None: 2025-05-07T20:31:39.2303431Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.2303765Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.2304062Z ) 2025-05-07T20:31:39.2304256Z else: 2025-05-07T20:31:39.2304465Z scale_ub_tensor = None 2025-05-07T20:31:39.2304711Z 2025-05-07T20:31:39.2304951Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.2305288Z op = silu_mul_quant 2025-05-07T20:31:39.2305537Z if compiled: 2025-05-07T20:31:39.2305779Z op = torch.compile(op) 2025-05-07T20:31:39.2306075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.2306346Z 2025-05-07T20:31:39.2306530Z > y_fp8, y_scale = fn() 2025-05-07T20:31:39.2306698Z 2025-05-07T20:31:39.2306797Z moe/activation_test.py:117: 2025-05-07T20:31:39.2307093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.2307425Z moe/activation_test.py:115: in fn 2025-05-07T20:31:39.2307700Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.2308262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:39.2308824Z return fn(*args, **kwargs) 2025-05-07T20:31:39.2309613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:39.2310317Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:39.2310858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.2311542Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.2312203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.2312742Z kernel = self.compile( 2025-05-07T20:31:39.2313290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.2313955Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.2314349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.2314583Z 2025-05-07T20:31:39.2314797Z self = 2025-05-07T20:31:39.2315881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.2317294Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffdb87c0>} 2025-05-07T20:31:39.2318636Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.2319666Z context = 2025-05-07T20:31:39.2319958Z 2025-05-07T20:31:39.2320126Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.2320736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.2321204Z module_map=module_map) 2025-05-07T20:31:39.2321568Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.2321922Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.2322176Z E ^ 2025-05-07T20:31:39.2322644Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.2323104Z 2025-05-07T20:31:39.2323647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.2324166Z 2025-05-07T20:31:39.2324274Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.2324685Z self=, 2025-05-07T20:31:39.2325088Z T=2048, 2025-05-07T20:31:39.2325274Z D=5120, 2025-05-07T20:31:39.2325474Z scale_ub=1200.0, 2025-05-07T20:31:39.2325699Z contiguous=False, 2025-05-07T20:31:39.2325927Z compiled=True, 2025-05-07T20:31:39.2326125Z ) 2025-05-07T20:31:39.2326449Z self = 2025-05-07T20:31:39.2326948Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:39.2327222Z 2025-05-07T20:31:39.2327300Z @given( 2025-05-07T20:31:39.2327524Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.2327840Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.2328149Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.2328474Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.2328808Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.2329095Z ) 2025-05-07T20:31:39.2329442Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.2329996Z def test_silu_mul_quant( 2025-05-07T20:31:39.2330246Z self, 2025-05-07T20:31:39.2330434Z T: int, 2025-05-07T20:31:39.2330631Z D: int, 2025-05-07T20:31:39.2330847Z scale_ub: Optional[float], 2025-05-07T20:31:39.2331115Z contiguous: bool, 2025-05-07T20:31:39.2331352Z compiled: bool, 2025-05-07T20:31:39.2331579Z ) -> None: 2025-05-07T20:31:39.2331786Z torch.manual_seed(2025) 2025-05-07T20:31:39.2332023Z 2025-05-07T20:31:39.2332295Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.2332635Z 2025-05-07T20:31:39.2332837Z x_sign = torch.sign(x) 2025-05-07T20:31:39.2333136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.2333449Z x = x_sign * x_clamp 2025-05-07T20:31:39.2333679Z x0 = x[:, :D] 2025-05-07T20:31:39.2333893Z x1 = x[:, D:] 2025-05-07T20:31:39.2334101Z 2025-05-07T20:31:39.2334278Z if contiguous: 2025-05-07T20:31:39.2334526Z x0 = x0.contiguous() 2025-05-07T20:31:39.2334789Z x1 = x1.contiguous() 2025-05-07T20:31:39.2335030Z 2025-05-07T20:31:39.2335221Z if scale_ub is not None: 2025-05-07T20:31:39.2335489Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.2335822Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.2336130Z ) 2025-05-07T20:31:39.2336324Z else: 2025-05-07T20:31:39.2336529Z scale_ub_tensor = None 2025-05-07T20:31:39.2336777Z 2025-05-07T20:31:39.2337010Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.2337315Z op = silu_mul_quant 2025-05-07T20:31:39.2337564Z if compiled: 2025-05-07T20:31:39.2337812Z op = torch.compile(op) 2025-05-07T20:31:39.2338107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.2338622Z 2025-05-07T20:31:39.2338821Z > y_fp8, y_scale = fn() 2025-05-07T20:31:39.2338984Z 2025-05-07T20:31:39.2339092Z moe/activation_test.py:117: 2025-05-07T20:31:39.2339509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.2339840Z moe/activation_test.py:115: in fn 2025-05-07T20:31:39.2340119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.2340676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:39.2341241Z return fn(*args, **kwargs) 2025-05-07T20:31:39.2341909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:39.2342603Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:39.2343143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.2343831Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.2344508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.2345044Z kernel = self.compile( 2025-05-07T20:31:39.2345593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.2346257Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.2346659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.2346886Z 2025-05-07T20:31:39.2347097Z self = 2025-05-07T20:31:39.2348179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.2349665Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffdb98a0>} 2025-05-07T20:31:39.2351022Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.2352053Z context = 2025-05-07T20:31:39.2352340Z 2025-05-07T20:31:39.2352509Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.2353028Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.2353492Z module_map=module_map) 2025-05-07T20:31:39.2353849Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.2354204Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.2354469Z E ^ 2025-05-07T20:31:39.2354938Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.2355392Z 2025-05-07T20:31:39.2355809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.2356333Z 2025-05-07T20:31:39.3692723Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.3693604Z self=, 2025-05-07T20:31:39.3694398Z T=4096, 2025-05-07T20:31:39.3694768Z D=5120, 2025-05-07T20:31:39.3695137Z scale_ub=1200.0, 2025-05-07T20:31:39.3695575Z contiguous=True, 2025-05-07T20:31:39.3696004Z compiled=True, 2025-05-07T20:31:39.3696395Z ) 2025-05-07T20:31:39.3696994Z self = 2025-05-07T20:31:39.3697539Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:39.3697808Z 2025-05-07T20:31:39.3697883Z @given( 2025-05-07T20:31:39.3698121Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.3698581Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.3698881Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.3699210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.3699539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.3699821Z ) 2025-05-07T20:31:39.3701470Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.3701907Z def test_silu_mul_quant( 2025-05-07T20:31:39.3702145Z self, 2025-05-07T20:31:39.3702328Z T: int, 2025-05-07T20:31:39.3702527Z D: int, 2025-05-07T20:31:39.3702742Z scale_ub: Optional[float], 2025-05-07T20:31:39.3703004Z contiguous: bool, 2025-05-07T20:31:39.3703239Z compiled: bool, 2025-05-07T20:31:39.3703458Z ) -> None: 2025-05-07T20:31:39.3703667Z torch.manual_seed(2025) 2025-05-07T20:31:39.3703907Z 2025-05-07T20:31:39.3704191Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.3704529Z 2025-05-07T20:31:39.3704724Z x_sign = torch.sign(x) 2025-05-07T20:31:39.3705021Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.3705325Z x = x_sign * x_clamp 2025-05-07T20:31:39.3705561Z x0 = x[:, :D] 2025-05-07T20:31:39.3705780Z x1 = x[:, D:] 2025-05-07T20:31:39.3705989Z 2025-05-07T20:31:39.3706164Z if contiguous: 2025-05-07T20:31:39.3706394Z x0 = x0.contiguous() 2025-05-07T20:31:39.3706682Z x1 = x1.contiguous() 2025-05-07T20:31:39.3706934Z 2025-05-07T20:31:39.3707121Z if scale_ub is not None: 2025-05-07T20:31:39.3707392Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.3707719Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.3708021Z ) 2025-05-07T20:31:39.3708213Z else: 2025-05-07T20:31:39.3708538Z scale_ub_tensor = None 2025-05-07T20:31:39.3708795Z 2025-05-07T20:31:39.3709026Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.3709335Z op = silu_mul_quant 2025-05-07T20:31:39.3709584Z if compiled: 2025-05-07T20:31:39.3709828Z op = torch.compile(op) 2025-05-07T20:31:39.3710119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.3710393Z 2025-05-07T20:31:39.3710582Z > y_fp8, y_scale = fn() 2025-05-07T20:31:39.3712179Z 2025-05-07T20:31:39.3712284Z moe/activation_test.py:117: 2025-05-07T20:31:39.3712571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.3712904Z moe/activation_test.py:115: in fn 2025-05-07T20:31:39.3713184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.3713738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:39.3714302Z return fn(*args, **kwargs) 2025-05-07T20:31:39.3714966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:39.3715653Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:39.3716186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.3716867Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.3717532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.3718058Z kernel = self.compile( 2025-05-07T20:31:39.3718601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.3719261Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.3719661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.3719973Z 2025-05-07T20:31:39.3720183Z self = 2025-05-07T20:31:39.3721259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.3722617Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffdbaac0>} 2025-05-07T20:31:39.3724071Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.3725097Z context = 2025-05-07T20:31:39.3725383Z 2025-05-07T20:31:39.3725554Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.3726078Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.3726546Z module_map=module_map) 2025-05-07T20:31:39.3726902Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.3727253Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.3727508Z E ^ 2025-05-07T20:31:39.3727971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.3728421Z 2025-05-07T20:31:39.3728841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.3729359Z 2025-05-07T20:31:39.3729463Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.3729875Z self=, 2025-05-07T20:31:39.3730355Z T=128, 2025-05-07T20:31:39.3730547Z D=5120, 2025-05-07T20:31:39.3730729Z scale_ub=1200.0, 2025-05-07T20:31:39.3730942Z contiguous=False, 2025-05-07T20:31:39.3731165Z compiled=True, 2025-05-07T20:31:39.3731368Z ) 2025-05-07T20:31:39.4566518Z self = 2025-05-07T20:31:39.4567045Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:39.4567329Z 2025-05-07T20:31:39.4567409Z @given( 2025-05-07T20:31:39.4567644Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.4568056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.4568444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.4568780Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.4569113Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.4569395Z ) 2025-05-07T20:31:39.4569752Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.4570202Z def test_silu_mul_quant( 2025-05-07T20:31:39.4570436Z self, 2025-05-07T20:31:39.4570635Z T: int, 2025-05-07T20:31:39.4570837Z D: int, 2025-05-07T20:31:39.4571052Z scale_ub: Optional[float], 2025-05-07T20:31:39.4571325Z contiguous: bool, 2025-05-07T20:31:39.4571565Z compiled: bool, 2025-05-07T20:31:39.4571786Z ) -> None: 2025-05-07T20:31:39.4572001Z torch.manual_seed(2025) 2025-05-07T20:31:39.4572242Z 2025-05-07T20:31:39.4572520Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.4572860Z 2025-05-07T20:31:39.4573056Z x_sign = torch.sign(x) 2025-05-07T20:31:39.4573343Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.4580855Z x = x_sign * x_clamp 2025-05-07T20:31:39.4581151Z x0 = x[:, :D] 2025-05-07T20:31:39.4581373Z x1 = x[:, D:] 2025-05-07T20:31:39.4581588Z 2025-05-07T20:31:39.4581954Z if contiguous: 2025-05-07T20:31:39.4582196Z x0 = x0.contiguous() 2025-05-07T20:31:39.4582458Z x1 = x1.contiguous() 2025-05-07T20:31:39.4582696Z 2025-05-07T20:31:39.4582889Z if scale_ub is not None: 2025-05-07T20:31:39.4583160Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.4583496Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.4583838Z ) 2025-05-07T20:31:39.4584032Z else: 2025-05-07T20:31:39.4584243Z scale_ub_tensor = None 2025-05-07T20:31:39.4584499Z 2025-05-07T20:31:39.4584732Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.4585049Z op = silu_mul_quant 2025-05-07T20:31:39.4585302Z if compiled: 2025-05-07T20:31:39.4585542Z op = torch.compile(op) 2025-05-07T20:31:39.4585840Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.4586116Z 2025-05-07T20:31:39.4586311Z > y_fp8, y_scale = fn() 2025-05-07T20:31:39.4586493Z 2025-05-07T20:31:39.4586596Z moe/activation_test.py:117: 2025-05-07T20:31:39.4586892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.4587227Z moe/activation_test.py:115: in fn 2025-05-07T20:31:39.4587508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.4588075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:39.4588643Z return fn(*args, **kwargs) 2025-05-07T20:31:39.4589298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:39.4589988Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:39.4590528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.4591341Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.4592013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.4592546Z kernel = self.compile( 2025-05-07T20:31:39.4593096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.4593753Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.4594160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.4594398Z 2025-05-07T20:31:39.4594608Z self = 2025-05-07T20:31:39.4595685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.4597118Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff80c540>} 2025-05-07T20:31:39.4598461Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.4599494Z context = 2025-05-07T20:31:39.4599787Z 2025-05-07T20:31:39.4599955Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.4600480Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.4600947Z module_map=module_map) 2025-05-07T20:31:39.4601314Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.4601670Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.4601924Z E ^ 2025-05-07T20:31:39.4602487Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.4602943Z 2025-05-07T20:31:39.4603537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.4604053Z 2025-05-07T20:31:39.4604164Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.4604568Z self=, 2025-05-07T20:31:39.4604969Z T=16384, 2025-05-07T20:31:39.4605162Z D=7168, 2025-05-07T20:31:39.4605351Z scale_ub=1200.0, 2025-05-07T20:31:39.4605577Z contiguous=True, 2025-05-07T20:31:39.4605804Z compiled=True, 2025-05-07T20:31:39.4606002Z ) 2025-05-07T20:31:39.4606316Z self = 2025-05-07T20:31:39.4606819Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:39.4607125Z 2025-05-07T20:31:39.4607222Z @given( 2025-05-07T20:31:39.4607460Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.4607772Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.4608073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.4608389Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.4608713Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.4609000Z ) 2025-05-07T20:31:39.4609343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.4609784Z def test_silu_mul_quant( 2025-05-07T20:31:39.4610028Z self, 2025-05-07T20:31:39.4610225Z T: int, 2025-05-07T20:31:39.4610419Z D: int, 2025-05-07T20:31:39.4610636Z scale_ub: Optional[float], 2025-05-07T20:31:39.4610905Z contiguous: bool, 2025-05-07T20:31:39.4611137Z compiled: bool, 2025-05-07T20:31:39.4611360Z ) -> None: 2025-05-07T20:31:39.4611663Z torch.manual_seed(2025) 2025-05-07T20:31:39.4611910Z 2025-05-07T20:31:39.4612178Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.4612515Z 2025-05-07T20:31:39.4612713Z x_sign = torch.sign(x) 2025-05-07T20:31:39.4612995Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.4613298Z x = x_sign * x_clamp 2025-05-07T20:31:39.4613534Z x0 = x[:, :D] 2025-05-07T20:31:39.4613741Z x1 = x[:, D:] 2025-05-07T20:31:39.4613945Z 2025-05-07T20:31:39.4614127Z if contiguous: 2025-05-07T20:31:39.4614350Z x0 = x0.contiguous() 2025-05-07T20:31:39.4614612Z x1 = x1.contiguous() 2025-05-07T20:31:39.4614847Z 2025-05-07T20:31:39.4615027Z if scale_ub is not None: 2025-05-07T20:31:39.4615299Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.4615634Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.4615936Z ) 2025-05-07T20:31:39.4616142Z else: 2025-05-07T20:31:39.4616350Z scale_ub_tensor = None 2025-05-07T20:31:39.4616600Z 2025-05-07T20:31:39.4616830Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.4617142Z op = silu_mul_quant 2025-05-07T20:31:39.4617385Z if compiled: 2025-05-07T20:31:39.4617626Z op = torch.compile(op) 2025-05-07T20:31:39.4617918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.4618188Z 2025-05-07T20:31:39.4618371Z > y_fp8, y_scale = fn() 2025-05-07T20:31:39.4618538Z 2025-05-07T20:31:39.4618636Z moe/activation_test.py:117: 2025-05-07T20:31:39.4618927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.4619247Z moe/activation_test.py:115: in fn 2025-05-07T20:31:39.4619531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.4620093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:39.4620734Z return fn(*args, **kwargs) 2025-05-07T20:31:39.4621387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:39.4622075Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:39.4622616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.4623291Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.4623952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.4624487Z kernel = self.compile( 2025-05-07T20:31:39.4625028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.4625680Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.4626083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.4626315Z 2025-05-07T20:31:39.4626530Z self = 2025-05-07T20:31:39.4627651Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.4629006Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff80d080>} 2025-05-07T20:31:39.4630348Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.4631449Z context = 2025-05-07T20:31:39.4631742Z 2025-05-07T20:31:39.4631914Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.4632427Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.4632890Z module_map=module_map) 2025-05-07T20:31:39.4633251Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.4633607Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.4633860Z E ^ 2025-05-07T20:31:39.4634319Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.4634770Z 2025-05-07T20:31:39.4635195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.4635709Z 2025-05-07T20:31:39.5594356Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.5594944Z self=, 2025-05-07T20:31:39.5595380Z T=16384, 2025-05-07T20:31:39.5595584Z D=5120, 2025-05-07T20:31:39.5595776Z scale_ub=1200.0, 2025-05-07T20:31:39.5595987Z contiguous=True, 2025-05-07T20:31:39.5596205Z compiled=False, 2025-05-07T20:31:39.5596411Z ) 2025-05-07T20:31:39.5596951Z self = 2025-05-07T20:31:39.5597946Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:39.5598501Z 2025-05-07T20:31:39.5598657Z @given( 2025-05-07T20:31:39.5599091Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.5599713Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.5600310Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.5600944Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.5601586Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.5602146Z ) 2025-05-07T20:31:39.5603137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.5604177Z def test_silu_mul_quant( 2025-05-07T20:31:39.5604646Z self, 2025-05-07T20:31:39.5605016Z T: int, 2025-05-07T20:31:39.5605394Z D: int, 2025-05-07T20:31:39.5605819Z scale_ub: Optional[float], 2025-05-07T20:31:39.5606348Z contiguous: bool, 2025-05-07T20:31:39.5606789Z compiled: bool, 2025-05-07T20:31:39.5607039Z ) -> None: 2025-05-07T20:31:39.5607283Z torch.manual_seed(2025) 2025-05-07T20:31:39.5607520Z 2025-05-07T20:31:39.5607790Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.5608126Z 2025-05-07T20:31:39.5608316Z x_sign = torch.sign(x) 2025-05-07T20:31:39.5608609Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.5608918Z x = x_sign * x_clamp 2025-05-07T20:31:39.5609153Z x0 = x[:, :D] 2025-05-07T20:31:39.5609363Z x1 = x[:, D:] 2025-05-07T20:31:39.5609575Z 2025-05-07T20:31:39.5609762Z if contiguous: 2025-05-07T20:31:39.5609989Z x0 = x0.contiguous() 2025-05-07T20:31:39.5610249Z x1 = x1.contiguous() 2025-05-07T20:31:39.5610488Z 2025-05-07T20:31:39.5610679Z if scale_ub is not None: 2025-05-07T20:31:39.5610952Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.5611282Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.5611583Z ) 2025-05-07T20:31:39.5611775Z else: 2025-05-07T20:31:39.5611980Z scale_ub_tensor = None 2025-05-07T20:31:39.5612222Z 2025-05-07T20:31:39.5612452Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.5612762Z op = silu_mul_quant 2025-05-07T20:31:39.5613003Z if compiled: 2025-05-07T20:31:39.5613252Z op = torch.compile(op) 2025-05-07T20:31:39.5613697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.5613978Z 2025-05-07T20:31:39.5614173Z > y_fp8, y_scale = fn() 2025-05-07T20:31:39.5614335Z 2025-05-07T20:31:39.5614438Z moe/activation_test.py:117: 2025-05-07T20:31:39.5614724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.5615053Z moe/activation_test.py:115: in fn 2025-05-07T20:31:39.5615355Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.5616045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:39.5616735Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:39.5617272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.5617959Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.5618634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.5619167Z kernel = self.compile( 2025-05-07T20:31:39.5619714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.5620373Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.5620766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.5620994Z 2025-05-07T20:31:39.5621199Z self = 2025-05-07T20:31:39.5622276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.5623643Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff80e660>} 2025-05-07T20:31:39.5625067Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.5626094Z context = 2025-05-07T20:31:39.5626382Z 2025-05-07T20:31:39.5626550Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.5627069Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.5627536Z module_map=module_map) 2025-05-07T20:31:39.5627891Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.5628243Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.5628503Z E ^ 2025-05-07T20:31:39.5628963Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.5629427Z 2025-05-07T20:31:39.5629846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.5630364Z 2025-05-07T20:31:39.5630468Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.5630882Z self=, 2025-05-07T20:31:39.5631274Z T=1, 2025-05-07T20:31:39.5631455Z D=7168, 2025-05-07T20:31:39.5631645Z scale_ub=1200.0, 2025-05-07T20:31:39.5631860Z contiguous=False, 2025-05-07T20:31:39.5632081Z compiled=False, 2025-05-07T20:31:39.5632283Z ) 2025-05-07T20:31:39.5632598Z self = 2025-05-07T20:31:39.5633077Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:39.5633347Z 2025-05-07T20:31:39.5633424Z @given( 2025-05-07T20:31:39.5633649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.5634039Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.5634346Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.5634675Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.5634996Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.5635280Z ) 2025-05-07T20:31:39.5635628Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.5636069Z def test_silu_mul_quant( 2025-05-07T20:31:39.5636302Z self, 2025-05-07T20:31:39.5636492Z T: int, 2025-05-07T20:31:39.5636683Z D: int, 2025-05-07T20:31:39.5636890Z scale_ub: Optional[float], 2025-05-07T20:31:39.5637158Z contiguous: bool, 2025-05-07T20:31:39.5637399Z compiled: bool, 2025-05-07T20:31:39.5637613Z ) -> None: 2025-05-07T20:31:39.5637827Z torch.manual_seed(2025) 2025-05-07T20:31:39.5638064Z 2025-05-07T20:31:39.5638341Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.5638918Z 2025-05-07T20:31:39.5639111Z x_sign = torch.sign(x) 2025-05-07T20:31:39.5639394Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.5639699Z x = x_sign * x_clamp 2025-05-07T20:31:39.5639937Z x0 = x[:, :D] 2025-05-07T20:31:39.5640144Z x1 = x[:, D:] 2025-05-07T20:31:39.5640349Z 2025-05-07T20:31:39.5640533Z if contiguous: 2025-05-07T20:31:39.5640757Z x0 = x0.contiguous() 2025-05-07T20:31:39.5641011Z x1 = x1.contiguous() 2025-05-07T20:31:39.5641250Z 2025-05-07T20:31:39.5641432Z if scale_ub is not None: 2025-05-07T20:31:39.5641704Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.5642036Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.5642348Z ) 2025-05-07T20:31:39.5642534Z else: 2025-05-07T20:31:39.5642742Z scale_ub_tensor = None 2025-05-07T20:31:39.5642998Z 2025-05-07T20:31:39.5643451Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.5643764Z op = silu_mul_quant 2025-05-07T20:31:39.5644018Z if compiled: 2025-05-07T20:31:39.5644257Z op = torch.compile(op) 2025-05-07T20:31:39.5644555Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.5644833Z 2025-05-07T20:31:39.5645022Z > y_fp8, y_scale = fn() 2025-05-07T20:31:39.5645189Z 2025-05-07T20:31:39.5645286Z moe/activation_test.py:117: 2025-05-07T20:31:39.5645575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.5645908Z moe/activation_test.py:115: in fn 2025-05-07T20:31:39.5646182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.5646872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:39.5647568Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:39.5648110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.5648804Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.5649477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.5650018Z kernel = self.compile( 2025-05-07T20:31:39.5650559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.5651224Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.5651622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.5651849Z 2025-05-07T20:31:39.5652061Z self = 2025-05-07T20:31:39.5653254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.5654632Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff80dd00>} 2025-05-07T20:31:39.5655988Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.5657022Z context = 2025-05-07T20:31:39.5657310Z 2025-05-07T20:31:39.5657479Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.5658002Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.5658474Z module_map=module_map) 2025-05-07T20:31:39.5658846Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.5659193Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.5659449Z E ^ 2025-05-07T20:31:39.5659914Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.5660367Z 2025-05-07T20:31:39.5660786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.5661304Z 2025-05-07T20:31:39.9321478Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.9322716Z self=, 2025-05-07T20:31:39.9323474Z T=4096, 2025-05-07T20:31:39.9323736Z D=7168, 2025-05-07T20:31:39.9323933Z scale_ub=1200.0, 2025-05-07T20:31:39.9324178Z contiguous=False, 2025-05-07T20:31:39.9324414Z compiled=True, 2025-05-07T20:31:39.9324627Z ) 2025-05-07T20:31:39.9324983Z self = 2025-05-07T20:31:39.9325881Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:39.9326163Z 2025-05-07T20:31:39.9326259Z @given( 2025-05-07T20:31:39.9326492Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.9326822Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.9327140Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.9327470Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.9327808Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.9328108Z ) 2025-05-07T20:31:39.9328462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.9328921Z def test_silu_mul_quant( 2025-05-07T20:31:39.9329174Z self, 2025-05-07T20:31:39.9329373Z T: int, 2025-05-07T20:31:39.9329583Z D: int, 2025-05-07T20:31:39.9329819Z scale_ub: Optional[float], 2025-05-07T20:31:39.9330105Z contiguous: bool, 2025-05-07T20:31:39.9330359Z compiled: bool, 2025-05-07T20:31:39.9330603Z ) -> None: 2025-05-07T20:31:39.9330822Z torch.manual_seed(2025) 2025-05-07T20:31:39.9331074Z 2025-05-07T20:31:39.9331366Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.9331722Z 2025-05-07T20:31:39.9331919Z x_sign = torch.sign(x) 2025-05-07T20:31:39.9332226Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.9332550Z x = x_sign * x_clamp 2025-05-07T20:31:39.9332795Z x0 = x[:, :D] 2025-05-07T20:31:39.9333021Z x1 = x[:, D:] 2025-05-07T20:31:39.9333242Z 2025-05-07T20:31:39.9333434Z if contiguous: 2025-05-07T20:31:39.9333681Z x0 = x0.contiguous() 2025-05-07T20:31:39.9333955Z x1 = x1.contiguous() 2025-05-07T20:31:39.9334198Z 2025-05-07T20:31:39.9334565Z if scale_ub is not None: 2025-05-07T20:31:39.9334861Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.9335200Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.9335519Z ) 2025-05-07T20:31:39.9335723Z else: 2025-05-07T20:31:39.9335936Z scale_ub_tensor = None 2025-05-07T20:31:39.9336202Z 2025-05-07T20:31:39.9336454Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.9336782Z op = silu_mul_quant 2025-05-07T20:31:39.9337035Z if compiled: 2025-05-07T20:31:39.9337293Z op = torch.compile(op) 2025-05-07T20:31:39.9337604Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.9337881Z 2025-05-07T20:31:39.9338081Z > y_fp8, y_scale = fn() 2025-05-07T20:31:39.9338254Z 2025-05-07T20:31:39.9338726Z moe/activation_test.py:117: 2025-05-07T20:31:39.9339069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.9339426Z moe/activation_test.py:115: in fn 2025-05-07T20:31:39.9339722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.9340289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:39.9340864Z return fn(*args, **kwargs) 2025-05-07T20:31:39.9341533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:39.9342231Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:39.9342771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.9343463Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.9344139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.9344688Z kernel = self.compile( 2025-05-07T20:31:39.9345237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.9346069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.9346472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.9346705Z 2025-05-07T20:31:39.9346940Z self = 2025-05-07T20:31:39.9348050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.9349532Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff74ccc0>} 2025-05-07T20:31:39.9350884Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.9351927Z context = 2025-05-07T20:31:39.9352218Z 2025-05-07T20:31:39.9352401Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.9352924Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.9353400Z module_map=module_map) 2025-05-07T20:31:39.9353790Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.9354146Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.9354415Z E ^ 2025-05-07T20:31:39.9354890Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.9355345Z 2025-05-07T20:31:39.9365742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.9366336Z 2025-05-07T20:31:39.9366451Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.9366882Z self=, 2025-05-07T20:31:39.9367301Z T=128, 2025-05-07T20:31:39.9367494Z D=7168, 2025-05-07T20:31:39.9367699Z scale_ub=1200.0, 2025-05-07T20:31:39.9367932Z contiguous=False, 2025-05-07T20:31:39.9368158Z compiled=True, 2025-05-07T20:31:39.9368373Z ) 2025-05-07T20:31:40.0086255Z self = 2025-05-07T20:31:40.0087021Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:40.0087414Z 2025-05-07T20:31:40.0087523Z @given( 2025-05-07T20:31:40.0087846Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.0088226Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.0088563Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.0088922Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.0089252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.0089551Z ) 2025-05-07T20:31:40.0089909Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.0090363Z def test_silu_mul_quant( 2025-05-07T20:31:40.0090602Z self, 2025-05-07T20:31:40.0090805Z T: int, 2025-05-07T20:31:40.0091010Z D: int, 2025-05-07T20:31:40.0091227Z scale_ub: Optional[float], 2025-05-07T20:31:40.0091508Z contiguous: bool, 2025-05-07T20:31:40.0091758Z compiled: bool, 2025-05-07T20:31:40.0091984Z ) -> None: 2025-05-07T20:31:40.0092211Z torch.manual_seed(2025) 2025-05-07T20:31:40.0092463Z 2025-05-07T20:31:40.0092738Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.0093091Z 2025-05-07T20:31:40.0093303Z x_sign = torch.sign(x) 2025-05-07T20:31:40.0093922Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.0094248Z x = x_sign * x_clamp 2025-05-07T20:31:40.0094495Z x0 = x[:, :D] 2025-05-07T20:31:40.0094712Z x1 = x[:, D:] 2025-05-07T20:31:40.0094933Z 2025-05-07T20:31:40.0095131Z if contiguous: 2025-05-07T20:31:40.0095367Z x0 = x0.contiguous() 2025-05-07T20:31:40.0095639Z x1 = x1.contiguous() 2025-05-07T20:31:40.0095893Z 2025-05-07T20:31:40.0096094Z if scale_ub is not None: 2025-05-07T20:31:40.0096367Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.0096713Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.0097044Z ) 2025-05-07T20:31:40.0097242Z else: 2025-05-07T20:31:40.0097461Z scale_ub_tensor = None 2025-05-07T20:31:40.0097725Z 2025-05-07T20:31:40.0097961Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.0098291Z op = silu_mul_quant 2025-05-07T20:31:40.0098560Z if compiled: 2025-05-07T20:31:40.0098808Z op = torch.compile(op) 2025-05-07T20:31:40.0099117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.0099403Z 2025-05-07T20:31:40.0099597Z > y_fp8, y_scale = fn() 2025-05-07T20:31:40.0099773Z 2025-05-07T20:31:40.0099875Z moe/activation_test.py:117: 2025-05-07T20:31:40.0100183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.0100526Z moe/activation_test.py:115: in fn 2025-05-07T20:31:40.0100809Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.0101375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:40.0101948Z return fn(*args, **kwargs) 2025-05-07T20:31:40.0102617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:40.0103451Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:40.0104001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.0104704Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.0105369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.0105910Z kernel = self.compile( 2025-05-07T20:31:40.0106460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.0107176Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.0107571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.0107804Z 2025-05-07T20:31:40.0108016Z self = 2025-05-07T20:31:40.0109103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.0110482Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff74d580>} 2025-05-07T20:31:40.0111822Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.0112842Z context = 2025-05-07T20:31:40.0113135Z 2025-05-07T20:31:40.0113303Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.0113839Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.0114387Z module_map=module_map) 2025-05-07T20:31:40.0114756Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.0115112Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.0115379Z E ^ 2025-05-07T20:31:40.0115842Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.0116302Z 2025-05-07T20:31:40.0116722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.0117237Z 2025-05-07T20:31:40.0117350Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.0117772Z self=, 2025-05-07T20:31:40.0118172Z T=2048, 2025-05-07T20:31:40.0118368Z D=7168, 2025-05-07T20:31:40.0118567Z scale_ub=None, 2025-05-07T20:31:40.0118777Z contiguous=True, 2025-05-07T20:31:40.0119012Z compiled=True, 2025-05-07T20:31:40.0119232Z ) 2025-05-07T20:31:40.0119549Z self = 2025-05-07T20:31:40.0120043Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:40.0120314Z 2025-05-07T20:31:40.0120403Z @given( 2025-05-07T20:31:40.0120632Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.0120950Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.0121263Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.0121601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.0121929Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.0122220Z ) 2025-05-07T20:31:40.0122577Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.0123017Z def test_silu_mul_quant( 2025-05-07T20:31:40.0123398Z self, 2025-05-07T20:31:40.0123599Z T: int, 2025-05-07T20:31:40.0123880Z D: int, 2025-05-07T20:31:40.0124107Z scale_ub: Optional[float], 2025-05-07T20:31:40.0124385Z contiguous: bool, 2025-05-07T20:31:40.0124622Z compiled: bool, 2025-05-07T20:31:40.0124857Z ) -> None: 2025-05-07T20:31:40.0125082Z torch.manual_seed(2025) 2025-05-07T20:31:40.0125325Z 2025-05-07T20:31:40.0125607Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.0125958Z 2025-05-07T20:31:40.0126151Z x_sign = torch.sign(x) 2025-05-07T20:31:40.0126445Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.0126759Z x = x_sign * x_clamp 2025-05-07T20:31:40.0127003Z x0 = x[:, :D] 2025-05-07T20:31:40.0127215Z x1 = x[:, D:] 2025-05-07T20:31:40.0127446Z 2025-05-07T20:31:40.0127640Z if contiguous: 2025-05-07T20:31:40.0127869Z x0 = x0.contiguous() 2025-05-07T20:31:40.0128137Z x1 = x1.contiguous() 2025-05-07T20:31:40.0128391Z 2025-05-07T20:31:40.0128590Z if scale_ub is not None: 2025-05-07T20:31:40.0128870Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.0129208Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.0129523Z ) 2025-05-07T20:31:40.0129714Z else: 2025-05-07T20:31:40.0129933Z scale_ub_tensor = None 2025-05-07T20:31:40.0130193Z 2025-05-07T20:31:40.0130422Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.0130742Z op = silu_mul_quant 2025-05-07T20:31:40.0130995Z if compiled: 2025-05-07T20:31:40.0131237Z op = torch.compile(op) 2025-05-07T20:31:40.0131538Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.0131816Z 2025-05-07T20:31:40.0132009Z > y_fp8, y_scale = fn() 2025-05-07T20:31:40.0132181Z 2025-05-07T20:31:40.0132281Z moe/activation_test.py:117: 2025-05-07T20:31:40.0132588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.0133012Z moe/activation_test.py:115: in fn 2025-05-07T20:31:40.0133294Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.0133863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:40.0134431Z return fn(*args, **kwargs) 2025-05-07T20:31:40.0135085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:40.0135781Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:40.0136328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.0137077Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.0137742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.0138288Z kernel = self.compile( 2025-05-07T20:31:40.0139118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.0139774Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.0140180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.0140415Z 2025-05-07T20:31:40.0140625Z self = 2025-05-07T20:31:40.0141703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.0143056Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff74e480>} 2025-05-07T20:31:40.0144518Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.0145554Z context = 2025-05-07T20:31:40.0145841Z 2025-05-07T20:31:40.0146018Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.0146548Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.0147059Z module_map=module_map) 2025-05-07T20:31:40.0147432Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.0147792Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.0148045Z E ^ 2025-05-07T20:31:40.0148515Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.0148963Z 2025-05-07T20:31:40.0149395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.0149913Z 2025-05-07T20:31:40.0802105Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.0802729Z self=, 2025-05-07T20:31:40.0803485Z T=16384, 2025-05-07T20:31:40.0803759Z D=5120, 2025-05-07T20:31:40.0804052Z scale_ub=None, 2025-05-07T20:31:40.0804341Z contiguous=False, 2025-05-07T20:31:40.0804664Z compiled=False, 2025-05-07T20:31:40.0804884Z ) 2025-05-07T20:31:40.0805200Z self = 2025-05-07T20:31:40.0805706Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:40.0805986Z 2025-05-07T20:31:40.0806078Z @given( 2025-05-07T20:31:40.0806305Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.0806642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.0807257Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.0807584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.0807921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.0808212Z ) 2025-05-07T20:31:40.0808569Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.0809009Z def test_silu_mul_quant( 2025-05-07T20:31:40.0809252Z self, 2025-05-07T20:31:40.0809457Z T: int, 2025-05-07T20:31:40.0809653Z D: int, 2025-05-07T20:31:40.0809878Z scale_ub: Optional[float], 2025-05-07T20:31:40.0810153Z contiguous: bool, 2025-05-07T20:31:40.0810388Z compiled: bool, 2025-05-07T20:31:40.0810616Z ) -> None: 2025-05-07T20:31:40.0810834Z torch.manual_seed(2025) 2025-05-07T20:31:40.0811070Z 2025-05-07T20:31:40.0811345Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.0811696Z 2025-05-07T20:31:40.0811892Z x_sign = torch.sign(x) 2025-05-07T20:31:40.0812184Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.0814209Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.0816112Z 2025-05-07T20:31:40.0816232Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:40.0816450Z 2025-05-07T20:31:40.0816555Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.0817102Z self=, 2025-05-07T20:31:40.0817510Z T=4096, 2025-05-07T20:31:40.0817706Z D=7168, 2025-05-07T20:31:40.0817902Z scale_ub=1200.0, 2025-05-07T20:31:40.0818123Z contiguous=True, 2025-05-07T20:31:40.0818348Z compiled=True, 2025-05-07T20:31:40.0818557Z ) 2025-05-07T20:31:40.0818871Z self = 2025-05-07T20:31:40.0819369Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:40.0819640Z 2025-05-07T20:31:40.0819724Z @given( 2025-05-07T20:31:40.0819960Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.0820271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.0820582Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.0822412Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.0822739Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.0823033Z ) 2025-05-07T20:31:40.0823399Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.0823843Z def test_silu_mul_quant( 2025-05-07T20:31:40.0824090Z self, 2025-05-07T20:31:40.0824287Z T: int, 2025-05-07T20:31:40.0824484Z D: int, 2025-05-07T20:31:40.0824708Z scale_ub: Optional[float], 2025-05-07T20:31:40.0824984Z contiguous: bool, 2025-05-07T20:31:40.0825220Z compiled: bool, 2025-05-07T20:31:40.0825450Z ) -> None: 2025-05-07T20:31:40.0825674Z torch.manual_seed(2025) 2025-05-07T20:31:40.0825922Z 2025-05-07T20:31:40.0826191Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.0826536Z 2025-05-07T20:31:40.0826735Z x_sign = torch.sign(x) 2025-05-07T20:31:40.0827046Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.0829080Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.0832446Z 2025-05-07T20:31:40.0832565Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:40.0832776Z 2025-05-07T20:31:40.0832889Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.0833307Z self=, 2025-05-07T20:31:40.0833708Z T=16384, 2025-05-07T20:31:40.0833911Z D=7168, 2025-05-07T20:31:40.0834103Z scale_ub=None, 2025-05-07T20:31:40.0834318Z contiguous=False, 2025-05-07T20:31:40.0834547Z compiled=False, 2025-05-07T20:31:40.0834760Z ) 2025-05-07T20:31:40.0835082Z self = 2025-05-07T20:31:40.0835583Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:40.0835862Z 2025-05-07T20:31:40.0835947Z @given( 2025-05-07T20:31:40.0836175Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.0836495Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.0836805Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.0837139Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.0837464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.0837759Z ) 2025-05-07T20:31:40.0838113Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.0838816Z def test_silu_mul_quant( 2025-05-07T20:31:40.0839062Z self, 2025-05-07T20:31:40.0839261Z T: int, 2025-05-07T20:31:40.0839580Z D: int, 2025-05-07T20:31:40.0839818Z scale_ub: Optional[float], 2025-05-07T20:31:40.0840097Z contiguous: bool, 2025-05-07T20:31:40.0840334Z compiled: bool, 2025-05-07T20:31:40.0840566Z ) -> None: 2025-05-07T20:31:40.0840788Z torch.manual_seed(2025) 2025-05-07T20:31:40.0841027Z 2025-05-07T20:31:40.0841306Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.0843431Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.0845355Z 2025-05-07T20:31:40.0845496Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:40.0845711Z 2025-05-07T20:31:40.0845823Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.0846241Z self=, 2025-05-07T20:31:40.0846642Z T=2048, 2025-05-07T20:31:40.0846839Z D=7168, 2025-05-07T20:31:40.0847041Z scale_ub=1200.0, 2025-05-07T20:31:40.0847262Z contiguous=True, 2025-05-07T20:31:40.0847495Z compiled=True, 2025-05-07T20:31:40.0847705Z ) 2025-05-07T20:31:40.0848023Z self = 2025-05-07T20:31:40.0848521Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:40.0848790Z 2025-05-07T20:31:40.0848880Z @given( 2025-05-07T20:31:40.0849105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.0849425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.0849743Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.0850207Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.0850534Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.0850825Z ) 2025-05-07T20:31:40.0851181Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.0851612Z def test_silu_mul_quant( 2025-05-07T20:31:40.0851857Z self, 2025-05-07T20:31:40.0852051Z T: int, 2025-05-07T20:31:40.0852241Z D: int, 2025-05-07T20:31:40.0852458Z scale_ub: Optional[float], 2025-05-07T20:31:40.0852728Z contiguous: bool, 2025-05-07T20:31:40.0852960Z compiled: bool, 2025-05-07T20:31:40.0853180Z ) -> None: 2025-05-07T20:31:40.0853394Z torch.manual_seed(2025) 2025-05-07T20:31:40.0853630Z 2025-05-07T20:31:40.0853903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.0854245Z 2025-05-07T20:31:40.0854444Z x_sign = torch.sign(x) 2025-05-07T20:31:40.0854742Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.0856728Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.0858629Z 2025-05-07T20:31:40.0858746Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:40.0858956Z 2025-05-07T20:31:40.0859067Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.0859549Z self=, 2025-05-07T20:31:40.0859957Z T=2048, 2025-05-07T20:31:40.0860146Z D=7168, 2025-05-07T20:31:40.0860330Z scale_ub=None, 2025-05-07T20:31:40.0860550Z contiguous=True, 2025-05-07T20:31:40.0860771Z compiled=False, 2025-05-07T20:31:40.0860969Z ) 2025-05-07T20:31:40.1750503Z self = 2025-05-07T20:31:40.1751232Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:40.1751511Z 2025-05-07T20:31:40.1751599Z @given( 2025-05-07T20:31:40.1751866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.1752177Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.1752483Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.1752817Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.1753141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.1753433Z ) 2025-05-07T20:31:40.1753803Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.1754261Z def test_silu_mul_quant( 2025-05-07T20:31:40.1754499Z self, 2025-05-07T20:31:40.1754696Z T: int, 2025-05-07T20:31:40.1754898Z D: int, 2025-05-07T20:31:40.1755114Z scale_ub: Optional[float], 2025-05-07T20:31:40.1755390Z contiguous: bool, 2025-05-07T20:31:40.1755632Z compiled: bool, 2025-05-07T20:31:40.1755854Z ) -> None: 2025-05-07T20:31:40.1756074Z torch.manual_seed(2025) 2025-05-07T20:31:40.1756322Z 2025-05-07T20:31:40.1756590Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.1756938Z 2025-05-07T20:31:40.1757137Z > x_sign = torch.sign(x) 2025-05-07T20:31:40.1759082Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.1761870Z 2025-05-07T20:31:40.1761995Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:40.1762209Z 2025-05-07T20:31:40.1762313Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.1762725Z self=, 2025-05-07T20:31:40.1763131Z T=1, 2025-05-07T20:31:40.1763442Z D=7168, 2025-05-07T20:31:40.1763637Z scale_ub=1200.0, 2025-05-07T20:31:40.1763860Z contiguous=True, 2025-05-07T20:31:40.1764077Z compiled=False, 2025-05-07T20:31:40.1764286Z ) 2025-05-07T20:31:40.1764606Z self = 2025-05-07T20:31:40.1765102Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:40.1765374Z 2025-05-07T20:31:40.1765455Z @given( 2025-05-07T20:31:40.1765685Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.1765999Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.1766299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.1766628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.1766957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.1767240Z ) 2025-05-07T20:31:40.1767594Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.1768035Z def test_silu_mul_quant( 2025-05-07T20:31:40.1768271Z self, 2025-05-07T20:31:40.1768469Z T: int, 2025-05-07T20:31:40.1768675Z D: int, 2025-05-07T20:31:40.1768887Z scale_ub: Optional[float], 2025-05-07T20:31:40.1769160Z contiguous: bool, 2025-05-07T20:31:40.1769578Z compiled: bool, 2025-05-07T20:31:40.1769813Z ) -> None: 2025-05-07T20:31:40.1770029Z torch.manual_seed(2025) 2025-05-07T20:31:40.1770270Z 2025-05-07T20:31:40.1770548Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.1770883Z 2025-05-07T20:31:40.1771088Z x_sign = torch.sign(x) 2025-05-07T20:31:40.1771384Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.1781029Z x = x_sign * x_clamp 2025-05-07T20:31:40.1781322Z x0 = x[:, :D] 2025-05-07T20:31:40.1781550Z x1 = x[:, D:] 2025-05-07T20:31:40.1781755Z 2025-05-07T20:31:40.1781949Z if contiguous: 2025-05-07T20:31:40.1782190Z x0 = x0.contiguous() 2025-05-07T20:31:40.1782444Z x1 = x1.contiguous() 2025-05-07T20:31:40.1782689Z 2025-05-07T20:31:40.1782875Z if scale_ub is not None: 2025-05-07T20:31:40.1783139Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.1783493Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.1783819Z ) 2025-05-07T20:31:40.1784007Z else: 2025-05-07T20:31:40.1784226Z scale_ub_tensor = None 2025-05-07T20:31:40.1784483Z 2025-05-07T20:31:40.1784716Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.1785037Z op = silu_mul_quant 2025-05-07T20:31:40.1785291Z if compiled: 2025-05-07T20:31:40.1785543Z op = torch.compile(op) 2025-05-07T20:31:40.1785838Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.1786121Z 2025-05-07T20:31:40.1786316Z > y_fp8, y_scale = fn() 2025-05-07T20:31:40.1786480Z 2025-05-07T20:31:40.1786585Z moe/activation_test.py:117: 2025-05-07T20:31:40.1786884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.1787219Z moe/activation_test.py:115: in fn 2025-05-07T20:31:40.1787500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.1788325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:40.1789019Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:40.1789558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.1790240Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.1790912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.1791450Z kernel = self.compile( 2025-05-07T20:31:40.1791991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.1792653Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.1793056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.1793288Z 2025-05-07T20:31:40.1793502Z self = 2025-05-07T20:31:40.1794577Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.1795951Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff489d00>} 2025-05-07T20:31:40.1797289Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.1798318Z context = 2025-05-07T20:31:40.1798603Z 2025-05-07T20:31:40.1798856Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.1799383Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.1799853Z module_map=module_map) 2025-05-07T20:31:40.1800221Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.1800573Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.1800837Z E ^ 2025-05-07T20:31:40.1801304Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.1801762Z 2025-05-07T20:31:40.1802196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.1802715Z 2025-05-07T20:31:40.1802821Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.1803239Z self=, 2025-05-07T20:31:40.1803760Z T=128, 2025-05-07T20:31:40.1803949Z D=5120, 2025-05-07T20:31:40.1804141Z scale_ub=None, 2025-05-07T20:31:40.1804358Z contiguous=True, 2025-05-07T20:31:40.1804584Z compiled=False, 2025-05-07T20:31:40.1804781Z ) 2025-05-07T20:31:40.2350333Z self = 2025-05-07T20:31:40.2350856Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:40.2351149Z 2025-05-07T20:31:40.2351292Z @given( 2025-05-07T20:31:40.2351628Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.2352073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.2352509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.2352980Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.2353337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.2353630Z ) 2025-05-07T20:31:40.2353986Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.2354621Z def test_silu_mul_quant( 2025-05-07T20:31:40.2354868Z self, 2025-05-07T20:31:40.2355066Z T: int, 2025-05-07T20:31:40.2355261Z D: int, 2025-05-07T20:31:40.2355483Z scale_ub: Optional[float], 2025-05-07T20:31:40.2355764Z contiguous: bool, 2025-05-07T20:31:40.2355999Z compiled: bool, 2025-05-07T20:31:40.2356236Z ) -> None: 2025-05-07T20:31:40.2356453Z torch.manual_seed(2025) 2025-05-07T20:31:40.2356691Z 2025-05-07T20:31:40.2356970Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.2357317Z 2025-05-07T20:31:40.2357517Z x_sign = torch.sign(x) 2025-05-07T20:31:40.2357806Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.2358119Z x = x_sign * x_clamp 2025-05-07T20:31:40.2358360Z x0 = x[:, :D] 2025-05-07T20:31:40.2358572Z x1 = x[:, D:] 2025-05-07T20:31:40.2358783Z 2025-05-07T20:31:40.2358978Z if contiguous: 2025-05-07T20:31:40.2359216Z x0 = x0.contiguous() 2025-05-07T20:31:40.2359482Z x1 = x1.contiguous() 2025-05-07T20:31:40.2359730Z 2025-05-07T20:31:40.2359915Z if scale_ub is not None: 2025-05-07T20:31:40.2360186Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.2360527Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.2360832Z ) 2025-05-07T20:31:40.2361033Z else: 2025-05-07T20:31:40.2361245Z scale_ub_tensor = None 2025-05-07T20:31:40.2361489Z 2025-05-07T20:31:40.2361724Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.2362043Z op = silu_mul_quant 2025-05-07T20:31:40.2362293Z if compiled: 2025-05-07T20:31:40.2362536Z op = torch.compile(op) 2025-05-07T20:31:40.2362837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.2363115Z 2025-05-07T20:31:40.2363536Z > y_fp8, y_scale = fn() 2025-05-07T20:31:40.2363718Z 2025-05-07T20:31:40.2363820Z moe/activation_test.py:117: 2025-05-07T20:31:40.2364120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.2364453Z moe/activation_test.py:115: in fn 2025-05-07T20:31:40.2364739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.2365433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:40.2366128Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:40.2366665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.2367358Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.2368045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.2368584Z kernel = self.compile( 2025-05-07T20:31:40.2369154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.2369828Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.2370237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.2370468Z 2025-05-07T20:31:40.2370680Z self = 2025-05-07T20:31:40.2371797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.2373179Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff48ae80>} 2025-05-07T20:31:40.2374540Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.2375666Z context = 2025-05-07T20:31:40.2375956Z 2025-05-07T20:31:40.2376140Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.2376671Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.2377156Z module_map=module_map) 2025-05-07T20:31:40.2377536Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.2377894Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.2378167Z E ^ 2025-05-07T20:31:40.2378654Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.2379107Z 2025-05-07T20:31:40.2379546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.2380075Z 2025-05-07T20:31:40.2380184Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.2380612Z self=, 2025-05-07T20:31:40.2381031Z T=128, 2025-05-07T20:31:40.2381224Z D=7168, 2025-05-07T20:31:40.2381428Z scale_ub=None, 2025-05-07T20:31:40.2381644Z contiguous=True, 2025-05-07T20:31:40.2381868Z compiled=False, 2025-05-07T20:31:40.2382078Z ) 2025-05-07T20:31:40.2382402Z self = 2025-05-07T20:31:40.2382897Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:40.2383167Z 2025-05-07T20:31:40.2383250Z @given( 2025-05-07T20:31:40.2383483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.2383803Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.2384184Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.2384521Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.2384850Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.2385129Z ) 2025-05-07T20:31:40.2385480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.2385919Z def test_silu_mul_quant( 2025-05-07T20:31:40.2386158Z self, 2025-05-07T20:31:40.2386359Z T: int, 2025-05-07T20:31:40.2386559Z D: int, 2025-05-07T20:31:40.2386810Z scale_ub: Optional[float], 2025-05-07T20:31:40.2387100Z contiguous: bool, 2025-05-07T20:31:40.2387342Z compiled: bool, 2025-05-07T20:31:40.2387570Z ) -> None: 2025-05-07T20:31:40.2387781Z torch.manual_seed(2025) 2025-05-07T20:31:40.2388029Z 2025-05-07T20:31:40.2388320Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.2388665Z 2025-05-07T20:31:40.2388879Z x_sign = torch.sign(x) 2025-05-07T20:31:40.2389180Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.2389490Z x = x_sign * x_clamp 2025-05-07T20:31:40.2389744Z x0 = x[:, :D] 2025-05-07T20:31:40.2389976Z x1 = x[:, D:] 2025-05-07T20:31:40.2390188Z 2025-05-07T20:31:40.2390384Z if contiguous: 2025-05-07T20:31:40.2390624Z x0 = x0.contiguous() 2025-05-07T20:31:40.2390885Z x1 = x1.contiguous() 2025-05-07T20:31:40.2391137Z 2025-05-07T20:31:40.2391338Z if scale_ub is not None: 2025-05-07T20:31:40.2391611Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.2391958Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.2392283Z ) 2025-05-07T20:31:40.2392487Z else: 2025-05-07T20:31:40.2392698Z scale_ub_tensor = None 2025-05-07T20:31:40.2392955Z 2025-05-07T20:31:40.2393210Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.2393610Z op = silu_mul_quant 2025-05-07T20:31:40.2393868Z if compiled: 2025-05-07T20:31:40.2394129Z op = torch.compile(op) 2025-05-07T20:31:40.2394429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.2394723Z 2025-05-07T20:31:40.2394924Z > y_fp8, y_scale = fn() 2025-05-07T20:31:40.2395087Z 2025-05-07T20:31:40.2395187Z moe/activation_test.py:117: 2025-05-07T20:31:40.2395487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.2395819Z moe/activation_test.py:115: in fn 2025-05-07T20:31:40.2396098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.2396797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:40.2397500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:40.2398045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.2398747Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.2399426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.2399970Z kernel = self.compile( 2025-05-07T20:31:40.2400515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.2401183Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.2401584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.2401813Z 2025-05-07T20:31:40.2402033Z self = 2025-05-07T20:31:40.2403218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.2404687Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff48bec0>} 2025-05-07T20:31:40.2406035Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.2407064Z context = 2025-05-07T20:31:40.2407351Z 2025-05-07T20:31:40.2407527Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.2408047Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.2408515Z module_map=module_map) 2025-05-07T20:31:40.2408890Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.2409244Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.2409510Z E ^ 2025-05-07T20:31:40.2409993Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.2410450Z 2025-05-07T20:31:40.2410885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.2411402Z 2025-05-07T20:31:40.2411507Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.2411933Z self=, 2025-05-07T20:31:40.2412354Z T=2048, 2025-05-07T20:31:40.2412548Z D=7168, 2025-05-07T20:31:40.2412761Z scale_ub=1200.0, 2025-05-07T20:31:40.2412999Z contiguous=True, 2025-05-07T20:31:40.2413227Z compiled=False, 2025-05-07T20:31:40.2413453Z ) 2025-05-07T20:31:40.3088971Z self = 2025-05-07T20:31:40.3090649Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:40.3091221Z 2025-05-07T20:31:40.3091373Z @given( 2025-05-07T20:31:40.3091839Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.3092466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.3093073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.3093724Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.3094366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.3094938Z ) 2025-05-07T20:31:40.3095638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.3096511Z def test_silu_mul_quant( 2025-05-07T20:31:40.3096880Z self, 2025-05-07T20:31:40.3097119Z T: int, 2025-05-07T20:31:40.3097329Z D: int, 2025-05-07T20:31:40.3097553Z scale_ub: Optional[float], 2025-05-07T20:31:40.3097842Z contiguous: bool, 2025-05-07T20:31:40.3098093Z compiled: bool, 2025-05-07T20:31:40.3098322Z ) -> None: 2025-05-07T20:31:40.3098545Z torch.manual_seed(2025) 2025-05-07T20:31:40.3098792Z 2025-05-07T20:31:40.3099064Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.3101131Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.3102992Z 2025-05-07T20:31:40.3103114Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:40.3103550Z 2025-05-07T20:31:40.3103737Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.3104161Z self=, 2025-05-07T20:31:40.3104572Z T=1, 2025-05-07T20:31:40.3104764Z D=5120, 2025-05-07T20:31:40.3104963Z scale_ub=1200.0, 2025-05-07T20:31:40.3105183Z contiguous=True, 2025-05-07T20:31:40.3105410Z compiled=False, 2025-05-07T20:31:40.3105625Z ) 2025-05-07T20:31:40.3105943Z self = 2025-05-07T20:31:40.3106437Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:40.3106706Z 2025-05-07T20:31:40.3106793Z @given( 2025-05-07T20:31:40.3107020Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.3107350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.3107658Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.3107993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.3108329Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.3108619Z ) 2025-05-07T20:31:40.3108975Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.3109419Z def test_silu_mul_quant( 2025-05-07T20:31:40.3109666Z self, 2025-05-07T20:31:40.3109867Z T: int, 2025-05-07T20:31:40.3110068Z D: int, 2025-05-07T20:31:40.3110296Z scale_ub: Optional[float], 2025-05-07T20:31:40.3110572Z contiguous: bool, 2025-05-07T20:31:40.3110808Z compiled: bool, 2025-05-07T20:31:40.3111036Z ) -> None: 2025-05-07T20:31:40.3111257Z torch.manual_seed(2025) 2025-05-07T20:31:40.3111501Z 2025-05-07T20:31:40.3111778Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.3112121Z 2025-05-07T20:31:40.3112312Z x_sign = torch.sign(x) 2025-05-07T20:31:40.3112609Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.3113024Z x = x_sign * x_clamp 2025-05-07T20:31:40.3113263Z x0 = x[:, :D] 2025-05-07T20:31:40.3113482Z x1 = x[:, D:] 2025-05-07T20:31:40.3113706Z 2025-05-07T20:31:40.3113893Z if contiguous: 2025-05-07T20:31:40.3114151Z x0 = x0.contiguous() 2025-05-07T20:31:40.3114533Z x1 = x1.contiguous() 2025-05-07T20:31:40.3114797Z 2025-05-07T20:31:40.3114986Z if scale_ub is not None: 2025-05-07T20:31:40.3115266Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.3115613Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.3115922Z ) 2025-05-07T20:31:40.3116119Z else: 2025-05-07T20:31:40.3116335Z scale_ub_tensor = None 2025-05-07T20:31:40.3116583Z 2025-05-07T20:31:40.3116818Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.3117134Z op = silu_mul_quant 2025-05-07T20:31:40.3117379Z if compiled: 2025-05-07T20:31:40.3117644Z op = torch.compile(op) 2025-05-07T20:31:40.3117944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.3118215Z 2025-05-07T20:31:40.3118409Z > y_fp8, y_scale = fn() 2025-05-07T20:31:40.3118581Z 2025-05-07T20:31:40.3118680Z moe/activation_test.py:117: 2025-05-07T20:31:40.3118979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.3119308Z moe/activation_test.py:115: in fn 2025-05-07T20:31:40.3119591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.3120282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:40.3120970Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:40.3121516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.3122313Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.3122993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.3123633Z kernel = self.compile( 2025-05-07T20:31:40.3124183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.3124851Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.3125247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.3125483Z 2025-05-07T20:31:40.3125694Z self = 2025-05-07T20:31:40.3126798Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.3128169Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff5b94e0>} 2025-05-07T20:31:40.3129516Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.3130538Z context = 2025-05-07T20:31:40.3130834Z 2025-05-07T20:31:40.3131002Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.3131530Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.3132000Z module_map=module_map) 2025-05-07T20:31:40.3132358Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.3132715Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.3132982Z E ^ 2025-05-07T20:31:40.3133535Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.3133995Z 2025-05-07T20:31:40.3134414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.3134932Z 2025-05-07T20:31:40.3135038Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.3135453Z self=, 2025-05-07T20:31:40.3135849Z T=2048, 2025-05-07T20:31:40.3136041Z D=5120, 2025-05-07T20:31:40.3136237Z scale_ub=None, 2025-05-07T20:31:40.3136448Z contiguous=True, 2025-05-07T20:31:40.3136676Z compiled=False, 2025-05-07T20:31:40.3136877Z ) 2025-05-07T20:31:40.3137192Z self = 2025-05-07T20:31:40.3137688Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:40.3137968Z 2025-05-07T20:31:40.3138053Z @given( 2025-05-07T20:31:40.3138281Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.3138767Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.3139071Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.3139394Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.3139715Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.3140003Z ) 2025-05-07T20:31:40.3140354Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.3140789Z def test_silu_mul_quant( 2025-05-07T20:31:40.3141027Z self, 2025-05-07T20:31:40.3141221Z T: int, 2025-05-07T20:31:40.3141413Z D: int, 2025-05-07T20:31:40.3141633Z scale_ub: Optional[float], 2025-05-07T20:31:40.3141903Z contiguous: bool, 2025-05-07T20:31:40.3142137Z compiled: bool, 2025-05-07T20:31:40.3142358Z ) -> None: 2025-05-07T20:31:40.3142703Z torch.manual_seed(2025) 2025-05-07T20:31:40.3142949Z 2025-05-07T20:31:40.3143212Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.3143554Z 2025-05-07T20:31:40.3143747Z > x_sign = torch.sign(x) 2025-05-07T20:31:40.3145687Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.3147600Z 2025-05-07T20:31:40.3147716Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:40.3147935Z 2025-05-07T20:31:40.3148043Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.3148461Z self=, 2025-05-07T20:31:40.3148866Z T=16384, 2025-05-07T20:31:40.3149052Z D=5120, 2025-05-07T20:31:40.3149242Z scale_ub=None, 2025-05-07T20:31:40.3149453Z contiguous=True, 2025-05-07T20:31:40.3149671Z compiled=False, 2025-05-07T20:31:40.3149871Z ) 2025-05-07T20:31:40.3845661Z self = 2025-05-07T20:31:40.3846260Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:40.3846642Z 2025-05-07T20:31:40.3846724Z @given( 2025-05-07T20:31:40.3847029Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.3847353Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.3847660Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.3847991Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.3848335Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.3848820Z ) 2025-05-07T20:31:40.3849164Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.3849605Z def test_silu_mul_quant( 2025-05-07T20:31:40.3849845Z self, 2025-05-07T20:31:40.3850041Z T: int, 2025-05-07T20:31:40.3858165Z D: int, 2025-05-07T20:31:40.3858393Z scale_ub: Optional[float], 2025-05-07T20:31:40.3858659Z contiguous: bool, 2025-05-07T20:31:40.3858899Z compiled: bool, 2025-05-07T20:31:40.3859116Z ) -> None: 2025-05-07T20:31:40.3859328Z torch.manual_seed(2025) 2025-05-07T20:31:40.3859566Z 2025-05-07T20:31:40.3859837Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.3861900Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.3863765Z 2025-05-07T20:31:40.3863897Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:40.3864107Z 2025-05-07T20:31:40.3864207Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.3864612Z self=, 2025-05-07T20:31:40.3865010Z T=4096, 2025-05-07T20:31:40.3865188Z D=5120, 2025-05-07T20:31:40.3865375Z scale_ub=None, 2025-05-07T20:31:40.3865585Z contiguous=True, 2025-05-07T20:31:40.3865802Z compiled=False, 2025-05-07T20:31:40.3866004Z ) 2025-05-07T20:31:40.3866477Z self = 2025-05-07T20:31:40.3866972Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:40.3867246Z 2025-05-07T20:31:40.3867323Z @given( 2025-05-07T20:31:40.3867555Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.3867865Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.3868158Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.3868480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.3868803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.3869076Z ) 2025-05-07T20:31:40.3869421Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.3869858Z def test_silu_mul_quant( 2025-05-07T20:31:40.3870093Z self, 2025-05-07T20:31:40.3870284Z T: int, 2025-05-07T20:31:40.3870476Z D: int, 2025-05-07T20:31:40.3870684Z scale_ub: Optional[float], 2025-05-07T20:31:40.3870960Z contiguous: bool, 2025-05-07T20:31:40.3871196Z compiled: bool, 2025-05-07T20:31:40.3871418Z ) -> None: 2025-05-07T20:31:40.3871625Z torch.manual_seed(2025) 2025-05-07T20:31:40.3871864Z 2025-05-07T20:31:40.3872133Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.3874157Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.3876008Z 2025-05-07T20:31:40.3876129Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:40.3876428Z 2025-05-07T20:31:40.3876528Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.3876956Z self=, 2025-05-07T20:31:40.3877391Z T=2048, 2025-05-07T20:31:40.3877569Z D=5120, 2025-05-07T20:31:40.3877753Z scale_ub=None, 2025-05-07T20:31:40.3877962Z contiguous=False, 2025-05-07T20:31:40.3878178Z compiled=False, 2025-05-07T20:31:40.3878378Z ) 2025-05-07T20:31:40.3878692Z self = 2025-05-07T20:31:40.3879183Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:40.3879459Z 2025-05-07T20:31:40.3879535Z @given( 2025-05-07T20:31:40.3879761Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.3880061Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.3880370Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.3880703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.3881031Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.3881311Z ) 2025-05-07T20:31:40.3881650Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.3882084Z def test_silu_mul_quant( 2025-05-07T20:31:40.3882316Z self, 2025-05-07T20:31:40.3882511Z T: int, 2025-05-07T20:31:40.3882702Z D: int, 2025-05-07T20:31:40.3882906Z scale_ub: Optional[float], 2025-05-07T20:31:40.3883171Z contiguous: bool, 2025-05-07T20:31:40.3883587Z compiled: bool, 2025-05-07T20:31:40.3883799Z ) -> None: 2025-05-07T20:31:40.3884008Z torch.manual_seed(2025) 2025-05-07T20:31:40.3884243Z 2025-05-07T20:31:40.3884504Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.3886634Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.3888535Z 2025-05-07T20:31:40.3888652Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:40.3888872Z 2025-05-07T20:31:40.3888971Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.3889372Z self=, 2025-05-07T20:31:40.3889766Z T=4096, 2025-05-07T20:31:40.3889940Z D=7168, 2025-05-07T20:31:40.3890119Z scale_ub=None, 2025-05-07T20:31:40.3890324Z contiguous=True, 2025-05-07T20:31:40.3890532Z compiled=True, 2025-05-07T20:31:40.3890733Z ) 2025-05-07T20:31:40.3892480Z self = 2025-05-07T20:31:40.3892960Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:40.3893231Z 2025-05-07T20:31:40.3893307Z @given( 2025-05-07T20:31:40.3893529Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.3893828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.3894128Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.3894446Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.3894772Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.3895047Z ) 2025-05-07T20:31:40.3895388Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.3895820Z def test_silu_mul_quant( 2025-05-07T20:31:40.3896050Z self, 2025-05-07T20:31:40.3896234Z T: int, 2025-05-07T20:31:40.3896422Z D: int, 2025-05-07T20:31:40.3896718Z scale_ub: Optional[float], 2025-05-07T20:31:40.3897008Z contiguous: bool, 2025-05-07T20:31:40.3897264Z compiled: bool, 2025-05-07T20:31:40.3897476Z ) -> None: 2025-05-07T20:31:40.3897685Z torch.manual_seed(2025) 2025-05-07T20:31:40.3897918Z 2025-05-07T20:31:40.3898178Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.3900217Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.3902077Z 2025-05-07T20:31:40.3902191Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:40.3902409Z 2025-05-07T20:31:40.3902509Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.3902913Z self=, 2025-05-07T20:31:40.3903305Z T=2048, 2025-05-07T20:31:40.3903484Z D=5120, 2025-05-07T20:31:40.3903668Z scale_ub=1200.0, 2025-05-07T20:31:40.3903882Z contiguous=False, 2025-05-07T20:31:40.3904101Z compiled=False, 2025-05-07T20:31:40.3904296Z ) 2025-05-07T20:31:40.3904607Z self = 2025-05-07T20:31:40.3905096Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:40.3905375Z 2025-05-07T20:31:40.3905452Z @given( 2025-05-07T20:31:40.3905675Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.3905976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.3906353Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.3906679Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.3906993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.3907272Z ) 2025-05-07T20:31:40.3907614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.3908041Z def test_silu_mul_quant( 2025-05-07T20:31:40.3908277Z self, 2025-05-07T20:31:40.3908460Z T: int, 2025-05-07T20:31:40.3908651Z D: int, 2025-05-07T20:31:40.3908854Z scale_ub: Optional[float], 2025-05-07T20:31:40.3909120Z contiguous: bool, 2025-05-07T20:31:40.3909346Z compiled: bool, 2025-05-07T20:31:40.3909555Z ) -> None: 2025-05-07T20:31:40.3909768Z torch.manual_seed(2025) 2025-05-07T20:31:40.3909998Z 2025-05-07T20:31:40.3910256Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.3912290Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.3914144Z 2025-05-07T20:31:40.3914259Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:40.3914469Z 2025-05-07T20:31:40.3914574Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.3914975Z self=, 2025-05-07T20:31:40.3915364Z T=4096, 2025-05-07T20:31:40.3915542Z D=7168, 2025-05-07T20:31:40.3915727Z scale_ub=1200.0, 2025-05-07T20:31:40.3915949Z contiguous=True, 2025-05-07T20:31:40.3916250Z compiled=False, 2025-05-07T20:31:40.3916447Z ) 2025-05-07T20:31:40.4824645Z self = 2025-05-07T20:31:40.4825255Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:40.4825659Z 2025-05-07T20:31:40.4825767Z @given( 2025-05-07T20:31:40.4826077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.4826501Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.4826913Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.4827354Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.4827778Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.4828073Z ) 2025-05-07T20:31:40.4828419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.4828856Z def test_silu_mul_quant( 2025-05-07T20:31:40.4829089Z self, 2025-05-07T20:31:40.4829292Z T: int, 2025-05-07T20:31:40.4829484Z D: int, 2025-05-07T20:31:40.4829691Z scale_ub: Optional[float], 2025-05-07T20:31:40.4829960Z contiguous: bool, 2025-05-07T20:31:40.4830197Z compiled: bool, 2025-05-07T20:31:40.4830413Z ) -> None: 2025-05-07T20:31:40.4830624Z torch.manual_seed(2025) 2025-05-07T20:31:40.4830862Z 2025-05-07T20:31:40.4831132Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.4833337Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.4835211Z 2025-05-07T20:31:40.4835327Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:40.4835537Z 2025-05-07T20:31:40.4835646Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.4836052Z self=, 2025-05-07T20:31:40.4836457Z T=16384, 2025-05-07T20:31:40.4836649Z D=7168, 2025-05-07T20:31:40.4836840Z scale_ub=None, 2025-05-07T20:31:40.4837048Z contiguous=False, 2025-05-07T20:31:40.4837270Z compiled=True, 2025-05-07T20:31:40.4837476Z ) 2025-05-07T20:31:40.4837784Z self = 2025-05-07T20:31:40.4838283Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:40.4838763Z 2025-05-07T20:31:40.4838848Z @given( 2025-05-07T20:31:40.4839071Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.4839394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.4839697Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.4840014Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.4840339Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.4840618Z ) 2025-05-07T20:31:40.4840966Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.4841395Z def test_silu_mul_quant( 2025-05-07T20:31:40.4841638Z self, 2025-05-07T20:31:40.4841827Z T: int, 2025-05-07T20:31:40.4842019Z D: int, 2025-05-07T20:31:40.4842230Z scale_ub: Optional[float], 2025-05-07T20:31:40.4842498Z contiguous: bool, 2025-05-07T20:31:40.4842728Z compiled: bool, 2025-05-07T20:31:40.4842945Z ) -> None: 2025-05-07T20:31:40.4843159Z torch.manual_seed(2025) 2025-05-07T20:31:40.4843496Z 2025-05-07T20:31:40.4843776Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.4845979Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.4847830Z 2025-05-07T20:31:40.4847947Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:40.4848156Z 2025-05-07T20:31:40.4848259Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.4848661Z self=, 2025-05-07T20:31:40.4849062Z T=4096, 2025-05-07T20:31:40.4849247Z D=7168, 2025-05-07T20:31:40.4849440Z scale_ub=None, 2025-05-07T20:31:40.4849651Z contiguous=True, 2025-05-07T20:31:40.4849879Z compiled=False, 2025-05-07T20:31:40.4850077Z ) 2025-05-07T20:31:40.4850393Z self = 2025-05-07T20:31:40.4850879Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:40.4851147Z 2025-05-07T20:31:40.4851229Z @given( 2025-05-07T20:31:40.4851449Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.4851751Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.4852052Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.4852374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.4852707Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.4852987Z ) 2025-05-07T20:31:40.4853325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.4853877Z def test_silu_mul_quant( 2025-05-07T20:31:40.4854125Z self, 2025-05-07T20:31:40.4854310Z T: int, 2025-05-07T20:31:40.4854502Z D: int, 2025-05-07T20:31:40.4854720Z scale_ub: Optional[float], 2025-05-07T20:31:40.4854981Z contiguous: bool, 2025-05-07T20:31:40.4855222Z compiled: bool, 2025-05-07T20:31:40.4855468Z ) -> None: 2025-05-07T20:31:40.4855682Z torch.manual_seed(2025) 2025-05-07T20:31:40.4855919Z 2025-05-07T20:31:40.4856187Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.4858288Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.4860148Z 2025-05-07T20:31:40.4860263Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:40.4860477Z 2025-05-07T20:31:40.4860582Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.4860982Z self=, 2025-05-07T20:31:40.4861379Z T=16384, 2025-05-07T20:31:40.4861566Z D=7168, 2025-05-07T20:31:40.4861751Z scale_ub=None, 2025-05-07T20:31:40.4861959Z contiguous=True, 2025-05-07T20:31:40.4862174Z compiled=False, 2025-05-07T20:31:40.4862372Z ) 2025-05-07T20:31:40.4862682Z self = 2025-05-07T20:31:40.4863176Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:40.4863449Z 2025-05-07T20:31:40.4863530Z @given( 2025-05-07T20:31:40.4863753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.4864151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.4864449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.4864768Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.4865095Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.4865374Z ) 2025-05-07T20:31:40.4865718Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.4866154Z def test_silu_mul_quant( 2025-05-07T20:31:40.4866394Z self, 2025-05-07T20:31:40.4866587Z T: int, 2025-05-07T20:31:40.4866772Z D: int, 2025-05-07T20:31:40.4867012Z scale_ub: Optional[float], 2025-05-07T20:31:40.4867301Z contiguous: bool, 2025-05-07T20:31:40.4867529Z compiled: bool, 2025-05-07T20:31:40.4867745Z ) -> None: 2025-05-07T20:31:40.4867956Z torch.manual_seed(2025) 2025-05-07T20:31:40.4868189Z 2025-05-07T20:31:40.4868467Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.4870497Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.4872344Z 2025-05-07T20:31:40.4872464Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:40.4872672Z 2025-05-07T20:31:40.4872777Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.4873181Z self=, 2025-05-07T20:31:40.4873666Z T=16384, 2025-05-07T20:31:40.4873858Z D=7168, 2025-05-07T20:31:40.4874041Z scale_ub=1200.0, 2025-05-07T20:31:40.4874262Z contiguous=True, 2025-05-07T20:31:40.4874483Z compiled=False, 2025-05-07T20:31:40.4874681Z ) 2025-05-07T20:31:40.4874998Z self = 2025-05-07T20:31:40.4875496Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:40.4875775Z 2025-05-07T20:31:40.4875852Z @given( 2025-05-07T20:31:40.4876079Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.4876385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.4876689Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.4877038Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.4877388Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.4877668Z ) 2025-05-07T20:31:40.4878014Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.4878453Z def test_silu_mul_quant( 2025-05-07T20:31:40.4878685Z self, 2025-05-07T20:31:40.4878872Z T: int, 2025-05-07T20:31:40.4879065Z D: int, 2025-05-07T20:31:40.4879275Z scale_ub: Optional[float], 2025-05-07T20:31:40.4879537Z contiguous: bool, 2025-05-07T20:31:40.4879771Z compiled: bool, 2025-05-07T20:31:40.4879994Z ) -> None: 2025-05-07T20:31:40.4880202Z torch.manual_seed(2025) 2025-05-07T20:31:40.4880440Z 2025-05-07T20:31:40.4880709Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.4882745Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.4884772Z 2025-05-07T20:31:40.4884894Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:40.4885103Z 2025-05-07T20:31:40.4885203Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.4885610Z self=, 2025-05-07T20:31:40.4886006Z T=128, 2025-05-07T20:31:40.4886184Z D=5120, 2025-05-07T20:31:40.4886371Z scale_ub=1200.0, 2025-05-07T20:31:40.4886591Z contiguous=False, 2025-05-07T20:31:40.4886807Z compiled=False, 2025-05-07T20:31:40.4887000Z ) 2025-05-07T20:31:40.5922556Z self = 2025-05-07T20:31:40.5923366Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:40.5923789Z 2025-05-07T20:31:40.5923905Z @given( 2025-05-07T20:31:40.5924225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.5924662Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.5925076Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.5925526Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.5925932Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.5926217Z ) 2025-05-07T20:31:40.5926559Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.5927023Z def test_silu_mul_quant( 2025-05-07T20:31:40.5927291Z self, 2025-05-07T20:31:40.5927478Z T: int, 2025-05-07T20:31:40.5927670Z D: int, 2025-05-07T20:31:40.5927881Z scale_ub: Optional[float], 2025-05-07T20:31:40.5928145Z contiguous: bool, 2025-05-07T20:31:40.5928380Z compiled: bool, 2025-05-07T20:31:40.5928597Z ) -> None: 2025-05-07T20:31:40.5929039Z torch.manual_seed(2025) 2025-05-07T20:31:40.5929281Z 2025-05-07T20:31:40.5929548Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.5929880Z 2025-05-07T20:31:40.5930066Z x_sign = torch.sign(x) 2025-05-07T20:31:40.5930371Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.5930673Z x = x_sign * x_clamp 2025-05-07T20:31:40.5930911Z x0 = x[:, :D] 2025-05-07T20:31:40.5931118Z x1 = x[:, D:] 2025-05-07T20:31:40.5931318Z 2025-05-07T20:31:40.5931503Z if contiguous: 2025-05-07T20:31:40.5931727Z x0 = x0.contiguous() 2025-05-07T20:31:40.5931984Z x1 = x1.contiguous() 2025-05-07T20:31:40.5932220Z 2025-05-07T20:31:40.5932402Z if scale_ub is not None: 2025-05-07T20:31:40.5932671Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.5933002Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.5933309Z ) 2025-05-07T20:31:40.5933505Z else: 2025-05-07T20:31:40.5933708Z scale_ub_tensor = None 2025-05-07T20:31:40.5933949Z 2025-05-07T20:31:40.5934177Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.5934486Z op = silu_mul_quant 2025-05-07T20:31:40.5934740Z if compiled: 2025-05-07T20:31:40.5934984Z op = torch.compile(op) 2025-05-07T20:31:40.5935275Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.5935545Z 2025-05-07T20:31:40.5935727Z > y_fp8, y_scale = fn() 2025-05-07T20:31:40.5935893Z 2025-05-07T20:31:40.5935990Z moe/activation_test.py:117: 2025-05-07T20:31:40.5936279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.5936607Z moe/activation_test.py:115: in fn 2025-05-07T20:31:40.5936895Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.5937597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:40.5938651Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:40.5939237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.5939918Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.5940588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.5941119Z kernel = self.compile( 2025-05-07T20:31:40.5941660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.5942316Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.5942705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.5942932Z 2025-05-07T20:31:40.5943144Z self = 2025-05-07T20:31:40.5944228Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.5945587Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff150220>} 2025-05-07T20:31:40.5946934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.5947964Z context = 2025-05-07T20:31:40.5948253Z 2025-05-07T20:31:40.5948422Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.5949256Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.5957414Z module_map=module_map) 2025-05-07T20:31:40.5957788Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.5958139Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.5958393Z E ^ 2025-05-07T20:31:40.5958860Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.5959311Z 2025-05-07T20:31:40.5959734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.5960253Z 2025-05-07T20:31:40.5960354Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.5960765Z self=, 2025-05-07T20:31:40.5961167Z T=2048, 2025-05-07T20:31:40.5961347Z D=7168, 2025-05-07T20:31:40.5961538Z scale_ub=None, 2025-05-07T20:31:40.5961757Z contiguous=False, 2025-05-07T20:31:40.5961970Z compiled=False, 2025-05-07T20:31:40.5962172Z ) 2025-05-07T20:31:40.5962484Z self = 2025-05-07T20:31:40.5962967Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:40.5963349Z 2025-05-07T20:31:40.5963426Z @given( 2025-05-07T20:31:40.5963656Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.5963966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.5964265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.5964591Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.5964917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.5965192Z ) 2025-05-07T20:31:40.5965539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.5965977Z def test_silu_mul_quant( 2025-05-07T20:31:40.5966380Z self, 2025-05-07T20:31:40.5966565Z T: int, 2025-05-07T20:31:40.5966755Z D: int, 2025-05-07T20:31:40.5966962Z scale_ub: Optional[float], 2025-05-07T20:31:40.5967229Z contiguous: bool, 2025-05-07T20:31:40.5967469Z compiled: bool, 2025-05-07T20:31:40.5967693Z ) -> None: 2025-05-07T20:31:40.5967893Z torch.manual_seed(2025) 2025-05-07T20:31:40.5968127Z 2025-05-07T20:31:40.5968392Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.5970433Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.5972289Z 2025-05-07T20:31:40.5972405Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:40.5972618Z 2025-05-07T20:31:40.5972719Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.5973126Z self=, 2025-05-07T20:31:40.5973522Z T=128, 2025-05-07T20:31:40.5973699Z D=7168, 2025-05-07T20:31:40.5973884Z scale_ub=1200.0, 2025-05-07T20:31:40.5974103Z contiguous=True, 2025-05-07T20:31:40.5974313Z compiled=True, 2025-05-07T20:31:40.5974510Z ) 2025-05-07T20:31:40.6277977Z self = 2025-05-07T20:31:40.6279451Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:40.6280190Z 2025-05-07T20:31:40.6280402Z @given( 2025-05-07T20:31:40.6281372Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.6282017Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.6282606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.6283404Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.6284035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.6284591Z ) 2025-05-07T20:31:40.6285272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.6286126Z def test_silu_mul_quant( 2025-05-07T20:31:40.6286591Z self, 2025-05-07T20:31:40.6286884Z T: int, 2025-05-07T20:31:40.6287073Z D: int, 2025-05-07T20:31:40.6287288Z scale_ub: Optional[float], 2025-05-07T20:31:40.6287551Z contiguous: bool, 2025-05-07T20:31:40.6287783Z compiled: bool, 2025-05-07T20:31:40.6288010Z ) -> None: 2025-05-07T20:31:40.6288225Z torch.manual_seed(2025) 2025-05-07T20:31:40.6288455Z 2025-05-07T20:31:40.6288730Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.6289065Z 2025-05-07T20:31:40.6289253Z x_sign = torch.sign(x) 2025-05-07T20:31:40.6289539Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.6289838Z x = x_sign * x_clamp 2025-05-07T20:31:40.6290072Z x0 = x[:, :D] 2025-05-07T20:31:40.6290285Z x1 = x[:, D:] 2025-05-07T20:31:40.6290483Z 2025-05-07T20:31:40.6290666Z if contiguous: 2025-05-07T20:31:40.6290893Z x0 = x0.contiguous() 2025-05-07T20:31:40.6291136Z x1 = x1.contiguous() 2025-05-07T20:31:40.6291374Z 2025-05-07T20:31:40.6291558Z if scale_ub is not None: 2025-05-07T20:31:40.6291824Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.6292151Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.6292460Z ) 2025-05-07T20:31:40.6292647Z else: 2025-05-07T20:31:40.6292847Z scale_ub_tensor = None 2025-05-07T20:31:40.6293226Z 2025-05-07T20:31:40.6293456Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.6293762Z op = silu_mul_quant 2025-05-07T20:31:40.6294006Z if compiled: 2025-05-07T20:31:40.6294253Z op = torch.compile(op) 2025-05-07T20:31:40.6294541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.6294808Z 2025-05-07T20:31:40.6294994Z > y_fp8, y_scale = fn() 2025-05-07T20:31:40.6295156Z 2025-05-07T20:31:40.6295255Z moe/activation_test.py:117: 2025-05-07T20:31:40.6295544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.6295868Z moe/activation_test.py:115: in fn 2025-05-07T20:31:40.6296139Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.6296692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:40.6297247Z return fn(*args, **kwargs) 2025-05-07T20:31:40.6297909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:40.6298592Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:40.6299131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.6299806Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.6300463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.6300987Z kernel = self.compile( 2025-05-07T20:31:40.6301527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.6302186Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.6302573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.6302884Z 2025-05-07T20:31:40.6303098Z self = 2025-05-07T20:31:40.6304176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.6305540Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff150860>} 2025-05-07T20:31:40.6306873Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.6307888Z context = 2025-05-07T20:31:40.6308178Z 2025-05-07T20:31:40.6308348Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.6308868Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.6309326Z module_map=module_map) 2025-05-07T20:31:40.6309680Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.6310028Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.6310286Z E ^ 2025-05-07T20:31:40.6310746Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.6311195Z 2025-05-07T20:31:40.6311613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.6312130Z 2025-05-07T20:31:40.6312233Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.6312639Z self=, 2025-05-07T20:31:40.6313032Z T=128, 2025-05-07T20:31:40.6313216Z D=7168, 2025-05-07T20:31:40.6313486Z scale_ub=1200.0, 2025-05-07T20:31:40.6313700Z contiguous=True, 2025-05-07T20:31:40.6313917Z compiled=False, 2025-05-07T20:31:40.6314115Z ) 2025-05-07T20:31:40.6314426Z self = 2025-05-07T20:31:40.6314912Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:40.6315184Z 2025-05-07T20:31:40.6315257Z @given( 2025-05-07T20:31:40.6315481Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.6315781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.6316080Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.6316403Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.6316722Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.6317002Z ) 2025-05-07T20:31:40.6317343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.6317782Z def test_silu_mul_quant( 2025-05-07T20:31:40.6318015Z self, 2025-05-07T20:31:40.6318205Z T: int, 2025-05-07T20:31:40.6318392Z D: int, 2025-05-07T20:31:40.6318603Z scale_ub: Optional[float], 2025-05-07T20:31:40.6318862Z contiguous: bool, 2025-05-07T20:31:40.6319090Z compiled: bool, 2025-05-07T20:31:40.6319303Z ) -> None: 2025-05-07T20:31:40.6319518Z torch.manual_seed(2025) 2025-05-07T20:31:40.6319756Z 2025-05-07T20:31:40.6320015Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.6320354Z 2025-05-07T20:31:40.6320544Z x_sign = torch.sign(x) 2025-05-07T20:31:40.6320826Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.6322906Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.6324844Z 2025-05-07T20:31:40.6324963Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:40.6325184Z 2025-05-07T20:31:40.6325284Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.6325699Z self=, 2025-05-07T20:31:40.6326097Z T=128, 2025-05-07T20:31:40.6326282Z D=5120, 2025-05-07T20:31:40.6326471Z scale_ub=1200.0, 2025-05-07T20:31:40.6326685Z contiguous=True, 2025-05-07T20:31:40.6326899Z compiled=True, 2025-05-07T20:31:40.6327100Z ) 2025-05-07T20:31:40.6327415Z self = 2025-05-07T20:31:40.6327897Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:40.6328160Z 2025-05-07T20:31:40.6328243Z @given( 2025-05-07T20:31:40.6328462Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.6328766Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.6329064Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.6329386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.6329703Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.6329983Z ) 2025-05-07T20:31:40.6330324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.6330752Z def test_silu_mul_quant( 2025-05-07T20:31:40.6330989Z self, 2025-05-07T20:31:40.6331177Z T: int, 2025-05-07T20:31:40.6331362Z D: int, 2025-05-07T20:31:40.6331589Z scale_ub: Optional[float], 2025-05-07T20:31:40.6331864Z contiguous: bool, 2025-05-07T20:31:40.6332181Z compiled: bool, 2025-05-07T20:31:40.6332400Z ) -> None: 2025-05-07T20:31:40.6332614Z torch.manual_seed(2025) 2025-05-07T20:31:40.6332844Z 2025-05-07T20:31:40.6333109Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.6333445Z 2025-05-07T20:31:40.6333628Z > x_sign = torch.sign(x) 2025-05-07T20:31:40.6335559Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:40.6337461Z 2025-05-07T20:31:40.6337581Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:40.6337797Z 2025-05-07T20:31:40.6337898Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.6338305Z self=, 2025-05-07T20:31:40.6338868Z T=128, 2025-05-07T20:31:40.6339049Z D=7168, 2025-05-07T20:31:40.6339234Z scale_ub=None, 2025-05-07T20:31:40.6339437Z contiguous=True, 2025-05-07T20:31:40.6339649Z compiled=True, 2025-05-07T20:31:40.6339843Z ) 2025-05-07T20:31:41.1541744Z self = 2025-05-07T20:31:41.1542264Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:41.1542533Z 2025-05-07T20:31:41.1542608Z @given( 2025-05-07T20:31:41.1542834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1543145Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.1543728Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.1544070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.1544391Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.1544676Z ) 2025-05-07T20:31:41.1545022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.1545454Z def test_silu_mul_quant( 2025-05-07T20:31:41.1545698Z self, 2025-05-07T20:31:41.1545901Z T: int, 2025-05-07T20:31:41.1546094Z D: int, 2025-05-07T20:31:41.1546302Z scale_ub: Optional[float], 2025-05-07T20:31:41.1546569Z contiguous: bool, 2025-05-07T20:31:41.1546807Z compiled: bool, 2025-05-07T20:31:41.1547029Z ) -> None: 2025-05-07T20:31:41.1547240Z torch.manual_seed(2025) 2025-05-07T20:31:41.1547479Z 2025-05-07T20:31:41.1547744Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1549833Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.1551679Z 2025-05-07T20:31:41.1551793Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.1552009Z 2025-05-07T20:31:41.1608834Z FAILED 2025-05-07T20:31:41.1609079Z 2025-05-07T20:31:41.1609362Z =================================== FAILURES =================================== 2025-05-07T20:31:41.1609822Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:31:41.1610274Z + Exception Group Traceback (most recent call last): 2025-05-07T20:31:41.1611087Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor 2025-05-07T20:31:41.1611728Z | yield 2025-05-07T20:31:41.1612168Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run 2025-05-07T20:31:41.1612686Z | self._callTestMethod(testMethod) 2025-05-07T20:31:41.1613252Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod 2025-05-07T20:31:41.1613796Z | if method() is not None: 2025-05-07T20:31:41.1614041Z | ^^^^^^^^ 2025-05-07T20:31:41.1614684Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:31:41.1615409Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1615708Z | ^^^^^^^ 2025-05-07T20:31:41.1616270Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:31:41.1616896Z | raise the_error_hypothesis_found 2025-05-07T20:31:41.1617329Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:31:41.1617753Z +-+---------------- 1 ---------------- 2025-05-07T20:31:41.1618038Z | Traceback (most recent call last): 2025-05-07T20:31:41.1618750Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:41.1619533Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1619900Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:41.1621976Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.1624183Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:41.1624623Z | self=, 2025-05-07T20:31:41.1625028Z | T=128, 2025-05-07T20:31:41.1625251Z | D=7168, 2025-05-07T20:31:41.1625473Z | scale_ub=1200.0, 2025-05-07T20:31:41.1625733Z | contiguous=True, 2025-05-07T20:31:41.1625979Z | compiled=False, 2025-05-07T20:31:41.1626200Z | ) 2025-05-07T20:31:41.1626380Z | 2025-05-07T20:31:41.1626913Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:31:41.1627511Z +---------------- 2 ---------------- 2025-05-07T20:31:41.1627797Z | Traceback (most recent call last): 2025-05-07T20:31:41.1628504Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:41.1629274Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1629649Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:41.1631623Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.1633664Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:41.1634098Z | self=, 2025-05-07T20:31:41.1634500Z | T=128, 2025-05-07T20:31:41.1634699Z | D=7168, 2025-05-07T20:31:41.1634912Z | scale_ub=None, 2025-05-07T20:31:41.1635137Z | contiguous=True, 2025-05-07T20:31:41.1635372Z | compiled=True, 2025-05-07T20:31:41.1635589Z | ) 2025-05-07T20:31:41.1635758Z | 2025-05-07T20:31:41.1636274Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:41.1636868Z +---------------- 3 ---------------- 2025-05-07T20:31:41.1637154Z | Traceback (most recent call last): 2025-05-07T20:31:41.1637866Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:41.1638793Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1639165Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:41.1641136Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.1643216Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:41.1643759Z | self=, 2025-05-07T20:31:41.1644162Z | T=128, 2025-05-07T20:31:41.1644361Z | D=5120, 2025-05-07T20:31:41.1644566Z | scale_ub=1200.0, 2025-05-07T20:31:41.1644803Z | contiguous=True, 2025-05-07T20:31:41.1645036Z | compiled=True, 2025-05-07T20:31:41.1645249Z | ) 2025-05-07T20:31:41.1645423Z | 2025-05-07T20:31:41.1645936Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:31:41.1646527Z +---------------- 4 ---------------- 2025-05-07T20:31:41.1646856Z | Traceback (most recent call last): 2025-05-07T20:31:41.1647603Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:31:41.1648324Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.1648616Z | ^^^^^^^^ 2025-05-07T20:31:41.1649251Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:31:41.1649959Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.1650308Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:41.1651148Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:31:41.1651940Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.1652553Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:31:41.1653305Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.1653900Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:41.1654545Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:31:41.1655333Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.1655811Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:41.1656476Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:31:41.1657287Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.1657750Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:41.1658397Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:31:41.1659108Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.1659474Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:41.1660081Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:31:41.1660648Z | fn() 2025-05-07T20:31:41.1661212Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:31:41.1661874Z | self.fn.run( 2025-05-07T20:31:41.1662402Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:31:41.1681483Z | kernel = self.compile( 2025-05-07T20:31:41.1685318Z | ^^^^^^^^^^^^^ 2025-05-07T20:31:41.1686492Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:31:41.1687677Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.1688261Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:41.1689170Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:41.1690285Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.1690955Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:41.1691476Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.1691962Z | def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.1692330Z | ^ 2025-05-07T20:31:41.1692971Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.1693772Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:41.1694351Z | # The test always failed when commented parts were varied together. 2025-05-07T20:31:41.1695086Z | self=, 2025-05-07T20:31:41.1695689Z | T=1, # or any other generated value 2025-05-07T20:31:41.1696128Z | D=5120, # or any other generated value 2025-05-07T20:31:41.1696608Z | scale_ub=None, # or any other generated value 2025-05-07T20:31:41.1697103Z | contiguous=True, # or any other generated value 2025-05-07T20:31:41.1697608Z | compiled=True, # or any other generated value 2025-05-07T20:31:41.1698029Z | ) 2025-05-07T20:31:41.1698270Z | 2025-05-07T20:31:41.1699005Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:41.1699975Z +------------------------------------ 2025-05-07T20:31:41.1700481Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:31:41.1700991Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.1701569Z self=, 2025-05-07T20:31:41.1702128Z T=1, 2025-05-07T20:31:41.1702375Z D=5120, 2025-05-07T20:31:41.1702641Z scale_ub=None, 2025-05-07T20:31:41.1702939Z contiguous=True, 2025-05-07T20:31:41.1703248Z compiled=True, 2025-05-07T20:31:41.1703539Z ) 2025-05-07T20:31:41.1703984Z self = 2025-05-07T20:31:41.1704656Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:41.1705018Z 2025-05-07T20:31:41.1705126Z @given( 2025-05-07T20:31:41.1705448Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1705904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.1706342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.1706811Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.1707296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.1707703Z ) 2025-05-07T20:31:41.1708203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.1708823Z def test_silu_mul_quant( 2025-05-07T20:31:41.1709157Z self, 2025-05-07T20:31:41.1709439Z T: int, 2025-05-07T20:31:41.1709726Z D: int, 2025-05-07T20:31:41.1710029Z scale_ub: Optional[float], 2025-05-07T20:31:41.1710401Z contiguous: bool, 2025-05-07T20:31:41.1710745Z compiled: bool, 2025-05-07T20:31:41.1711073Z ) -> None: 2025-05-07T20:31:41.1711382Z torch.manual_seed(2025) 2025-05-07T20:31:41.1711735Z 2025-05-07T20:31:41.1712116Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1712690Z 2025-05-07T20:31:41.1712965Z x_sign = torch.sign(x) 2025-05-07T20:31:41.1713381Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.1713806Z x = x_sign * x_clamp 2025-05-07T20:31:41.1714144Z x0 = x[:, :D] 2025-05-07T20:31:41.1714450Z x1 = x[:, D:] 2025-05-07T20:31:41.1714750Z 2025-05-07T20:31:41.1715020Z if contiguous: 2025-05-07T20:31:41.1715347Z x0 = x0.contiguous() 2025-05-07T20:31:41.1715711Z x1 = x1.contiguous() 2025-05-07T20:31:41.1716066Z 2025-05-07T20:31:41.1716340Z if scale_ub is not None: 2025-05-07T20:31:41.1716718Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.1717241Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.1717675Z ) 2025-05-07T20:31:41.1717939Z else: 2025-05-07T20:31:41.1718222Z scale_ub_tensor = None 2025-05-07T20:31:41.1718590Z 2025-05-07T20:31:41.1718927Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1719380Z op = silu_mul_quant 2025-05-07T20:31:41.1719740Z if compiled: 2025-05-07T20:31:41.1720091Z op = torch.compile(op) 2025-05-07T20:31:41.1720511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1720891Z 2025-05-07T20:31:41.1721156Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.1721560Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.1721969Z 2025-05-07T20:31:41.1722292Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1722749Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.1723159Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.1723756Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.1724247Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.1724679Z 2025-05-07T20:31:41.1724965Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.1725339Z 2025-05-07T20:31:41.1725482Z moe/activation_test.py:126: 2025-05-07T20:31:41.1725895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1726359Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.1726804Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.1727895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.1728943Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.1729690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.1730632Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.1731587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.1732597Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.1733585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.1734650Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.1735627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.1736543Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.1737385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.1738074Z fn() 2025-05-07T20:31:41.1739853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.1740659Z self.fn.run( 2025-05-07T20:31:41.1741530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.1742294Z kernel = self.compile( 2025-05-07T20:31:41.1743026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.1743902Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.1744418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1744734Z 2025-05-07T20:31:41.1745007Z self = 2025-05-07T20:31:41.1746459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.1748335Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09369c3060>} 2025-05-07T20:31:41.1750142Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.1751510Z context = 2025-05-07T20:31:41.1751894Z 2025-05-07T20:31:41.1752113Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.1752816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.1753427Z module_map=module_map) 2025-05-07T20:31:41.1753858Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.1754287Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.1754609Z E ^ 2025-05-07T20:31:41.1755177Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.1758842Z 2025-05-07T20:31:41.1759364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.1760012Z 2025-05-07T20:31:41.1760136Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.1760637Z self=, 2025-05-07T20:31:41.1761125Z T=2048, 2025-05-07T20:31:41.1761351Z D=5120, 2025-05-07T20:31:41.1761581Z scale_ub=1200.0, 2025-05-07T20:31:41.1761841Z contiguous=True, 2025-05-07T20:31:41.1762112Z compiled=False, 2025-05-07T20:31:41.1762363Z ) 2025-05-07T20:31:41.1762744Z self = 2025-05-07T20:31:41.1763527Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:41.1763869Z 2025-05-07T20:31:41.1763969Z @given( 2025-05-07T20:31:41.1764251Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1764680Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.1765100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.1765538Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.1765950Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.1766295Z ) 2025-05-07T20:31:41.1766722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.1767268Z def test_silu_mul_quant( 2025-05-07T20:31:41.1767565Z self, 2025-05-07T20:31:41.1767795Z T: int, 2025-05-07T20:31:41.1768021Z D: int, 2025-05-07T20:31:41.1768283Z scale_ub: Optional[float], 2025-05-07T20:31:41.1768618Z contiguous: bool, 2025-05-07T20:31:41.1768916Z compiled: bool, 2025-05-07T20:31:41.1769187Z ) -> None: 2025-05-07T20:31:41.1769450Z torch.manual_seed(2025) 2025-05-07T20:31:41.1769839Z 2025-05-07T20:31:41.1770173Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1770590Z 2025-05-07T20:31:41.1770826Z x_sign = torch.sign(x) 2025-05-07T20:31:41.1771174Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.1771551Z x = x_sign * x_clamp 2025-05-07T20:31:41.1771839Z x0 = x[:, :D] 2025-05-07T20:31:41.1772110Z x1 = x[:, D:] 2025-05-07T20:31:41.1772382Z 2025-05-07T20:31:41.1772611Z if contiguous: 2025-05-07T20:31:41.1772889Z x0 = x0.contiguous() 2025-05-07T20:31:41.1773201Z x1 = x1.contiguous() 2025-05-07T20:31:41.1773499Z 2025-05-07T20:31:41.1773721Z if scale_ub is not None: 2025-05-07T20:31:41.1774075Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.1774500Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.1774869Z ) 2025-05-07T20:31:41.1775099Z else: 2025-05-07T20:31:41.1775369Z scale_ub_tensor = None 2025-05-07T20:31:41.1775692Z 2025-05-07T20:31:41.1775973Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1776357Z op = silu_mul_quant 2025-05-07T20:31:41.1776659Z if compiled: 2025-05-07T20:31:41.1776948Z op = torch.compile(op) 2025-05-07T20:31:41.1777307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1777651Z 2025-05-07T20:31:41.1777883Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.1778093Z 2025-05-07T20:31:41.1778223Z moe/activation_test.py:117: 2025-05-07T20:31:41.1778632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1779090Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.1779480Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1780444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.1781470Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.1782191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.1783121Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.1784033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.1784765Z kernel = self.compile( 2025-05-07T20:31:41.1785520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.1786457Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.1787013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1787336Z 2025-05-07T20:31:41.1787612Z self = 2025-05-07T20:31:41.1789107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.1791016Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09369deac0>} 2025-05-07T20:31:41.1792900Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.1794331Z context = 2025-05-07T20:31:41.1794720Z 2025-05-07T20:31:41.1794938Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.1795758Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.1796407Z module_map=module_map) 2025-05-07T20:31:41.1796909Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.1797411Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.1797779Z E ^ 2025-05-07T20:31:41.1798410Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.1799043Z 2025-05-07T20:31:41.1799619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.1800329Z 2025-05-07T20:31:41.1800473Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.1801040Z self=, 2025-05-07T20:31:41.1801580Z T=2048, 2025-05-07T20:31:41.1801839Z D=5120, 2025-05-07T20:31:41.1802099Z scale_ub=1200.0, 2025-05-07T20:31:41.1802393Z contiguous=True, 2025-05-07T20:31:41.1802711Z compiled=True, 2025-05-07T20:31:41.1802996Z ) 2025-05-07T20:31:41.1803548Z self = 2025-05-07T20:31:41.1804235Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.1804609Z 2025-05-07T20:31:41.1804724Z @given( 2025-05-07T20:31:41.1805008Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1805422Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.1805819Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.1806255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.1806686Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.1807069Z ) 2025-05-07T20:31:41.1807539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.1808128Z def test_silu_mul_quant( 2025-05-07T20:31:41.1808457Z self, 2025-05-07T20:31:41.1810130Z T: int, 2025-05-07T20:31:41.1810521Z D: int, 2025-05-07T20:31:41.1810824Z scale_ub: Optional[float], 2025-05-07T20:31:41.1811196Z contiguous: bool, 2025-05-07T20:31:41.1811525Z compiled: bool, 2025-05-07T20:31:41.1811827Z ) -> None: 2025-05-07T20:31:41.1812117Z torch.manual_seed(2025) 2025-05-07T20:31:41.1812447Z 2025-05-07T20:31:41.1812820Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1813286Z 2025-05-07T20:31:41.1813549Z x_sign = torch.sign(x) 2025-05-07T20:31:41.1813936Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.1814355Z x = x_sign * x_clamp 2025-05-07T20:31:41.1814685Z x0 = x[:, :D] 2025-05-07T20:31:41.1814982Z x1 = x[:, D:] 2025-05-07T20:31:41.1815271Z 2025-05-07T20:31:41.1815533Z if contiguous: 2025-05-07T20:31:41.1815841Z x0 = x0.contiguous() 2025-05-07T20:31:41.1816198Z x1 = x1.contiguous() 2025-05-07T20:31:41.1816540Z 2025-05-07T20:31:41.1816796Z if scale_ub is not None: 2025-05-07T20:31:41.1817177Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.1817634Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.1818045Z ) 2025-05-07T20:31:41.1818314Z else: 2025-05-07T20:31:41.1818601Z scale_ub_tensor = None 2025-05-07T20:31:41.1818939Z 2025-05-07T20:31:41.1819245Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1819679Z op = silu_mul_quant 2025-05-07T20:31:41.1820014Z if compiled: 2025-05-07T20:31:41.1820346Z op = torch.compile(op) 2025-05-07T20:31:41.1820766Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1821163Z 2025-05-07T20:31:41.1821418Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.1821821Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.1822234Z 2025-05-07T20:31:41.1822659Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1823128Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.1823515Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.1823928Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.1824420Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.1824843Z 2025-05-07T20:31:41.1825103Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.1825394Z 2025-05-07T20:31:41.1825529Z moe/activation_test.py:126: 2025-05-07T20:31:41.1825943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1826399Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.1826837Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.1827942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.1829010Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.1829774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.1830765Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.1831722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.1832739Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.1833764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.1834844Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.1835851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.1836740Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.1838306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.1839291Z fn() 2025-05-07T20:31:41.1840012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.1840838Z self.fn.run( 2025-05-07T20:31:41.1841506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.1842294Z kernel = self.compile( 2025-05-07T20:31:41.1843075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.1844109Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.1844671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1844983Z 2025-05-07T20:31:41.1845281Z self = 2025-05-07T20:31:41.1846747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.1848611Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f093c5387c0>} 2025-05-07T20:31:41.1850451Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.1851867Z context = 2025-05-07T20:31:41.1852261Z 2025-05-07T20:31:41.1852785Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.1853508Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.1854020Z module_map=module_map) 2025-05-07T20:31:41.1854385Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.1854745Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.1855004Z E ^ 2025-05-07T20:31:41.1855472Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.1855921Z 2025-05-07T20:31:41.1856346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.1856857Z 2025-05-07T20:31:41.1856967Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.1857372Z self=, 2025-05-07T20:31:41.1857776Z T=16384, 2025-05-07T20:31:41.1857976Z D=7168, 2025-05-07T20:31:41.1858166Z scale_ub=1200.0, 2025-05-07T20:31:41.1858387Z contiguous=False, 2025-05-07T20:31:41.1858610Z compiled=False, 2025-05-07T20:31:41.1858806Z ) 2025-05-07T20:31:41.1859126Z self = 2025-05-07T20:31:41.1859627Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.1859905Z 2025-05-07T20:31:41.1859983Z @given( 2025-05-07T20:31:41.1860215Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1860526Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.1860832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.1861154Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.1861487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.1861773Z ) 2025-05-07T20:31:41.1862112Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.1862753Z def test_silu_mul_quant( 2025-05-07T20:31:41.1862994Z self, 2025-05-07T20:31:41.1863180Z T: int, 2025-05-07T20:31:41.1863379Z D: int, 2025-05-07T20:31:41.1863600Z scale_ub: Optional[float], 2025-05-07T20:31:41.1863862Z contiguous: bool, 2025-05-07T20:31:41.1864101Z compiled: bool, 2025-05-07T20:31:41.1864329Z ) -> None: 2025-05-07T20:31:41.1864541Z torch.manual_seed(2025) 2025-05-07T20:31:41.1864782Z 2025-05-07T20:31:41.1865055Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1865403Z 2025-05-07T20:31:41.1865593Z x_sign = torch.sign(x) 2025-05-07T20:31:41.1865886Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.1866198Z x = x_sign * x_clamp 2025-05-07T20:31:41.1866429Z x0 = x[:, :D] 2025-05-07T20:31:41.1866646Z x1 = x[:, D:] 2025-05-07T20:31:41.1866858Z 2025-05-07T20:31:41.1867034Z if contiguous: 2025-05-07T20:31:41.1867281Z x0 = x0.contiguous() 2025-05-07T20:31:41.1867534Z x1 = x1.contiguous() 2025-05-07T20:31:41.1867771Z 2025-05-07T20:31:41.1867961Z if scale_ub is not None: 2025-05-07T20:31:41.1868239Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.1868565Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.1868876Z ) 2025-05-07T20:31:41.1869070Z else: 2025-05-07T20:31:41.1869268Z scale_ub_tensor = None 2025-05-07T20:31:41.1869518Z 2025-05-07T20:31:41.1869760Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1870070Z op = silu_mul_quant 2025-05-07T20:31:41.1870316Z if compiled: 2025-05-07T20:31:41.1870564Z op = torch.compile(op) 2025-05-07T20:31:41.1870858Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1871124Z 2025-05-07T20:31:41.1871313Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.1871572Z 2025-05-07T20:31:41.1871683Z moe/activation_test.py:117: 2025-05-07T20:31:41.1871971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1872303Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.1872585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1873268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.1873959Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.1885478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.1886183Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.1886860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.1887402Z kernel = self.compile( 2025-05-07T20:31:41.1887956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.1888633Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.1889038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1889272Z 2025-05-07T20:31:41.1889481Z self = 2025-05-07T20:31:41.1890566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.1891936Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09369ddc60>} 2025-05-07T20:31:41.1893287Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.1894452Z context = 2025-05-07T20:31:41.1894738Z 2025-05-07T20:31:41.1894907Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.1895441Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.1895930Z module_map=module_map) 2025-05-07T20:31:41.1896299Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.1896648Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.1896918Z E ^ 2025-05-07T20:31:41.1897438Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.1897890Z 2025-05-07T20:31:41.1898314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.1898844Z 2025-05-07T20:31:41.1898946Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.1899362Z self=, 2025-05-07T20:31:41.1899768Z T=1, 2025-05-07T20:31:41.1899947Z D=7168, 2025-05-07T20:31:41.1900136Z scale_ub=None, 2025-05-07T20:31:41.1900351Z contiguous=True, 2025-05-07T20:31:41.1900568Z compiled=True, 2025-05-07T20:31:41.1900779Z ) 2025-05-07T20:31:41.1901099Z self = 2025-05-07T20:31:41.1901581Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:41.1901848Z 2025-05-07T20:31:41.1901927Z @given( 2025-05-07T20:31:41.1902169Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1902477Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.1902882Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.1903226Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.1903557Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.1903837Z ) 2025-05-07T20:31:41.1904183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.1904628Z def test_silu_mul_quant( 2025-05-07T20:31:41.1904865Z self, 2025-05-07T20:31:41.1905061Z T: int, 2025-05-07T20:31:41.1905257Z D: int, 2025-05-07T20:31:41.1905468Z scale_ub: Optional[float], 2025-05-07T20:31:41.1905741Z contiguous: bool, 2025-05-07T20:31:41.1905982Z compiled: bool, 2025-05-07T20:31:41.1906196Z ) -> None: 2025-05-07T20:31:41.1906412Z torch.manual_seed(2025) 2025-05-07T20:31:41.1906655Z 2025-05-07T20:31:41.1906920Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1907262Z 2025-05-07T20:31:41.1907462Z x_sign = torch.sign(x) 2025-05-07T20:31:41.1907766Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.1908067Z x = x_sign * x_clamp 2025-05-07T20:31:41.1908305Z x0 = x[:, :D] 2025-05-07T20:31:41.1908518Z x1 = x[:, D:] 2025-05-07T20:31:41.1908720Z 2025-05-07T20:31:41.1908907Z if contiguous: 2025-05-07T20:31:41.1909147Z x0 = x0.contiguous() 2025-05-07T20:31:41.1909403Z x1 = x1.contiguous() 2025-05-07T20:31:41.1909645Z 2025-05-07T20:31:41.1909839Z if scale_ub is not None: 2025-05-07T20:31:41.1910112Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.1910448Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.1910759Z ) 2025-05-07T20:31:41.1910948Z else: 2025-05-07T20:31:41.1911166Z scale_ub_tensor = None 2025-05-07T20:31:41.1911417Z 2025-05-07T20:31:41.1911647Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1911962Z op = silu_mul_quant 2025-05-07T20:31:41.1912309Z if compiled: 2025-05-07T20:31:41.1912558Z op = torch.compile(op) 2025-05-07T20:31:41.1912857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1913124Z 2025-05-07T20:31:41.1913314Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.1913598Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.1913882Z 2025-05-07T20:31:41.1914120Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1914450Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.1914732Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.1915047Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.1915405Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.1915707Z 2025-05-07T20:31:41.1915907Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.1916103Z 2025-05-07T20:31:41.1916205Z moe/activation_test.py:126: 2025-05-07T20:31:41.1916508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1916835Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.1917159Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.1917950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.1918698Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.1919244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.1919928Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.1920620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.1921470Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.1922235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.1922984Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.1923825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.1924458Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.1925060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.1925577Z fn() 2025-05-07T20:31:41.1926078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.1926661Z self.fn.run( 2025-05-07T20:31:41.1927139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.1927676Z kernel = self.compile( 2025-05-07T20:31:41.1928215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.1928871Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.1929268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1929494Z 2025-05-07T20:31:41.1929708Z self = 2025-05-07T20:31:41.1930776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.1932150Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09367e0360>} 2025-05-07T20:31:41.1933579Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.1934607Z context = 2025-05-07T20:31:41.1934892Z 2025-05-07T20:31:41.1935058Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.1935583Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.1936055Z module_map=module_map) 2025-05-07T20:31:41.1936423Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.1936772Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.1937061Z E ^ 2025-05-07T20:31:41.1937558Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.1938015Z 2025-05-07T20:31:41.1938761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.1939355Z 2025-05-07T20:31:41.1939482Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.1939963Z self=, 2025-05-07T20:31:41.1940370Z T=4096, 2025-05-07T20:31:41.1940554Z D=5120, 2025-05-07T20:31:41.1940746Z scale_ub=None, 2025-05-07T20:31:41.1940956Z contiguous=False, 2025-05-07T20:31:41.1941170Z compiled=False, 2025-05-07T20:31:41.1941377Z ) 2025-05-07T20:31:41.1941692Z self = 2025-05-07T20:31:41.1942181Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:41.1942463Z 2025-05-07T20:31:41.1942538Z @given( 2025-05-07T20:31:41.1942767Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1943290Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.1943595Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.1943921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.1944247Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.1944524Z ) 2025-05-07T20:31:41.1944868Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.1945307Z def test_silu_mul_quant( 2025-05-07T20:31:41.1945540Z self, 2025-05-07T20:31:41.1945733Z T: int, 2025-05-07T20:31:41.1945922Z D: int, 2025-05-07T20:31:41.1946130Z scale_ub: Optional[float], 2025-05-07T20:31:41.1946397Z contiguous: bool, 2025-05-07T20:31:41.1946636Z compiled: bool, 2025-05-07T20:31:41.1946849Z ) -> None: 2025-05-07T20:31:41.1947062Z torch.manual_seed(2025) 2025-05-07T20:31:41.1947301Z 2025-05-07T20:31:41.1947575Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1947912Z 2025-05-07T20:31:41.1948107Z x_sign = torch.sign(x) 2025-05-07T20:31:41.1948398Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.1948698Z x = x_sign * x_clamp 2025-05-07T20:31:41.1948940Z x0 = x[:, :D] 2025-05-07T20:31:41.1949160Z x1 = x[:, D:] 2025-05-07T20:31:41.1949366Z 2025-05-07T20:31:41.1949551Z if contiguous: 2025-05-07T20:31:41.1949782Z x0 = x0.contiguous() 2025-05-07T20:31:41.1950031Z x1 = x1.contiguous() 2025-05-07T20:31:41.1950270Z 2025-05-07T20:31:41.1950460Z if scale_ub is not None: 2025-05-07T20:31:41.1950730Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.1951065Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.1951375Z ) 2025-05-07T20:31:41.1951560Z else: 2025-05-07T20:31:41.1951771Z scale_ub_tensor = None 2025-05-07T20:31:41.1952023Z 2025-05-07T20:31:41.1952397Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1952707Z op = silu_mul_quant 2025-05-07T20:31:41.1952957Z if compiled: 2025-05-07T20:31:41.1953201Z op = torch.compile(op) 2025-05-07T20:31:41.1953492Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1953764Z 2025-05-07T20:31:41.1953951Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.1954110Z 2025-05-07T20:31:41.1954205Z moe/activation_test.py:117: 2025-05-07T20:31:41.1954494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1954820Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.1955093Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1955783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.1956476Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.1957013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.1957711Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.1958384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.1958923Z kernel = self.compile( 2025-05-07T20:31:41.1959464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.1960124Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.1960523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1960749Z 2025-05-07T20:31:41.1960955Z self = 2025-05-07T20:31:41.1962116Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.1963596Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09367e1c60>} 2025-05-07T20:31:41.1964943Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.1965973Z context = 2025-05-07T20:31:41.1966259Z 2025-05-07T20:31:41.1966425Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.1966948Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.1967423Z module_map=module_map) 2025-05-07T20:31:41.1967792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.1968138Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.1968397Z E ^ 2025-05-07T20:31:41.1968865Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.1969314Z 2025-05-07T20:31:41.1969729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.1970260Z 2025-05-07T20:31:41.1970361Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.1970771Z self=, 2025-05-07T20:31:41.1971171Z T=4096, 2025-05-07T20:31:41.1971349Z D=7168, 2025-05-07T20:31:41.1971543Z scale_ub=None, 2025-05-07T20:31:41.1971756Z contiguous=False, 2025-05-07T20:31:41.1971972Z compiled=False, 2025-05-07T20:31:41.1972169Z ) 2025-05-07T20:31:41.1972491Z self = 2025-05-07T20:31:41.1973071Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:41.1973350Z 2025-05-07T20:31:41.1973426Z @given( 2025-05-07T20:31:41.1973649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1973956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.1974260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.1974583Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.1974910Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.1975185Z ) 2025-05-07T20:31:41.1975528Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.1975962Z def test_silu_mul_quant( 2025-05-07T20:31:41.1976190Z self, 2025-05-07T20:31:41.1976384Z T: int, 2025-05-07T20:31:41.1976579Z D: int, 2025-05-07T20:31:41.1976793Z scale_ub: Optional[float], 2025-05-07T20:31:41.1977070Z contiguous: bool, 2025-05-07T20:31:41.1977308Z compiled: bool, 2025-05-07T20:31:41.1977518Z ) -> None: 2025-05-07T20:31:41.1977723Z torch.manual_seed(2025) 2025-05-07T20:31:41.1977953Z 2025-05-07T20:31:41.1978213Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1978554Z 2025-05-07T20:31:41.1978746Z x_sign = torch.sign(x) 2025-05-07T20:31:41.1979034Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.1979335Z x = x_sign * x_clamp 2025-05-07T20:31:41.1979569Z x0 = x[:, :D] 2025-05-07T20:31:41.1979782Z x1 = x[:, D:] 2025-05-07T20:31:41.1979983Z 2025-05-07T20:31:41.1980170Z if contiguous: 2025-05-07T20:31:41.1980400Z x0 = x0.contiguous() 2025-05-07T20:31:41.1980647Z x1 = x1.contiguous() 2025-05-07T20:31:41.1980885Z 2025-05-07T20:31:41.1982537Z if scale_ub is not None: 2025-05-07T20:31:41.1982892Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.1983230Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.1983539Z ) 2025-05-07T20:31:41.1983726Z else: 2025-05-07T20:31:41.1983932Z scale_ub_tensor = None 2025-05-07T20:31:41.1984182Z 2025-05-07T20:31:41.1984405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1984716Z op = silu_mul_quant 2025-05-07T20:31:41.1984960Z if compiled: 2025-05-07T20:31:41.1985202Z op = torch.compile(op) 2025-05-07T20:31:41.1985488Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1985760Z 2025-05-07T20:31:41.1985950Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.1986112Z 2025-05-07T20:31:41.1986208Z moe/activation_test.py:117: 2025-05-07T20:31:41.1986497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1986829Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.1987109Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1987802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.1988497Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.1989037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.1989721Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.1990391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.1990930Z kernel = self.compile( 2025-05-07T20:31:41.1992693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.1993350Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.1993490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1993584Z 2025-05-07T20:31:41.1993792Z self = 2025-05-07T20:31:41.1994572Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.1995076Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09367e2c00>} 2025-05-07T20:31:41.1995828Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.1996025Z context = 2025-05-07T20:31:41.1996034Z 2025-05-07T20:31:41.1996198Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.1996463Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.1996574Z module_map=module_map) 2025-05-07T20:31:41.1996733Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.1996838Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.1996917Z E ^ 2025-05-07T20:31:41.1997300Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.1997305Z 2025-05-07T20:31:41.1997754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.1997759Z 2025-05-07T20:31:41.1997860Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.1998192Z self=, 2025-05-07T20:31:41.1998272Z T=128, 2025-05-07T20:31:41.1998346Z D=7168, 2025-05-07T20:31:41.1998433Z scale_ub=None, 2025-05-07T20:31:41.1998518Z contiguous=False, 2025-05-07T20:31:41.1998598Z compiled=True, 2025-05-07T20:31:41.1998677Z ) 2025-05-07T20:31:41.1998895Z self = 2025-05-07T20:31:41.1999064Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.1999068Z 2025-05-07T20:31:41.1999150Z @given( 2025-05-07T20:31:41.1999267Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1999373Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.1999487Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.1999605Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.1999725Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.1999794Z ) 2025-05-07T20:31:41.2000050Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2000146Z def test_silu_mul_quant( 2025-05-07T20:31:41.2000222Z self, 2025-05-07T20:31:41.2000296Z T: int, 2025-05-07T20:31:41.2000378Z D: int, 2025-05-07T20:31:41.2000475Z scale_ub: Optional[float], 2025-05-07T20:31:41.2000562Z contiguous: bool, 2025-05-07T20:31:41.2000653Z compiled: bool, 2025-05-07T20:31:41.2000732Z ) -> None: 2025-05-07T20:31:41.2000835Z torch.manual_seed(2025) 2025-05-07T20:31:41.2000908Z 2025-05-07T20:31:41.2001076Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2001153Z 2025-05-07T20:31:41.2001242Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2001364Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2001455Z x = x_sign * x_clamp 2025-05-07T20:31:41.2001534Z x0 = x[:, :D] 2025-05-07T20:31:41.2001620Z x1 = x[:, D:] 2025-05-07T20:31:41.2001781Z 2025-05-07T20:31:41.2001866Z if contiguous: 2025-05-07T20:31:41.2001958Z x0 = x0.contiguous() 2025-05-07T20:31:41.2002052Z x1 = x1.contiguous() 2025-05-07T20:31:41.2002124Z 2025-05-07T20:31:41.2002221Z if scale_ub is not None: 2025-05-07T20:31:41.2002323Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2002459Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2002539Z ) 2025-05-07T20:31:41.2002613Z else: 2025-05-07T20:31:41.2002701Z scale_ub_tensor = None 2025-05-07T20:31:41.2002774Z 2025-05-07T20:31:41.2002904Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2002988Z op = silu_mul_quant 2025-05-07T20:31:41.2003077Z if compiled: 2025-05-07T20:31:41.2003172Z op = torch.compile(op) 2025-05-07T20:31:41.2003386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2003468Z 2025-05-07T20:31:41.2003555Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.2003681Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.2003753Z 2025-05-07T20:31:41.2003886Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2003995Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.2004092Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.2004215Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.2004361Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2004433Z 2025-05-07T20:31:41.2004530Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.2004534Z 2025-05-07T20:31:41.2004636Z moe/activation_test.py:126: 2025-05-07T20:31:41.2004767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2004875Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.2005118Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2005689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.2005795Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.2006157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2006376Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2006750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.2007003Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2007407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.2007663Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2008042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.2008211Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.2008554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.2008636Z fn() 2025-05-07T20:31:41.2009037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.2009115Z self.fn.run( 2025-05-07T20:31:41.2009463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2009557Z kernel = self.compile( 2025-05-07T20:31:41.2009937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2010200Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2010326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2010331Z 2025-05-07T20:31:41.2010542Z self = 2025-05-07T20:31:41.2011316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2011813Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09367e3f60>} 2025-05-07T20:31:41.2012575Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2012771Z context = 2025-05-07T20:31:41.2012775Z 2025-05-07T20:31:41.2012947Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2013210Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2013323Z module_map=module_map) 2025-05-07T20:31:41.2013483Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2013585Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.2013666Z E ^ 2025-05-07T20:31:41.2014022Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2014026Z 2025-05-07T20:31:41.2014491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2014499Z 2025-05-07T20:31:41.2014719Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2014944Z self=, 2025-05-07T20:31:41.2015027Z T=128, 2025-05-07T20:31:41.2015100Z D=7168, 2025-05-07T20:31:41.2015179Z scale_ub=None, 2025-05-07T20:31:41.2015270Z contiguous=False, 2025-05-07T20:31:41.2015353Z compiled=False, 2025-05-07T20:31:41.2015424Z ) 2025-05-07T20:31:41.2015648Z self = 2025-05-07T20:31:41.2015819Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:41.2015824Z 2025-05-07T20:31:41.2015901Z @given( 2025-05-07T20:31:41.2016022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2016120Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2016242Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2016364Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2016480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2016559Z ) 2025-05-07T20:31:41.2016804Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2016896Z def test_silu_mul_quant( 2025-05-07T20:31:41.2016981Z self, 2025-05-07T20:31:41.2017055Z T: int, 2025-05-07T20:31:41.2017128Z D: int, 2025-05-07T20:31:41.2017229Z scale_ub: Optional[float], 2025-05-07T20:31:41.2017313Z contiguous: bool, 2025-05-07T20:31:41.2017398Z compiled: bool, 2025-05-07T20:31:41.2017479Z ) -> None: 2025-05-07T20:31:41.2017571Z torch.manual_seed(2025) 2025-05-07T20:31:41.2017645Z 2025-05-07T20:31:41.2017812Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2017887Z 2025-05-07T20:31:41.2017982Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2018104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2018286Z x = x_sign * x_clamp 2025-05-07T20:31:41.2018376Z x0 = x[:, :D] 2025-05-07T20:31:41.2018453Z x1 = x[:, D:] 2025-05-07T20:31:41.2018525Z 2025-05-07T20:31:41.2018610Z if contiguous: 2025-05-07T20:31:41.2018698Z x0 = x0.contiguous() 2025-05-07T20:31:41.2018785Z x1 = x1.contiguous() 2025-05-07T20:31:41.2018862Z 2025-05-07T20:31:41.2018948Z if scale_ub is not None: 2025-05-07T20:31:41.2019058Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2019190Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2019263Z ) 2025-05-07T20:31:41.2019342Z else: 2025-05-07T20:31:41.2019433Z scale_ub_tensor = None 2025-05-07T20:31:41.2019505Z 2025-05-07T20:31:41.2019640Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2019727Z op = silu_mul_quant 2025-05-07T20:31:41.2019809Z if compiled: 2025-05-07T20:31:41.2019920Z op = torch.compile(op) 2025-05-07T20:31:41.2020027Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2020097Z 2025-05-07T20:31:41.2020189Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2020194Z 2025-05-07T20:31:41.2020289Z moe/activation_test.py:117: 2025-05-07T20:31:41.2020421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2020520Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2020616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2021122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2021217Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2021578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2021892Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2022242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2022338Z kernel = self.compile( 2025-05-07T20:31:41.2022723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2022895Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2023026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2023031Z 2025-05-07T20:31:41.2023236Z self = 2025-05-07T20:31:41.2024016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2024520Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090e923ec0>} 2025-05-07T20:31:41.2025279Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2025472Z context = 2025-05-07T20:31:41.2025477Z 2025-05-07T20:31:41.2025640Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2025906Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2026010Z module_map=module_map) 2025-05-07T20:31:41.2026169Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2026271Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2026348Z E ^ 2025-05-07T20:31:41.2026709Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2026829Z 2025-05-07T20:31:41.2027284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2027290Z 2025-05-07T20:31:41.2027400Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2027626Z self=, 2025-05-07T20:31:41.2027702Z T=4096, 2025-05-07T20:31:41.2027775Z D=5120, 2025-05-07T20:31:41.2027860Z scale_ub=1200.0, 2025-05-07T20:31:41.2027943Z contiguous=True, 2025-05-07T20:31:41.2028023Z compiled=False, 2025-05-07T20:31:41.2028101Z ) 2025-05-07T20:31:41.2028316Z self = 2025-05-07T20:31:41.2028496Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:41.2028501Z 2025-05-07T20:31:41.2028589Z @given( 2025-05-07T20:31:41.2028710Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2028813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2028926Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2029039Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2029159Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2029233Z ) 2025-05-07T20:31:41.2029483Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2029576Z def test_silu_mul_quant( 2025-05-07T20:31:41.2029652Z self, 2025-05-07T20:31:41.2029731Z T: int, 2025-05-07T20:31:41.2029805Z D: int, 2025-05-07T20:31:41.2029903Z scale_ub: Optional[float], 2025-05-07T20:31:41.2029999Z contiguous: bool, 2025-05-07T20:31:41.2030081Z compiled: bool, 2025-05-07T20:31:41.2030157Z ) -> None: 2025-05-07T20:31:41.2030337Z torch.manual_seed(2025) 2025-05-07T20:31:41.2030416Z 2025-05-07T20:31:41.2030586Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2037256Z 2025-05-07T20:31:41.2037371Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2037504Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2037603Z x = x_sign * x_clamp 2025-05-07T20:31:41.2037683Z x0 = x[:, :D] 2025-05-07T20:31:41.2037764Z x1 = x[:, D:] 2025-05-07T20:31:41.2037845Z 2025-05-07T20:31:41.2037931Z if contiguous: 2025-05-07T20:31:41.2038024Z x0 = x0.contiguous() 2025-05-07T20:31:41.2038118Z x1 = x1.contiguous() 2025-05-07T20:31:41.2038191Z 2025-05-07T20:31:41.2038288Z if scale_ub is not None: 2025-05-07T20:31:41.2038648Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2038845Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2038932Z ) 2025-05-07T20:31:41.2039022Z else: 2025-05-07T20:31:41.2039120Z scale_ub_tensor = None 2025-05-07T20:31:41.2039200Z 2025-05-07T20:31:41.2039337Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2039426Z op = silu_mul_quant 2025-05-07T20:31:41.2039517Z if compiled: 2025-05-07T20:31:41.2039617Z op = torch.compile(op) 2025-05-07T20:31:41.2039719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2039795Z 2025-05-07T20:31:41.2039885Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2039891Z 2025-05-07T20:31:41.2039994Z moe/activation_test.py:117: 2025-05-07T20:31:41.2040124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2040227Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2040332Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2040842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2041200Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2041573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2041794Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2042142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2042234Z kernel = self.compile( 2025-05-07T20:31:41.2042619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2042797Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2042923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2042928Z 2025-05-07T20:31:41.2043146Z self = 2025-05-07T20:31:41.2044067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2044567Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090e720720>} 2025-05-07T20:31:41.2045322Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2045509Z context = 2025-05-07T20:31:41.2045514Z 2025-05-07T20:31:41.2045685Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2046064Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2046176Z module_map=module_map) 2025-05-07T20:31:41.2046343Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2046439Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2046523Z E ^ 2025-05-07T20:31:41.2046876Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2046881Z 2025-05-07T20:31:41.2047293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2047298Z 2025-05-07T20:31:41.2047404Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2047624Z self=, 2025-05-07T20:31:41.2047701Z T=1, 2025-05-07T20:31:41.2047783Z D=5120, 2025-05-07T20:31:41.2047862Z scale_ub=None, 2025-05-07T20:31:41.2047954Z contiguous=True, 2025-05-07T20:31:41.2048040Z compiled=True, 2025-05-07T20:31:41.2048114Z ) 2025-05-07T20:31:41.2048338Z self = 2025-05-07T20:31:41.2048497Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:41.2048502Z 2025-05-07T20:31:41.2048574Z @given( 2025-05-07T20:31:41.2048696Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2048791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2048901Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2049017Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2049126Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2049202Z ) 2025-05-07T20:31:41.2049444Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2049535Z def test_silu_mul_quant( 2025-05-07T20:31:41.2049614Z self, 2025-05-07T20:31:41.2049774Z T: int, 2025-05-07T20:31:41.2049847Z D: int, 2025-05-07T20:31:41.2049946Z scale_ub: Optional[float], 2025-05-07T20:31:41.2050033Z contiguous: bool, 2025-05-07T20:31:41.2050116Z compiled: bool, 2025-05-07T20:31:41.2050198Z ) -> None: 2025-05-07T20:31:41.2050291Z torch.manual_seed(2025) 2025-05-07T20:31:41.2050362Z 2025-05-07T20:31:41.2050536Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2050609Z 2025-05-07T20:31:41.2050704Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2050831Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2050914Z x = x_sign * x_clamp 2025-05-07T20:31:41.2050997Z x0 = x[:, :D] 2025-05-07T20:31:41.2051074Z x1 = x[:, D:] 2025-05-07T20:31:41.2051145Z 2025-05-07T20:31:41.2051229Z if contiguous: 2025-05-07T20:31:41.2051318Z x0 = x0.contiguous() 2025-05-07T20:31:41.2051409Z x1 = x1.contiguous() 2025-05-07T20:31:41.2051492Z 2025-05-07T20:31:41.2051578Z if scale_ub is not None: 2025-05-07T20:31:41.2051682Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2051820Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2051894Z ) 2025-05-07T20:31:41.2051968Z else: 2025-05-07T20:31:41.2052063Z scale_ub_tensor = None 2025-05-07T20:31:41.2052135Z 2025-05-07T20:31:41.2052273Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2052361Z op = silu_mul_quant 2025-05-07T20:31:41.2052443Z if compiled: 2025-05-07T20:31:41.2052548Z op = torch.compile(op) 2025-05-07T20:31:41.2052650Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2052723Z 2025-05-07T20:31:41.2052821Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.2052938Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.2053008Z 2025-05-07T20:31:41.2053237Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2053339Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.2053435Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.2053560Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.2053696Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2053778Z 2025-05-07T20:31:41.2053874Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.2053879Z 2025-05-07T20:31:41.2053973Z moe/activation_test.py:126: 2025-05-07T20:31:41.2054106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2054209Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.2054340Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2054910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.2055015Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.2055383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2055604Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2055972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.2056235Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2056632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.2056894Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2057274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.2057517Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.2057865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.2057941Z fn() 2025-05-07T20:31:41.2058341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.2058429Z self.fn.run( 2025-05-07T20:31:41.2058769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2058858Z kernel = self.compile( 2025-05-07T20:31:41.2059239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2059418Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2059543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2059553Z 2025-05-07T20:31:41.2059760Z self = 2025-05-07T20:31:41.2060537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2061032Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0924366660>} 2025-05-07T20:31:41.2061785Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2061972Z context = 2025-05-07T20:31:41.2061976Z 2025-05-07T20:31:41.2062216Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2062489Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2062594Z module_map=module_map) 2025-05-07T20:31:41.2062756Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2062856Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.2062931Z E ^ 2025-05-07T20:31:41.2063289Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2063294Z 2025-05-07T20:31:41.2063705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2063710Z 2025-05-07T20:31:41.2063813Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2064031Z self=, 2025-05-07T20:31:41.2064105Z T=2048, 2025-05-07T20:31:41.2064191Z D=5120, 2025-05-07T20:31:41.2064271Z scale_ub=None, 2025-05-07T20:31:41.2064353Z contiguous=True, 2025-05-07T20:31:41.2064437Z compiled=True, 2025-05-07T20:31:41.2064509Z ) 2025-05-07T20:31:41.2064724Z self = 2025-05-07T20:31:41.2064895Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:41.2064900Z 2025-05-07T20:31:41.2064975Z @given( 2025-05-07T20:31:41.2065098Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2065194Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2065305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2065422Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2065533Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2065603Z ) 2025-05-07T20:31:41.2065856Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2066051Z def test_silu_mul_quant( 2025-05-07T20:31:41.2066125Z self, 2025-05-07T20:31:41.2066204Z T: int, 2025-05-07T20:31:41.2066278Z D: int, 2025-05-07T20:31:41.2066372Z scale_ub: Optional[float], 2025-05-07T20:31:41.2066464Z contiguous: bool, 2025-05-07T20:31:41.2066544Z compiled: bool, 2025-05-07T20:31:41.2066622Z ) -> None: 2025-05-07T20:31:41.2066713Z torch.manual_seed(2025) 2025-05-07T20:31:41.2066784Z 2025-05-07T20:31:41.2066956Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2067028Z 2025-05-07T20:31:41.2067115Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2067242Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2067327Z x = x_sign * x_clamp 2025-05-07T20:31:41.2067402Z x0 = x[:, :D] 2025-05-07T20:31:41.2067483Z x1 = x[:, D:] 2025-05-07T20:31:41.2067554Z 2025-05-07T20:31:41.2067638Z if contiguous: 2025-05-07T20:31:41.2067736Z x0 = x0.contiguous() 2025-05-07T20:31:41.2067821Z x1 = x1.contiguous() 2025-05-07T20:31:41.2067894Z 2025-05-07T20:31:41.2067983Z if scale_ub is not None: 2025-05-07T20:31:41.2068083Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2068220Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2068294Z ) 2025-05-07T20:31:41.2068366Z else: 2025-05-07T20:31:41.2068461Z scale_ub_tensor = None 2025-05-07T20:31:41.2068533Z 2025-05-07T20:31:41.2068663Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2068753Z op = silu_mul_quant 2025-05-07T20:31:41.2068834Z if compiled: 2025-05-07T20:31:41.2068931Z op = torch.compile(op) 2025-05-07T20:31:41.2069038Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2069107Z 2025-05-07T20:31:41.2069198Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.2069400Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.2069470Z 2025-05-07T20:31:41.2069610Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2069707Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.2069803Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.2069929Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.2070066Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2070136Z 2025-05-07T20:31:41.2070240Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.2070245Z 2025-05-07T20:31:41.2070340Z moe/activation_test.py:126: 2025-05-07T20:31:41.2070470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2070570Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.2070699Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2071267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.2071370Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.2071730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2071957Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2072323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.2072580Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2072976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.2073228Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2073611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.2073855Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.2074199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.2074272Z fn() 2025-05-07T20:31:41.2074671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.2074754Z self.fn.run( 2025-05-07T20:31:41.2075092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2075182Z kernel = self.compile( 2025-05-07T20:31:41.2075567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2075737Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2075878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2075883Z 2025-05-07T20:31:41.2076085Z self = 2025-05-07T20:31:41.2076858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2077354Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f092455a5c0>} 2025-05-07T20:31:41.2078100Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2078370Z context = 2025-05-07T20:31:41.2078385Z 2025-05-07T20:31:41.2078550Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2078811Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2078920Z module_map=module_map) 2025-05-07T20:31:41.2079079Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2079182Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.2079257Z E ^ 2025-05-07T20:31:41.2079606Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2079611Z 2025-05-07T20:31:41.2080028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2080033Z 2025-05-07T20:31:41.2080133Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2080364Z self=, 2025-05-07T20:31:41.2080443Z T=128, 2025-05-07T20:31:41.2080517Z D=5120, 2025-05-07T20:31:41.2080601Z scale_ub=None, 2025-05-07T20:31:41.2080683Z contiguous=True, 2025-05-07T20:31:41.2080757Z compiled=True, 2025-05-07T20:31:41.2080831Z ) 2025-05-07T20:31:41.2081047Z self = 2025-05-07T20:31:41.2081212Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:41.2081217Z 2025-05-07T20:31:41.2081298Z @given( 2025-05-07T20:31:41.2081413Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2081516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2081628Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2081742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2081855Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2081931Z ) 2025-05-07T20:31:41.2082256Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2082347Z def test_silu_mul_quant( 2025-05-07T20:31:41.2082421Z self, 2025-05-07T20:31:41.2082494Z T: int, 2025-05-07T20:31:41.2082572Z D: int, 2025-05-07T20:31:41.2082665Z scale_ub: Optional[float], 2025-05-07T20:31:41.2082754Z contiguous: bool, 2025-05-07T20:31:41.2082839Z compiled: bool, 2025-05-07T20:31:41.2082912Z ) -> None: 2025-05-07T20:31:41.2083006Z torch.manual_seed(2025) 2025-05-07T20:31:41.2083076Z 2025-05-07T20:31:41.2083319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2083396Z 2025-05-07T20:31:41.2083484Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2083607Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2083697Z x = x_sign * x_clamp 2025-05-07T20:31:41.2083772Z x0 = x[:, :D] 2025-05-07T20:31:41.2083857Z x1 = x[:, D:] 2025-05-07T20:31:41.2083934Z 2025-05-07T20:31:41.2084014Z if contiguous: 2025-05-07T20:31:41.2084101Z x0 = x0.contiguous() 2025-05-07T20:31:41.2084189Z x1 = x1.contiguous() 2025-05-07T20:31:41.2084259Z 2025-05-07T20:31:41.2084349Z if scale_ub is not None: 2025-05-07T20:31:41.2084450Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2084582Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2084663Z ) 2025-05-07T20:31:41.2084737Z else: 2025-05-07T20:31:41.2084829Z scale_ub_tensor = None 2025-05-07T20:31:41.2084900Z 2025-05-07T20:31:41.2085028Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2085115Z op = silu_mul_quant 2025-05-07T20:31:41.2085203Z if compiled: 2025-05-07T20:31:41.2085298Z op = torch.compile(op) 2025-05-07T20:31:41.2085484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2085561Z 2025-05-07T20:31:41.2085650Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.2085773Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.2085843Z 2025-05-07T20:31:41.2085973Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2086082Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.2086177Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.2086293Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.2086431Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2086500Z 2025-05-07T20:31:41.2086594Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.2086599Z 2025-05-07T20:31:41.2086700Z moe/activation_test.py:126: 2025-05-07T20:31:41.2086823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2086926Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.2087064Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2087626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.2087729Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.2088088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2088308Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2088679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.2088932Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2089332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.2089589Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2090043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.2090207Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.2090548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.2090626Z fn() 2025-05-07T20:31:41.2091024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.2091101Z self.fn.run( 2025-05-07T20:31:41.2091442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2091531Z kernel = self.compile( 2025-05-07T20:31:41.2091913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2092093Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2092216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2092221Z 2025-05-07T20:31:41.2092427Z self = 2025-05-07T20:31:41.2093198Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2093693Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0924558400>} 2025-05-07T20:31:41.2094545Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2094736Z context = 2025-05-07T20:31:41.2094741Z 2025-05-07T20:31:41.2094907Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2095169Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2095277Z module_map=module_map) 2025-05-07T20:31:41.2095433Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2095530Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.2095609Z E ^ 2025-05-07T20:31:41.2095961Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2095965Z 2025-05-07T20:31:41.2096374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2096383Z 2025-05-07T20:31:41.2096492Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2096711Z self=, 2025-05-07T20:31:41.2096790Z T=4096, 2025-05-07T20:31:41.2096864Z D=5120, 2025-05-07T20:31:41.2096941Z scale_ub=None, 2025-05-07T20:31:41.2097027Z contiguous=True, 2025-05-07T20:31:41.2097106Z compiled=True, 2025-05-07T20:31:41.2097177Z ) 2025-05-07T20:31:41.2097392Z self = 2025-05-07T20:31:41.2097564Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:41.2097569Z 2025-05-07T20:31:41.2097644Z @given( 2025-05-07T20:31:41.2097763Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2097855Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2097965Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2098085Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2098275Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2098346Z ) 2025-05-07T20:31:41.2098590Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2098679Z def test_silu_mul_quant( 2025-05-07T20:31:41.2098755Z self, 2025-05-07T20:31:41.2098828Z T: int, 2025-05-07T20:31:41.2098903Z D: int, 2025-05-07T20:31:41.2098998Z scale_ub: Optional[float], 2025-05-07T20:31:41.2099082Z contiguous: bool, 2025-05-07T20:31:41.2099163Z compiled: bool, 2025-05-07T20:31:41.2099241Z ) -> None: 2025-05-07T20:31:41.2099333Z torch.manual_seed(2025) 2025-05-07T20:31:41.2099398Z 2025-05-07T20:31:41.2099565Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2099635Z 2025-05-07T20:31:41.2099721Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2099846Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2099943Z x = x_sign * x_clamp 2025-05-07T20:31:41.2100020Z x0 = x[:, :D] 2025-05-07T20:31:41.2100100Z x1 = x[:, D:] 2025-05-07T20:31:41.2100169Z 2025-05-07T20:31:41.2100251Z if contiguous: 2025-05-07T20:31:41.2100337Z x0 = x0.contiguous() 2025-05-07T20:31:41.2100420Z x1 = x1.contiguous() 2025-05-07T20:31:41.2100496Z 2025-05-07T20:31:41.2100584Z if scale_ub is not None: 2025-05-07T20:31:41.2100687Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2100820Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2100891Z ) 2025-05-07T20:31:41.2100962Z else: 2025-05-07T20:31:41.2101056Z scale_ub_tensor = None 2025-05-07T20:31:41.2101126Z 2025-05-07T20:31:41.2101250Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2101345Z op = silu_mul_quant 2025-05-07T20:31:41.2101426Z if compiled: 2025-05-07T20:31:41.2101603Z op = torch.compile(op) 2025-05-07T20:31:41.2101709Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2101777Z 2025-05-07T20:31:41.2101866Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.2101982Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.2102052Z 2025-05-07T20:31:41.2102184Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2102282Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.2102377Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.2102501Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.2102639Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2102711Z 2025-05-07T20:31:41.2102813Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.2102817Z 2025-05-07T20:31:41.2102911Z moe/activation_test.py:126: 2025-05-07T20:31:41.2103043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2103145Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.2103275Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2103834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.2103932Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.2104290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2104512Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2104877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.2105135Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2105536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.2105866Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2106243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.2106406Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.2106747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.2106821Z fn() 2025-05-07T20:31:41.2107218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.2107300Z self.fn.run( 2025-05-07T20:31:41.2107636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2107724Z kernel = self.compile( 2025-05-07T20:31:41.2108113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2108290Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2108419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2108423Z 2025-05-07T20:31:41.2108623Z self = 2025-05-07T20:31:41.2109391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2109888Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090db2e7a0>} 2025-05-07T20:31:41.2110713Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2110910Z context = 2025-05-07T20:31:41.2110914Z 2025-05-07T20:31:41.2111073Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2111336Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2111438Z module_map=module_map) 2025-05-07T20:31:41.2111593Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2111695Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.2111769Z E ^ 2025-05-07T20:31:41.2112129Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2112134Z 2025-05-07T20:31:41.2112555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2112563Z 2025-05-07T20:31:41.2112660Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2112882Z self=, 2025-05-07T20:31:41.2112956Z T=16384, 2025-05-07T20:31:41.2113030Z D=5120, 2025-05-07T20:31:41.2113112Z scale_ub=None, 2025-05-07T20:31:41.2113194Z contiguous=True, 2025-05-07T20:31:41.2113273Z compiled=True, 2025-05-07T20:31:41.2113345Z ) 2025-05-07T20:31:41.2113558Z self = 2025-05-07T20:31:41.2113724Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:41.2113732Z 2025-05-07T20:31:41.2113806Z @given( 2025-05-07T20:31:41.2113919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2114018Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2114133Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2114327Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2114439Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2114510Z ) 2025-05-07T20:31:41.2114752Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2114845Z def test_silu_mul_quant( 2025-05-07T20:31:41.2114917Z self, 2025-05-07T20:31:41.2114992Z T: int, 2025-05-07T20:31:41.2115069Z D: int, 2025-05-07T20:31:41.2115160Z scale_ub: Optional[float], 2025-05-07T20:31:41.2115250Z contiguous: bool, 2025-05-07T20:31:41.2115331Z compiled: bool, 2025-05-07T20:31:41.2115403Z ) -> None: 2025-05-07T20:31:41.2115494Z torch.manual_seed(2025) 2025-05-07T20:31:41.2115561Z 2025-05-07T20:31:41.2115725Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2115800Z 2025-05-07T20:31:41.2115894Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2116020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2116106Z x = x_sign * x_clamp 2025-05-07T20:31:41.2116184Z x0 = x[:, :D] 2025-05-07T20:31:41.2116257Z x1 = x[:, D:] 2025-05-07T20:31:41.2116330Z 2025-05-07T20:31:41.2116406Z if contiguous: 2025-05-07T20:31:41.2116495Z x0 = x0.contiguous() 2025-05-07T20:31:41.2116581Z x1 = x1.contiguous() 2025-05-07T20:31:41.2116652Z 2025-05-07T20:31:41.2116738Z if scale_ub is not None: 2025-05-07T20:31:41.2116837Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2116970Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2117043Z ) 2025-05-07T20:31:41.2117113Z else: 2025-05-07T20:31:41.2117202Z scale_ub_tensor = None 2025-05-07T20:31:41.2117277Z 2025-05-07T20:31:41.2117401Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2117579Z op = silu_mul_quant 2025-05-07T20:31:41.2117664Z if compiled: 2025-05-07T20:31:41.2117760Z op = torch.compile(op) 2025-05-07T20:31:41.2117867Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2117936Z 2025-05-07T20:31:41.2118021Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.2118143Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.2118213Z 2025-05-07T20:31:41.2118343Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2118444Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.2118538Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.2118657Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.2118797Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2118866Z 2025-05-07T20:31:41.2118963Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.2118970Z 2025-05-07T20:31:41.2119067Z moe/activation_test.py:126: 2025-05-07T20:31:41.2119193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2119303Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.2119432Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2119990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.2120093Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.2120453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2120675Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2121038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.2121293Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2121776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.2122027Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2122402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.2122566Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.2122907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.2122984Z fn() 2025-05-07T20:31:41.2123456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.2123533Z self.fn.run( 2025-05-07T20:31:41.2123876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2123971Z kernel = self.compile( 2025-05-07T20:31:41.2124354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2124528Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2124653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2124658Z 2025-05-07T20:31:41.2124865Z self = 2025-05-07T20:31:41.2125634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2126239Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090d9de200>} 2025-05-07T20:31:41.2127012Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2127198Z context = 2025-05-07T20:31:41.2127203Z 2025-05-07T20:31:41.2127364Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2127623Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2127728Z module_map=module_map) 2025-05-07T20:31:41.2127885Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2127983Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.2128059Z E ^ 2025-05-07T20:31:41.2128415Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2128423Z 2025-05-07T20:31:41.2128834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2128843Z 2025-05-07T20:31:41.2128940Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2129157Z self=, 2025-05-07T20:31:41.2129237Z T=1, 2025-05-07T20:31:41.2129312Z D=5120, 2025-05-07T20:31:41.2129391Z scale_ub=1200.0, 2025-05-07T20:31:41.2129478Z contiguous=True, 2025-05-07T20:31:41.2129561Z compiled=True, 2025-05-07T20:31:41.2129629Z ) 2025-05-07T20:31:41.2129848Z self = 2025-05-07T20:31:41.2130008Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.2130013Z 2025-05-07T20:31:41.2130096Z @given( 2025-05-07T20:31:41.2130214Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2130390Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2130505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2130616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2130726Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2130798Z ) 2025-05-07T20:31:41.2131038Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2131125Z def test_silu_mul_quant( 2025-05-07T20:31:41.2131200Z self, 2025-05-07T20:31:41.2131273Z T: int, 2025-05-07T20:31:41.2131343Z D: int, 2025-05-07T20:31:41.2131441Z scale_ub: Optional[float], 2025-05-07T20:31:41.2131526Z contiguous: bool, 2025-05-07T20:31:41.2131609Z compiled: bool, 2025-05-07T20:31:41.2131683Z ) -> None: 2025-05-07T20:31:41.2131774Z torch.manual_seed(2025) 2025-05-07T20:31:41.2131845Z 2025-05-07T20:31:41.2132017Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2132092Z 2025-05-07T20:31:41.2132183Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2132302Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2132385Z x = x_sign * x_clamp 2025-05-07T20:31:41.2132465Z x0 = x[:, :D] 2025-05-07T20:31:41.2132541Z x1 = x[:, D:] 2025-05-07T20:31:41.2132611Z 2025-05-07T20:31:41.2132694Z if contiguous: 2025-05-07T20:31:41.2132778Z x0 = x0.contiguous() 2025-05-07T20:31:41.2132872Z x1 = x1.contiguous() 2025-05-07T20:31:41.2132938Z 2025-05-07T20:31:41.2133024Z if scale_ub is not None: 2025-05-07T20:31:41.2133129Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2133258Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2133331Z ) 2025-05-07T20:31:41.2133409Z else: 2025-05-07T20:31:41.2133499Z scale_ub_tensor = None 2025-05-07T20:31:41.2133650Z 2025-05-07T20:31:41.2133788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2133872Z op = silu_mul_quant 2025-05-07T20:31:41.2133952Z if compiled: 2025-05-07T20:31:41.2134049Z op = torch.compile(op) 2025-05-07T20:31:41.2134152Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2134228Z 2025-05-07T20:31:41.2134316Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2134320Z 2025-05-07T20:31:41.2134415Z moe/activation_test.py:117: 2025-05-07T20:31:41.2134542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2134638Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2134732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2135101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2135187Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2135690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2135792Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2136147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2136371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2136710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2136801Z kernel = self.compile( 2025-05-07T20:31:41.2137202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2137398Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2137529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2137534Z 2025-05-07T20:31:41.2137825Z self = 2025-05-07T20:31:41.2138843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2139350Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090d451d00>} 2025-05-07T20:31:41.2140097Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2140292Z context = 2025-05-07T20:31:41.2140297Z 2025-05-07T20:31:41.2140465Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2140729Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2140836Z module_map=module_map) 2025-05-07T20:31:41.2140993Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2141092Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2141164Z E ^ 2025-05-07T20:31:41.2141516Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2141521Z 2025-05-07T20:31:41.2141934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2141938Z 2025-05-07T20:31:41.2142036Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2142260Z self=, 2025-05-07T20:31:41.2142335Z T=1, 2025-05-07T20:31:41.2142543Z D=5120, 2025-05-07T20:31:41.2142632Z scale_ub=None, 2025-05-07T20:31:41.2142714Z contiguous=False, 2025-05-07T20:31:41.2142794Z compiled=True, 2025-05-07T20:31:41.2142867Z ) 2025-05-07T20:31:41.2143083Z self = 2025-05-07T20:31:41.2143245Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.2143249Z 2025-05-07T20:31:41.2143329Z @given( 2025-05-07T20:31:41.2143443Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2143540Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2143650Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2143760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2143872Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2143940Z ) 2025-05-07T20:31:41.2144183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2144284Z def test_silu_mul_quant( 2025-05-07T20:31:41.2144362Z self, 2025-05-07T20:31:41.2144434Z T: int, 2025-05-07T20:31:41.2144505Z D: int, 2025-05-07T20:31:41.2144600Z scale_ub: Optional[float], 2025-05-07T20:31:41.2144682Z contiguous: bool, 2025-05-07T20:31:41.2144767Z compiled: bool, 2025-05-07T20:31:41.2144840Z ) -> None: 2025-05-07T20:31:41.2144933Z torch.manual_seed(2025) 2025-05-07T20:31:41.2145002Z 2025-05-07T20:31:41.2145169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2145240Z 2025-05-07T20:31:41.2145328Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2145449Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2145536Z x = x_sign * x_clamp 2025-05-07T20:31:41.2145610Z x0 = x[:, :D] 2025-05-07T20:31:41.2145683Z x1 = x[:, D:] 2025-05-07T20:31:41.2145759Z 2025-05-07T20:31:41.2145837Z if contiguous: 2025-05-07T20:31:41.2146049Z x0 = x0.contiguous() 2025-05-07T20:31:41.2146137Z x1 = x1.contiguous() 2025-05-07T20:31:41.2146206Z 2025-05-07T20:31:41.2146292Z if scale_ub is not None: 2025-05-07T20:31:41.2146398Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2146528Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2146603Z ) 2025-05-07T20:31:41.2146676Z else: 2025-05-07T20:31:41.2146769Z scale_ub_tensor = None 2025-05-07T20:31:41.2146840Z 2025-05-07T20:31:41.2146965Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2147051Z op = silu_mul_quant 2025-05-07T20:31:41.2147146Z if compiled: 2025-05-07T20:31:41.2147257Z op = torch.compile(op) 2025-05-07T20:31:41.2147374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2147454Z 2025-05-07T20:31:41.2147540Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.2147659Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.2147735Z 2025-05-07T20:31:41.2147865Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2147963Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.2148058Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.2148178Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.2148317Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2148383Z 2025-05-07T20:31:41.2148479Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.2148483Z 2025-05-07T20:31:41.2148582Z moe/activation_test.py:126: 2025-05-07T20:31:41.2148704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2148807Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.2148937Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2149577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.2149683Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.2150045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2150266Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2150639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.2150891Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2151297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.2151546Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2151922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.2152093Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.2152432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.2152511Z fn() 2025-05-07T20:31:41.2152909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.2152985Z self.fn.run( 2025-05-07T20:31:41.2153326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2153414Z kernel = self.compile( 2025-05-07T20:31:41.2153794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2153968Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2154096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2154201Z 2025-05-07T20:31:41.2154410Z self = 2025-05-07T20:31:41.2155180Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2155671Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090d451260>} 2025-05-07T20:31:41.2156423Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2156615Z context = 2025-05-07T20:31:41.2156625Z 2025-05-07T20:31:41.2156792Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2157052Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2157152Z module_map=module_map) 2025-05-07T20:31:41.2157312Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2157411Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.2157488Z E ^ 2025-05-07T20:31:41.2157842Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2157847Z 2025-05-07T20:31:41.2158257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2158262Z 2025-05-07T20:31:41.2158362Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2158657Z self=, 2025-05-07T20:31:41.2158741Z T=1, 2025-05-07T20:31:41.2158813Z D=5120, 2025-05-07T20:31:41.2158891Z scale_ub=None, 2025-05-07T20:31:41.2158973Z contiguous=True, 2025-05-07T20:31:41.2159054Z compiled=False, 2025-05-07T20:31:41.2159122Z ) 2025-05-07T20:31:41.2159338Z self = 2025-05-07T20:31:41.2159497Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:41.2159502Z 2025-05-07T20:31:41.2159576Z @given( 2025-05-07T20:31:41.2159694Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2159789Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2159903Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2160012Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2160122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2160201Z ) 2025-05-07T20:31:41.2165696Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2165814Z def test_silu_mul_quant( 2025-05-07T20:31:41.2165893Z self, 2025-05-07T20:31:41.2165969Z T: int, 2025-05-07T20:31:41.2166040Z D: int, 2025-05-07T20:31:41.2166145Z scale_ub: Optional[float], 2025-05-07T20:31:41.2166233Z contiguous: bool, 2025-05-07T20:31:41.2166315Z compiled: bool, 2025-05-07T20:31:41.2166398Z ) -> None: 2025-05-07T20:31:41.2166489Z torch.manual_seed(2025) 2025-05-07T20:31:41.2166566Z 2025-05-07T20:31:41.2166738Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2166811Z 2025-05-07T20:31:41.2166902Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2167026Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2167111Z x = x_sign * x_clamp 2025-05-07T20:31:41.2167193Z x0 = x[:, :D] 2025-05-07T20:31:41.2167271Z x1 = x[:, D:] 2025-05-07T20:31:41.2167450Z 2025-05-07T20:31:41.2167540Z if contiguous: 2025-05-07T20:31:41.2167631Z x0 = x0.contiguous() 2025-05-07T20:31:41.2167716Z x1 = x1.contiguous() 2025-05-07T20:31:41.2167792Z 2025-05-07T20:31:41.2167876Z if scale_ub is not None: 2025-05-07T20:31:41.2167980Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2168110Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2168183Z ) 2025-05-07T20:31:41.2168263Z else: 2025-05-07T20:31:41.2168352Z scale_ub_tensor = None 2025-05-07T20:31:41.2168422Z 2025-05-07T20:31:41.2168553Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2168640Z op = silu_mul_quant 2025-05-07T20:31:41.2168726Z if compiled: 2025-05-07T20:31:41.2168825Z op = torch.compile(op) 2025-05-07T20:31:41.2168927Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2169009Z 2025-05-07T20:31:41.2169102Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2169107Z 2025-05-07T20:31:41.2169202Z moe/activation_test.py:117: 2025-05-07T20:31:41.2169338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2169438Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2169537Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2170045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2170142Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2170502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2170729Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2171234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2171333Z kernel = self.compile( 2025-05-07T20:31:41.2171717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2171889Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2172022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2172027Z 2025-05-07T20:31:41.2172227Z self = 2025-05-07T20:31:41.2173001Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2173502Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090decff60>} 2025-05-07T20:31:41.2174252Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2174442Z context = 2025-05-07T20:31:41.2174447Z 2025-05-07T20:31:41.2174609Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2174873Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2174977Z module_map=module_map) 2025-05-07T20:31:41.2175137Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2175236Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2175309Z E ^ 2025-05-07T20:31:41.2175668Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2175755Z 2025-05-07T20:31:41.2176171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2176176Z 2025-05-07T20:31:41.2176275Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2176501Z self=, 2025-05-07T20:31:41.2176576Z T=128, 2025-05-07T20:31:41.2176645Z D=5120, 2025-05-07T20:31:41.2176725Z scale_ub=None, 2025-05-07T20:31:41.2176807Z contiguous=False, 2025-05-07T20:31:41.2176891Z compiled=True, 2025-05-07T20:31:41.2176961Z ) 2025-05-07T20:31:41.2177176Z self = 2025-05-07T20:31:41.2177348Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.2177353Z 2025-05-07T20:31:41.2177427Z @given( 2025-05-07T20:31:41.2177543Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2177643Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2177762Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2177874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2177987Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2178060Z ) 2025-05-07T20:31:41.2178304Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2178392Z def test_silu_mul_quant( 2025-05-07T20:31:41.2178467Z self, 2025-05-07T20:31:41.2178547Z T: int, 2025-05-07T20:31:41.2178622Z D: int, 2025-05-07T20:31:41.2178714Z scale_ub: Optional[float], 2025-05-07T20:31:41.2178805Z contiguous: bool, 2025-05-07T20:31:41.2178886Z compiled: bool, 2025-05-07T20:31:41.2178957Z ) -> None: 2025-05-07T20:31:41.2179050Z torch.manual_seed(2025) 2025-05-07T20:31:41.2179121Z 2025-05-07T20:31:41.2179367Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2179448Z 2025-05-07T20:31:41.2179535Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2179659Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2179744Z x = x_sign * x_clamp 2025-05-07T20:31:41.2179819Z x0 = x[:, :D] 2025-05-07T20:31:41.2179898Z x1 = x[:, D:] 2025-05-07T20:31:41.2179969Z 2025-05-07T20:31:41.2180049Z if contiguous: 2025-05-07T20:31:41.2180143Z x0 = x0.contiguous() 2025-05-07T20:31:41.2180229Z x1 = x1.contiguous() 2025-05-07T20:31:41.2180299Z 2025-05-07T20:31:41.2180395Z if scale_ub is not None: 2025-05-07T20:31:41.2180496Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2180626Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2180700Z ) 2025-05-07T20:31:41.2180774Z else: 2025-05-07T20:31:41.2180864Z scale_ub_tensor = None 2025-05-07T20:31:41.2180935Z 2025-05-07T20:31:41.2181071Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2181157Z op = silu_mul_quant 2025-05-07T20:31:41.2181240Z if compiled: 2025-05-07T20:31:41.2181336Z op = torch.compile(op) 2025-05-07T20:31:41.2181445Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2181513Z 2025-05-07T20:31:41.2181601Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2181605Z 2025-05-07T20:31:41.2181701Z moe/activation_test.py:117: 2025-05-07T20:31:41.2181824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2181918Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2182016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2182382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2182474Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2182973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2183147Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2183509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2183731Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2184079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2184171Z kernel = self.compile( 2025-05-07T20:31:41.2184553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2184728Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2184854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2184858Z 2025-05-07T20:31:41.2185064Z self = 2025-05-07T20:31:41.2185845Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2186341Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090decea20>} 2025-05-07T20:31:41.2187098Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2187284Z context = 2025-05-07T20:31:41.2187288Z 2025-05-07T20:31:41.2187454Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2187792Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2187901Z module_map=module_map) 2025-05-07T20:31:41.2188062Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2188156Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2188232Z E ^ 2025-05-07T20:31:41.2188583Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2188588Z 2025-05-07T20:31:41.2189000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2189005Z 2025-05-07T20:31:41.2189106Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2189325Z self=, 2025-05-07T20:31:41.2189400Z T=128, 2025-05-07T20:31:41.2189472Z D=7168, 2025-05-07T20:31:41.2189554Z scale_ub=1200.0, 2025-05-07T20:31:41.2189642Z contiguous=False, 2025-05-07T20:31:41.2189721Z compiled=False, 2025-05-07T20:31:41.2189791Z ) 2025-05-07T20:31:41.2190007Z self = 2025-05-07T20:31:41.2190175Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.2190179Z 2025-05-07T20:31:41.2190248Z @given( 2025-05-07T20:31:41.2190369Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2190463Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2190574Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2190685Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2190798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2190873Z ) 2025-05-07T20:31:41.2191114Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2191204Z def test_silu_mul_quant( 2025-05-07T20:31:41.2191393Z self, 2025-05-07T20:31:41.2191465Z T: int, 2025-05-07T20:31:41.2191537Z D: int, 2025-05-07T20:31:41.2191630Z scale_ub: Optional[float], 2025-05-07T20:31:41.2191711Z contiguous: bool, 2025-05-07T20:31:41.2191790Z compiled: bool, 2025-05-07T20:31:41.2191865Z ) -> None: 2025-05-07T20:31:41.2191954Z torch.manual_seed(2025) 2025-05-07T20:31:41.2192024Z 2025-05-07T20:31:41.2192188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2192259Z 2025-05-07T20:31:41.2192346Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2192466Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2192547Z x = x_sign * x_clamp 2025-05-07T20:31:41.2192625Z x0 = x[:, :D] 2025-05-07T20:31:41.2192698Z x1 = x[:, D:] 2025-05-07T20:31:41.2192767Z 2025-05-07T20:31:41.2192847Z if contiguous: 2025-05-07T20:31:41.2192936Z x0 = x0.contiguous() 2025-05-07T20:31:41.2193025Z x1 = x1.contiguous() 2025-05-07T20:31:41.2193098Z 2025-05-07T20:31:41.2193183Z if scale_ub is not None: 2025-05-07T20:31:41.2193286Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2193416Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2193489Z ) 2025-05-07T20:31:41.2193564Z else: 2025-05-07T20:31:41.2193651Z scale_ub_tensor = None 2025-05-07T20:31:41.2193721Z 2025-05-07T20:31:41.2193847Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2193931Z op = silu_mul_quant 2025-05-07T20:31:41.2194009Z if compiled: 2025-05-07T20:31:41.2194107Z op = torch.compile(op) 2025-05-07T20:31:41.2194206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2194274Z 2025-05-07T20:31:41.2194363Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2194368Z 2025-05-07T20:31:41.2194540Z moe/activation_test.py:117: 2025-05-07T20:31:41.2194675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2194770Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2194866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2195369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2195462Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2195819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2196040Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2196378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2196470Z kernel = self.compile( 2025-05-07T20:31:41.2196858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2197033Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2197160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2197165Z 2025-05-07T20:31:41.2197366Z self = 2025-05-07T20:31:41.2198138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2198633Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090d4525c0>} 2025-05-07T20:31:41.2199386Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2199655Z context = 2025-05-07T20:31:41.2199659Z 2025-05-07T20:31:41.2199819Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2200088Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2200189Z module_map=module_map) 2025-05-07T20:31:41.2200347Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2200439Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2200511Z E ^ 2025-05-07T20:31:41.2200869Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2200874Z 2025-05-07T20:31:41.2201293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2201302Z 2025-05-07T20:31:41.2201399Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2201622Z self=, 2025-05-07T20:31:41.2201694Z T=128, 2025-05-07T20:31:41.2201766Z D=5120, 2025-05-07T20:31:41.2201845Z scale_ub=None, 2025-05-07T20:31:41.2201923Z contiguous=False, 2025-05-07T20:31:41.2202003Z compiled=False, 2025-05-07T20:31:41.2202070Z ) 2025-05-07T20:31:41.2202283Z self = 2025-05-07T20:31:41.2202451Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:41.2202455Z 2025-05-07T20:31:41.2202528Z @given( 2025-05-07T20:31:41.2202642Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2202741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2202848Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2203044Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2203158Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2203228Z ) 2025-05-07T20:31:41.2203563Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2203653Z def test_silu_mul_quant( 2025-05-07T20:31:41.2203726Z self, 2025-05-07T20:31:41.2203801Z T: int, 2025-05-07T20:31:41.2203873Z D: int, 2025-05-07T20:31:41.2203962Z scale_ub: Optional[float], 2025-05-07T20:31:41.2204046Z contiguous: bool, 2025-05-07T20:31:41.2204126Z compiled: bool, 2025-05-07T20:31:41.2204200Z ) -> None: 2025-05-07T20:31:41.2204290Z torch.manual_seed(2025) 2025-05-07T20:31:41.2204359Z 2025-05-07T20:31:41.2204522Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2204597Z 2025-05-07T20:31:41.2204682Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2204810Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2204897Z x = x_sign * x_clamp 2025-05-07T20:31:41.2204972Z x0 = x[:, :D] 2025-05-07T20:31:41.2205052Z x1 = x[:, D:] 2025-05-07T20:31:41.2205121Z 2025-05-07T20:31:41.2205198Z if contiguous: 2025-05-07T20:31:41.2205288Z x0 = x0.contiguous() 2025-05-07T20:31:41.2205373Z x1 = x1.contiguous() 2025-05-07T20:31:41.2205440Z 2025-05-07T20:31:41.2205525Z if scale_ub is not None: 2025-05-07T20:31:41.2205628Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2205757Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2205830Z ) 2025-05-07T20:31:41.2205902Z else: 2025-05-07T20:31:41.2205991Z scale_ub_tensor = None 2025-05-07T20:31:41.2206065Z 2025-05-07T20:31:41.2206190Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2206279Z op = silu_mul_quant 2025-05-07T20:31:41.2206449Z if compiled: 2025-05-07T20:31:41.2206544Z op = torch.compile(op) 2025-05-07T20:31:41.2206647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2206715Z 2025-05-07T20:31:41.2206800Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2206805Z 2025-05-07T20:31:41.2206903Z moe/activation_test.py:117: 2025-05-07T20:31:41.2207035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2207145Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2207259Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2207764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2207859Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2208217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2208439Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2208790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2208877Z kernel = self.compile( 2025-05-07T20:31:41.2209262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2209432Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2209552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2209557Z 2025-05-07T20:31:41.2209762Z self = 2025-05-07T20:31:41.2210618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2211127Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090d38b7e0>} 2025-05-07T20:31:41.2211874Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2212060Z context = 2025-05-07T20:31:41.2212065Z 2025-05-07T20:31:41.2212231Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2212493Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2212600Z module_map=module_map) 2025-05-07T20:31:41.2212758Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2212855Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2212936Z E ^ 2025-05-07T20:31:41.2213290Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2213295Z 2025-05-07T20:31:41.2213707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2213715Z 2025-05-07T20:31:41.2213813Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2214031Z self=, 2025-05-07T20:31:41.2214109Z T=128, 2025-05-07T20:31:41.2214180Z D=5120, 2025-05-07T20:31:41.2214255Z scale_ub=1200.0, 2025-05-07T20:31:41.2214334Z contiguous=True, 2025-05-07T20:31:41.2214413Z compiled=False, 2025-05-07T20:31:41.2214483Z ) 2025-05-07T20:31:41.2214700Z self = 2025-05-07T20:31:41.2214868Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:41.2214951Z 2025-05-07T20:31:41.2215026Z @given( 2025-05-07T20:31:41.2215139Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2215234Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2215346Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2215458Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2215566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2215637Z ) 2025-05-07T20:31:41.2215876Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2215964Z def test_silu_mul_quant( 2025-05-07T20:31:41.2216039Z self, 2025-05-07T20:31:41.2216110Z T: int, 2025-05-07T20:31:41.2216180Z D: int, 2025-05-07T20:31:41.2216276Z scale_ub: Optional[float], 2025-05-07T20:31:41.2216360Z contiguous: bool, 2025-05-07T20:31:41.2216440Z compiled: bool, 2025-05-07T20:31:41.2216524Z ) -> None: 2025-05-07T20:31:41.2216612Z torch.manual_seed(2025) 2025-05-07T20:31:41.2216683Z 2025-05-07T20:31:41.2216846Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2216916Z 2025-05-07T20:31:41.2217004Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2217125Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2217207Z x = x_sign * x_clamp 2025-05-07T20:31:41.2217285Z x0 = x[:, :D] 2025-05-07T20:31:41.2217360Z x1 = x[:, D:] 2025-05-07T20:31:41.2217429Z 2025-05-07T20:31:41.2217511Z if contiguous: 2025-05-07T20:31:41.2217597Z x0 = x0.contiguous() 2025-05-07T20:31:41.2217681Z x1 = x1.contiguous() 2025-05-07T20:31:41.2217749Z 2025-05-07T20:31:41.2217832Z if scale_ub is not None: 2025-05-07T20:31:41.2217937Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2218171Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2218247Z ) 2025-05-07T20:31:41.2218324Z else: 2025-05-07T20:31:41.2218412Z scale_ub_tensor = None 2025-05-07T20:31:41.2218476Z 2025-05-07T20:31:41.2218605Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2218690Z op = silu_mul_quant 2025-05-07T20:31:41.2218773Z if compiled: 2025-05-07T20:31:41.2218869Z op = torch.compile(op) 2025-05-07T20:31:41.2218970Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2219041Z 2025-05-07T20:31:41.2219130Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2219134Z 2025-05-07T20:31:41.2219228Z moe/activation_test.py:117: 2025-05-07T20:31:41.2219351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2219447Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2219542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2220049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2220144Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2220502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2220723Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2221060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2221148Z kernel = self.compile( 2025-05-07T20:31:41.2221536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2221707Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2221833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2221837Z 2025-05-07T20:31:41.2222041Z self = 2025-05-07T20:31:41.2222945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2223443Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c6dea20>} 2025-05-07T20:31:41.2224191Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2224378Z context = 2025-05-07T20:31:41.2224382Z 2025-05-07T20:31:41.2224548Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2224817Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2224919Z module_map=module_map) 2025-05-07T20:31:41.2225077Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2225174Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2225249Z E ^ 2025-05-07T20:31:41.2225600Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2225605Z 2025-05-07T20:31:41.2226018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2226023Z 2025-05-07T20:31:41.2226120Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2226342Z self=, 2025-05-07T20:31:41.2226415Z T=1, 2025-05-07T20:31:41.2226487Z D=7168, 2025-05-07T20:31:41.2226651Z scale_ub=1200.0, 2025-05-07T20:31:41.2226733Z contiguous=True, 2025-05-07T20:31:41.2226808Z compiled=True, 2025-05-07T20:31:41.2226880Z ) 2025-05-07T20:31:41.2227095Z self = 2025-05-07T20:31:41.2227256Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.2227263Z 2025-05-07T20:31:41.2227336Z @given( 2025-05-07T20:31:41.2227451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2227548Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2227656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2227766Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2227876Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2227946Z ) 2025-05-07T20:31:41.2228184Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2228279Z def test_silu_mul_quant( 2025-05-07T20:31:41.2228355Z self, 2025-05-07T20:31:41.2228427Z T: int, 2025-05-07T20:31:41.2228502Z D: int, 2025-05-07T20:31:41.2228593Z scale_ub: Optional[float], 2025-05-07T20:31:41.2228680Z contiguous: bool, 2025-05-07T20:31:41.2228760Z compiled: bool, 2025-05-07T20:31:41.2228832Z ) -> None: 2025-05-07T20:31:41.2228924Z torch.manual_seed(2025) 2025-05-07T20:31:41.2228995Z 2025-05-07T20:31:41.2229158Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2229232Z 2025-05-07T20:31:41.2229317Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2229436Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2229522Z x = x_sign * x_clamp 2025-05-07T20:31:41.2229596Z x0 = x[:, :D] 2025-05-07T20:31:41.2229669Z x1 = x[:, D:] 2025-05-07T20:31:41.2229739Z 2025-05-07T20:31:41.2229816Z if contiguous: 2025-05-07T20:31:41.2229906Z x0 = x0.contiguous() 2025-05-07T20:31:41.2230075Z x1 = x1.contiguous() 2025-05-07T20:31:41.2230146Z 2025-05-07T20:31:41.2230232Z if scale_ub is not None: 2025-05-07T20:31:41.2230332Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2230462Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2230538Z ) 2025-05-07T20:31:41.2230609Z else: 2025-05-07T20:31:41.2230697Z scale_ub_tensor = None 2025-05-07T20:31:41.2230768Z 2025-05-07T20:31:41.2230892Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2230976Z op = silu_mul_quant 2025-05-07T20:31:41.2231058Z if compiled: 2025-05-07T20:31:41.2231154Z op = torch.compile(op) 2025-05-07T20:31:41.2231256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2231327Z 2025-05-07T20:31:41.2231412Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2231417Z 2025-05-07T20:31:41.2231514Z moe/activation_test.py:117: 2025-05-07T20:31:41.2231642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2231734Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2231832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2232199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2232285Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2232783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2232874Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2233233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2233452Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2233870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2233969Z kernel = self.compile( 2025-05-07T20:31:41.2234352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2234521Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2234648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2234653Z 2025-05-07T20:31:41.2234852Z self = 2025-05-07T20:31:41.2235621Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2236119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c6dd440>} 2025-05-07T20:31:41.2236872Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2237057Z context = 2025-05-07T20:31:41.2237061Z 2025-05-07T20:31:41.2237222Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2237487Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2237590Z module_map=module_map) 2025-05-07T20:31:41.2237750Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2237843Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2237915Z E ^ 2025-05-07T20:31:41.2238276Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2238356Z 2025-05-07T20:31:41.2239036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2239043Z 2025-05-07T20:31:41.2239147Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2239367Z self=, 2025-05-07T20:31:41.2239438Z T=1, 2025-05-07T20:31:41.2239513Z D=7168, 2025-05-07T20:31:41.2239593Z scale_ub=1200.0, 2025-05-07T20:31:41.2239673Z contiguous=False, 2025-05-07T20:31:41.2239754Z compiled=True, 2025-05-07T20:31:41.2239823Z ) 2025-05-07T20:31:41.2240036Z self = 2025-05-07T20:31:41.2240202Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.2240206Z 2025-05-07T20:31:41.2240279Z @given( 2025-05-07T20:31:41.2240403Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2240506Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2240616Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2240731Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2240838Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2240907Z ) 2025-05-07T20:31:41.2241150Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2241237Z def test_silu_mul_quant( 2025-05-07T20:31:41.2241311Z self, 2025-05-07T20:31:41.2241382Z T: int, 2025-05-07T20:31:41.2241451Z D: int, 2025-05-07T20:31:41.2241542Z scale_ub: Optional[float], 2025-05-07T20:31:41.2241628Z contiguous: bool, 2025-05-07T20:31:41.2241705Z compiled: bool, 2025-05-07T20:31:41.2241780Z ) -> None: 2025-05-07T20:31:41.2241870Z torch.manual_seed(2025) 2025-05-07T20:31:41.2241935Z 2025-05-07T20:31:41.2242262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2242332Z 2025-05-07T20:31:41.2242420Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2242543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2242627Z x = x_sign * x_clamp 2025-05-07T20:31:41.2242700Z x0 = x[:, :D] 2025-05-07T20:31:41.2242778Z x1 = x[:, D:] 2025-05-07T20:31:41.2242848Z 2025-05-07T20:31:41.2242930Z if contiguous: 2025-05-07T20:31:41.2243019Z x0 = x0.contiguous() 2025-05-07T20:31:41.2243102Z x1 = x1.contiguous() 2025-05-07T20:31:41.2243171Z 2025-05-07T20:31:41.2243361Z if scale_ub is not None: 2025-05-07T20:31:41.2243464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2243597Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2243667Z ) 2025-05-07T20:31:41.2243738Z else: 2025-05-07T20:31:41.2243837Z scale_ub_tensor = None 2025-05-07T20:31:41.2243909Z 2025-05-07T20:31:41.2244034Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2244125Z op = silu_mul_quant 2025-05-07T20:31:41.2244206Z if compiled: 2025-05-07T20:31:41.2244303Z op = torch.compile(op) 2025-05-07T20:31:41.2244408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2244478Z 2025-05-07T20:31:41.2244564Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2244573Z 2025-05-07T20:31:41.2244662Z moe/activation_test.py:117: 2025-05-07T20:31:41.2244788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2244888Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2244982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2245352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2245442Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2246065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2246156Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2246513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2246732Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2247075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2247164Z kernel = self.compile( 2025-05-07T20:31:41.2247547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2247720Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2247847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2247857Z 2025-05-07T20:31:41.2248063Z self = 2025-05-07T20:31:41.2248833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2249325Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c6dcc20>} 2025-05-07T20:31:41.2250075Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2250261Z context = 2025-05-07T20:31:41.2250265Z 2025-05-07T20:31:41.2250528Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2250795Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2250896Z module_map=module_map) 2025-05-07T20:31:41.2251057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2251149Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2251224Z E ^ 2025-05-07T20:31:41.2251577Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2251581Z 2025-05-07T20:31:41.2251993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2251997Z 2025-05-07T20:31:41.2252099Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2252319Z self=, 2025-05-07T20:31:41.2252398Z T=1, 2025-05-07T20:31:41.2252474Z D=7168, 2025-05-07T20:31:41.2252554Z scale_ub=None, 2025-05-07T20:31:41.2252639Z contiguous=False, 2025-05-07T20:31:41.2252717Z compiled=True, 2025-05-07T20:31:41.2252785Z ) 2025-05-07T20:31:41.2253005Z self = 2025-05-07T20:31:41.2253164Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.2253169Z 2025-05-07T20:31:41.2253243Z @given( 2025-05-07T20:31:41.2253361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2253454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2253564Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2253679Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2253786Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2253858Z ) 2025-05-07T20:31:41.2254105Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2254279Z def test_silu_mul_quant( 2025-05-07T20:31:41.2254355Z self, 2025-05-07T20:31:41.2254427Z T: int, 2025-05-07T20:31:41.2254497Z D: int, 2025-05-07T20:31:41.2254592Z scale_ub: Optional[float], 2025-05-07T20:31:41.2254676Z contiguous: bool, 2025-05-07T20:31:41.2254754Z compiled: bool, 2025-05-07T20:31:41.2254831Z ) -> None: 2025-05-07T20:31:41.2254921Z torch.manual_seed(2025) 2025-05-07T20:31:41.2254988Z 2025-05-07T20:31:41.2255157Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2255225Z 2025-05-07T20:31:41.2255315Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2255434Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2255513Z x = x_sign * x_clamp 2025-05-07T20:31:41.2255595Z x0 = x[:, :D] 2025-05-07T20:31:41.2255669Z x1 = x[:, D:] 2025-05-07T20:31:41.2255738Z 2025-05-07T20:31:41.2255828Z if contiguous: 2025-05-07T20:31:41.2255919Z x0 = x0.contiguous() 2025-05-07T20:31:41.2256003Z x1 = x1.contiguous() 2025-05-07T20:31:41.2256077Z 2025-05-07T20:31:41.2256160Z if scale_ub is not None: 2025-05-07T20:31:41.2256259Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2256391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2256460Z ) 2025-05-07T20:31:41.2256533Z else: 2025-05-07T20:31:41.2256619Z scale_ub_tensor = None 2025-05-07T20:31:41.2256689Z 2025-05-07T20:31:41.2256820Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2256904Z op = silu_mul_quant 2025-05-07T20:31:41.2256982Z if compiled: 2025-05-07T20:31:41.2257079Z op = torch.compile(op) 2025-05-07T20:31:41.2257181Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2257250Z 2025-05-07T20:31:41.2257419Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.2257546Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.2257615Z 2025-05-07T20:31:41.2257749Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2257847Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.2257943Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.2258060Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.2258195Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2258267Z 2025-05-07T20:31:41.2258363Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.2258368Z 2025-05-07T20:31:41.2258460Z moe/activation_test.py:126: 2025-05-07T20:31:41.2258584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2258686Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.2258814Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2259383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.2259482Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.2259844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2260063Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2260427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.2260684Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2261081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.2261333Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2261791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.2261954Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.2262298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.2262370Z fn() 2025-05-07T20:31:41.2262770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.2262849Z self.fn.run( 2025-05-07T20:31:41.2263185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2263279Z kernel = self.compile( 2025-05-07T20:31:41.2263659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2263834Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2263968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2263973Z 2025-05-07T20:31:41.2264174Z self = 2025-05-07T20:31:41.2264950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2265444Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090dc9eb60>} 2025-05-07T20:31:41.2266199Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2266462Z context = 2025-05-07T20:31:41.2266470Z 2025-05-07T20:31:41.2266633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2266895Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2266997Z module_map=module_map) 2025-05-07T20:31:41.2267152Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2267254Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.2267330Z E ^ 2025-05-07T20:31:41.2267690Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2267695Z 2025-05-07T20:31:41.2268107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2268112Z 2025-05-07T20:31:41.2268210Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2268438Z self=, 2025-05-07T20:31:41.2268511Z T=1, 2025-05-07T20:31:41.2268589Z D=5120, 2025-05-07T20:31:41.2268666Z scale_ub=1200.0, 2025-05-07T20:31:41.2268746Z contiguous=False, 2025-05-07T20:31:41.2268827Z compiled=True, 2025-05-07T20:31:41.2268895Z ) 2025-05-07T20:31:41.2269110Z self = 2025-05-07T20:31:41.2269274Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.2269279Z 2025-05-07T20:31:41.2269352Z @given( 2025-05-07T20:31:41.2269465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2269562Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2269672Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2269783Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2269899Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2270054Z ) 2025-05-07T20:31:41.2270297Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2270385Z def test_silu_mul_quant( 2025-05-07T20:31:41.2270456Z self, 2025-05-07T20:31:41.2270533Z T: int, 2025-05-07T20:31:41.2270607Z D: int, 2025-05-07T20:31:41.2270701Z scale_ub: Optional[float], 2025-05-07T20:31:41.2270786Z contiguous: bool, 2025-05-07T20:31:41.2270866Z compiled: bool, 2025-05-07T20:31:41.2270939Z ) -> None: 2025-05-07T20:31:41.2271029Z torch.manual_seed(2025) 2025-05-07T20:31:41.2271097Z 2025-05-07T20:31:41.2271262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2271337Z 2025-05-07T20:31:41.2271424Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2271549Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2271631Z x = x_sign * x_clamp 2025-05-07T20:31:41.2271708Z x0 = x[:, :D] 2025-05-07T20:31:41.2271791Z x1 = x[:, D:] 2025-05-07T20:31:41.2271859Z 2025-05-07T20:31:41.2271937Z if contiguous: 2025-05-07T20:31:41.2272022Z x0 = x0.contiguous() 2025-05-07T20:31:41.2272106Z x1 = x1.contiguous() 2025-05-07T20:31:41.2272175Z 2025-05-07T20:31:41.2272260Z if scale_ub is not None: 2025-05-07T20:31:41.2272361Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2272492Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2272565Z ) 2025-05-07T20:31:41.2272636Z else: 2025-05-07T20:31:41.2272726Z scale_ub_tensor = None 2025-05-07T20:31:41.2272795Z 2025-05-07T20:31:41.2272921Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2273009Z op = silu_mul_quant 2025-05-07T20:31:41.2273087Z if compiled: 2025-05-07T20:31:41.2273182Z op = torch.compile(op) 2025-05-07T20:31:41.2273368Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2273442Z 2025-05-07T20:31:41.2273529Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2273533Z 2025-05-07T20:31:41.2273628Z moe/activation_test.py:117: 2025-05-07T20:31:41.2273751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2273848Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2273940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2274307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2274401Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2274893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2274985Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2275353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2275577Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2275918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2276008Z kernel = self.compile( 2025-05-07T20:31:41.2276388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2276561Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2276683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2276688Z 2025-05-07T20:31:41.2276888Z self = 2025-05-07T20:31:41.2277670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2278272Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090dc9f880>} 2025-05-07T20:31:41.2279034Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2279220Z context = 2025-05-07T20:31:41.2279225Z 2025-05-07T20:31:41.2279389Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2279650Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2279751Z module_map=module_map) 2025-05-07T20:31:41.2279911Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2280020Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2280092Z E ^ 2025-05-07T20:31:41.2280449Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2280454Z 2025-05-07T20:31:41.2280867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2280871Z 2025-05-07T20:31:41.2280971Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2281192Z self=, 2025-05-07T20:31:41.2281265Z T=1, 2025-05-07T20:31:41.2281337Z D=5120, 2025-05-07T20:31:41.2281417Z scale_ub=1200.0, 2025-05-07T20:31:41.2281499Z contiguous=False, 2025-05-07T20:31:41.2281578Z compiled=False, 2025-05-07T20:31:41.2281648Z ) 2025-05-07T20:31:41.2281865Z self = 2025-05-07T20:31:41.2282106Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.2282115Z 2025-05-07T20:31:41.2282190Z @given( 2025-05-07T20:31:41.2282308Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2282401Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2282510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2282624Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2287900Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2287990Z ) 2025-05-07T20:31:41.2288244Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2288338Z def test_silu_mul_quant( 2025-05-07T20:31:41.2288415Z self, 2025-05-07T20:31:41.2288487Z T: int, 2025-05-07T20:31:41.2288562Z D: int, 2025-05-07T20:31:41.2288660Z scale_ub: Optional[float], 2025-05-07T20:31:41.2288747Z contiguous: bool, 2025-05-07T20:31:41.2288835Z compiled: bool, 2025-05-07T20:31:41.2288920Z ) -> None: 2025-05-07T20:31:41.2289012Z torch.manual_seed(2025) 2025-05-07T20:31:41.2289082Z 2025-05-07T20:31:41.2289253Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2289327Z 2025-05-07T20:31:41.2289414Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2289543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2289628Z x = x_sign * x_clamp 2025-05-07T20:31:41.2289710Z x0 = x[:, :D] 2025-05-07T20:31:41.2289785Z x1 = x[:, D:] 2025-05-07T20:31:41.2289856Z 2025-05-07T20:31:41.2289939Z if contiguous: 2025-05-07T20:31:41.2290028Z x0 = x0.contiguous() 2025-05-07T20:31:41.2290113Z x1 = x1.contiguous() 2025-05-07T20:31:41.2290187Z 2025-05-07T20:31:41.2290272Z if scale_ub is not None: 2025-05-07T20:31:41.2290374Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2290511Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2290693Z ) 2025-05-07T20:31:41.2290767Z else: 2025-05-07T20:31:41.2290862Z scale_ub_tensor = None 2025-05-07T20:31:41.2290933Z 2025-05-07T20:31:41.2291066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2291152Z op = silu_mul_quant 2025-05-07T20:31:41.2291232Z if compiled: 2025-05-07T20:31:41.2291334Z op = torch.compile(op) 2025-05-07T20:31:41.2291438Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2291509Z 2025-05-07T20:31:41.2291597Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2291602Z 2025-05-07T20:31:41.2291699Z moe/activation_test.py:117: 2025-05-07T20:31:41.2291826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2291926Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2292025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2292538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2292640Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2293004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2293227Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2293568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2293659Z kernel = self.compile( 2025-05-07T20:31:41.2294052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2294223Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2294352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2294441Z 2025-05-07T20:31:41.2294648Z self = 2025-05-07T20:31:41.2295416Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2295915Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090ce76480>} 2025-05-07T20:31:41.2296662Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2296852Z context = 2025-05-07T20:31:41.2296857Z 2025-05-07T20:31:41.2297023Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2297290Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2297391Z module_map=module_map) 2025-05-07T20:31:41.2297548Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2297648Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2297720Z E ^ 2025-05-07T20:31:41.2298072Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2298077Z 2025-05-07T20:31:41.2298494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2298498Z 2025-05-07T20:31:41.2298596Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2298818Z self=, 2025-05-07T20:31:41.2298892Z T=16384, 2025-05-07T20:31:41.2299054Z D=5120, 2025-05-07T20:31:41.2299135Z scale_ub=1200.0, 2025-05-07T20:31:41.2299215Z contiguous=False, 2025-05-07T20:31:41.2299291Z compiled=True, 2025-05-07T20:31:41.2299364Z ) 2025-05-07T20:31:41.2299577Z self = 2025-05-07T20:31:41.2299752Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.2299759Z 2025-05-07T20:31:41.2299832Z @given( 2025-05-07T20:31:41.2299949Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2300045Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2300154Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2300267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2300381Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2300448Z ) 2025-05-07T20:31:41.2300694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2300791Z def test_silu_mul_quant( 2025-05-07T20:31:41.2300864Z self, 2025-05-07T20:31:41.2300934Z T: int, 2025-05-07T20:31:41.2301010Z D: int, 2025-05-07T20:31:41.2301103Z scale_ub: Optional[float], 2025-05-07T20:31:41.2301189Z contiguous: bool, 2025-05-07T20:31:41.2301268Z compiled: bool, 2025-05-07T20:31:41.2301340Z ) -> None: 2025-05-07T20:31:41.2301435Z torch.manual_seed(2025) 2025-05-07T20:31:41.2301503Z 2025-05-07T20:31:41.2301668Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2301741Z 2025-05-07T20:31:41.2301827Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2301946Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2302034Z x = x_sign * x_clamp 2025-05-07T20:31:41.2302111Z x0 = x[:, :D] 2025-05-07T20:31:41.2302188Z x1 = x[:, D:] 2025-05-07T20:31:41.2302265Z 2025-05-07T20:31:41.2302491Z if contiguous: 2025-05-07T20:31:41.2302591Z x0 = x0.contiguous() 2025-05-07T20:31:41.2302676Z x1 = x1.contiguous() 2025-05-07T20:31:41.2302745Z 2025-05-07T20:31:41.2302834Z if scale_ub is not None: 2025-05-07T20:31:41.2302935Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2303067Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2303141Z ) 2025-05-07T20:31:41.2303215Z else: 2025-05-07T20:31:41.2303303Z scale_ub_tensor = None 2025-05-07T20:31:41.2303373Z 2025-05-07T20:31:41.2303500Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2303583Z op = silu_mul_quant 2025-05-07T20:31:41.2303669Z if compiled: 2025-05-07T20:31:41.2303763Z op = torch.compile(op) 2025-05-07T20:31:41.2303866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2303938Z 2025-05-07T20:31:41.2304022Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2304038Z 2025-05-07T20:31:41.2304133Z moe/activation_test.py:117: 2025-05-07T20:31:41.2304256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2304352Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2304451Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2304818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2304905Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2305401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2305494Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2305857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2306078Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2306502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2306592Z kernel = self.compile( 2025-05-07T20:31:41.2306973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2307151Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2307273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2307277Z 2025-05-07T20:31:41.2307483Z self = 2025-05-07T20:31:41.2308254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2308754Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090ce751c0>} 2025-05-07T20:31:41.2309515Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2309699Z context = 2025-05-07T20:31:41.2309704Z 2025-05-07T20:31:41.2309873Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2310134Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2310238Z module_map=module_map) 2025-05-07T20:31:41.2310398Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2310490Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2310563Z E ^ 2025-05-07T20:31:41.2311000Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2311010Z 2025-05-07T20:31:41.2311424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2311429Z 2025-05-07T20:31:41.2311529Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2311747Z self=, 2025-05-07T20:31:41.2311820Z T=2048, 2025-05-07T20:31:41.2311896Z D=7168, 2025-05-07T20:31:41.2311971Z scale_ub=1200.0, 2025-05-07T20:31:41.2312056Z contiguous=False, 2025-05-07T20:31:41.2312132Z compiled=True, 2025-05-07T20:31:41.2312200Z ) 2025-05-07T20:31:41.2312417Z self = 2025-05-07T20:31:41.2312590Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.2312594Z 2025-05-07T20:31:41.2312672Z @given( 2025-05-07T20:31:41.2312791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2312885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2312994Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2313109Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2313218Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2313292Z ) 2025-05-07T20:31:41.2313537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2313623Z def test_silu_mul_quant( 2025-05-07T20:31:41.2313697Z self, 2025-05-07T20:31:41.2313767Z T: int, 2025-05-07T20:31:41.2313836Z D: int, 2025-05-07T20:31:41.2313930Z scale_ub: Optional[float], 2025-05-07T20:31:41.2314014Z contiguous: bool, 2025-05-07T20:31:41.2314093Z compiled: bool, 2025-05-07T20:31:41.2314167Z ) -> None: 2025-05-07T20:31:41.2314260Z torch.manual_seed(2025) 2025-05-07T20:31:41.2314436Z 2025-05-07T20:31:41.2314604Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2314675Z 2025-05-07T20:31:41.2314766Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2314885Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2314967Z x = x_sign * x_clamp 2025-05-07T20:31:41.2315043Z x0 = x[:, :D] 2025-05-07T20:31:41.2315117Z x1 = x[:, D:] 2025-05-07T20:31:41.2315184Z 2025-05-07T20:31:41.2315265Z if contiguous: 2025-05-07T20:31:41.2315351Z x0 = x0.contiguous() 2025-05-07T20:31:41.2315434Z x1 = x1.contiguous() 2025-05-07T20:31:41.2315505Z 2025-05-07T20:31:41.2315591Z if scale_ub is not None: 2025-05-07T20:31:41.2315694Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2315826Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2315894Z ) 2025-05-07T20:31:41.2315971Z else: 2025-05-07T20:31:41.2316065Z scale_ub_tensor = None 2025-05-07T20:31:41.2316135Z 2025-05-07T20:31:41.2316263Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2316348Z op = silu_mul_quant 2025-05-07T20:31:41.2316429Z if compiled: 2025-05-07T20:31:41.2316525Z op = torch.compile(op) 2025-05-07T20:31:41.2316624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2316694Z 2025-05-07T20:31:41.2316783Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2316787Z 2025-05-07T20:31:41.2316879Z moe/activation_test.py:117: 2025-05-07T20:31:41.2317002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2317098Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2317192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2317561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2317734Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2318233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2318329Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2318687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2318911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2319255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2319343Z kernel = self.compile( 2025-05-07T20:31:41.2319729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2319900Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2320027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2320036Z 2025-05-07T20:31:41.2320238Z self = 2025-05-07T20:31:41.2321006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2321508Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090ce76fc0>} 2025-05-07T20:31:41.2322255Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2322443Z context = 2025-05-07T20:31:41.2322529Z 2025-05-07T20:31:41.2322691Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2322951Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2323057Z module_map=module_map) 2025-05-07T20:31:41.2323219Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2323386Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2323464Z E ^ 2025-05-07T20:31:41.2323815Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2323820Z 2025-05-07T20:31:41.2324239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2324243Z 2025-05-07T20:31:41.2324340Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2324565Z self=, 2025-05-07T20:31:41.2324647Z T=1, 2025-05-07T20:31:41.2324721Z D=5120, 2025-05-07T20:31:41.2324796Z scale_ub=None, 2025-05-07T20:31:41.2324879Z contiguous=False, 2025-05-07T20:31:41.2324954Z compiled=False, 2025-05-07T20:31:41.2325027Z ) 2025-05-07T20:31:41.2325242Z self = 2025-05-07T20:31:41.2325403Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:41.2325407Z 2025-05-07T20:31:41.2325484Z @given( 2025-05-07T20:31:41.2325598Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2325692Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2325804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2325918Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2326027Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2326100Z ) 2025-05-07T20:31:41.2326423Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2326519Z def test_silu_mul_quant( 2025-05-07T20:31:41.2326591Z self, 2025-05-07T20:31:41.2326662Z T: int, 2025-05-07T20:31:41.2326734Z D: int, 2025-05-07T20:31:41.2326827Z scale_ub: Optional[float], 2025-05-07T20:31:41.2326911Z contiguous: bool, 2025-05-07T20:31:41.2326993Z compiled: bool, 2025-05-07T20:31:41.2327065Z ) -> None: 2025-05-07T20:31:41.2327155Z torch.manual_seed(2025) 2025-05-07T20:31:41.2327222Z 2025-05-07T20:31:41.2327387Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2327456Z 2025-05-07T20:31:41.2327549Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2327669Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2327755Z x = x_sign * x_clamp 2025-05-07T20:31:41.2327829Z x0 = x[:, :D] 2025-05-07T20:31:41.2327902Z x1 = x[:, D:] 2025-05-07T20:31:41.2327988Z 2025-05-07T20:31:41.2328064Z if contiguous: 2025-05-07T20:31:41.2328153Z x0 = x0.contiguous() 2025-05-07T20:31:41.2328239Z x1 = x1.contiguous() 2025-05-07T20:31:41.2328307Z 2025-05-07T20:31:41.2328392Z if scale_ub is not None: 2025-05-07T20:31:41.2328499Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2328627Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2328697Z ) 2025-05-07T20:31:41.2328775Z else: 2025-05-07T20:31:41.2328862Z scale_ub_tensor = None 2025-05-07T20:31:41.2328929Z 2025-05-07T20:31:41.2329056Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2329139Z op = silu_mul_quant 2025-05-07T20:31:41.2329222Z if compiled: 2025-05-07T20:31:41.2329316Z op = torch.compile(op) 2025-05-07T20:31:41.2329416Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2329492Z 2025-05-07T20:31:41.2329659Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2329664Z 2025-05-07T20:31:41.2329756Z moe/activation_test.py:117: 2025-05-07T20:31:41.2329883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2329977Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2330071Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2330575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2330667Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2331027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2331244Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2331581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2331684Z kernel = self.compile( 2025-05-07T20:31:41.2332071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2332249Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2332371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2332376Z 2025-05-07T20:31:41.2332576Z self = 2025-05-07T20:31:41.2333352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2333848Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c5b0860>} 2025-05-07T20:31:41.2334687Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2334874Z context = 2025-05-07T20:31:41.2334878Z 2025-05-07T20:31:41.2335039Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2335306Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2335408Z module_map=module_map) 2025-05-07T20:31:41.2335566Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2335657Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2335728Z E ^ 2025-05-07T20:31:41.2336086Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2336095Z 2025-05-07T20:31:41.2336511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2336516Z 2025-05-07T20:31:41.2336616Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2336833Z self=, 2025-05-07T20:31:41.2336905Z T=4096, 2025-05-07T20:31:41.2336985Z D=7168, 2025-05-07T20:31:41.2337061Z scale_ub=1200.0, 2025-05-07T20:31:41.2337139Z contiguous=False, 2025-05-07T20:31:41.2337224Z compiled=False, 2025-05-07T20:31:41.2337288Z ) 2025-05-07T20:31:41.2337503Z self = 2025-05-07T20:31:41.2337679Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.2337683Z 2025-05-07T20:31:41.2337756Z @given( 2025-05-07T20:31:41.2337872Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2337973Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2338164Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2338279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2338629Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2338737Z ) 2025-05-07T20:31:41.2339013Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2339102Z def test_silu_mul_quant( 2025-05-07T20:31:41.2339174Z self, 2025-05-07T20:31:41.2339247Z T: int, 2025-05-07T20:31:41.2339315Z D: int, 2025-05-07T20:31:41.2339408Z scale_ub: Optional[float], 2025-05-07T20:31:41.2339493Z contiguous: bool, 2025-05-07T20:31:41.2339573Z compiled: bool, 2025-05-07T20:31:41.2339647Z ) -> None: 2025-05-07T20:31:41.2339737Z torch.manual_seed(2025) 2025-05-07T20:31:41.2339804Z 2025-05-07T20:31:41.2339981Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2340057Z 2025-05-07T20:31:41.2340144Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2340266Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2340350Z x = x_sign * x_clamp 2025-05-07T20:31:41.2340425Z x0 = x[:, :D] 2025-05-07T20:31:41.2340503Z x1 = x[:, D:] 2025-05-07T20:31:41.2340571Z 2025-05-07T20:31:41.2340648Z if contiguous: 2025-05-07T20:31:41.2340741Z x0 = x0.contiguous() 2025-05-07T20:31:41.2340827Z x1 = x1.contiguous() 2025-05-07T20:31:41.2340900Z 2025-05-07T20:31:41.2340984Z if scale_ub is not None: 2025-05-07T20:31:41.2341085Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2341218Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2341289Z ) 2025-05-07T20:31:41.2341361Z else: 2025-05-07T20:31:41.2341455Z scale_ub_tensor = None 2025-05-07T20:31:41.2341525Z 2025-05-07T20:31:41.2341821Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2341918Z op = silu_mul_quant 2025-05-07T20:31:41.2341998Z if compiled: 2025-05-07T20:31:41.2342093Z op = torch.compile(op) 2025-05-07T20:31:41.2342198Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2342264Z 2025-05-07T20:31:41.2342350Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2342355Z 2025-05-07T20:31:41.2342450Z moe/activation_test.py:117: 2025-05-07T20:31:41.2342573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2342669Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2342767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2343266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2343361Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2343727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2343953Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2344295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2344384Z kernel = self.compile( 2025-05-07T20:31:41.2344766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2344940Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2345063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2345067Z 2025-05-07T20:31:41.2345270Z self = 2025-05-07T20:31:41.2346044Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2346666Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c5b19e0>} 2025-05-07T20:31:41.2347421Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2347608Z context = 2025-05-07T20:31:41.2347613Z 2025-05-07T20:31:41.2347781Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2348041Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2348146Z module_map=module_map) 2025-05-07T20:31:41.2348314Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2348407Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2348481Z E ^ 2025-05-07T20:31:41.2348834Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2348839Z 2025-05-07T20:31:41.2349250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2349259Z 2025-05-07T20:31:41.2349356Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2349578Z self=, 2025-05-07T20:31:41.2349653Z T=16384, 2025-05-07T20:31:41.2349725Z D=7168, 2025-05-07T20:31:41.2349800Z scale_ub=None, 2025-05-07T20:31:41.2349883Z contiguous=True, 2025-05-07T20:31:41.2349959Z compiled=True, 2025-05-07T20:31:41.2350029Z ) 2025-05-07T20:31:41.2350325Z self = 2025-05-07T20:31:41.2350500Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:41.2350504Z 2025-05-07T20:31:41.2350576Z @given( 2025-05-07T20:31:41.2350689Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2350784Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2350902Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2351014Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2351121Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2351196Z ) 2025-05-07T20:31:41.2351438Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2351525Z def test_silu_mul_quant( 2025-05-07T20:31:41.2351603Z self, 2025-05-07T20:31:41.2351675Z T: int, 2025-05-07T20:31:41.2351745Z D: int, 2025-05-07T20:31:41.2351849Z scale_ub: Optional[float], 2025-05-07T20:31:41.2351936Z contiguous: bool, 2025-05-07T20:31:41.2352017Z compiled: bool, 2025-05-07T20:31:41.2352094Z ) -> None: 2025-05-07T20:31:41.2352182Z torch.manual_seed(2025) 2025-05-07T20:31:41.2352255Z 2025-05-07T20:31:41.2352419Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2352488Z 2025-05-07T20:31:41.2352576Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2352695Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2352779Z x = x_sign * x_clamp 2025-05-07T20:31:41.2352854Z x0 = x[:, :D] 2025-05-07T20:31:41.2352930Z x1 = x[:, D:] 2025-05-07T20:31:41.2353001Z 2025-05-07T20:31:41.2353081Z if contiguous: 2025-05-07T20:31:41.2353167Z x0 = x0.contiguous() 2025-05-07T20:31:41.2353251Z x1 = x1.contiguous() 2025-05-07T20:31:41.2353322Z 2025-05-07T20:31:41.2353405Z if scale_ub is not None: 2025-05-07T20:31:41.2353513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2353722Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2353793Z ) 2025-05-07T20:31:41.2353869Z else: 2025-05-07T20:31:41.2353956Z scale_ub_tensor = None 2025-05-07T20:31:41.2354025Z 2025-05-07T20:31:41.2354152Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2354236Z op = silu_mul_quant 2025-05-07T20:31:41.2354315Z if compiled: 2025-05-07T20:31:41.2354411Z op = torch.compile(op) 2025-05-07T20:31:41.2354512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2354580Z 2025-05-07T20:31:41.2354669Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2354673Z 2025-05-07T20:31:41.2354765Z moe/activation_test.py:117: 2025-05-07T20:31:41.2354892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2354986Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2355093Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2355470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2355555Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2356051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2356147Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2356504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2356726Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2357064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2357152Z kernel = self.compile( 2025-05-07T20:31:41.2357623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2357800Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2357925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2357930Z 2025-05-07T20:31:41.2358131Z self = 2025-05-07T20:31:41.2358900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2359400Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c5b2b60>} 2025-05-07T20:31:41.2360156Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2360349Z context = 2025-05-07T20:31:41.2360354Z 2025-05-07T20:31:41.2360514Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2360772Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2360878Z module_map=module_map) 2025-05-07T20:31:41.2361035Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2361132Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2361207Z E ^ 2025-05-07T20:31:41.2361562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2361567Z 2025-05-07T20:31:41.2361991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2362074Z 2025-05-07T20:31:41.2362172Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2362394Z self=, 2025-05-07T20:31:41.2362464Z T=4096, 2025-05-07T20:31:41.2362536Z D=5120, 2025-05-07T20:31:41.2362614Z scale_ub=None, 2025-05-07T20:31:41.2362694Z contiguous=False, 2025-05-07T20:31:41.2362768Z compiled=True, 2025-05-07T20:31:41.2362841Z ) 2025-05-07T20:31:41.2363054Z self = 2025-05-07T20:31:41.2363222Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.2363227Z 2025-05-07T20:31:41.2363394Z @given( 2025-05-07T20:31:41.2363508Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2363603Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2363713Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2363837Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2363948Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2364018Z ) 2025-05-07T20:31:41.2364258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2364350Z def test_silu_mul_quant( 2025-05-07T20:31:41.2364419Z self, 2025-05-07T20:31:41.2364491Z T: int, 2025-05-07T20:31:41.2364564Z D: int, 2025-05-07T20:31:41.2364656Z scale_ub: Optional[float], 2025-05-07T20:31:41.2364738Z contiguous: bool, 2025-05-07T20:31:41.2364821Z compiled: bool, 2025-05-07T20:31:41.2364893Z ) -> None: 2025-05-07T20:31:41.2364989Z torch.manual_seed(2025) 2025-05-07T20:31:41.2365059Z 2025-05-07T20:31:41.2365223Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2365294Z 2025-05-07T20:31:41.2365380Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2365583Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2365677Z x = x_sign * x_clamp 2025-05-07T20:31:41.2365750Z x0 = x[:, :D] 2025-05-07T20:31:41.2365823Z x1 = x[:, D:] 2025-05-07T20:31:41.2365896Z 2025-05-07T20:31:41.2365973Z if contiguous: 2025-05-07T20:31:41.2366059Z x0 = x0.contiguous() 2025-05-07T20:31:41.2366148Z x1 = x1.contiguous() 2025-05-07T20:31:41.2366216Z 2025-05-07T20:31:41.2366299Z if scale_ub is not None: 2025-05-07T20:31:41.2366407Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2366539Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2366615Z ) 2025-05-07T20:31:41.2366684Z else: 2025-05-07T20:31:41.2366778Z scale_ub_tensor = None 2025-05-07T20:31:41.2366856Z 2025-05-07T20:31:41.2366986Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2367070Z op = silu_mul_quant 2025-05-07T20:31:41.2367163Z if compiled: 2025-05-07T20:31:41.2367257Z op = torch.compile(op) 2025-05-07T20:31:41.2367359Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2367427Z 2025-05-07T20:31:41.2367513Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2367518Z 2025-05-07T20:31:41.2367613Z moe/activation_test.py:117: 2025-05-07T20:31:41.2367737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2367833Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2367930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2368296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2368382Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2368879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2368976Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2369420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2369642Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2369981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2370074Z kernel = self.compile( 2025-05-07T20:31:41.2370457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2370628Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2370753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2370758Z 2025-05-07T20:31:41.2370958Z self = 2025-05-07T20:31:41.2371735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2372236Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c5b3d80>} 2025-05-07T20:31:41.2372993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2373180Z context = 2025-05-07T20:31:41.2373185Z 2025-05-07T20:31:41.2373345Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2373608Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2373809Z module_map=module_map) 2025-05-07T20:31:41.2373972Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2374069Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2374147Z E ^ 2025-05-07T20:31:41.2374501Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2374506Z 2025-05-07T20:31:41.2374917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2374921Z 2025-05-07T20:31:41.2375019Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2375242Z self=, 2025-05-07T20:31:41.2375315Z T=4096, 2025-05-07T20:31:41.2375392Z D=5120, 2025-05-07T20:31:41.2375471Z scale_ub=1200.0, 2025-05-07T20:31:41.2375550Z contiguous=False, 2025-05-07T20:31:41.2375635Z compiled=False, 2025-05-07T20:31:41.2375712Z ) 2025-05-07T20:31:41.2375926Z self = 2025-05-07T20:31:41.2376104Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.2376109Z 2025-05-07T20:31:41.2376182Z @given( 2025-05-07T20:31:41.2376295Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2376392Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2376502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2376616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2376725Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2376798Z ) 2025-05-07T20:31:41.2377043Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2377130Z def test_silu_mul_quant( 2025-05-07T20:31:41.2377201Z self, 2025-05-07T20:31:41.2377280Z T: int, 2025-05-07T20:31:41.2377354Z D: int, 2025-05-07T20:31:41.2377526Z scale_ub: Optional[float], 2025-05-07T20:31:41.2377612Z contiguous: bool, 2025-05-07T20:31:41.2377691Z compiled: bool, 2025-05-07T20:31:41.2377763Z ) -> None: 2025-05-07T20:31:41.2377853Z torch.manual_seed(2025) 2025-05-07T20:31:41.2377922Z 2025-05-07T20:31:41.2378092Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2378162Z 2025-05-07T20:31:41.2378247Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2378371Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2378457Z x = x_sign * x_clamp 2025-05-07T20:31:41.2378529Z x0 = x[:, :D] 2025-05-07T20:31:41.2378608Z x1 = x[:, D:] 2025-05-07T20:31:41.2378677Z 2025-05-07T20:31:41.2378754Z if contiguous: 2025-05-07T20:31:41.2378843Z x0 = x0.contiguous() 2025-05-07T20:31:41.2378928Z x1 = x1.contiguous() 2025-05-07T20:31:41.2378995Z 2025-05-07T20:31:41.2379086Z if scale_ub is not None: 2025-05-07T20:31:41.2379192Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2379321Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2379397Z ) 2025-05-07T20:31:41.2379468Z else: 2025-05-07T20:31:41.2379562Z scale_ub_tensor = None 2025-05-07T20:31:41.2379628Z 2025-05-07T20:31:41.2379753Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2379839Z op = silu_mul_quant 2025-05-07T20:31:41.2379917Z if compiled: 2025-05-07T20:31:41.2380010Z op = torch.compile(op) 2025-05-07T20:31:41.2380111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2380177Z 2025-05-07T20:31:41.2380262Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2380266Z 2025-05-07T20:31:41.2380360Z moe/activation_test.py:117: 2025-05-07T20:31:41.2380565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2380669Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2380762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2381261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2381354Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2381712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2381930Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2382273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2382363Z kernel = self.compile( 2025-05-07T20:31:41.2382749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2382922Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2383050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2383054Z 2025-05-07T20:31:41.2383258Z self = 2025-05-07T20:31:41.2384026Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2384524Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c444c20>} 2025-05-07T20:31:41.2385277Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2385466Z context = 2025-05-07T20:31:41.2385557Z 2025-05-07T20:31:41.2385720Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2385982Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2386086Z module_map=module_map) 2025-05-07T20:31:41.2386242Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2386334Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2386409Z E ^ 2025-05-07T20:31:41.2386766Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2386770Z 2025-05-07T20:31:41.2387189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2387194Z 2025-05-07T20:31:41.2387291Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2387522Z self=, 2025-05-07T20:31:41.2387599Z T=4096, 2025-05-07T20:31:41.2387670Z D=5120, 2025-05-07T20:31:41.2387747Z scale_ub=1200.0, 2025-05-07T20:31:41.2387831Z contiguous=False, 2025-05-07T20:31:41.2387910Z compiled=True, 2025-05-07T20:31:41.2387980Z ) 2025-05-07T20:31:41.2388196Z self = 2025-05-07T20:31:41.2388365Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.2388370Z 2025-05-07T20:31:41.2388444Z @given( 2025-05-07T20:31:41.2388558Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2388653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2388769Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2388883Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2389072Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2389148Z ) 2025-05-07T20:31:41.2389388Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2389480Z def test_silu_mul_quant( 2025-05-07T20:31:41.2389551Z self, 2025-05-07T20:31:41.2389621Z T: int, 2025-05-07T20:31:41.2389694Z D: int, 2025-05-07T20:31:41.2389784Z scale_ub: Optional[float], 2025-05-07T20:31:41.2389868Z contiguous: bool, 2025-05-07T20:31:41.2389951Z compiled: bool, 2025-05-07T20:31:41.2390024Z ) -> None: 2025-05-07T20:31:41.2390114Z torch.manual_seed(2025) 2025-05-07T20:31:41.2390185Z 2025-05-07T20:31:41.2390352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2390421Z 2025-05-07T20:31:41.2390509Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2390628Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2390709Z x = x_sign * x_clamp 2025-05-07T20:31:41.2390790Z x0 = x[:, :D] 2025-05-07T20:31:41.2390869Z x1 = x[:, D:] 2025-05-07T20:31:41.2390943Z 2025-05-07T20:31:41.2391021Z if contiguous: 2025-05-07T20:31:41.2391108Z x0 = x0.contiguous() 2025-05-07T20:31:41.2391194Z x1 = x1.contiguous() 2025-05-07T20:31:41.2391263Z 2025-05-07T20:31:41.2391345Z if scale_ub is not None: 2025-05-07T20:31:41.2391450Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2391578Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2391649Z ) 2025-05-07T20:31:41.2391722Z else: 2025-05-07T20:31:41.2391811Z scale_ub_tensor = None 2025-05-07T20:31:41.2391879Z 2025-05-07T20:31:41.2392007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2392091Z op = silu_mul_quant 2025-05-07T20:31:41.2392173Z if compiled: 2025-05-07T20:31:41.2392266Z op = torch.compile(op) 2025-05-07T20:31:41.2392371Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2392523Z 2025-05-07T20:31:41.2392611Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2392616Z 2025-05-07T20:31:41.2392709Z moe/activation_test.py:117: 2025-05-07T20:31:41.2392834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2392932Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2393024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2393395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2393484Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2393982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2394073Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2394435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2394666Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2395004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2395093Z kernel = self.compile( 2025-05-07T20:31:41.2395480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2395651Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2395775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2395781Z 2025-05-07T20:31:41.2395982Z self = 2025-05-07T20:31:41.2396830Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2397336Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c445f80>} 2025-05-07T20:31:41.2398088Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2398277Z context = 2025-05-07T20:31:41.2398282Z 2025-05-07T20:31:41.2398443Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2398705Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2398808Z module_map=module_map) 2025-05-07T20:31:41.2398972Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2399073Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2399147Z E ^ 2025-05-07T20:31:41.2399501Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2399506Z 2025-05-07T20:31:41.2399920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2399924Z 2025-05-07T20:31:41.2400021Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2400242Z self=, 2025-05-07T20:31:41.2400315Z T=2048, 2025-05-07T20:31:41.2400387Z D=7168, 2025-05-07T20:31:41.2400467Z scale_ub=1200.0, 2025-05-07T20:31:41.2400549Z contiguous=False, 2025-05-07T20:31:41.2400629Z compiled=False, 2025-05-07T20:31:41.2400702Z ) 2025-05-07T20:31:41.2400915Z self = 2025-05-07T20:31:41.2401194Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.2401201Z 2025-05-07T20:31:41.2401275Z @given( 2025-05-07T20:31:41.2401393Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2401490Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2401599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2401710Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2401822Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2401890Z ) 2025-05-07T20:31:41.2402131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2402221Z def test_silu_mul_quant( 2025-05-07T20:31:41.2402293Z self, 2025-05-07T20:31:41.2402365Z T: int, 2025-05-07T20:31:41.2402433Z D: int, 2025-05-07T20:31:41.2402524Z scale_ub: Optional[float], 2025-05-07T20:31:41.2402616Z contiguous: bool, 2025-05-07T20:31:41.2402699Z compiled: bool, 2025-05-07T20:31:41.2402770Z ) -> None: 2025-05-07T20:31:41.2402863Z torch.manual_seed(2025) 2025-05-07T20:31:41.2402931Z 2025-05-07T20:31:41.2403097Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2403170Z 2025-05-07T20:31:41.2408405Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2408551Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2408643Z x = x_sign * x_clamp 2025-05-07T20:31:41.2408724Z x0 = x[:, :D] 2025-05-07T20:31:41.2408802Z x1 = x[:, D:] 2025-05-07T20:31:41.2408873Z 2025-05-07T20:31:41.2408955Z if contiguous: 2025-05-07T20:31:41.2409051Z x0 = x0.contiguous() 2025-05-07T20:31:41.2409137Z x1 = x1.contiguous() 2025-05-07T20:31:41.2409206Z 2025-05-07T20:31:41.2409297Z if scale_ub is not None: 2025-05-07T20:31:41.2409400Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2409647Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2409723Z ) 2025-05-07T20:31:41.2409794Z else: 2025-05-07T20:31:41.2409886Z scale_ub_tensor = None 2025-05-07T20:31:41.2409958Z 2025-05-07T20:31:41.2410088Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2410175Z op = silu_mul_quant 2025-05-07T20:31:41.2410261Z if compiled: 2025-05-07T20:31:41.2410359Z op = torch.compile(op) 2025-05-07T20:31:41.2410464Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2410534Z 2025-05-07T20:31:41.2410620Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2410625Z 2025-05-07T20:31:41.2410721Z moe/activation_test.py:117: 2025-05-07T20:31:41.2410846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2410941Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2411044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2411549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2411647Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2412007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2412228Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2412573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2412664Z kernel = self.compile( 2025-05-07T20:31:41.2413049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2413227Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2413356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2413440Z 2025-05-07T20:31:41.2413648Z self = 2025-05-07T20:31:41.2414425Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2414924Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c446d40>} 2025-05-07T20:31:41.2415679Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2415866Z context = 2025-05-07T20:31:41.2415871Z 2025-05-07T20:31:41.2416041Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2416308Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2416411Z module_map=module_map) 2025-05-07T20:31:41.2416573Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2416668Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2416746Z E ^ 2025-05-07T20:31:41.2417102Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2417107Z 2025-05-07T20:31:41.2417525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2417529Z 2025-05-07T20:31:41.2417629Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2417851Z self=, 2025-05-07T20:31:41.2418004Z T=1, 2025-05-07T20:31:41.2418083Z D=7168, 2025-05-07T20:31:41.2418163Z scale_ub=None, 2025-05-07T20:31:41.2418247Z contiguous=True, 2025-05-07T20:31:41.2418329Z compiled=False, 2025-05-07T20:31:41.2418395Z ) 2025-05-07T20:31:41.2418612Z self = 2025-05-07T20:31:41.2418773Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:41.2418777Z 2025-05-07T20:31:41.2418853Z @given( 2025-05-07T20:31:41.2418974Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2419071Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2419185Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2419300Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2419412Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2419482Z ) 2025-05-07T20:31:41.2419728Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2419819Z def test_silu_mul_quant( 2025-05-07T20:31:41.2419897Z self, 2025-05-07T20:31:41.2419968Z T: int, 2025-05-07T20:31:41.2420041Z D: int, 2025-05-07T20:31:41.2420139Z scale_ub: Optional[float], 2025-05-07T20:31:41.2420225Z contiguous: bool, 2025-05-07T20:31:41.2420304Z compiled: bool, 2025-05-07T20:31:41.2420380Z ) -> None: 2025-05-07T20:31:41.2420473Z torch.manual_seed(2025) 2025-05-07T20:31:41.2420545Z 2025-05-07T20:31:41.2420712Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2420786Z 2025-05-07T20:31:41.2420876Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2420995Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2421081Z x = x_sign * x_clamp 2025-05-07T20:31:41.2421162Z x0 = x[:, :D] 2025-05-07T20:31:41.2421240Z x1 = x[:, D:] 2025-05-07T20:31:41.2421309Z 2025-05-07T20:31:41.2421485Z if contiguous: 2025-05-07T20:31:41.2421572Z x0 = x0.contiguous() 2025-05-07T20:31:41.2421655Z x1 = x1.contiguous() 2025-05-07T20:31:41.2421726Z 2025-05-07T20:31:41.2421813Z if scale_ub is not None: 2025-05-07T20:31:41.2421917Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2422052Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2422124Z ) 2025-05-07T20:31:41.2422200Z else: 2025-05-07T20:31:41.2422289Z scale_ub_tensor = None 2025-05-07T20:31:41.2422358Z 2025-05-07T20:31:41.2422488Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2422573Z op = silu_mul_quant 2025-05-07T20:31:41.2422658Z if compiled: 2025-05-07T20:31:41.2422763Z op = torch.compile(op) 2025-05-07T20:31:41.2422864Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2422932Z 2025-05-07T20:31:41.2423029Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2423040Z 2025-05-07T20:31:41.2423131Z moe/activation_test.py:117: 2025-05-07T20:31:41.2423260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2423357Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2423453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2423959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2424051Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2424409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2424630Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2424972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2425148Z kernel = self.compile( 2025-05-07T20:31:41.2425538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2425710Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2425837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2425842Z 2025-05-07T20:31:41.2426045Z self = 2025-05-07T20:31:41.2426816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2427343Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f090c446fc0>} 2025-05-07T20:31:41.2428123Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2428314Z context = 2025-05-07T20:31:41.2428319Z 2025-05-07T20:31:41.2428480Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2428741Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2428843Z module_map=module_map) 2025-05-07T20:31:41.2429001Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2429100Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2429172Z E ^ 2025-05-07T20:31:41.2429523Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2429531Z 2025-05-07T20:31:41.2429947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2430108Z 2025-05-07T20:31:41.2430208Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2430431Z self=, 2025-05-07T20:31:41.2430506Z T=16384, 2025-05-07T20:31:41.2430572Z D=7168, 2025-05-07T20:31:41.2430655Z scale_ub=1200.0, 2025-05-07T20:31:41.2430738Z contiguous=False, 2025-05-07T20:31:41.2430819Z compiled=True, 2025-05-07T20:31:41.2430887Z ) 2025-05-07T20:31:41.2431100Z self = 2025-05-07T20:31:41.2431280Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.2431285Z 2025-05-07T20:31:41.2431358Z @given( 2025-05-07T20:31:41.2431476Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2431567Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2431687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2431801Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2431911Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2431981Z ) 2025-05-07T20:31:41.2432223Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2432311Z def test_silu_mul_quant( 2025-05-07T20:31:41.2432383Z self, 2025-05-07T20:31:41.2432458Z T: int, 2025-05-07T20:31:41.2432525Z D: int, 2025-05-07T20:31:41.2432618Z scale_ub: Optional[float], 2025-05-07T20:31:41.2432703Z contiguous: bool, 2025-05-07T20:31:41.2432783Z compiled: bool, 2025-05-07T20:31:41.2432856Z ) -> None: 2025-05-07T20:31:41.2432947Z torch.manual_seed(2025) 2025-05-07T20:31:41.2433015Z 2025-05-07T20:31:41.2433182Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2433249Z 2025-05-07T20:31:41.2433419Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2433545Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2433629Z x = x_sign * x_clamp 2025-05-07T20:31:41.2433703Z x0 = x[:, :D] 2025-05-07T20:31:41.2433781Z x1 = x[:, D:] 2025-05-07T20:31:41.2433848Z 2025-05-07T20:31:41.2433927Z if contiguous: 2025-05-07T20:31:41.2434016Z x0 = x0.contiguous() 2025-05-07T20:31:41.2434100Z x1 = x1.contiguous() 2025-05-07T20:31:41.2434168Z 2025-05-07T20:31:41.2434254Z if scale_ub is not None: 2025-05-07T20:31:41.2434354Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2434487Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2434557Z ) 2025-05-07T20:31:41.2434626Z else: 2025-05-07T20:31:41.2434719Z scale_ub_tensor = None 2025-05-07T20:31:41.2434788Z 2025-05-07T20:31:41.2434919Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2435012Z op = silu_mul_quant 2025-05-07T20:31:41.2435090Z if compiled: 2025-05-07T20:31:41.2435184Z op = torch.compile(op) 2025-05-07T20:31:41.2435286Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2435354Z 2025-05-07T20:31:41.2435437Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2435445Z 2025-05-07T20:31:41.2435536Z moe/activation_test.py:117: 2025-05-07T20:31:41.2435657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2435756Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2435853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2436222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2436313Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2436814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2437018Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2437400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2437645Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2437987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2438078Z kernel = self.compile( 2025-05-07T20:31:41.2438686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2438938Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2439067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2439072Z 2025-05-07T20:31:41.2439286Z self = 2025-05-07T20:31:41.2440062Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2440560Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffc3d300>} 2025-05-07T20:31:41.2441312Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2441498Z context = 2025-05-07T20:31:41.2441503Z 2025-05-07T20:31:41.2441665Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2442066Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2442175Z module_map=module_map) 2025-05-07T20:31:41.2442337Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2442432Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2442509Z E ^ 2025-05-07T20:31:41.2442860Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2442865Z 2025-05-07T20:31:41.2443406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2443411Z 2025-05-07T20:31:41.2443510Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2443730Z self=, 2025-05-07T20:31:41.2443805Z T=1, 2025-05-07T20:31:41.2443877Z D=7168, 2025-05-07T20:31:41.2443953Z scale_ub=None, 2025-05-07T20:31:41.2444046Z contiguous=False, 2025-05-07T20:31:41.2444132Z compiled=False, 2025-05-07T20:31:41.2444200Z ) 2025-05-07T20:31:41.2444416Z self = 2025-05-07T20:31:41.2444579Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:41.2444583Z 2025-05-07T20:31:41.2444656Z @given( 2025-05-07T20:31:41.2444771Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2444867Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2444980Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2445091Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2445199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2445268Z ) 2025-05-07T20:31:41.2445508Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2445595Z def test_silu_mul_quant( 2025-05-07T20:31:41.2445669Z self, 2025-05-07T20:31:41.2445870Z T: int, 2025-05-07T20:31:41.2445941Z D: int, 2025-05-07T20:31:41.2446037Z scale_ub: Optional[float], 2025-05-07T20:31:41.2446122Z contiguous: bool, 2025-05-07T20:31:41.2446200Z compiled: bool, 2025-05-07T20:31:41.2446276Z ) -> None: 2025-05-07T20:31:41.2446364Z torch.manual_seed(2025) 2025-05-07T20:31:41.2446437Z 2025-05-07T20:31:41.2446602Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2446672Z 2025-05-07T20:31:41.2446762Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2446883Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2446965Z x = x_sign * x_clamp 2025-05-07T20:31:41.2447041Z x0 = x[:, :D] 2025-05-07T20:31:41.2447115Z x1 = x[:, D:] 2025-05-07T20:31:41.2447183Z 2025-05-07T20:31:41.2447261Z if contiguous: 2025-05-07T20:31:41.2447346Z x0 = x0.contiguous() 2025-05-07T20:31:41.2447435Z x1 = x1.contiguous() 2025-05-07T20:31:41.2447512Z 2025-05-07T20:31:41.2447600Z if scale_ub is not None: 2025-05-07T20:31:41.2447699Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2447832Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2447904Z ) 2025-05-07T20:31:41.2447979Z else: 2025-05-07T20:31:41.2448069Z scale_ub_tensor = None 2025-05-07T20:31:41.2448138Z 2025-05-07T20:31:41.2448266Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2448350Z op = silu_mul_quant 2025-05-07T20:31:41.2448429Z if compiled: 2025-05-07T20:31:41.2448526Z op = torch.compile(op) 2025-05-07T20:31:41.2448627Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2448696Z 2025-05-07T20:31:41.2448784Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2448788Z 2025-05-07T20:31:41.2448880Z moe/activation_test.py:117: 2025-05-07T20:31:41.2449089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2449190Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2449285Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2449790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2449881Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2450237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2450458Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2450795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2450884Z kernel = self.compile( 2025-05-07T20:31:41.2451276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2451448Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2451572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2451577Z 2025-05-07T20:31:41.2451779Z self = 2025-05-07T20:31:41.2452548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2453043Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffc3e0c0>} 2025-05-07T20:31:41.2453796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2454063Z context = 2025-05-07T20:31:41.2454068Z 2025-05-07T20:31:41.2454228Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2454488Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2454590Z module_map=module_map) 2025-05-07T20:31:41.2454748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2454843Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2454913Z E ^ 2025-05-07T20:31:41.2455263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2455271Z 2025-05-07T20:31:41.2455680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2455690Z 2025-05-07T20:31:41.2455796Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2456020Z self=, 2025-05-07T20:31:41.2456092Z T=2048, 2025-05-07T20:31:41.2456161Z D=7168, 2025-05-07T20:31:41.2456239Z scale_ub=None, 2025-05-07T20:31:41.2456323Z contiguous=False, 2025-05-07T20:31:41.2456399Z compiled=True, 2025-05-07T20:31:41.2456468Z ) 2025-05-07T20:31:41.2456682Z self = 2025-05-07T20:31:41.2456851Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.2456856Z 2025-05-07T20:31:41.2456930Z @given( 2025-05-07T20:31:41.2457044Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2457139Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2457247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2457436Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2457555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2457625Z ) 2025-05-07T20:31:41.2457863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2457954Z def test_silu_mul_quant( 2025-05-07T20:31:41.2458024Z self, 2025-05-07T20:31:41.2458100Z T: int, 2025-05-07T20:31:41.2458170Z D: int, 2025-05-07T20:31:41.2458262Z scale_ub: Optional[float], 2025-05-07T20:31:41.2458351Z contiguous: bool, 2025-05-07T20:31:41.2458430Z compiled: bool, 2025-05-07T20:31:41.2458501Z ) -> None: 2025-05-07T20:31:41.2458594Z torch.manual_seed(2025) 2025-05-07T20:31:41.2458662Z 2025-05-07T20:31:41.2458828Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2458901Z 2025-05-07T20:31:41.2458988Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2459113Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2459202Z x = x_sign * x_clamp 2025-05-07T20:31:41.2459274Z x0 = x[:, :D] 2025-05-07T20:31:41.2459350Z x1 = x[:, D:] 2025-05-07T20:31:41.2459419Z 2025-05-07T20:31:41.2459495Z if contiguous: 2025-05-07T20:31:41.2459583Z x0 = x0.contiguous() 2025-05-07T20:31:41.2459666Z x1 = x1.contiguous() 2025-05-07T20:31:41.2459735Z 2025-05-07T20:31:41.2459821Z if scale_ub is not None: 2025-05-07T20:31:41.2459921Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2460050Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2460124Z ) 2025-05-07T20:31:41.2460196Z else: 2025-05-07T20:31:41.2460283Z scale_ub_tensor = None 2025-05-07T20:31:41.2460352Z 2025-05-07T20:31:41.2460478Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2460567Z op = silu_mul_quant 2025-05-07T20:31:41.2460645Z if compiled: 2025-05-07T20:31:41.2460831Z op = torch.compile(op) 2025-05-07T20:31:41.2460935Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2461002Z 2025-05-07T20:31:41.2461086Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2461090Z 2025-05-07T20:31:41.2461186Z moe/activation_test.py:117: 2025-05-07T20:31:41.2461310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2461404Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2461501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2461867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2461955Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2462449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2462543Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2462909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2463136Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2463474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2463568Z kernel = self.compile( 2025-05-07T20:31:41.2463949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2464119Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2464246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2464251Z 2025-05-07T20:31:41.2464453Z self = 2025-05-07T20:31:41.2465326Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2465832Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffc3f560>} 2025-05-07T20:31:41.2466584Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2466774Z context = 2025-05-07T20:31:41.2466778Z 2025-05-07T20:31:41.2466939Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2467200Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2467306Z module_map=module_map) 2025-05-07T20:31:41.2467468Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2467567Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2467635Z E ^ 2025-05-07T20:31:41.2467990Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2467994Z 2025-05-07T20:31:41.2468408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2468413Z 2025-05-07T20:31:41.2468510Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2468731Z self=, 2025-05-07T20:31:41.2468803Z T=4096, 2025-05-07T20:31:41.2468874Z D=7168, 2025-05-07T20:31:41.2468954Z scale_ub=None, 2025-05-07T20:31:41.2469034Z contiguous=False, 2025-05-07T20:31:41.2469114Z compiled=True, 2025-05-07T20:31:41.2469181Z ) 2025-05-07T20:31:41.2469400Z self = 2025-05-07T20:31:41.2469652Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.2469657Z 2025-05-07T20:31:41.2469729Z @given( 2025-05-07T20:31:41.2469842Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2469939Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2470049Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2470161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2470274Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2470345Z ) 2025-05-07T20:31:41.2470588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2470677Z def test_silu_mul_quant( 2025-05-07T20:31:41.2470750Z self, 2025-05-07T20:31:41.2470825Z T: int, 2025-05-07T20:31:41.2470897Z D: int, 2025-05-07T20:31:41.2470995Z scale_ub: Optional[float], 2025-05-07T20:31:41.2471089Z contiguous: bool, 2025-05-07T20:31:41.2471169Z compiled: bool, 2025-05-07T20:31:41.2471239Z ) -> None: 2025-05-07T20:31:41.2471332Z torch.manual_seed(2025) 2025-05-07T20:31:41.2471403Z 2025-05-07T20:31:41.2471567Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2471643Z 2025-05-07T20:31:41.2471730Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2471855Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2471941Z x = x_sign * x_clamp 2025-05-07T20:31:41.2472016Z x0 = x[:, :D] 2025-05-07T20:31:41.2472095Z x1 = x[:, D:] 2025-05-07T20:31:41.2472164Z 2025-05-07T20:31:41.2472242Z if contiguous: 2025-05-07T20:31:41.2472333Z x0 = x0.contiguous() 2025-05-07T20:31:41.2472418Z x1 = x1.contiguous() 2025-05-07T20:31:41.2472487Z 2025-05-07T20:31:41.2472580Z if scale_ub is not None: 2025-05-07T20:31:41.2472763Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2472895Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2472972Z ) 2025-05-07T20:31:41.2473044Z else: 2025-05-07T20:31:41.2473134Z scale_ub_tensor = None 2025-05-07T20:31:41.2473205Z 2025-05-07T20:31:41.2473332Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2473418Z op = silu_mul_quant 2025-05-07T20:31:41.2473498Z if compiled: 2025-05-07T20:31:41.2473595Z op = torch.compile(op) 2025-05-07T20:31:41.2473697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2473764Z 2025-05-07T20:31:41.2473848Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2473852Z 2025-05-07T20:31:41.2473948Z moe/activation_test.py:117: 2025-05-07T20:31:41.2474070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2474174Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2474275Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2474643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2474732Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2475228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2475322Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2475682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2475901Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2476242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2476330Z kernel = self.compile( 2025-05-07T20:31:41.2476718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2476973Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2477098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2477102Z 2025-05-07T20:31:41.2477307Z self = 2025-05-07T20:31:41.2478085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2478584Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffa287c0>} 2025-05-07T20:31:41.2479345Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2479538Z context = 2025-05-07T20:31:41.2479543Z 2025-05-07T20:31:41.2479705Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2479964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2480065Z module_map=module_map) 2025-05-07T20:31:41.2480225Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2480319Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2480388Z E ^ 2025-05-07T20:31:41.2480744Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2480748Z 2025-05-07T20:31:41.2481239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2481249Z 2025-05-07T20:31:41.2481351Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2481570Z self=, 2025-05-07T20:31:41.2481644Z T=16384, 2025-05-07T20:31:41.2481720Z D=5120, 2025-05-07T20:31:41.2481802Z scale_ub=1200.0, 2025-05-07T20:31:41.2481884Z contiguous=False, 2025-05-07T20:31:41.2481966Z compiled=False, 2025-05-07T20:31:41.2482036Z ) 2025-05-07T20:31:41.2482254Z self = 2025-05-07T20:31:41.2482432Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.2482436Z 2025-05-07T20:31:41.2482508Z @given( 2025-05-07T20:31:41.2482625Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2482721Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2482836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2482955Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2483063Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2483135Z ) 2025-05-07T20:31:41.2483484Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2483574Z def test_silu_mul_quant( 2025-05-07T20:31:41.2483650Z self, 2025-05-07T20:31:41.2483722Z T: int, 2025-05-07T20:31:41.2483794Z D: int, 2025-05-07T20:31:41.2483893Z scale_ub: Optional[float], 2025-05-07T20:31:41.2483977Z contiguous: bool, 2025-05-07T20:31:41.2484057Z compiled: bool, 2025-05-07T20:31:41.2484135Z ) -> None: 2025-05-07T20:31:41.2484225Z torch.manual_seed(2025) 2025-05-07T20:31:41.2484294Z 2025-05-07T20:31:41.2484462Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2484533Z 2025-05-07T20:31:41.2484620Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2484834Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2484917Z x = x_sign * x_clamp 2025-05-07T20:31:41.2484993Z x0 = x[:, :D] 2025-05-07T20:31:41.2485068Z x1 = x[:, D:] 2025-05-07T20:31:41.2485136Z 2025-05-07T20:31:41.2485217Z if contiguous: 2025-05-07T20:31:41.2485304Z x0 = x0.contiguous() 2025-05-07T20:31:41.2485389Z x1 = x1.contiguous() 2025-05-07T20:31:41.2485463Z 2025-05-07T20:31:41.2485550Z if scale_ub is not None: 2025-05-07T20:31:41.2485652Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2485786Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2485857Z ) 2025-05-07T20:31:41.2485929Z else: 2025-05-07T20:31:41.2486020Z scale_ub_tensor = None 2025-05-07T20:31:41.2486089Z 2025-05-07T20:31:41.2486219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2486309Z op = silu_mul_quant 2025-05-07T20:31:41.2486393Z if compiled: 2025-05-07T20:31:41.2486494Z op = torch.compile(op) 2025-05-07T20:31:41.2486594Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2486665Z 2025-05-07T20:31:41.2486757Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2486761Z 2025-05-07T20:31:41.2486853Z moe/activation_test.py:117: 2025-05-07T20:31:41.2486976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2487076Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2487172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2487678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2487769Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2488129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2488440Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2488789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2488880Z kernel = self.compile( 2025-05-07T20:31:41.2489267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2489438Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2489565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2489569Z 2025-05-07T20:31:41.2489772Z self = 2025-05-07T20:31:41.2490548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2491055Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffa29620>} 2025-05-07T20:31:41.2491808Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2491998Z context = 2025-05-07T20:31:41.2492002Z 2025-05-07T20:31:41.2492163Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2492426Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2492530Z module_map=module_map) 2025-05-07T20:31:41.2492687Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2492787Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2492938Z E ^ 2025-05-07T20:31:41.2493290Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2493295Z 2025-05-07T20:31:41.2493711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2493715Z 2025-05-07T20:31:41.2493815Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2494037Z self=, 2025-05-07T20:31:41.2494111Z T=16384, 2025-05-07T20:31:41.2494184Z D=5120, 2025-05-07T20:31:41.2494266Z scale_ub=1200.0, 2025-05-07T20:31:41.2494347Z contiguous=True, 2025-05-07T20:31:41.2494426Z compiled=True, 2025-05-07T20:31:41.2494496Z ) 2025-05-07T20:31:41.2494710Z self = 2025-05-07T20:31:41.2494883Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.2494895Z 2025-05-07T20:31:41.2494969Z @given( 2025-05-07T20:31:41.2495084Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2495182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2495293Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2495407Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2495520Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2495589Z ) 2025-05-07T20:31:41.2495832Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2495922Z def test_silu_mul_quant( 2025-05-07T20:31:41.2495996Z self, 2025-05-07T20:31:41.2496069Z T: int, 2025-05-07T20:31:41.2496147Z D: int, 2025-05-07T20:31:41.2496239Z scale_ub: Optional[float], 2025-05-07T20:31:41.2496326Z contiguous: bool, 2025-05-07T20:31:41.2496510Z compiled: bool, 2025-05-07T20:31:41.2496587Z ) -> None: 2025-05-07T20:31:41.2496681Z torch.manual_seed(2025) 2025-05-07T20:31:41.2496751Z 2025-05-07T20:31:41.2496918Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2496993Z 2025-05-07T20:31:41.2497082Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2497203Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2497288Z x = x_sign * x_clamp 2025-05-07T20:31:41.2497364Z x0 = x[:, :D] 2025-05-07T20:31:41.2497438Z x1 = x[:, D:] 2025-05-07T20:31:41.2497511Z 2025-05-07T20:31:41.2497590Z if contiguous: 2025-05-07T20:31:41.2497676Z x0 = x0.contiguous() 2025-05-07T20:31:41.2497763Z x1 = x1.contiguous() 2025-05-07T20:31:41.2497830Z 2025-05-07T20:31:41.2497917Z if scale_ub is not None: 2025-05-07T20:31:41.2498018Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2498155Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2498233Z ) 2025-05-07T20:31:41.2498306Z else: 2025-05-07T20:31:41.2498395Z scale_ub_tensor = None 2025-05-07T20:31:41.2498467Z 2025-05-07T20:31:41.2498593Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2498676Z op = silu_mul_quant 2025-05-07T20:31:41.2498764Z if compiled: 2025-05-07T20:31:41.2498860Z op = torch.compile(op) 2025-05-07T20:31:41.2498962Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2499035Z 2025-05-07T20:31:41.2499122Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2499126Z 2025-05-07T20:31:41.2499224Z moe/activation_test.py:117: 2025-05-07T20:31:41.2499346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2499442Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2499540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2499912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2500083Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2500581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2500676Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2501038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2501258Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2501597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2501687Z kernel = self.compile( 2025-05-07T20:31:41.2502070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2502246Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2502377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2502381Z 2025-05-07T20:31:41.2502583Z self = 2025-05-07T20:31:41.2503356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2503852Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffa2aa20>} 2025-05-07T20:31:41.2504604Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2504872Z context = 2025-05-07T20:31:41.2504877Z 2025-05-07T20:31:41.2505039Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2505302Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2505404Z module_map=module_map) 2025-05-07T20:31:41.2505565Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2505658Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2505730Z E ^ 2025-05-07T20:31:41.2506085Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2506090Z 2025-05-07T20:31:41.2506501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2506505Z 2025-05-07T20:31:41.2506611Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2506835Z self=, 2025-05-07T20:31:41.2506908Z T=16384, 2025-05-07T20:31:41.2506985Z D=5120, 2025-05-07T20:31:41.2507060Z scale_ub=None, 2025-05-07T20:31:41.2507140Z contiguous=False, 2025-05-07T20:31:41.2507230Z compiled=True, 2025-05-07T20:31:41.2507314Z ) 2025-05-07T20:31:41.2507552Z self = 2025-05-07T20:31:41.2507726Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.2507731Z 2025-05-07T20:31:41.2507804Z @given( 2025-05-07T20:31:41.2507919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2508016Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2508130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2508246Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2508363Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2508511Z ) 2025-05-07T20:31:41.2508758Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2508847Z def test_silu_mul_quant( 2025-05-07T20:31:41.2508920Z self, 2025-05-07T20:31:41.2508998Z T: int, 2025-05-07T20:31:41.2509071Z D: int, 2025-05-07T20:31:41.2509163Z scale_ub: Optional[float], 2025-05-07T20:31:41.2509249Z contiguous: bool, 2025-05-07T20:31:41.2509329Z compiled: bool, 2025-05-07T20:31:41.2509403Z ) -> None: 2025-05-07T20:31:41.2509494Z torch.manual_seed(2025) 2025-05-07T20:31:41.2509565Z 2025-05-07T20:31:41.2509736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2509808Z 2025-05-07T20:31:41.2509896Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2510020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2510109Z x = x_sign * x_clamp 2025-05-07T20:31:41.2510196Z x0 = x[:, :D] 2025-05-07T20:31:41.2510271Z x1 = x[:, D:] 2025-05-07T20:31:41.2510339Z 2025-05-07T20:31:41.2510416Z if contiguous: 2025-05-07T20:31:41.2510505Z x0 = x0.contiguous() 2025-05-07T20:31:41.2510590Z x1 = x1.contiguous() 2025-05-07T20:31:41.2510659Z 2025-05-07T20:31:41.2510746Z if scale_ub is not None: 2025-05-07T20:31:41.2510847Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2510980Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2511054Z ) 2025-05-07T20:31:41.2511127Z else: 2025-05-07T20:31:41.2511219Z scale_ub_tensor = None 2025-05-07T20:31:41.2511285Z 2025-05-07T20:31:41.2511410Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2511497Z op = silu_mul_quant 2025-05-07T20:31:41.2511576Z if compiled: 2025-05-07T20:31:41.2511751Z op = torch.compile(op) 2025-05-07T20:31:41.2511870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2511940Z 2025-05-07T20:31:41.2512027Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2512034Z 2025-05-07T20:31:41.2512129Z moe/activation_test.py:117: 2025-05-07T20:31:41.2512253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2512351Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2512446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2512813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2512903Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2513397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2513490Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2513856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2514081Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2514422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2514511Z kernel = self.compile( 2025-05-07T20:31:41.2514894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2515068Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2515190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2515195Z 2025-05-07T20:31:41.2515400Z self = 2025-05-07T20:31:41.2516175Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2516756Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffa2bc40>} 2025-05-07T20:31:41.2517555Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2517744Z context = 2025-05-07T20:31:41.2517748Z 2025-05-07T20:31:41.2517913Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2518173Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2518276Z module_map=module_map) 2025-05-07T20:31:41.2518443Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2518541Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2518617Z E ^ 2025-05-07T20:31:41.2518971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2518975Z 2025-05-07T20:31:41.2519386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2519391Z 2025-05-07T20:31:41.2519492Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2519711Z self=, 2025-05-07T20:31:41.2519790Z T=2048, 2025-05-07T20:31:41.2519862Z D=5120, 2025-05-07T20:31:41.2519937Z scale_ub=None, 2025-05-07T20:31:41.2520023Z contiguous=False, 2025-05-07T20:31:41.2520102Z compiled=True, 2025-05-07T20:31:41.2520171Z ) 2025-05-07T20:31:41.2520465Z self = 2025-05-07T20:31:41.2520640Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.2520644Z 2025-05-07T20:31:41.2520715Z @given( 2025-05-07T20:31:41.2520833Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2520929Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2521041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2521153Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2521260Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2521332Z ) 2025-05-07T20:31:41.2521572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2521662Z def test_silu_mul_quant( 2025-05-07T20:31:41.2521734Z self, 2025-05-07T20:31:41.2521807Z T: int, 2025-05-07T20:31:41.2521877Z D: int, 2025-05-07T20:31:41.2521973Z scale_ub: Optional[float], 2025-05-07T20:31:41.2522063Z contiguous: bool, 2025-05-07T20:31:41.2522147Z compiled: bool, 2025-05-07T20:31:41.2522224Z ) -> None: 2025-05-07T20:31:41.2522316Z torch.manual_seed(2025) 2025-05-07T20:31:41.2522385Z 2025-05-07T20:31:41.2522550Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2522621Z 2025-05-07T20:31:41.2522710Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2522830Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2522914Z x = x_sign * x_clamp 2025-05-07T20:31:41.2522992Z x0 = x[:, :D] 2025-05-07T20:31:41.2523069Z x1 = x[:, D:] 2025-05-07T20:31:41.2523137Z 2025-05-07T20:31:41.2523219Z if contiguous: 2025-05-07T20:31:41.2523387Z x0 = x0.contiguous() 2025-05-07T20:31:41.2523473Z x1 = x1.contiguous() 2025-05-07T20:31:41.2523545Z 2025-05-07T20:31:41.2523632Z if scale_ub is not None: 2025-05-07T20:31:41.2523740Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2523980Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2524054Z ) 2025-05-07T20:31:41.2524128Z else: 2025-05-07T20:31:41.2529219Z scale_ub_tensor = None 2025-05-07T20:31:41.2529298Z 2025-05-07T20:31:41.2529443Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2529534Z op = silu_mul_quant 2025-05-07T20:31:41.2529616Z if compiled: 2025-05-07T20:31:41.2529717Z op = torch.compile(op) 2025-05-07T20:31:41.2529821Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2529891Z 2025-05-07T20:31:41.2529983Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2529988Z 2025-05-07T20:31:41.2530081Z moe/activation_test.py:117: 2025-05-07T20:31:41.2530214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2530310Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2530413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2530790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2530880Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2531371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2531468Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2531823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2532046Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2532384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2532476Z kernel = self.compile( 2025-05-07T20:31:41.2532959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2533138Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2533262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2533273Z 2025-05-07T20:31:41.2533476Z self = 2025-05-07T20:31:41.2534243Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2534743Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffdb87c0>} 2025-05-07T20:31:41.2535496Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2535690Z context = 2025-05-07T20:31:41.2535695Z 2025-05-07T20:31:41.2535857Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2536118Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2536229Z module_map=module_map) 2025-05-07T20:31:41.2536388Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2536486Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2536557Z E ^ 2025-05-07T20:31:41.2536911Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2536916Z 2025-05-07T20:31:41.2537332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2537422Z 2025-05-07T20:31:41.2537525Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2537743Z self=, 2025-05-07T20:31:41.2537824Z T=2048, 2025-05-07T20:31:41.2537897Z D=5120, 2025-05-07T20:31:41.2537980Z scale_ub=1200.0, 2025-05-07T20:31:41.2538063Z contiguous=False, 2025-05-07T20:31:41.2538141Z compiled=True, 2025-05-07T20:31:41.2538216Z ) 2025-05-07T20:31:41.2538645Z self = 2025-05-07T20:31:41.2538893Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.2538901Z 2025-05-07T20:31:41.2538979Z @given( 2025-05-07T20:31:41.2539096Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2539193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2539307Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2539426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2539546Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2539618Z ) 2025-05-07T20:31:41.2539859Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2539951Z def test_silu_mul_quant( 2025-05-07T20:31:41.2540023Z self, 2025-05-07T20:31:41.2540095Z T: int, 2025-05-07T20:31:41.2540169Z D: int, 2025-05-07T20:31:41.2540263Z scale_ub: Optional[float], 2025-05-07T20:31:41.2540348Z contiguous: bool, 2025-05-07T20:31:41.2540431Z compiled: bool, 2025-05-07T20:31:41.2540506Z ) -> None: 2025-05-07T20:31:41.2540598Z torch.manual_seed(2025) 2025-05-07T20:31:41.2540670Z 2025-05-07T20:31:41.2540836Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2540909Z 2025-05-07T20:31:41.2540996Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2541298Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2541395Z x = x_sign * x_clamp 2025-05-07T20:31:41.2541472Z x0 = x[:, :D] 2025-05-07T20:31:41.2541548Z x1 = x[:, D:] 2025-05-07T20:31:41.2541622Z 2025-05-07T20:31:41.2541702Z if contiguous: 2025-05-07T20:31:41.2541791Z x0 = x0.contiguous() 2025-05-07T20:31:41.2541876Z x1 = x1.contiguous() 2025-05-07T20:31:41.2541947Z 2025-05-07T20:31:41.2542033Z if scale_ub is not None: 2025-05-07T20:31:41.2542142Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2542275Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2542356Z ) 2025-05-07T20:31:41.2542428Z else: 2025-05-07T20:31:41.2542518Z scale_ub_tensor = None 2025-05-07T20:31:41.2542589Z 2025-05-07T20:31:41.2542717Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2542802Z op = silu_mul_quant 2025-05-07T20:31:41.2542895Z if compiled: 2025-05-07T20:31:41.2542996Z op = torch.compile(op) 2025-05-07T20:31:41.2543097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2543168Z 2025-05-07T20:31:41.2543256Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2543261Z 2025-05-07T20:31:41.2543354Z moe/activation_test.py:117: 2025-05-07T20:31:41.2543481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2543579Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2543678Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2544045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2544135Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2544631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2544724Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2545210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2545438Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2545775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2545870Z kernel = self.compile( 2025-05-07T20:31:41.2546252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2546422Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2546549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2546553Z 2025-05-07T20:31:41.2546754Z self = 2025-05-07T20:31:41.2547530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2548031Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffdb98a0>} 2025-05-07T20:31:41.2548778Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2548966Z context = 2025-05-07T20:31:41.2548971Z 2025-05-07T20:31:41.2549131Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2549398Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2549581Z module_map=module_map) 2025-05-07T20:31:41.2549748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2549843Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2549915Z E ^ 2025-05-07T20:31:41.2550270Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2550275Z 2025-05-07T20:31:41.2550685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2550690Z 2025-05-07T20:31:41.2550786Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2551007Z self=, 2025-05-07T20:31:41.2551080Z T=4096, 2025-05-07T20:31:41.2551151Z D=5120, 2025-05-07T20:31:41.2551232Z scale_ub=1200.0, 2025-05-07T20:31:41.2551308Z contiguous=True, 2025-05-07T20:31:41.2551394Z compiled=True, 2025-05-07T20:31:41.2551463Z ) 2025-05-07T20:31:41.2551687Z self = 2025-05-07T20:31:41.2551859Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.2551864Z 2025-05-07T20:31:41.2551936Z @given( 2025-05-07T20:31:41.2552049Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2552147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2552259Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2552370Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2552482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2552555Z ) 2025-05-07T20:31:41.2552802Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2552892Z def test_silu_mul_quant( 2025-05-07T20:31:41.2552964Z self, 2025-05-07T20:31:41.2553039Z T: int, 2025-05-07T20:31:41.2553111Z D: int, 2025-05-07T20:31:41.2553209Z scale_ub: Optional[float], 2025-05-07T20:31:41.2553378Z contiguous: bool, 2025-05-07T20:31:41.2553457Z compiled: bool, 2025-05-07T20:31:41.2553533Z ) -> None: 2025-05-07T20:31:41.2553625Z torch.manual_seed(2025) 2025-05-07T20:31:41.2553692Z 2025-05-07T20:31:41.2553862Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2553931Z 2025-05-07T20:31:41.2554017Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2554140Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2554224Z x = x_sign * x_clamp 2025-05-07T20:31:41.2554298Z x0 = x[:, :D] 2025-05-07T20:31:41.2554374Z x1 = x[:, D:] 2025-05-07T20:31:41.2554440Z 2025-05-07T20:31:41.2554517Z if contiguous: 2025-05-07T20:31:41.2554606Z x0 = x0.contiguous() 2025-05-07T20:31:41.2554691Z x1 = x1.contiguous() 2025-05-07T20:31:41.2554762Z 2025-05-07T20:31:41.2554851Z if scale_ub is not None: 2025-05-07T20:31:41.2554959Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2555092Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2555165Z ) 2025-05-07T20:31:41.2555237Z else: 2025-05-07T20:31:41.2555332Z scale_ub_tensor = None 2025-05-07T20:31:41.2555400Z 2025-05-07T20:31:41.2555528Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2555615Z op = silu_mul_quant 2025-05-07T20:31:41.2555693Z if compiled: 2025-05-07T20:31:41.2555785Z op = torch.compile(op) 2025-05-07T20:31:41.2555886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2555952Z 2025-05-07T20:31:41.2556039Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2556043Z 2025-05-07T20:31:41.2556135Z moe/activation_test.py:117: 2025-05-07T20:31:41.2556258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2556434Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2556536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2556903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2556994Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2557487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2557584Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2557939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2558159Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2558498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2558588Z kernel = self.compile( 2025-05-07T20:31:41.2558976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2559157Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2559280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2559285Z 2025-05-07T20:31:41.2559490Z self = 2025-05-07T20:31:41.2560256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2560751Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ffdbaac0>} 2025-05-07T20:31:41.2561508Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2561882Z context = 2025-05-07T20:31:41.2561887Z 2025-05-07T20:31:41.2562051Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2562309Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2562413Z module_map=module_map) 2025-05-07T20:31:41.2562569Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2562662Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2562741Z E ^ 2025-05-07T20:31:41.2563091Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2563096Z 2025-05-07T20:31:41.2563643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2563654Z 2025-05-07T20:31:41.2563754Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2563971Z self=, 2025-05-07T20:31:41.2564042Z T=128, 2025-05-07T20:31:41.2564113Z D=5120, 2025-05-07T20:31:41.2564191Z scale_ub=1200.0, 2025-05-07T20:31:41.2564276Z contiguous=False, 2025-05-07T20:31:41.2564351Z compiled=True, 2025-05-07T20:31:41.2564421Z ) 2025-05-07T20:31:41.2564640Z self = 2025-05-07T20:31:41.2564807Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.2564812Z 2025-05-07T20:31:41.2564884Z @given( 2025-05-07T20:31:41.2565001Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2565096Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2565289Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2565405Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2565513Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2565587Z ) 2025-05-07T20:31:41.2565827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2565913Z def test_silu_mul_quant( 2025-05-07T20:31:41.2565990Z self, 2025-05-07T20:31:41.2566061Z T: int, 2025-05-07T20:31:41.2566131Z D: int, 2025-05-07T20:31:41.2566226Z scale_ub: Optional[float], 2025-05-07T20:31:41.2566311Z contiguous: bool, 2025-05-07T20:31:41.2566391Z compiled: bool, 2025-05-07T20:31:41.2566469Z ) -> None: 2025-05-07T20:31:41.2566558Z torch.manual_seed(2025) 2025-05-07T20:31:41.2566629Z 2025-05-07T20:31:41.2566794Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2566863Z 2025-05-07T20:31:41.2566959Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2567084Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2567166Z x = x_sign * x_clamp 2025-05-07T20:31:41.2567241Z x0 = x[:, :D] 2025-05-07T20:31:41.2567315Z x1 = x[:, D:] 2025-05-07T20:31:41.2567381Z 2025-05-07T20:31:41.2567463Z if contiguous: 2025-05-07T20:31:41.2567547Z x0 = x0.contiguous() 2025-05-07T20:31:41.2567630Z x1 = x1.contiguous() 2025-05-07T20:31:41.2567704Z 2025-05-07T20:31:41.2567790Z if scale_ub is not None: 2025-05-07T20:31:41.2567895Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2568024Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2568095Z ) 2025-05-07T20:31:41.2568171Z else: 2025-05-07T20:31:41.2568259Z scale_ub_tensor = None 2025-05-07T20:31:41.2568328Z 2025-05-07T20:31:41.2568454Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2568545Z op = silu_mul_quant 2025-05-07T20:31:41.2568708Z if compiled: 2025-05-07T20:31:41.2568805Z op = torch.compile(op) 2025-05-07T20:31:41.2568905Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2568970Z 2025-05-07T20:31:41.2569057Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2569061Z 2025-05-07T20:31:41.2569152Z moe/activation_test.py:117: 2025-05-07T20:31:41.2569280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2569376Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2569471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2569843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2569931Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2570429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2570534Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2570891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2571114Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2571451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2571539Z kernel = self.compile( 2025-05-07T20:31:41.2571923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2572093Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2572214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2572223Z 2025-05-07T20:31:41.2572427Z self = 2025-05-07T20:31:41.2573276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2573778Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff80c540>} 2025-05-07T20:31:41.2574524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2574710Z context = 2025-05-07T20:31:41.2574715Z 2025-05-07T20:31:41.2574874Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2575138Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2575246Z module_map=module_map) 2025-05-07T20:31:41.2575402Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2575498Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2575570Z E ^ 2025-05-07T20:31:41.2575919Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2575923Z 2025-05-07T20:31:41.2576337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2576341Z 2025-05-07T20:31:41.2576439Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2576659Z self=, 2025-05-07T20:31:41.2576734Z T=16384, 2025-05-07T20:31:41.2576805Z D=7168, 2025-05-07T20:31:41.2576884Z scale_ub=1200.0, 2025-05-07T20:31:41.2576963Z contiguous=True, 2025-05-07T20:31:41.2577045Z compiled=True, 2025-05-07T20:31:41.2577197Z ) 2025-05-07T20:31:41.2577410Z self = 2025-05-07T20:31:41.2577581Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.2577586Z 2025-05-07T20:31:41.2577666Z @given( 2025-05-07T20:31:41.2577778Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2577871Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2577984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2578095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2578207Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2578274Z ) 2025-05-07T20:31:41.2578515Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2578606Z def test_silu_mul_quant( 2025-05-07T20:31:41.2578678Z self, 2025-05-07T20:31:41.2578748Z T: int, 2025-05-07T20:31:41.2578833Z D: int, 2025-05-07T20:31:41.2578925Z scale_ub: Optional[float], 2025-05-07T20:31:41.2579008Z contiguous: bool, 2025-05-07T20:31:41.2579092Z compiled: bool, 2025-05-07T20:31:41.2579162Z ) -> None: 2025-05-07T20:31:41.2579249Z torch.manual_seed(2025) 2025-05-07T20:31:41.2579321Z 2025-05-07T20:31:41.2579484Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2579557Z 2025-05-07T20:31:41.2579642Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2579760Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2579846Z x = x_sign * x_clamp 2025-05-07T20:31:41.2579920Z x0 = x[:, :D] 2025-05-07T20:31:41.2579994Z x1 = x[:, D:] 2025-05-07T20:31:41.2580067Z 2025-05-07T20:31:41.2580143Z if contiguous: 2025-05-07T20:31:41.2580228Z x0 = x0.contiguous() 2025-05-07T20:31:41.2580312Z x1 = x1.contiguous() 2025-05-07T20:31:41.2581050Z 2025-05-07T20:31:41.2581150Z if scale_ub is not None: 2025-05-07T20:31:41.2581254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2581385Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2581456Z ) 2025-05-07T20:31:41.2581528Z else: 2025-05-07T20:31:41.2581616Z scale_ub_tensor = None 2025-05-07T20:31:41.2581687Z 2025-05-07T20:31:41.2581813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2581895Z op = silu_mul_quant 2025-05-07T20:31:41.2581977Z if compiled: 2025-05-07T20:31:41.2582071Z op = torch.compile(op) 2025-05-07T20:31:41.2582172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2582242Z 2025-05-07T20:31:41.2582328Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2582333Z 2025-05-07T20:31:41.2582427Z moe/activation_test.py:117: 2025-05-07T20:31:41.2582559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2582658Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2582755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2583123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2583209Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2583704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2583795Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2584149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2584373Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2584711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2584812Z kernel = self.compile( 2025-05-07T20:31:41.2585273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2585443Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2585569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2585573Z 2025-05-07T20:31:41.2585777Z self = 2025-05-07T20:31:41.2586545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2587044Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff80d080>} 2025-05-07T20:31:41.2587794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2587993Z context = 2025-05-07T20:31:41.2587998Z 2025-05-07T20:31:41.2588159Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2588420Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2588521Z module_map=module_map) 2025-05-07T20:31:41.2588679Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2588776Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2588845Z E ^ 2025-05-07T20:31:41.2589195Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2589200Z 2025-05-07T20:31:41.2589717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2589722Z 2025-05-07T20:31:41.2589822Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2590042Z self=, 2025-05-07T20:31:41.2590115Z T=16384, 2025-05-07T20:31:41.2590185Z D=5120, 2025-05-07T20:31:41.2590274Z scale_ub=1200.0, 2025-05-07T20:31:41.2590355Z contiguous=True, 2025-05-07T20:31:41.2590431Z compiled=False, 2025-05-07T20:31:41.2590503Z ) 2025-05-07T20:31:41.2590716Z self = 2025-05-07T20:31:41.2590889Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:41.2590893Z 2025-05-07T20:31:41.2590966Z @given( 2025-05-07T20:31:41.2591080Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2591181Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2591294Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2591404Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2591516Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2591584Z ) 2025-05-07T20:31:41.2591824Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2591915Z def test_silu_mul_quant( 2025-05-07T20:31:41.2591987Z self, 2025-05-07T20:31:41.2592055Z T: int, 2025-05-07T20:31:41.2592129Z D: int, 2025-05-07T20:31:41.2592220Z scale_ub: Optional[float], 2025-05-07T20:31:41.2592306Z contiguous: bool, 2025-05-07T20:31:41.2592385Z compiled: bool, 2025-05-07T20:31:41.2592460Z ) -> None: 2025-05-07T20:31:41.2592552Z torch.manual_seed(2025) 2025-05-07T20:31:41.2592620Z 2025-05-07T20:31:41.2592787Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2592946Z 2025-05-07T20:31:41.2593033Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2593152Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2593238Z x = x_sign * x_clamp 2025-05-07T20:31:41.2593312Z x0 = x[:, :D] 2025-05-07T20:31:41.2593386Z x1 = x[:, D:] 2025-05-07T20:31:41.2593456Z 2025-05-07T20:31:41.2593533Z if contiguous: 2025-05-07T20:31:41.2593617Z x0 = x0.contiguous() 2025-05-07T20:31:41.2593708Z x1 = x1.contiguous() 2025-05-07T20:31:41.2593776Z 2025-05-07T20:31:41.2593864Z if scale_ub is not None: 2025-05-07T20:31:41.2593964Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2594094Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2594168Z ) 2025-05-07T20:31:41.2594237Z else: 2025-05-07T20:31:41.2594324Z scale_ub_tensor = None 2025-05-07T20:31:41.2594396Z 2025-05-07T20:31:41.2594526Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2594614Z op = silu_mul_quant 2025-05-07T20:31:41.2594697Z if compiled: 2025-05-07T20:31:41.2594790Z op = torch.compile(op) 2025-05-07T20:31:41.2594889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2594960Z 2025-05-07T20:31:41.2595044Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2595048Z 2025-05-07T20:31:41.2595142Z moe/activation_test.py:117: 2025-05-07T20:31:41.2595264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2595359Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2595453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2595953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2596046Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2596488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2596716Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2597056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2597145Z kernel = self.compile( 2025-05-07T20:31:41.2597576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2597749Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2597871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2597876Z 2025-05-07T20:31:41.2598079Z self = 2025-05-07T20:31:41.2598851Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2599348Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff80e660>} 2025-05-07T20:31:41.2600099Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2600285Z context = 2025-05-07T20:31:41.2600289Z 2025-05-07T20:31:41.2600452Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2600709Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2600811Z module_map=module_map) 2025-05-07T20:31:41.2600974Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2601145Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2601221Z E ^ 2025-05-07T20:31:41.2601574Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2601578Z 2025-05-07T20:31:41.2601989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2601993Z 2025-05-07T20:31:41.2602094Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2602313Z self=, 2025-05-07T20:31:41.2602385Z T=1, 2025-05-07T20:31:41.2602459Z D=7168, 2025-05-07T20:31:41.2602536Z scale_ub=1200.0, 2025-05-07T20:31:41.2602622Z contiguous=False, 2025-05-07T20:31:41.2602702Z compiled=False, 2025-05-07T20:31:41.2602769Z ) 2025-05-07T20:31:41.2602989Z self = 2025-05-07T20:31:41.2603158Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.2603163Z 2025-05-07T20:31:41.2603235Z @given( 2025-05-07T20:31:41.2603481Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2603575Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2603682Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2603795Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2603903Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2603976Z ) 2025-05-07T20:31:41.2604214Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2604301Z def test_silu_mul_quant( 2025-05-07T20:31:41.2604375Z self, 2025-05-07T20:31:41.2604447Z T: int, 2025-05-07T20:31:41.2604515Z D: int, 2025-05-07T20:31:41.2604610Z scale_ub: Optional[float], 2025-05-07T20:31:41.2604775Z contiguous: bool, 2025-05-07T20:31:41.2604862Z compiled: bool, 2025-05-07T20:31:41.2604935Z ) -> None: 2025-05-07T20:31:41.2605024Z torch.manual_seed(2025) 2025-05-07T20:31:41.2605094Z 2025-05-07T20:31:41.2605262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2605332Z 2025-05-07T20:31:41.2605421Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2605540Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2605621Z x = x_sign * x_clamp 2025-05-07T20:31:41.2605697Z x0 = x[:, :D] 2025-05-07T20:31:41.2605770Z x1 = x[:, D:] 2025-05-07T20:31:41.2605838Z 2025-05-07T20:31:41.2605921Z if contiguous: 2025-05-07T20:31:41.2606008Z x0 = x0.contiguous() 2025-05-07T20:31:41.2606092Z x1 = x1.contiguous() 2025-05-07T20:31:41.2606159Z 2025-05-07T20:31:41.2606244Z if scale_ub is not None: 2025-05-07T20:31:41.2606351Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2606486Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2606557Z ) 2025-05-07T20:31:41.2606628Z else: 2025-05-07T20:31:41.2606718Z scale_ub_tensor = None 2025-05-07T20:31:41.2606785Z 2025-05-07T20:31:41.2606913Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2606995Z op = silu_mul_quant 2025-05-07T20:31:41.2607073Z if compiled: 2025-05-07T20:31:41.2607172Z op = torch.compile(op) 2025-05-07T20:31:41.2607292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2607367Z 2025-05-07T20:31:41.2607474Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2607479Z 2025-05-07T20:31:41.2607576Z moe/activation_test.py:117: 2025-05-07T20:31:41.2607698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2607796Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2607895Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2608479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2608571Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2608927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2609152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2609491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2609579Z kernel = self.compile( 2025-05-07T20:31:41.2609969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2610139Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2610271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2610281Z 2025-05-07T20:31:41.2610484Z self = 2025-05-07T20:31:41.2611251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2611748Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff80dd00>} 2025-05-07T20:31:41.2612494Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2612689Z context = 2025-05-07T20:31:41.2612768Z 2025-05-07T20:31:41.2612938Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2613204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2613304Z module_map=module_map) 2025-05-07T20:31:41.2613462Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2613557Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2613630Z E ^ 2025-05-07T20:31:41.2613980Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2613985Z 2025-05-07T20:31:41.2614400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2614404Z 2025-05-07T20:31:41.2614501Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2614726Z self=, 2025-05-07T20:31:41.2614803Z T=4096, 2025-05-07T20:31:41.2614874Z D=7168, 2025-05-07T20:31:41.2614953Z scale_ub=1200.0, 2025-05-07T20:31:41.2615038Z contiguous=False, 2025-05-07T20:31:41.2615114Z compiled=True, 2025-05-07T20:31:41.2615187Z ) 2025-05-07T20:31:41.2615400Z self = 2025-05-07T20:31:41.2615573Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.2615580Z 2025-05-07T20:31:41.2615654Z @given( 2025-05-07T20:31:41.2615769Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2615863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2615971Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2616084Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2616194Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2616262Z ) 2025-05-07T20:31:41.2616506Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2616679Z def test_silu_mul_quant( 2025-05-07T20:31:41.2616750Z self, 2025-05-07T20:31:41.2616821Z T: int, 2025-05-07T20:31:41.2616894Z D: int, 2025-05-07T20:31:41.2616986Z scale_ub: Optional[float], 2025-05-07T20:31:41.2617072Z contiguous: bool, 2025-05-07T20:31:41.2617152Z compiled: bool, 2025-05-07T20:31:41.2617227Z ) -> None: 2025-05-07T20:31:41.2617321Z torch.manual_seed(2025) 2025-05-07T20:31:41.2617387Z 2025-05-07T20:31:41.2617554Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2617630Z 2025-05-07T20:31:41.2617718Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2617838Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2617925Z x = x_sign * x_clamp 2025-05-07T20:31:41.2617998Z x0 = x[:, :D] 2025-05-07T20:31:41.2618072Z x1 = x[:, D:] 2025-05-07T20:31:41.2618153Z 2025-05-07T20:31:41.2618229Z if contiguous: 2025-05-07T20:31:41.2618318Z x0 = x0.contiguous() 2025-05-07T20:31:41.2618400Z x1 = x1.contiguous() 2025-05-07T20:31:41.2618468Z 2025-05-07T20:31:41.2618555Z if scale_ub is not None: 2025-05-07T20:31:41.2618655Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2618783Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2618855Z ) 2025-05-07T20:31:41.2618926Z else: 2025-05-07T20:31:41.2619014Z scale_ub_tensor = None 2025-05-07T20:31:41.2619083Z 2025-05-07T20:31:41.2619207Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2619290Z op = silu_mul_quant 2025-05-07T20:31:41.2619374Z if compiled: 2025-05-07T20:31:41.2619469Z op = torch.compile(op) 2025-05-07T20:31:41.2619570Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2619634Z 2025-05-07T20:31:41.2619829Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2619834Z 2025-05-07T20:31:41.2619927Z moe/activation_test.py:117: 2025-05-07T20:31:41.2620051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2620144Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2620245Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2620613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2620699Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2621201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2621292Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2621655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2621878Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2622225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2622318Z kernel = self.compile( 2025-05-07T20:31:41.2622699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2622871Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2622994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2622998Z 2025-05-07T20:31:41.2623202Z self = 2025-05-07T20:31:41.2623973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2624470Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff74ccc0>} 2025-05-07T20:31:41.2625297Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2625483Z context = 2025-05-07T20:31:41.2625488Z 2025-05-07T20:31:41.2625647Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2625911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2626013Z module_map=module_map) 2025-05-07T20:31:41.2626176Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2626267Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2626348Z E ^ 2025-05-07T20:31:41.2626702Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2626707Z 2025-05-07T20:31:41.2627119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2627123Z 2025-05-07T20:31:41.2627224Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2627441Z self=, 2025-05-07T20:31:41.2627514Z T=128, 2025-05-07T20:31:41.2627589Z D=7168, 2025-05-07T20:31:41.2627665Z scale_ub=1200.0, 2025-05-07T20:31:41.2627744Z contiguous=False, 2025-05-07T20:31:41.2627824Z compiled=True, 2025-05-07T20:31:41.2627891Z ) 2025-05-07T20:31:41.2628102Z self = 2025-05-07T20:31:41.2628353Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.2628364Z 2025-05-07T20:31:41.2628437Z @given( 2025-05-07T20:31:41.2628554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2628647Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2628755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2628868Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2628975Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2629044Z ) 2025-05-07T20:31:41.2629287Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2629375Z def test_silu_mul_quant( 2025-05-07T20:31:41.2629449Z self, 2025-05-07T20:31:41.2629524Z T: int, 2025-05-07T20:31:41.2629595Z D: int, 2025-05-07T20:31:41.2629687Z scale_ub: Optional[float], 2025-05-07T20:31:41.2629775Z contiguous: bool, 2025-05-07T20:31:41.2629857Z compiled: bool, 2025-05-07T20:31:41.2629937Z ) -> None: 2025-05-07T20:31:41.2630028Z torch.manual_seed(2025) 2025-05-07T20:31:41.2630094Z 2025-05-07T20:31:41.2630264Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2630334Z 2025-05-07T20:31:41.2630422Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2630550Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2630632Z x = x_sign * x_clamp 2025-05-07T20:31:41.2630705Z x0 = x[:, :D] 2025-05-07T20:31:41.2630782Z x1 = x[:, D:] 2025-05-07T20:31:41.2630850Z 2025-05-07T20:31:41.2630926Z if contiguous: 2025-05-07T20:31:41.2631015Z x0 = x0.contiguous() 2025-05-07T20:31:41.2631101Z x1 = x1.contiguous() 2025-05-07T20:31:41.2631172Z 2025-05-07T20:31:41.2631259Z if scale_ub is not None: 2025-05-07T20:31:41.2631361Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2631497Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2631663Z ) 2025-05-07T20:31:41.2631736Z else: 2025-05-07T20:31:41.2631826Z scale_ub_tensor = None 2025-05-07T20:31:41.2631895Z 2025-05-07T20:31:41.2632021Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2632105Z op = silu_mul_quant 2025-05-07T20:31:41.2632186Z if compiled: 2025-05-07T20:31:41.2632281Z op = torch.compile(op) 2025-05-07T20:31:41.2632383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2632451Z 2025-05-07T20:31:41.2632539Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2632544Z 2025-05-07T20:31:41.2632636Z moe/activation_test.py:117: 2025-05-07T20:31:41.2632762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2632862Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2632956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2633328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2633422Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2633913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2634011Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2634369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2634589Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2634929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2635019Z kernel = self.compile( 2025-05-07T20:31:41.2635400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2635658Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2635786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2635790Z 2025-05-07T20:31:41.2635994Z self = 2025-05-07T20:31:41.2636763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2637255Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff74d580>} 2025-05-07T20:31:41.2638009Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2638199Z context = 2025-05-07T20:31:41.2638209Z 2025-05-07T20:31:41.2638530Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2638880Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2638987Z module_map=module_map) 2025-05-07T20:31:41.2639144Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2639234Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2639311Z E ^ 2025-05-07T20:31:41.2639662Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2639667Z 2025-05-07T20:31:41.2640082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2640086Z 2025-05-07T20:31:41.2640192Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2640417Z self=, 2025-05-07T20:31:41.2640643Z T=2048, 2025-05-07T20:31:41.2640715Z D=7168, 2025-05-07T20:31:41.2640790Z scale_ub=None, 2025-05-07T20:31:41.2640872Z contiguous=True, 2025-05-07T20:31:41.2640951Z compiled=True, 2025-05-07T20:31:41.2641020Z ) 2025-05-07T20:31:41.2641239Z self = 2025-05-07T20:31:41.2641403Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:41.2641408Z 2025-05-07T20:31:41.2641481Z @given( 2025-05-07T20:31:41.2641601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2641695Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2641810Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2641920Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2642029Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2642115Z ) 2025-05-07T20:31:41.2642357Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2642447Z def test_silu_mul_quant( 2025-05-07T20:31:41.2642519Z self, 2025-05-07T20:31:41.2642592Z T: int, 2025-05-07T20:31:41.2642663Z D: int, 2025-05-07T20:31:41.2642759Z scale_ub: Optional[float], 2025-05-07T20:31:41.2642843Z contiguous: bool, 2025-05-07T20:31:41.2642922Z compiled: bool, 2025-05-07T20:31:41.2642996Z ) -> None: 2025-05-07T20:31:41.2643083Z torch.manual_seed(2025) 2025-05-07T20:31:41.2643153Z 2025-05-07T20:31:41.2643402Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2643472Z 2025-05-07T20:31:41.2643559Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2643681Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2643765Z x = x_sign * x_clamp 2025-05-07T20:31:41.2643970Z x0 = x[:, :D] 2025-05-07T20:31:41.2644055Z x1 = x[:, D:] 2025-05-07T20:31:41.2644125Z 2025-05-07T20:31:41.2644205Z if contiguous: 2025-05-07T20:31:41.2644293Z x0 = x0.contiguous() 2025-05-07T20:31:41.2644378Z x1 = x1.contiguous() 2025-05-07T20:31:41.2644450Z 2025-05-07T20:31:41.2644536Z if scale_ub is not None: 2025-05-07T20:31:41.2644640Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2644769Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2644841Z ) 2025-05-07T20:31:41.2644914Z else: 2025-05-07T20:31:41.2645003Z scale_ub_tensor = None 2025-05-07T20:31:41.2645071Z 2025-05-07T20:31:41.2645198Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2645282Z op = silu_mul_quant 2025-05-07T20:31:41.2645361Z if compiled: 2025-05-07T20:31:41.2645459Z op = torch.compile(op) 2025-05-07T20:31:41.2645566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2645639Z 2025-05-07T20:31:41.2645726Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2645731Z 2025-05-07T20:31:41.2645824Z moe/activation_test.py:117: 2025-05-07T20:31:41.2645952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2651163Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2651283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2651666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2651756Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2652252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2652350Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2652719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2653078Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2653418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2653508Z kernel = self.compile( 2025-05-07T20:31:41.2653903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2654075Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2654200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2654209Z 2025-05-07T20:31:41.2654414Z self = 2025-05-07T20:31:41.2655186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2655692Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff74e480>} 2025-05-07T20:31:41.2656439Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2656633Z context = 2025-05-07T20:31:41.2656637Z 2025-05-07T20:31:41.2656801Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2657062Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2657168Z module_map=module_map) 2025-05-07T20:31:41.2657327Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2657508Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2657581Z E ^ 2025-05-07T20:31:41.2657934Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2657938Z 2025-05-07T20:31:41.2658353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2658357Z 2025-05-07T20:31:41.2658454Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2658680Z self=, 2025-05-07T20:31:41.2658756Z T=16384, 2025-05-07T20:31:41.2658827Z D=5120, 2025-05-07T20:31:41.2658908Z scale_ub=None, 2025-05-07T20:31:41.2658992Z contiguous=False, 2025-05-07T20:31:41.2659069Z compiled=False, 2025-05-07T20:31:41.2659143Z ) 2025-05-07T20:31:41.2659360Z self = 2025-05-07T20:31:41.2659540Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:41.2659549Z 2025-05-07T20:31:41.2659624Z @given( 2025-05-07T20:31:41.2659741Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2659836Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2659955Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2660068Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2660178Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2660252Z ) 2025-05-07T20:31:41.2660493Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2660585Z def test_silu_mul_quant( 2025-05-07T20:31:41.2660661Z self, 2025-05-07T20:31:41.2660732Z T: int, 2025-05-07T20:31:41.2660810Z D: int, 2025-05-07T20:31:41.2660903Z scale_ub: Optional[float], 2025-05-07T20:31:41.2660989Z contiguous: bool, 2025-05-07T20:31:41.2661082Z compiled: bool, 2025-05-07T20:31:41.2661248Z ) -> None: 2025-05-07T20:31:41.2661337Z torch.manual_seed(2025) 2025-05-07T20:31:41.2661405Z 2025-05-07T20:31:41.2661572Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2661644Z 2025-05-07T20:31:41.2661730Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2661850Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2663665Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2663676Z 2025-05-07T20:31:41.2663791Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:41.2663796Z 2025-05-07T20:31:41.2663896Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2664116Z self=, 2025-05-07T20:31:41.2664190Z T=4096, 2025-05-07T20:31:41.2664261Z D=7168, 2025-05-07T20:31:41.2664343Z scale_ub=1200.0, 2025-05-07T20:31:41.2664419Z contiguous=True, 2025-05-07T20:31:41.2664499Z compiled=True, 2025-05-07T20:31:41.2664568Z ) 2025-05-07T20:31:41.2664782Z self = 2025-05-07T20:31:41.2664949Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.2664954Z 2025-05-07T20:31:41.2665025Z @given( 2025-05-07T20:31:41.2665141Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2665318Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2665432Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2665545Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2665653Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2665730Z ) 2025-05-07T20:31:41.2665969Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2666058Z def test_silu_mul_quant( 2025-05-07T20:31:41.2666136Z self, 2025-05-07T20:31:41.2666209Z T: int, 2025-05-07T20:31:41.2666277Z D: int, 2025-05-07T20:31:41.2666371Z scale_ub: Optional[float], 2025-05-07T20:31:41.2666455Z contiguous: bool, 2025-05-07T20:31:41.2666536Z compiled: bool, 2025-05-07T20:31:41.2666616Z ) -> None: 2025-05-07T20:31:41.2666706Z torch.manual_seed(2025) 2025-05-07T20:31:41.2666773Z 2025-05-07T20:31:41.2666941Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2667017Z 2025-05-07T20:31:41.2667110Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2667255Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2669069Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2669078Z 2025-05-07T20:31:41.2669191Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:41.2669195Z 2025-05-07T20:31:41.2669292Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2669520Z self=, 2025-05-07T20:31:41.2669677Z T=16384, 2025-05-07T20:31:41.2669752Z D=7168, 2025-05-07T20:31:41.2669837Z scale_ub=None, 2025-05-07T20:31:41.2669920Z contiguous=False, 2025-05-07T20:31:41.2669998Z compiled=False, 2025-05-07T20:31:41.2670069Z ) 2025-05-07T20:31:41.2670280Z self = 2025-05-07T20:31:41.2670455Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:41.2670463Z 2025-05-07T20:31:41.2670536Z @given( 2025-05-07T20:31:41.2670647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2670744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2670856Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2670970Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2671085Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2671155Z ) 2025-05-07T20:31:41.2671406Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2671501Z def test_silu_mul_quant( 2025-05-07T20:31:41.2671576Z self, 2025-05-07T20:31:41.2671651Z T: int, 2025-05-07T20:31:41.2671723Z D: int, 2025-05-07T20:31:41.2671814Z scale_ub: Optional[float], 2025-05-07T20:31:41.2671902Z contiguous: bool, 2025-05-07T20:31:41.2671979Z compiled: bool, 2025-05-07T20:31:41.2672050Z ) -> None: 2025-05-07T20:31:41.2672141Z torch.manual_seed(2025) 2025-05-07T20:31:41.2672210Z 2025-05-07T20:31:41.2672373Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2674244Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2674256Z 2025-05-07T20:31:41.2674371Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.2674375Z 2025-05-07T20:31:41.2674479Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2674699Z self=, 2025-05-07T20:31:41.2674773Z T=2048, 2025-05-07T20:31:41.2674847Z D=7168, 2025-05-07T20:31:41.2674922Z scale_ub=1200.0, 2025-05-07T20:31:41.2675006Z contiguous=True, 2025-05-07T20:31:41.2675082Z compiled=True, 2025-05-07T20:31:41.2675152Z ) 2025-05-07T20:31:41.2675367Z self = 2025-05-07T20:31:41.2675540Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.2675548Z 2025-05-07T20:31:41.2675628Z @given( 2025-05-07T20:31:41.2675743Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2675838Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2675949Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2676060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2676170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2676243Z ) 2025-05-07T20:31:41.2676484Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2676570Z def test_silu_mul_quant( 2025-05-07T20:31:41.2676645Z self, 2025-05-07T20:31:41.2676719Z T: int, 2025-05-07T20:31:41.2676793Z D: int, 2025-05-07T20:31:41.2676885Z scale_ub: Optional[float], 2025-05-07T20:31:41.2676965Z contiguous: bool, 2025-05-07T20:31:41.2677047Z compiled: bool, 2025-05-07T20:31:41.2677210Z ) -> None: 2025-05-07T20:31:41.2677316Z torch.manual_seed(2025) 2025-05-07T20:31:41.2677396Z 2025-05-07T20:31:41.2677585Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2677654Z 2025-05-07T20:31:41.2677740Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2677860Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2679635Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2679651Z 2025-05-07T20:31:41.2679763Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:41.2679768Z 2025-05-07T20:31:41.2679866Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2680085Z self=, 2025-05-07T20:31:41.2680157Z T=2048, 2025-05-07T20:31:41.2680234Z D=7168, 2025-05-07T20:31:41.2680312Z scale_ub=None, 2025-05-07T20:31:41.2680394Z contiguous=True, 2025-05-07T20:31:41.2680476Z compiled=False, 2025-05-07T20:31:41.2680544Z ) 2025-05-07T20:31:41.2680756Z self = 2025-05-07T20:31:41.2680926Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:41.2680931Z 2025-05-07T20:31:41.2681001Z @given( 2025-05-07T20:31:41.2681117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2681210Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2681396Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2681515Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2681624Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2681694Z ) 2025-05-07T20:31:41.2681938Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2682029Z def test_silu_mul_quant( 2025-05-07T20:31:41.2682101Z self, 2025-05-07T20:31:41.2682175Z T: int, 2025-05-07T20:31:41.2682246Z D: int, 2025-05-07T20:31:41.2682337Z scale_ub: Optional[float], 2025-05-07T20:31:41.2682422Z contiguous: bool, 2025-05-07T20:31:41.2682500Z compiled: bool, 2025-05-07T20:31:41.2682574Z ) -> None: 2025-05-07T20:31:41.2682663Z torch.manual_seed(2025) 2025-05-07T20:31:41.2682729Z 2025-05-07T20:31:41.2682895Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2682964Z 2025-05-07T20:31:41.2683061Z > x_sign = torch.sign(x) 2025-05-07T20:31:41.2684938Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2684944Z 2025-05-07T20:31:41.2685056Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:41.2685061Z 2025-05-07T20:31:41.2685160Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2685377Z self=, 2025-05-07T20:31:41.2685450Z T=1, 2025-05-07T20:31:41.2685535Z D=7168, 2025-05-07T20:31:41.2685721Z scale_ub=1200.0, 2025-05-07T20:31:41.2685805Z contiguous=True, 2025-05-07T20:31:41.2685885Z compiled=False, 2025-05-07T20:31:41.2685954Z ) 2025-05-07T20:31:41.2686170Z self = 2025-05-07T20:31:41.2686332Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:41.2686337Z 2025-05-07T20:31:41.2686408Z @given( 2025-05-07T20:31:41.2686522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2686615Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2686721Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2686833Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2686941Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2687016Z ) 2025-05-07T20:31:41.2687256Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2687350Z def test_silu_mul_quant( 2025-05-07T20:31:41.2687430Z self, 2025-05-07T20:31:41.2687502Z T: int, 2025-05-07T20:31:41.2687573Z D: int, 2025-05-07T20:31:41.2687668Z scale_ub: Optional[float], 2025-05-07T20:31:41.2687749Z contiguous: bool, 2025-05-07T20:31:41.2687827Z compiled: bool, 2025-05-07T20:31:41.2687903Z ) -> None: 2025-05-07T20:31:41.2687998Z torch.manual_seed(2025) 2025-05-07T20:31:41.2688067Z 2025-05-07T20:31:41.2688243Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2688312Z 2025-05-07T20:31:41.2688397Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2688521Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2688604Z x = x_sign * x_clamp 2025-05-07T20:31:41.2688686Z x0 = x[:, :D] 2025-05-07T20:31:41.2688760Z x1 = x[:, D:] 2025-05-07T20:31:41.2688827Z 2025-05-07T20:31:41.2688907Z if contiguous: 2025-05-07T20:31:41.2689156Z x0 = x0.contiguous() 2025-05-07T20:31:41.2689245Z x1 = x1.contiguous() 2025-05-07T20:31:41.2689316Z 2025-05-07T20:31:41.2689403Z if scale_ub is not None: 2025-05-07T20:31:41.2689504Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2689637Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2689708Z ) 2025-05-07T20:31:41.2689780Z else: 2025-05-07T20:31:41.2689873Z scale_ub_tensor = None 2025-05-07T20:31:41.2689940Z 2025-05-07T20:31:41.2690069Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2690153Z op = silu_mul_quant 2025-05-07T20:31:41.2690233Z if compiled: 2025-05-07T20:31:41.2690332Z op = torch.compile(op) 2025-05-07T20:31:41.2690435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2690505Z 2025-05-07T20:31:41.2690596Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2690600Z 2025-05-07T20:31:41.2690703Z moe/activation_test.py:117: 2025-05-07T20:31:41.2690830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2690925Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2691020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2691521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2691613Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2691973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2692196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2692544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2692634Z kernel = self.compile( 2025-05-07T20:31:41.2693024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2693282Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2693408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2693413Z 2025-05-07T20:31:41.2693614Z self = 2025-05-07T20:31:41.2694385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2694885Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff489d00>} 2025-05-07T20:31:41.2695634Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2695830Z context = 2025-05-07T20:31:41.2695835Z 2025-05-07T20:31:41.2695997Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2696261Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2696365Z module_map=module_map) 2025-05-07T20:31:41.2696521Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2696617Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2696689Z E ^ 2025-05-07T20:31:41.2697043Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2697048Z 2025-05-07T20:31:41.2697541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2697550Z 2025-05-07T20:31:41.2697648Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2697870Z self=, 2025-05-07T20:31:41.2697942Z T=128, 2025-05-07T20:31:41.2698013Z D=5120, 2025-05-07T20:31:41.2698093Z scale_ub=None, 2025-05-07T20:31:41.2698170Z contiguous=True, 2025-05-07T20:31:41.2698247Z compiled=False, 2025-05-07T20:31:41.2698323Z ) 2025-05-07T20:31:41.2698536Z self = 2025-05-07T20:31:41.2698701Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:41.2698709Z 2025-05-07T20:31:41.2698781Z @given( 2025-05-07T20:31:41.2698893Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2698988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2699104Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2699222Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2699333Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2699400Z ) 2025-05-07T20:31:41.2699639Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2699729Z def test_silu_mul_quant( 2025-05-07T20:31:41.2699799Z self, 2025-05-07T20:31:41.2699871Z T: int, 2025-05-07T20:31:41.2699949Z D: int, 2025-05-07T20:31:41.2700039Z scale_ub: Optional[float], 2025-05-07T20:31:41.2700124Z contiguous: bool, 2025-05-07T20:31:41.2700203Z compiled: bool, 2025-05-07T20:31:41.2700274Z ) -> None: 2025-05-07T20:31:41.2700369Z torch.manual_seed(2025) 2025-05-07T20:31:41.2700437Z 2025-05-07T20:31:41.2700604Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2700676Z 2025-05-07T20:31:41.2700764Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2700975Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2701063Z x = x_sign * x_clamp 2025-05-07T20:31:41.2701139Z x0 = x[:, :D] 2025-05-07T20:31:41.2701212Z x1 = x[:, D:] 2025-05-07T20:31:41.2701288Z 2025-05-07T20:31:41.2701364Z if contiguous: 2025-05-07T20:31:41.2701450Z x0 = x0.contiguous() 2025-05-07T20:31:41.2701532Z x1 = x1.contiguous() 2025-05-07T20:31:41.2701600Z 2025-05-07T20:31:41.2701690Z if scale_ub is not None: 2025-05-07T20:31:41.2701788Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2701918Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2701994Z ) 2025-05-07T20:31:41.2702065Z else: 2025-05-07T20:31:41.2702156Z scale_ub_tensor = None 2025-05-07T20:31:41.2702225Z 2025-05-07T20:31:41.2702351Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2702441Z op = silu_mul_quant 2025-05-07T20:31:41.2702532Z if compiled: 2025-05-07T20:31:41.2702628Z op = torch.compile(op) 2025-05-07T20:31:41.2702733Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2702802Z 2025-05-07T20:31:41.2702886Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2702891Z 2025-05-07T20:31:41.2702989Z moe/activation_test.py:117: 2025-05-07T20:31:41.2703114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2703207Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2703301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2703800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2703892Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2704254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2704557Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2704899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2704989Z kernel = self.compile( 2025-05-07T20:31:41.2705373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2705547Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2705672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2705676Z 2025-05-07T20:31:41.2705881Z self = 2025-05-07T20:31:41.2706655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2707158Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff48ae80>} 2025-05-07T20:31:41.2707910Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2708096Z context = 2025-05-07T20:31:41.2708100Z 2025-05-07T20:31:41.2708261Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2708521Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2708623Z module_map=module_map) 2025-05-07T20:31:41.2708783Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2708880Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2709035Z E ^ 2025-05-07T20:31:41.2709389Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2709393Z 2025-05-07T20:31:41.2709806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2709810Z 2025-05-07T20:31:41.2709911Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2710135Z self=, 2025-05-07T20:31:41.2710208Z T=128, 2025-05-07T20:31:41.2710284Z D=7168, 2025-05-07T20:31:41.2710363Z scale_ub=None, 2025-05-07T20:31:41.2710445Z contiguous=True, 2025-05-07T20:31:41.2710525Z compiled=False, 2025-05-07T20:31:41.2710592Z ) 2025-05-07T20:31:41.2710807Z self = 2025-05-07T20:31:41.2710979Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:41.2710989Z 2025-05-07T20:31:41.2711062Z @given( 2025-05-07T20:31:41.2711178Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2711271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2711381Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2711494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2711601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2711672Z ) 2025-05-07T20:31:41.2711920Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2712009Z def test_silu_mul_quant( 2025-05-07T20:31:41.2712081Z self, 2025-05-07T20:31:41.2712156Z T: int, 2025-05-07T20:31:41.2712229Z D: int, 2025-05-07T20:31:41.2712324Z scale_ub: Optional[float], 2025-05-07T20:31:41.2712406Z contiguous: bool, 2025-05-07T20:31:41.2712484Z compiled: bool, 2025-05-07T20:31:41.2712665Z ) -> None: 2025-05-07T20:31:41.2712754Z torch.manual_seed(2025) 2025-05-07T20:31:41.2712824Z 2025-05-07T20:31:41.2712994Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2713067Z 2025-05-07T20:31:41.2713155Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2713280Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2713364Z x = x_sign * x_clamp 2025-05-07T20:31:41.2713437Z x0 = x[:, :D] 2025-05-07T20:31:41.2713512Z x1 = x[:, D:] 2025-05-07T20:31:41.2713579Z 2025-05-07T20:31:41.2713657Z if contiguous: 2025-05-07T20:31:41.2713746Z x0 = x0.contiguous() 2025-05-07T20:31:41.2713830Z x1 = x1.contiguous() 2025-05-07T20:31:41.2713896Z 2025-05-07T20:31:41.2713982Z if scale_ub is not None: 2025-05-07T20:31:41.2714083Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2714222Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2714297Z ) 2025-05-07T20:31:41.2714370Z else: 2025-05-07T20:31:41.2714462Z scale_ub_tensor = None 2025-05-07T20:31:41.2714530Z 2025-05-07T20:31:41.2714655Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2714743Z op = silu_mul_quant 2025-05-07T20:31:41.2714822Z if compiled: 2025-05-07T20:31:41.2714917Z op = torch.compile(op) 2025-05-07T20:31:41.2715021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2715087Z 2025-05-07T20:31:41.2715177Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2715182Z 2025-05-07T20:31:41.2715275Z moe/activation_test.py:117: 2025-05-07T20:31:41.2715399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2715499Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2715593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2716096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2716274Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2716632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2716855Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2717196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2717284Z kernel = self.compile( 2025-05-07T20:31:41.2717671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2717843Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2717964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2717972Z 2025-05-07T20:31:41.2718182Z self = 2025-05-07T20:31:41.2718951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2719449Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff48bec0>} 2025-05-07T20:31:41.2720195Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2720386Z context = 2025-05-07T20:31:41.2720391Z 2025-05-07T20:31:41.2720626Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2720893Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2720998Z module_map=module_map) 2025-05-07T20:31:41.2721157Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2721251Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2721321Z E ^ 2025-05-07T20:31:41.2721671Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2721675Z 2025-05-07T20:31:41.2722089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2722094Z 2025-05-07T20:31:41.2722190Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2722407Z self=, 2025-05-07T20:31:41.2722483Z T=2048, 2025-05-07T20:31:41.2722560Z D=7168, 2025-05-07T20:31:41.2722646Z scale_ub=1200.0, 2025-05-07T20:31:41.2722726Z contiguous=True, 2025-05-07T20:31:41.2722801Z compiled=False, 2025-05-07T20:31:41.2722872Z ) 2025-05-07T20:31:41.2723084Z self = 2025-05-07T20:31:41.2723346Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:41.2723351Z 2025-05-07T20:31:41.2723430Z @given( 2025-05-07T20:31:41.2723542Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2723637Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2723749Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2723862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2723972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2724040Z ) 2025-05-07T20:31:41.2724279Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2724462Z def test_silu_mul_quant( 2025-05-07T20:31:41.2724534Z self, 2025-05-07T20:31:41.2724604Z T: int, 2025-05-07T20:31:41.2724676Z D: int, 2025-05-07T20:31:41.2724768Z scale_ub: Optional[float], 2025-05-07T20:31:41.2724853Z contiguous: bool, 2025-05-07T20:31:41.2724934Z compiled: bool, 2025-05-07T20:31:41.2725005Z ) -> None: 2025-05-07T20:31:41.2725093Z torch.manual_seed(2025) 2025-05-07T20:31:41.2725163Z 2025-05-07T20:31:41.2725332Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2727123Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2727135Z 2025-05-07T20:31:41.2727248Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.2727253Z 2025-05-07T20:31:41.2727350Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2727569Z self=, 2025-05-07T20:31:41.2727642Z T=1, 2025-05-07T20:31:41.2727718Z D=5120, 2025-05-07T20:31:41.2727797Z scale_ub=1200.0, 2025-05-07T20:31:41.2727878Z contiguous=True, 2025-05-07T20:31:41.2727961Z compiled=False, 2025-05-07T20:31:41.2728032Z ) 2025-05-07T20:31:41.2728244Z self = 2025-05-07T20:31:41.2728410Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:41.2728415Z 2025-05-07T20:31:41.2728485Z @given( 2025-05-07T20:31:41.2728686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2728781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2728889Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2729003Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2729112Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2729181Z ) 2025-05-07T20:31:41.2729425Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2729514Z def test_silu_mul_quant( 2025-05-07T20:31:41.2729586Z self, 2025-05-07T20:31:41.2729659Z T: int, 2025-05-07T20:31:41.2729730Z D: int, 2025-05-07T20:31:41.2729826Z scale_ub: Optional[float], 2025-05-07T20:31:41.2729907Z contiguous: bool, 2025-05-07T20:31:41.2729985Z compiled: bool, 2025-05-07T20:31:41.2730061Z ) -> None: 2025-05-07T20:31:41.2730149Z torch.manual_seed(2025) 2025-05-07T20:31:41.2730222Z 2025-05-07T20:31:41.2730393Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2730463Z 2025-05-07T20:31:41.2730550Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2730672Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2730757Z x = x_sign * x_clamp 2025-05-07T20:31:41.2730833Z x0 = x[:, :D] 2025-05-07T20:31:41.2730909Z x1 = x[:, D:] 2025-05-07T20:31:41.2730978Z 2025-05-07T20:31:41.2731064Z if contiguous: 2025-05-07T20:31:41.2731153Z x0 = x0.contiguous() 2025-05-07T20:31:41.2731236Z x1 = x1.contiguous() 2025-05-07T20:31:41.2731312Z 2025-05-07T20:31:41.2731395Z if scale_ub is not None: 2025-05-07T20:31:41.2731495Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2731627Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2731699Z ) 2025-05-07T20:31:41.2731771Z else: 2025-05-07T20:31:41.2731865Z scale_ub_tensor = None 2025-05-07T20:31:41.2732013Z 2025-05-07T20:31:41.2732138Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2732226Z op = silu_mul_quant 2025-05-07T20:31:41.2732310Z if compiled: 2025-05-07T20:31:41.2732405Z op = torch.compile(op) 2025-05-07T20:31:41.2732509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2732577Z 2025-05-07T20:31:41.2732671Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2732676Z 2025-05-07T20:31:41.2732771Z moe/activation_test.py:117: 2025-05-07T20:31:41.2732896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2732994Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2733087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2733586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2733690Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2734049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2734273Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2734614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2734704Z kernel = self.compile( 2025-05-07T20:31:41.2735089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2735260Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2735383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2735391Z 2025-05-07T20:31:41.2735594Z self = 2025-05-07T20:31:41.2736439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2736946Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff5b94e0>} 2025-05-07T20:31:41.2737694Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2737884Z context = 2025-05-07T20:31:41.2737889Z 2025-05-07T20:31:41.2738051Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2738319Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2738681Z module_map=module_map) 2025-05-07T20:31:41.2738880Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2738979Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2739050Z E ^ 2025-05-07T20:31:41.2739402Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2739407Z 2025-05-07T20:31:41.2739822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2739826Z 2025-05-07T20:31:41.2739925Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2740146Z self=, 2025-05-07T20:31:41.2740220Z T=2048, 2025-05-07T20:31:41.2740291Z D=5120, 2025-05-07T20:31:41.2740369Z scale_ub=None, 2025-05-07T20:31:41.2740449Z contiguous=True, 2025-05-07T20:31:41.2740536Z compiled=False, 2025-05-07T20:31:41.2740765Z ) 2025-05-07T20:31:41.2740983Z self = 2025-05-07T20:31:41.2741151Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:41.2741155Z 2025-05-07T20:31:41.2741232Z @given( 2025-05-07T20:31:41.2741346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2741442Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2741555Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2741667Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2741778Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2741846Z ) 2025-05-07T20:31:41.2742084Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2742176Z def test_silu_mul_quant( 2025-05-07T20:31:41.2742250Z self, 2025-05-07T20:31:41.2742320Z T: int, 2025-05-07T20:31:41.2742405Z D: int, 2025-05-07T20:31:41.2742497Z scale_ub: Optional[float], 2025-05-07T20:31:41.2742583Z contiguous: bool, 2025-05-07T20:31:41.2742663Z compiled: bool, 2025-05-07T20:31:41.2742738Z ) -> None: 2025-05-07T20:31:41.2742828Z torch.manual_seed(2025) 2025-05-07T20:31:41.2742899Z 2025-05-07T20:31:41.2743063Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2743135Z 2025-05-07T20:31:41.2743223Z > x_sign = torch.sign(x) 2025-05-07T20:31:41.2745149Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2745164Z 2025-05-07T20:31:41.2745279Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:41.2745284Z 2025-05-07T20:31:41.2745381Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2745605Z self=, 2025-05-07T20:31:41.2745679Z T=16384, 2025-05-07T20:31:41.2745750Z D=5120, 2025-05-07T20:31:41.2745829Z scale_ub=None, 2025-05-07T20:31:41.2745908Z contiguous=True, 2025-05-07T20:31:41.2745986Z compiled=False, 2025-05-07T20:31:41.2746055Z ) 2025-05-07T20:31:41.2746269Z self = 2025-05-07T20:31:41.2746445Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:41.2746450Z 2025-05-07T20:31:41.2746525Z @given( 2025-05-07T20:31:41.2746644Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2746746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2746854Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2746965Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2747075Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2747144Z ) 2025-05-07T20:31:41.2747384Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2747474Z def test_silu_mul_quant( 2025-05-07T20:31:41.2747547Z self, 2025-05-07T20:31:41.2747620Z T: int, 2025-05-07T20:31:41.2747693Z D: int, 2025-05-07T20:31:41.2747784Z scale_ub: Optional[float], 2025-05-07T20:31:41.2747870Z contiguous: bool, 2025-05-07T20:31:41.2747951Z compiled: bool, 2025-05-07T20:31:41.2748022Z ) -> None: 2025-05-07T20:31:41.2748115Z torch.manual_seed(2025) 2025-05-07T20:31:41.2748184Z 2025-05-07T20:31:41.2748353Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2750237Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2750243Z 2025-05-07T20:31:41.2750357Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.2750362Z 2025-05-07T20:31:41.2750459Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2750676Z self=, 2025-05-07T20:31:41.2750758Z T=4096, 2025-05-07T20:31:41.2750836Z D=5120, 2025-05-07T20:31:41.2750913Z scale_ub=None, 2025-05-07T20:31:41.2750993Z contiguous=True, 2025-05-07T20:31:41.2751073Z compiled=False, 2025-05-07T20:31:41.2751141Z ) 2025-05-07T20:31:41.2751355Z self = 2025-05-07T20:31:41.2751520Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:41.2751525Z 2025-05-07T20:31:41.2751597Z @given( 2025-05-07T20:31:41.2751711Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2751806Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2751916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2752027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2752135Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2752207Z ) 2025-05-07T20:31:41.2752525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2752621Z def test_silu_mul_quant( 2025-05-07T20:31:41.2752696Z self, 2025-05-07T20:31:41.2752766Z T: int, 2025-05-07T20:31:41.2752834Z D: int, 2025-05-07T20:31:41.2752931Z scale_ub: Optional[float], 2025-05-07T20:31:41.2753014Z contiguous: bool, 2025-05-07T20:31:41.2753096Z compiled: bool, 2025-05-07T20:31:41.2753174Z ) -> None: 2025-05-07T20:31:41.2753261Z torch.manual_seed(2025) 2025-05-07T20:31:41.2753334Z 2025-05-07T20:31:41.2753497Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2755269Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2755283Z 2025-05-07T20:31:41.2755398Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.2755403Z 2025-05-07T20:31:41.2755497Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2755718Z self=, 2025-05-07T20:31:41.2755788Z T=2048, 2025-05-07T20:31:41.2755860Z D=5120, 2025-05-07T20:31:41.2755940Z scale_ub=None, 2025-05-07T20:31:41.2756021Z contiguous=False, 2025-05-07T20:31:41.2756097Z compiled=False, 2025-05-07T20:31:41.2756172Z ) 2025-05-07T20:31:41.2756384Z self = 2025-05-07T20:31:41.2756557Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:41.2756562Z 2025-05-07T20:31:41.2756722Z @given( 2025-05-07T20:31:41.2756834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2756934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2757041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2757151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2757261Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2757330Z ) 2025-05-07T20:31:41.2757568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2757658Z def test_silu_mul_quant( 2025-05-07T20:31:41.2757730Z self, 2025-05-07T20:31:41.2757802Z T: int, 2025-05-07T20:31:41.2757870Z D: int, 2025-05-07T20:31:41.2757963Z scale_ub: Optional[float], 2025-05-07T20:31:41.2758049Z contiguous: bool, 2025-05-07T20:31:41.2758128Z compiled: bool, 2025-05-07T20:31:41.2758202Z ) -> None: 2025-05-07T20:31:41.2758299Z torch.manual_seed(2025) 2025-05-07T20:31:41.2758372Z 2025-05-07T20:31:41.2758532Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2760303Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2760309Z 2025-05-07T20:31:41.2760424Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.2760429Z 2025-05-07T20:31:41.2760527Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2760820Z self=, 2025-05-07T20:31:41.2760902Z T=4096, 2025-05-07T20:31:41.2760973Z D=7168, 2025-05-07T20:31:41.2761050Z scale_ub=None, 2025-05-07T20:31:41.2761130Z contiguous=True, 2025-05-07T20:31:41.2761206Z compiled=True, 2025-05-07T20:31:41.2761276Z ) 2025-05-07T20:31:41.2761492Z self = 2025-05-07T20:31:41.2761657Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:41.2761661Z 2025-05-07T20:31:41.2761735Z @given( 2025-05-07T20:31:41.2761850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2761943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2762055Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2762166Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2762272Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2762345Z ) 2025-05-07T20:31:41.2762590Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2762680Z def test_silu_mul_quant( 2025-05-07T20:31:41.2762757Z self, 2025-05-07T20:31:41.2762830Z T: int, 2025-05-07T20:31:41.2762900Z D: int, 2025-05-07T20:31:41.2762997Z scale_ub: Optional[float], 2025-05-07T20:31:41.2763080Z contiguous: bool, 2025-05-07T20:31:41.2763158Z compiled: bool, 2025-05-07T20:31:41.2763232Z ) -> None: 2025-05-07T20:31:41.2763454Z torch.manual_seed(2025) 2025-05-07T20:31:41.2763526Z 2025-05-07T20:31:41.2763688Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2765467Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2765562Z 2025-05-07T20:31:41.2765676Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.2765680Z 2025-05-07T20:31:41.2765776Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2765996Z self=, 2025-05-07T20:31:41.2766070Z T=2048, 2025-05-07T20:31:41.2766141Z D=5120, 2025-05-07T20:31:41.2766221Z scale_ub=1200.0, 2025-05-07T20:31:41.2766303Z contiguous=False, 2025-05-07T20:31:41.2766380Z compiled=False, 2025-05-07T20:31:41.2766454Z ) 2025-05-07T20:31:41.2766667Z self = 2025-05-07T20:31:41.2766845Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.2766858Z 2025-05-07T20:31:41.2766931Z @given( 2025-05-07T20:31:41.2767043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2767140Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2767246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2767355Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2767465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2767535Z ) 2025-05-07T20:31:41.2767775Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2767866Z def test_silu_mul_quant( 2025-05-07T20:31:41.2767936Z self, 2025-05-07T20:31:41.2768011Z T: int, 2025-05-07T20:31:41.2768083Z D: int, 2025-05-07T20:31:41.2768174Z scale_ub: Optional[float], 2025-05-07T20:31:41.2768259Z contiguous: bool, 2025-05-07T20:31:41.2768338Z compiled: bool, 2025-05-07T20:31:41.2768492Z ) -> None: 2025-05-07T20:31:41.2768591Z torch.manual_seed(2025) 2025-05-07T20:31:41.2768660Z 2025-05-07T20:31:41.2768822Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2770587Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2770592Z 2025-05-07T20:31:41.2770702Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.2770707Z 2025-05-07T20:31:41.2770810Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2771030Z self=, 2025-05-07T20:31:41.2771107Z T=4096, 2025-05-07T20:31:41.2771180Z D=7168, 2025-05-07T20:31:41.2771256Z scale_ub=1200.0, 2025-05-07T20:31:41.2771337Z contiguous=True, 2025-05-07T20:31:41.2771414Z compiled=False, 2025-05-07T20:31:41.2771483Z ) 2025-05-07T20:31:41.2771702Z self = 2025-05-07T20:31:41.2771872Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:41.2771877Z 2025-05-07T20:31:41.2771949Z @given( 2025-05-07T20:31:41.2772065Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2772159Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2777719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2777865Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2777987Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2778170Z ) 2025-05-07T20:31:41.2778417Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2778508Z def test_silu_mul_quant( 2025-05-07T20:31:41.2778586Z self, 2025-05-07T20:31:41.2778657Z T: int, 2025-05-07T20:31:41.2778727Z D: int, 2025-05-07T20:31:41.2778821Z scale_ub: Optional[float], 2025-05-07T20:31:41.2778906Z contiguous: bool, 2025-05-07T20:31:41.2778985Z compiled: bool, 2025-05-07T20:31:41.2779061Z ) -> None: 2025-05-07T20:31:41.2779151Z torch.manual_seed(2025) 2025-05-07T20:31:41.2779225Z 2025-05-07T20:31:41.2779393Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2781182Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2781198Z 2025-05-07T20:31:41.2781313Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.2781319Z 2025-05-07T20:31:41.2781415Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2781641Z self=, 2025-05-07T20:31:41.2781715Z T=16384, 2025-05-07T20:31:41.2781793Z D=7168, 2025-05-07T20:31:41.2781870Z scale_ub=None, 2025-05-07T20:31:41.2781952Z contiguous=False, 2025-05-07T20:31:41.2782029Z compiled=True, 2025-05-07T20:31:41.2782102Z ) 2025-05-07T20:31:41.2782395Z self = 2025-05-07T20:31:41.2782577Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.2782583Z 2025-05-07T20:31:41.2782657Z @given( 2025-05-07T20:31:41.2782773Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2782873Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2782985Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2783096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2783207Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2783276Z ) 2025-05-07T20:31:41.2783522Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2783609Z def test_silu_mul_quant( 2025-05-07T20:31:41.2783681Z self, 2025-05-07T20:31:41.2783757Z T: int, 2025-05-07T20:31:41.2783831Z D: int, 2025-05-07T20:31:41.2783925Z scale_ub: Optional[float], 2025-05-07T20:31:41.2784017Z contiguous: bool, 2025-05-07T20:31:41.2784100Z compiled: bool, 2025-05-07T20:31:41.2784172Z ) -> None: 2025-05-07T20:31:41.2784274Z torch.manual_seed(2025) 2025-05-07T20:31:41.2784345Z 2025-05-07T20:31:41.2784509Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2786282Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2786287Z 2025-05-07T20:31:41.2786401Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.2786497Z 2025-05-07T20:31:41.2786602Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2786822Z self=, 2025-05-07T20:31:41.2786902Z T=4096, 2025-05-07T20:31:41.2786977Z D=7168, 2025-05-07T20:31:41.2787055Z scale_ub=None, 2025-05-07T20:31:41.2787138Z contiguous=True, 2025-05-07T20:31:41.2787217Z compiled=False, 2025-05-07T20:31:41.2787286Z ) 2025-05-07T20:31:41.2787500Z self = 2025-05-07T20:31:41.2787667Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:41.2787671Z 2025-05-07T20:31:41.2787744Z @given( 2025-05-07T20:31:41.2787860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2787954Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2788066Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2788182Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2788296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2788367Z ) 2025-05-07T20:31:41.2788606Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2788694Z def test_silu_mul_quant( 2025-05-07T20:31:41.2788769Z self, 2025-05-07T20:31:41.2788839Z T: int, 2025-05-07T20:31:41.2788907Z D: int, 2025-05-07T20:31:41.2789003Z scale_ub: Optional[float], 2025-05-07T20:31:41.2789087Z contiguous: bool, 2025-05-07T20:31:41.2789165Z compiled: bool, 2025-05-07T20:31:41.2789241Z ) -> None: 2025-05-07T20:31:41.2789329Z torch.manual_seed(2025) 2025-05-07T20:31:41.2789402Z 2025-05-07T20:31:41.2789564Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2791414Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2791431Z 2025-05-07T20:31:41.2791547Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.2791551Z 2025-05-07T20:31:41.2791649Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2791869Z self=, 2025-05-07T20:31:41.2791945Z T=16384, 2025-05-07T20:31:41.2792017Z D=7168, 2025-05-07T20:31:41.2792098Z scale_ub=None, 2025-05-07T20:31:41.2792179Z contiguous=True, 2025-05-07T20:31:41.2792260Z compiled=False, 2025-05-07T20:31:41.2792333Z ) 2025-05-07T20:31:41.2792556Z self = 2025-05-07T20:31:41.2792730Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:41.2792735Z 2025-05-07T20:31:41.2792809Z @given( 2025-05-07T20:31:41.2792922Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2793018Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2793126Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2793236Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2793347Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2793416Z ) 2025-05-07T20:31:41.2793663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2793754Z def test_silu_mul_quant( 2025-05-07T20:31:41.2793831Z self, 2025-05-07T20:31:41.2793906Z T: int, 2025-05-07T20:31:41.2793976Z D: int, 2025-05-07T20:31:41.2794073Z scale_ub: Optional[float], 2025-05-07T20:31:41.2794239Z contiguous: bool, 2025-05-07T20:31:41.2794320Z compiled: bool, 2025-05-07T20:31:41.2794390Z ) -> None: 2025-05-07T20:31:41.2794484Z torch.manual_seed(2025) 2025-05-07T20:31:41.2794552Z 2025-05-07T20:31:41.2794716Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2796487Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2796499Z 2025-05-07T20:31:41.2796615Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.2796623Z 2025-05-07T20:31:41.2796721Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2796940Z self=, 2025-05-07T20:31:41.2797019Z T=16384, 2025-05-07T20:31:41.2797094Z D=7168, 2025-05-07T20:31:41.2797173Z scale_ub=1200.0, 2025-05-07T20:31:41.2797258Z contiguous=True, 2025-05-07T20:31:41.2797338Z compiled=False, 2025-05-07T20:31:41.2797408Z ) 2025-05-07T20:31:41.2797622Z self = 2025-05-07T20:31:41.2797794Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:41.2797798Z 2025-05-07T20:31:41.2797872Z @given( 2025-05-07T20:31:41.2797991Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2798085Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2798275Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2798398Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2798508Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2798584Z ) 2025-05-07T20:31:41.2798824Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2798911Z def test_silu_mul_quant( 2025-05-07T20:31:41.2798990Z self, 2025-05-07T20:31:41.2799059Z T: int, 2025-05-07T20:31:41.2799133Z D: int, 2025-05-07T20:31:41.2799226Z scale_ub: Optional[float], 2025-05-07T20:31:41.2799309Z contiguous: bool, 2025-05-07T20:31:41.2799394Z compiled: bool, 2025-05-07T20:31:41.2799468Z ) -> None: 2025-05-07T20:31:41.2799560Z torch.manual_seed(2025) 2025-05-07T20:31:41.2799630Z 2025-05-07T20:31:41.2799794Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2801568Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2801579Z 2025-05-07T20:31:41.2801690Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.2801694Z 2025-05-07T20:31:41.2801794Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2802014Z self=, 2025-05-07T20:31:41.2802084Z T=128, 2025-05-07T20:31:41.2802162Z D=5120, 2025-05-07T20:31:41.2802239Z scale_ub=1200.0, 2025-05-07T20:31:41.2802321Z contiguous=False, 2025-05-07T20:31:41.2802488Z compiled=False, 2025-05-07T20:31:41.2802559Z ) 2025-05-07T20:31:41.2802770Z self = 2025-05-07T20:31:41.2802941Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.2802946Z 2025-05-07T20:31:41.2803018Z @given( 2025-05-07T20:31:41.2803134Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2803225Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2803440Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2803557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2803666Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2803735Z ) 2025-05-07T20:31:41.2803976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2804064Z def test_silu_mul_quant( 2025-05-07T20:31:41.2804137Z self, 2025-05-07T20:31:41.2804217Z T: int, 2025-05-07T20:31:41.2804295Z D: int, 2025-05-07T20:31:41.2804387Z scale_ub: Optional[float], 2025-05-07T20:31:41.2804473Z contiguous: bool, 2025-05-07T20:31:41.2804552Z compiled: bool, 2025-05-07T20:31:41.2804625Z ) -> None: 2025-05-07T20:31:41.2804715Z torch.manual_seed(2025) 2025-05-07T20:31:41.2804784Z 2025-05-07T20:31:41.2804949Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2805021Z 2025-05-07T20:31:41.2805107Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2805230Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2805312Z x = x_sign * x_clamp 2025-05-07T20:31:41.2805385Z x0 = x[:, :D] 2025-05-07T20:31:41.2805463Z x1 = x[:, D:] 2025-05-07T20:31:41.2805534Z 2025-05-07T20:31:41.2805612Z if contiguous: 2025-05-07T20:31:41.2805702Z x0 = x0.contiguous() 2025-05-07T20:31:41.2805901Z x1 = x1.contiguous() 2025-05-07T20:31:41.2805975Z 2025-05-07T20:31:41.2806066Z if scale_ub is not None: 2025-05-07T20:31:41.2806166Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2806298Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2806370Z ) 2025-05-07T20:31:41.2806441Z else: 2025-05-07T20:31:41.2806536Z scale_ub_tensor = None 2025-05-07T20:31:41.2806604Z 2025-05-07T20:31:41.2806733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2806819Z op = silu_mul_quant 2025-05-07T20:31:41.2806897Z if compiled: 2025-05-07T20:31:41.2806990Z op = torch.compile(op) 2025-05-07T20:31:41.2807098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2807164Z 2025-05-07T20:31:41.2807250Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2807258Z 2025-05-07T20:31:41.2807353Z moe/activation_test.py:117: 2025-05-07T20:31:41.2807484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2807586Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2807682Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2808181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2808278Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2808637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2808859Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2809200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2809290Z kernel = self.compile( 2025-05-07T20:31:41.2809676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2809936Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2810060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2810065Z 2025-05-07T20:31:41.2810487Z self = 2025-05-07T20:31:41.2811262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2811764Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff150220>} 2025-05-07T20:31:41.2812519Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2812716Z context = 2025-05-07T20:31:41.2812720Z 2025-05-07T20:31:41.2812881Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2813144Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2813249Z module_map=module_map) 2025-05-07T20:31:41.2813409Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2813503Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2813581Z E ^ 2025-05-07T20:31:41.2813934Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2813939Z 2025-05-07T20:31:41.2814355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2814359Z 2025-05-07T20:31:41.2814545Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2814767Z self=, 2025-05-07T20:31:41.2814845Z T=2048, 2025-05-07T20:31:41.2814917Z D=7168, 2025-05-07T20:31:41.2814993Z scale_ub=None, 2025-05-07T20:31:41.2815078Z contiguous=False, 2025-05-07T20:31:41.2815158Z compiled=False, 2025-05-07T20:31:41.2815230Z ) 2025-05-07T20:31:41.2815443Z self = 2025-05-07T20:31:41.2815611Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:41.2815615Z 2025-05-07T20:31:41.2815692Z @given( 2025-05-07T20:31:41.2815806Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2815899Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2816010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2816122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2816242Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2816312Z ) 2025-05-07T20:31:41.2816551Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2816643Z def test_silu_mul_quant( 2025-05-07T20:31:41.2816717Z self, 2025-05-07T20:31:41.2816787Z T: int, 2025-05-07T20:31:41.2816863Z D: int, 2025-05-07T20:31:41.2816956Z scale_ub: Optional[float], 2025-05-07T20:31:41.2817039Z contiguous: bool, 2025-05-07T20:31:41.2817122Z compiled: bool, 2025-05-07T20:31:41.2817194Z ) -> None: 2025-05-07T20:31:41.2817283Z torch.manual_seed(2025) 2025-05-07T20:31:41.2817354Z 2025-05-07T20:31:41.2817518Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2819307Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2819444Z 2025-05-07T20:31:41.2819562Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.2819567Z 2025-05-07T20:31:41.2819667Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2819886Z self=, 2025-05-07T20:31:41.2819962Z T=128, 2025-05-07T20:31:41.2820036Z D=7168, 2025-05-07T20:31:41.2820120Z scale_ub=1200.0, 2025-05-07T20:31:41.2820201Z contiguous=True, 2025-05-07T20:31:41.2820280Z compiled=True, 2025-05-07T20:31:41.2820351Z ) 2025-05-07T20:31:41.2820568Z self = 2025-05-07T20:31:41.2820737Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.2820741Z 2025-05-07T20:31:41.2820818Z @given( 2025-05-07T20:31:41.2820932Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2821027Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2821138Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2821249Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2821362Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2821436Z ) 2025-05-07T20:31:41.2821677Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2821768Z def test_silu_mul_quant( 2025-05-07T20:31:41.2821840Z self, 2025-05-07T20:31:41.2821911Z T: int, 2025-05-07T20:31:41.2821984Z D: int, 2025-05-07T20:31:41.2822078Z scale_ub: Optional[float], 2025-05-07T20:31:41.2822322Z contiguous: bool, 2025-05-07T20:31:41.2822407Z compiled: bool, 2025-05-07T20:31:41.2822479Z ) -> None: 2025-05-07T20:31:41.2822571Z torch.manual_seed(2025) 2025-05-07T20:31:41.2822643Z 2025-05-07T20:31:41.2822806Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2822878Z 2025-05-07T20:31:41.2822963Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2823081Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2823170Z x = x_sign * x_clamp 2025-05-07T20:31:41.2823248Z x0 = x[:, :D] 2025-05-07T20:31:41.2823322Z x1 = x[:, D:] 2025-05-07T20:31:41.2823392Z 2025-05-07T20:31:41.2823472Z if contiguous: 2025-05-07T20:31:41.2823559Z x0 = x0.contiguous() 2025-05-07T20:31:41.2823646Z x1 = x1.contiguous() 2025-05-07T20:31:41.2823715Z 2025-05-07T20:31:41.2823801Z if scale_ub is not None: 2025-05-07T20:31:41.2823911Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2824046Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2824123Z ) 2025-05-07T20:31:41.2824195Z else: 2025-05-07T20:31:41.2824288Z scale_ub_tensor = None 2025-05-07T20:31:41.2824361Z 2025-05-07T20:31:41.2824485Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2824571Z op = silu_mul_quant 2025-05-07T20:31:41.2824659Z if compiled: 2025-05-07T20:31:41.2824753Z op = torch.compile(op) 2025-05-07T20:31:41.2824852Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2824920Z 2025-05-07T20:31:41.2825009Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2825013Z 2025-05-07T20:31:41.2825105Z moe/activation_test.py:117: 2025-05-07T20:31:41.2825230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2825325Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2825513Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2825883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.2825974Z return fn(*args, **kwargs) 2025-05-07T20:31:41.2826471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2826563Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2826924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2827148Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2827487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2827580Z kernel = self.compile( 2025-05-07T20:31:41.2827968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2828145Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2828271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2828275Z 2025-05-07T20:31:41.2828478Z self = 2025-05-07T20:31:41.2829254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2829747Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08ff150860>} 2025-05-07T20:31:41.2830568Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2830765Z context = 2025-05-07T20:31:41.2830769Z 2025-05-07T20:31:41.2830929Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2831191Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2831293Z module_map=module_map) 2025-05-07T20:31:41.2831448Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2831544Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2831617Z E ^ 2025-05-07T20:31:41.2831971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2831976Z 2025-05-07T20:31:41.2832393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2832403Z 2025-05-07T20:31:41.2832502Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2832724Z self=, 2025-05-07T20:31:41.2832799Z T=128, 2025-05-07T20:31:41.2832873Z D=7168, 2025-05-07T20:31:41.2832953Z scale_ub=1200.0, 2025-05-07T20:31:41.2833034Z contiguous=True, 2025-05-07T20:31:41.2833114Z compiled=False, 2025-05-07T20:31:41.2833184Z ) 2025-05-07T20:31:41.2833396Z self = 2025-05-07T20:31:41.2833564Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:41.2833569Z 2025-05-07T20:31:41.2833643Z @given( 2025-05-07T20:31:41.2833756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2833853Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2833961Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2834182Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2834295Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2834366Z ) 2025-05-07T20:31:41.2834608Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2834703Z def test_silu_mul_quant( 2025-05-07T20:31:41.2834773Z self, 2025-05-07T20:31:41.2834845Z T: int, 2025-05-07T20:31:41.2834926Z D: int, 2025-05-07T20:31:41.2835019Z scale_ub: Optional[float], 2025-05-07T20:31:41.2835103Z contiguous: bool, 2025-05-07T20:31:41.2835190Z compiled: bool, 2025-05-07T20:31:41.2835261Z ) -> None: 2025-05-07T20:31:41.2835352Z torch.manual_seed(2025) 2025-05-07T20:31:41.2835422Z 2025-05-07T20:31:41.2835585Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2835656Z 2025-05-07T20:31:41.2835740Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2835863Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2837649Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2837655Z 2025-05-07T20:31:41.2837772Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:41.2837776Z 2025-05-07T20:31:41.2837880Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2838098Z self=, 2025-05-07T20:31:41.2838169Z T=128, 2025-05-07T20:31:41.2838324Z D=5120, 2025-05-07T20:31:41.2838709Z scale_ub=1200.0, 2025-05-07T20:31:41.2838842Z contiguous=True, 2025-05-07T20:31:41.2838921Z compiled=True, 2025-05-07T20:31:41.2838989Z ) 2025-05-07T20:31:41.2839207Z self = 2025-05-07T20:31:41.2839372Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.2839377Z 2025-05-07T20:31:41.2839449Z @given( 2025-05-07T20:31:41.2839568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2839663Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2839770Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2839885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2839994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2840066Z ) 2025-05-07T20:31:41.2840307Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2840407Z def test_silu_mul_quant( 2025-05-07T20:31:41.2840486Z self, 2025-05-07T20:31:41.2840558Z T: int, 2025-05-07T20:31:41.2840627Z D: int, 2025-05-07T20:31:41.2840724Z scale_ub: Optional[float], 2025-05-07T20:31:41.2840809Z contiguous: bool, 2025-05-07T20:31:41.2840888Z compiled: bool, 2025-05-07T20:31:41.2840962Z ) -> None: 2025-05-07T20:31:41.2841052Z torch.manual_seed(2025) 2025-05-07T20:31:41.2841122Z 2025-05-07T20:31:41.2841288Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2841358Z 2025-05-07T20:31:41.2841446Z > x_sign = torch.sign(x) 2025-05-07T20:31:41.2843210Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2843489Z 2025-05-07T20:31:41.2843613Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:41.2843618Z 2025-05-07T20:31:41.2843714Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2843933Z self=, 2025-05-07T20:31:41.2844010Z T=128, 2025-05-07T20:31:41.2844084Z D=7168, 2025-05-07T20:31:41.2844164Z scale_ub=None, 2025-05-07T20:31:41.2844250Z contiguous=True, 2025-05-07T20:31:41.2844329Z compiled=True, 2025-05-07T20:31:41.2844400Z ) 2025-05-07T20:31:41.2844617Z self = 2025-05-07T20:31:41.2844783Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:41.2844793Z 2025-05-07T20:31:41.2844870Z @given( 2025-05-07T20:31:41.2844986Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2845078Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2845190Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2845300Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2845408Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2845481Z ) 2025-05-07T20:31:41.2845720Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2845808Z def test_silu_mul_quant( 2025-05-07T20:31:41.2845884Z self, 2025-05-07T20:31:41.2845958Z T: int, 2025-05-07T20:31:41.2846032Z D: int, 2025-05-07T20:31:41.2846123Z scale_ub: Optional[float], 2025-05-07T20:31:41.2846206Z contiguous: bool, 2025-05-07T20:31:41.2846293Z compiled: bool, 2025-05-07T20:31:41.2846487Z ) -> None: 2025-05-07T20:31:41.2846582Z torch.manual_seed(2025) 2025-05-07T20:31:41.2846653Z 2025-05-07T20:31:41.2846817Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2848579Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:41.2848585Z 2025-05-07T20:31:41.2848698Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:41.2848833Z =============================== warnings summary =============================== 2025-05-07T20:31:41.2849147Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:41.2849448Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:41.2849747Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:41.2850619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:31:41.2850846Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:31:41.2850853Z 2025-05-07T20:31:41.2851024Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:31:41.2852291Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:31:41.2852555Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:31:41.2852560Z 2025-05-07T20:31:41.2852768Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:31:41.2852933Z ================== 1 failed, 1 passed, 13 warnings in 22.42s =================== 2025-05-07T20:31:43.1208875Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:31:43.1874656Z 2025-05-07T20:31:43.1875664Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:31:43.1876072Z 2025-05-07T20:31:43.1876077Z 2025-05-07T20:31:43.1894013Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:31:45.3457651Z ============================= test session starts ============================== 2025-05-07T20:31:45.3458355Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:45.3459377Z cachedir: .pytest_cache 2025-05-07T20:31:45.3460516Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:45.3461943Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:45.3462741Z plugins: hypothesis-6.131.14 2025-05-07T20:31:46.9647171Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:47.1167136Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:31:47.1167540Z run-last-failure: rerun previous 1 failure 2025-05-07T20:31:47.1167761Z 2025-05-07T20:31:49.3049781Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:49.3050904Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:31:49.3052264Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:49.3053763Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:49.3054746Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.3056056Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:49.3057447Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.3058428Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.3059665Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.3061401Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.3062467Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.3063743Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:49.3065004Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:31:49.3066230Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:49.3067431Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:31:49.3068258Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.3069281Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:49.3070300Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:31:49.3071251Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:49.3072475Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:49.3073762Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:49.3074888Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:49.3075945Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:31:49.3077120Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:49.3078482Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:49.3079539Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.3080502Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.3081245Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:31:49.3082263Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.3226065Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:49.3227135Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:31:49.3228473Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:49.3229946Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:49.3230929Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.3232246Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:49.3233629Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.3234607Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.3236017Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.3237413Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.3238702Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.3240033Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:49.3241287Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:31:49.3242513Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:49.3243887Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:31:49.3244706Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.3245729Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:49.3246748Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:31:49.3247543Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:49.3248754Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:49.3250185Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:49.3251302Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:49.3252351Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:31:49.3253537Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:49.3254890Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:49.3255946Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.3256858Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.3257596Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:31:49.3258610Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.8645329Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.8646057Z self=, 2025-05-07T20:31:49.8646462Z T=1, 2025-05-07T20:31:49.8646652Z D=5120, 2025-05-07T20:31:49.8646846Z scale_ub=None, 2025-05-07T20:31:49.8647052Z contiguous=True, 2025-05-07T20:31:49.8647283Z compiled=True, 2025-05-07T20:31:49.8647492Z ) 2025-05-07T20:31:49.8647814Z self = 2025-05-07T20:31:49.8648300Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.8648560Z 2025-05-07T20:31:49.8648649Z @given( 2025-05-07T20:31:49.8648882Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.8649195Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.8649502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.8649831Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.8650160Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.8650456Z ) 2025-05-07T20:31:49.8650810Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.8651246Z def test_silu_mul_quant( 2025-05-07T20:31:49.8651494Z self, 2025-05-07T20:31:49.8651694Z T: int, 2025-05-07T20:31:49.8651890Z D: int, 2025-05-07T20:31:49.8652117Z scale_ub: Optional[float], 2025-05-07T20:31:49.8652399Z contiguous: bool, 2025-05-07T20:31:49.8652634Z compiled: bool, 2025-05-07T20:31:49.8652864Z ) -> None: 2025-05-07T20:31:49.8653088Z torch.manual_seed(2025) 2025-05-07T20:31:49.8653333Z 2025-05-07T20:31:49.8653605Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.8655437Z 2025-05-07T20:31:49.8655638Z x_sign = torch.sign(x) 2025-05-07T20:31:49.8655926Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.8656248Z x = x_sign * x_clamp 2025-05-07T20:31:49.8657164Z x0 = x[:, :D] 2025-05-07T20:31:49.8657402Z x1 = x[:, D:] 2025-05-07T20:31:49.8657608Z 2025-05-07T20:31:49.8657800Z if contiguous: 2025-05-07T20:31:49.8658036Z x0 = x0.contiguous() 2025-05-07T20:31:49.8658290Z x1 = x1.contiguous() 2025-05-07T20:31:49.8658534Z 2025-05-07T20:31:49.8658726Z if scale_ub is not None: 2025-05-07T20:31:49.8658997Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.8659334Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.8670088Z ) 2025-05-07T20:31:49.8670324Z else: 2025-05-07T20:31:49.8670548Z scale_ub_tensor = None 2025-05-07T20:31:49.8670813Z 2025-05-07T20:31:49.8671051Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8671379Z op = silu_mul_quant 2025-05-07T20:31:49.8671639Z if compiled: 2025-05-07T20:31:49.8671894Z op = torch.compile(op) 2025-05-07T20:31:49.8672209Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8672495Z 2025-05-07T20:31:49.8672687Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.8672984Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.8673283Z 2025-05-07T20:31:49.8673531Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8673865Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.8674167Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.8674488Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.8674843Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.8675164Z 2025-05-07T20:31:49.8675372Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.8675569Z 2025-05-07T20:31:49.8675672Z moe/activation_test.py:126: 2025-05-07T20:31:49.8675984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8676452Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.8676795Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.8677591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.8678359Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.8678919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.8679609Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.8680321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.8681058Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.8681836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.8682600Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.8683523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.8685627Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.8686248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.8686771Z fn() 2025-05-07T20:31:49.8687296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.8687886Z self.fn.run( 2025-05-07T20:31:49.8688358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.8688898Z kernel = self.compile( 2025-05-07T20:31:49.8689457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.8690211Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.8690609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8690847Z 2025-05-07T20:31:49.8691057Z self = 2025-05-07T20:31:49.8692147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.8693535Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd39c2c7ce0>} 2025-05-07T20:31:49.8694881Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.8695919Z context = 2025-05-07T20:31:49.8696212Z 2025-05-07T20:31:49.8696382Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.8696911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.8697376Z module_map=module_map) 2025-05-07T20:31:49.8697751Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.8698118Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.8698392Z E ^ 2025-05-07T20:31:49.8698857Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.8699324Z 2025-05-07T20:31:49.8699825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.8700398Z 2025-05-07T20:31:49.8700514Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.8700929Z self=, 2025-05-07T20:31:49.8701341Z T=2048, 2025-05-07T20:31:49.8701539Z D=5120, 2025-05-07T20:31:49.8701742Z scale_ub=1200.0, 2025-05-07T20:31:49.8701965Z contiguous=True, 2025-05-07T20:31:49.8702203Z compiled=False, 2025-05-07T20:31:49.8702418Z ) 2025-05-07T20:31:50.4162816Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:50.4164923Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:31:50.4167441Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:50.4170118Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:50.4171337Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.4172646Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:50.4174051Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.4175453Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.4176706Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:50.4178106Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.4179167Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.4180459Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:50.4181723Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:31:50.4182950Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:50.4184163Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:31:50.4184985Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.4186160Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:50.4187193Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:31:50.4187990Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:50.4189206Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:50.4190490Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:50.4191619Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:50.4192673Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:31:50.4193855Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:50.4195217Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:50.4196275Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.4197185Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.4198018Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:31:50.4199040Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.5216773Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:50.5217835Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:31:50.5219181Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:50.5220620Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:50.5221590Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.5222896Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:50.5224281Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.5225421Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.5226668Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:50.5228043Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.5229101Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.5230388Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:50.5231645Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:31:50.5232867Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:50.5234068Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:31:50.5234899Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.5235924Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:50.5236953Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:31:50.5237904Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:50.5239265Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:50.5240562Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:50.5241686Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:50.5242735Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:31:50.5244024Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:50.5245392Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:50.5246453Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.5247370Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.5248115Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:31:50.5249251Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.9724825Z self = 2025-05-07T20:31:50.9725386Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:50.9725666Z 2025-05-07T20:31:50.9725757Z @given( 2025-05-07T20:31:50.9725997Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:50.9726323Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:50.9726635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:50.9726970Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:50.9727295Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:50.9727590Z ) 2025-05-07T20:31:50.9727957Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:50.9728401Z def test_silu_mul_quant( 2025-05-07T20:31:50.9728646Z self, 2025-05-07T20:31:50.9728849Z T: int, 2025-05-07T20:31:50.9729047Z D: int, 2025-05-07T20:31:50.9729270Z scale_ub: Optional[float], 2025-05-07T20:31:50.9729542Z contiguous: bool, 2025-05-07T20:31:50.9729782Z compiled: bool, 2025-05-07T20:31:50.9730015Z ) -> None: 2025-05-07T20:31:50.9730239Z torch.manual_seed(2025) 2025-05-07T20:31:50.9730483Z 2025-05-07T20:31:50.9730766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:50.9731114Z 2025-05-07T20:31:50.9731315Z x_sign = torch.sign(x) 2025-05-07T20:31:50.9731608Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:50.9731924Z x = x_sign * x_clamp 2025-05-07T20:31:50.9732171Z x0 = x[:, :D] 2025-05-07T20:31:50.9732386Z x1 = x[:, D:] 2025-05-07T20:31:50.9732597Z 2025-05-07T20:31:50.9732801Z if contiguous: 2025-05-07T20:31:50.9733192Z x0 = x0.contiguous() 2025-05-07T20:31:50.9733457Z x1 = x1.contiguous() 2025-05-07T20:31:50.9733710Z 2025-05-07T20:31:50.9733901Z if scale_ub is not None: 2025-05-07T20:31:50.9734182Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:50.9734524Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:50.9734831Z ) 2025-05-07T20:31:50.9735037Z else: 2025-05-07T20:31:50.9735261Z scale_ub_tensor = None 2025-05-07T20:31:50.9735511Z 2025-05-07T20:31:50.9735753Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:50.9736082Z op = silu_mul_quant 2025-05-07T20:31:50.9736341Z if compiled: 2025-05-07T20:31:50.9736588Z op = torch.compile(op) 2025-05-07T20:31:50.9736892Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:50.9737170Z 2025-05-07T20:31:50.9737372Z > y_fp8, y_scale = fn() 2025-05-07T20:31:50.9737552Z 2025-05-07T20:31:50.9737656Z moe/activation_test.py:117: 2025-05-07T20:31:50.9737961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:50.9738289Z moe/activation_test.py:115: in fn 2025-05-07T20:31:50.9738720Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:50.9739418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:50.9740112Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:50.9740710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:50.9741416Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:50.9742085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:50.9742626Z kernel = self.compile( 2025-05-07T20:31:50.9743306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:50.9743976Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.9744375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:50.9744616Z 2025-05-07T20:31:50.9744829Z self = 2025-05-07T20:31:50.9745917Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:50.9747290Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd39c531760>} 2025-05-07T20:31:50.9748641Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:50.9749678Z context = 2025-05-07T20:31:50.9749975Z 2025-05-07T20:31:50.9750167Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:50.9750700Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.9751177Z module_map=module_map) 2025-05-07T20:31:50.9751543Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.9751903Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.9752167Z E ^ 2025-05-07T20:31:50.9752634Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.9753099Z 2025-05-07T20:31:50.9753525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:50.9754171Z 2025-05-07T20:31:50.9754278Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:50.9754694Z self=, 2025-05-07T20:31:50.9755094Z T=2048, 2025-05-07T20:31:50.9755291Z D=5120, 2025-05-07T20:31:50.9755491Z scale_ub=1200.0, 2025-05-07T20:31:50.9755711Z contiguous=True, 2025-05-07T20:31:50.9755936Z compiled=True, 2025-05-07T20:31:50.9756151Z ) 2025-05-07T20:31:50.9756468Z self = 2025-05-07T20:31:50.9756971Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:50.9757250Z 2025-05-07T20:31:50.9757328Z @given( 2025-05-07T20:31:50.9757560Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:50.9757873Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:50.9758190Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:50.9758527Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:50.9758852Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:50.9759143Z ) 2025-05-07T20:31:50.9759496Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:50.9759942Z def test_silu_mul_quant( 2025-05-07T20:31:50.9760184Z self, 2025-05-07T20:31:50.9760384Z T: int, 2025-05-07T20:31:50.9760588Z D: int, 2025-05-07T20:31:50.9760807Z scale_ub: Optional[float], 2025-05-07T20:31:50.9761085Z contiguous: bool, 2025-05-07T20:31:50.9761330Z compiled: bool, 2025-05-07T20:31:50.9761556Z ) -> None: 2025-05-07T20:31:50.9761782Z torch.manual_seed(2025) 2025-05-07T20:31:50.9762032Z 2025-05-07T20:31:50.9762306Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:50.9762659Z 2025-05-07T20:31:50.9762947Z x_sign = torch.sign(x) 2025-05-07T20:31:50.9763343Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:50.9763660Z x = x_sign * x_clamp 2025-05-07T20:31:50.9763909Z x0 = x[:, :D] 2025-05-07T20:31:50.9764127Z x1 = x[:, D:] 2025-05-07T20:31:50.9764341Z 2025-05-07T20:31:50.9764535Z if contiguous: 2025-05-07T20:31:50.9764769Z x0 = x0.contiguous() 2025-05-07T20:31:50.9765033Z x1 = x1.contiguous() 2025-05-07T20:31:50.9765287Z 2025-05-07T20:31:50.9765486Z if scale_ub is not None: 2025-05-07T20:31:50.9765769Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:50.9766111Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:50.9766425Z ) 2025-05-07T20:31:50.9766616Z else: 2025-05-07T20:31:50.9766836Z scale_ub_tensor = None 2025-05-07T20:31:50.9767094Z 2025-05-07T20:31:50.9767326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:50.9767662Z op = silu_mul_quant 2025-05-07T20:31:50.9767924Z if compiled: 2025-05-07T20:31:50.9768180Z op = torch.compile(op) 2025-05-07T20:31:50.9768505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:50.9768797Z 2025-05-07T20:31:50.9769000Z y_fp8, y_scale = fn() 2025-05-07T20:31:50.9769307Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:50.9769620Z 2025-05-07T20:31:50.9769862Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:50.9770215Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:50.9770530Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:50.9770863Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:50.9771231Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:50.9771558Z 2025-05-07T20:31:50.9771776Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:50.9771972Z 2025-05-07T20:31:50.9772177Z moe/activation_test.py:126: 2025-05-07T20:31:50.9772490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:50.9772842Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:50.9773171Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:50.9773978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:50.9774749Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:50.9775314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:50.9776008Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:50.9776722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:50.9777469Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:50.9778242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:50.9778996Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:50.9779736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:50.9780385Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:50.9780993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:50.9781515Z fn() 2025-05-07T20:31:50.9782034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:50.9782623Z self.fn.run( 2025-05-07T20:31:50.9783175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:50.9783724Z kernel = self.compile( 2025-05-07T20:31:50.9784277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:50.9784943Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.9785337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:50.9785578Z 2025-05-07T20:31:50.9785790Z self = 2025-05-07T20:31:50.9786868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:50.9788240Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd396be71a0>} 2025-05-07T20:31:50.9789587Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:50.9790622Z context = 2025-05-07T20:31:50.9790918Z 2025-05-07T20:31:50.9791088Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:50.9791616Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.9792083Z module_map=module_map) 2025-05-07T20:31:50.9792455Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.9792819Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:50.9793082Z E ^ 2025-05-07T20:31:50.9793555Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.9794120Z 2025-05-07T20:31:50.9794541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:50.9795057Z 2025-05-07T20:31:50.9795170Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:50.9795581Z self=, 2025-05-07T20:31:50.9795991Z T=16384, 2025-05-07T20:31:50.9796184Z D=7168, 2025-05-07T20:31:50.9796370Z scale_ub=1200.0, 2025-05-07T20:31:50.9796597Z contiguous=False, 2025-05-07T20:31:50.9796829Z compiled=False, 2025-05-07T20:31:50.9797039Z ) 2025-05-07T20:31:51.2944713Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:51.2946671Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:31:51.2949138Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:51.2951263Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:51.2952237Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.2953535Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:51.2955088Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.2956076Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.2957311Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:51.2958694Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.2959760Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.2961106Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:51.2962366Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:31:51.2963704Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:51.2964919Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:31:51.2965749Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.2966898Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:51.2967924Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:31:51.2968726Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:51.2969955Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:51.2971244Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:51.2972377Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:51.2973423Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:31:51.2974605Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:51.2975958Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:51.2977104Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.2978030Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:51.2978773Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:31:51.2979799Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:51.3703800Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:51.3705906Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:31:51.3708582Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:51.3710972Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:51.3711951Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.3713249Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:51.3714636Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.3715782Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.3717015Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:51.3718397Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.3719459Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.3720788Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:51.3722048Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:31:51.3723363Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:51.3724573Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:31:51.3725401Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.3726539Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:51.3727576Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:31:51.3728381Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:51.3729599Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:51.3730878Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:51.3732000Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:51.3733052Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:31:51.3734234Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:51.3735590Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:51.3736647Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.3737567Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:51.3738560Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:31:51.3739584Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.0655855Z self = 2025-05-07T20:31:52.0656466Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:52.0656751Z 2025-05-07T20:31:52.0656840Z @given( 2025-05-07T20:31:52.0657081Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:52.0657411Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:52.0657719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:52.0658053Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:52.0658391Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:52.0658690Z ) 2025-05-07T20:31:52.0659047Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:52.0659489Z def test_silu_mul_quant( 2025-05-07T20:31:52.0659737Z self, 2025-05-07T20:31:52.0659945Z T: int, 2025-05-07T20:31:52.0660143Z D: int, 2025-05-07T20:31:52.0660372Z scale_ub: Optional[float], 2025-05-07T20:31:52.0660655Z contiguous: bool, 2025-05-07T20:31:52.0660893Z compiled: bool, 2025-05-07T20:31:52.0661129Z ) -> None: 2025-05-07T20:31:52.0661354Z torch.manual_seed(2025) 2025-05-07T20:31:52.0661597Z 2025-05-07T20:31:52.0661875Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:52.0662231Z 2025-05-07T20:31:52.0662438Z x_sign = torch.sign(x) 2025-05-07T20:31:52.0662729Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:52.0663220Z x = x_sign * x_clamp 2025-05-07T20:31:52.0663480Z x0 = x[:, :D] 2025-05-07T20:31:52.0663697Z x1 = x[:, D:] 2025-05-07T20:31:52.0663917Z 2025-05-07T20:31:52.0664116Z if contiguous: 2025-05-07T20:31:52.0664348Z x0 = x0.contiguous() 2025-05-07T20:31:52.0664618Z x1 = x1.contiguous() 2025-05-07T20:31:52.0664869Z 2025-05-07T20:31:52.0665061Z if scale_ub is not None: 2025-05-07T20:31:52.0665342Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:52.0665686Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:52.0665992Z ) 2025-05-07T20:31:52.0666196Z else: 2025-05-07T20:31:52.0666415Z scale_ub_tensor = None 2025-05-07T20:31:52.0666666Z 2025-05-07T20:31:52.0666908Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:52.0667228Z op = silu_mul_quant 2025-05-07T20:31:52.0667489Z if compiled: 2025-05-07T20:31:52.0667742Z op = torch.compile(op) 2025-05-07T20:31:52.0668049Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:52.0668328Z 2025-05-07T20:31:52.0668521Z > y_fp8, y_scale = fn() 2025-05-07T20:31:52.0668694Z 2025-05-07T20:31:52.0668794Z moe/activation_test.py:117: 2025-05-07T20:31:52.0669093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:52.0669422Z moe/activation_test.py:115: in fn 2025-05-07T20:31:52.0669713Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:52.0670413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:52.0671112Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:52.0671651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:52.0672339Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:52.0673022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:52.0673684Z kernel = self.compile( 2025-05-07T20:31:52.0674231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:52.0674893Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.0675290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:52.0675519Z 2025-05-07T20:31:52.0675726Z self = 2025-05-07T20:31:52.0676808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:52.0678185Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3970572e0>} 2025-05-07T20:31:52.0679541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:52.0680572Z context = 2025-05-07T20:31:52.0680861Z 2025-05-07T20:31:52.0681027Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:52.0681548Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.0682021Z module_map=module_map) 2025-05-07T20:31:52.0682379Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.0682739Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:52.0682999Z E ^ 2025-05-07T20:31:52.0683624Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.0684089Z 2025-05-07T20:31:52.0684514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:52.0685034Z 2025-05-07T20:31:52.0685137Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:52.0685547Z self=, 2025-05-07T20:31:52.0685946Z T=1, 2025-05-07T20:31:52.0686130Z D=7168, 2025-05-07T20:31:52.0686324Z scale_ub=None, 2025-05-07T20:31:52.0686536Z contiguous=True, 2025-05-07T20:31:52.0686760Z compiled=True, 2025-05-07T20:31:52.0686968Z ) 2025-05-07T20:31:52.0687284Z self = 2025-05-07T20:31:52.0687773Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:52.0688035Z 2025-05-07T20:31:52.0688112Z @given( 2025-05-07T20:31:52.0688352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:52.0688659Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:52.0688962Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:52.0689290Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:52.0689612Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:52.0689896Z ) 2025-05-07T20:31:52.0690242Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:52.0690685Z def test_silu_mul_quant( 2025-05-07T20:31:52.0690920Z self, 2025-05-07T20:31:52.0691117Z T: int, 2025-05-07T20:31:52.0691313Z D: int, 2025-05-07T20:31:52.0691523Z scale_ub: Optional[float], 2025-05-07T20:31:52.0691793Z contiguous: bool, 2025-05-07T20:31:52.0692030Z compiled: bool, 2025-05-07T20:31:52.0692248Z ) -> None: 2025-05-07T20:31:52.0692464Z torch.manual_seed(2025) 2025-05-07T20:31:52.0692707Z 2025-05-07T20:31:52.0693064Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:52.0693405Z 2025-05-07T20:31:52.0693615Z x_sign = torch.sign(x) 2025-05-07T20:31:52.0693909Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:52.0694214Z x = x_sign * x_clamp 2025-05-07T20:31:52.0694453Z x0 = x[:, :D] 2025-05-07T20:31:52.0694668Z x1 = x[:, D:] 2025-05-07T20:31:52.0694868Z 2025-05-07T20:31:52.0695056Z if contiguous: 2025-05-07T20:31:52.0695291Z x0 = x0.contiguous() 2025-05-07T20:31:52.0695542Z x1 = x1.contiguous() 2025-05-07T20:31:52.0695782Z 2025-05-07T20:31:52.0695975Z if scale_ub is not None: 2025-05-07T20:31:52.0696240Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:52.0696575Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:52.0696884Z ) 2025-05-07T20:31:52.0697080Z else: 2025-05-07T20:31:52.0697289Z scale_ub_tensor = None 2025-05-07T20:31:52.0697548Z 2025-05-07T20:31:52.0697779Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:52.0698086Z op = silu_mul_quant 2025-05-07T20:31:52.0698331Z if compiled: 2025-05-07T20:31:52.0698577Z op = torch.compile(op) 2025-05-07T20:31:52.0698865Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:52.0699140Z 2025-05-07T20:31:52.0699334Z y_fp8, y_scale = fn() 2025-05-07T20:31:52.0699608Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:52.0699898Z 2025-05-07T20:31:52.0700134Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:52.0700464Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:52.0700793Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:52.0701117Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:52.0701554Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:52.0701863Z 2025-05-07T20:31:52.0702065Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:52.0702257Z 2025-05-07T20:31:52.0702364Z moe/activation_test.py:126: 2025-05-07T20:31:52.0702652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:52.0702985Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:52.0703313Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:52.0704099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:52.0704857Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:52.0705403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:52.0706086Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:52.0706777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:52.0707507Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:52.0708268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:52.0709020Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:52.0709746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:52.0710386Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:52.0710992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:52.0711511Z fn() 2025-05-07T20:31:52.0712023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:52.0712688Z self.fn.run( 2025-05-07T20:31:52.0713164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:52.0713687Z kernel = self.compile( 2025-05-07T20:31:52.0714232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:52.0714891Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.0715283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:52.0715509Z 2025-05-07T20:31:52.0715717Z self = 2025-05-07T20:31:52.0716795Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:52.0718163Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd397055440>} 2025-05-07T20:31:52.0719504Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:52.0720523Z context = 2025-05-07T20:31:52.0720813Z 2025-05-07T20:31:52.0720978Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:52.0721500Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.0721965Z module_map=module_map) 2025-05-07T20:31:52.0722322Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.0722760Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:52.0723032Z E ^ 2025-05-07T20:31:52.0723574Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.0724032Z 2025-05-07T20:31:52.0724451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:52.0724971Z 2025-05-07T20:31:52.0725072Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:52.0725482Z self=, 2025-05-07T20:31:52.0725880Z T=4096, 2025-05-07T20:31:52.0726066Z D=5120, 2025-05-07T20:31:52.0726257Z scale_ub=None, 2025-05-07T20:31:52.0726466Z contiguous=False, 2025-05-07T20:31:52.0726694Z compiled=False, 2025-05-07T20:31:52.0726902Z ) 2025-05-07T20:31:52.4424259Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:52.4425497Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:31:52.4426862Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:52.4428319Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:52.4429308Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.4430638Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:52.4432485Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.4433485Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.4434723Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:52.4436121Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.4437206Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.4438711Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:52.4439980Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:31:52.4441202Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:52.4442596Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:31:52.4443537Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.4444572Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:52.4445605Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:31:52.4446404Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:52.4447632Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:52.4448932Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:52.4450068Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:52.4451172Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:31:52.4452368Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:52.4453746Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:52.4454953Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.4455876Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:52.4456623Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:31:52.4457663Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.7113204Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:52.7114335Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:31:52.7115698Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:52.7117173Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:52.7118159Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.7119923Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:52.7121467Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.7122451Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.7123800Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:52.7125184Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.7126255Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.7127546Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:52.7128802Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:31:52.7130021Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:52.7131284Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:31:52.7132126Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.7133335Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:52.7134351Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:31:52.7135149Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:52.7136366Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:52.7137660Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:52.7139059Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:52.7140103Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:31:52.7141294Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:52.7142660Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:52.7143856Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.7144776Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:52.7145519Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:31:52.7146550Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.3958324Z self = 2025-05-07T20:31:53.3958845Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:53.3959149Z 2025-05-07T20:31:53.3959237Z @given( 2025-05-07T20:31:53.3959471Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.3959790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.3960112Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.3960444Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.3960775Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.3961063Z ) 2025-05-07T20:31:53.3961412Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.3961847Z def test_silu_mul_quant( 2025-05-07T20:31:53.3962087Z self, 2025-05-07T20:31:53.3962277Z T: int, 2025-05-07T20:31:53.3962463Z D: int, 2025-05-07T20:31:53.3962675Z scale_ub: Optional[float], 2025-05-07T20:31:53.3962947Z contiguous: bool, 2025-05-07T20:31:53.3963178Z compiled: bool, 2025-05-07T20:31:53.3963524Z ) -> None: 2025-05-07T20:31:53.3963733Z torch.manual_seed(2025) 2025-05-07T20:31:53.3963968Z 2025-05-07T20:31:53.3964240Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.3964577Z 2025-05-07T20:31:53.3964768Z x_sign = torch.sign(x) 2025-05-07T20:31:53.3965279Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.3965586Z x = x_sign * x_clamp 2025-05-07T20:31:53.3965818Z x0 = x[:, :D] 2025-05-07T20:31:53.3966031Z x1 = x[:, D:] 2025-05-07T20:31:53.3966243Z 2025-05-07T20:31:53.3966433Z if contiguous: 2025-05-07T20:31:53.3966658Z x0 = x0.contiguous() 2025-05-07T20:31:53.3966916Z x1 = x1.contiguous() 2025-05-07T20:31:53.3967157Z 2025-05-07T20:31:53.3967341Z if scale_ub is not None: 2025-05-07T20:31:53.3967614Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.3967945Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.3968247Z ) 2025-05-07T20:31:53.3968436Z else: 2025-05-07T20:31:53.3968642Z scale_ub_tensor = None 2025-05-07T20:31:53.3968884Z 2025-05-07T20:31:53.3969114Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.3969432Z op = silu_mul_quant 2025-05-07T20:31:53.3969675Z if compiled: 2025-05-07T20:31:53.3969917Z op = torch.compile(op) 2025-05-07T20:31:53.3970210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.3970476Z 2025-05-07T20:31:53.3970668Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.3970838Z 2025-05-07T20:31:53.3970936Z moe/activation_test.py:117: 2025-05-07T20:31:53.3971228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.3971552Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.3971830Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.3972519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.3973202Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.3973872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.3974565Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.3975233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.3975760Z kernel = self.compile( 2025-05-07T20:31:53.3976302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.3976960Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.3977346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.3977579Z 2025-05-07T20:31:53.3977784Z self = 2025-05-07T20:31:53.3978891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.3980256Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd396b53b00>} 2025-05-07T20:31:53.3981595Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.3982614Z context = 2025-05-07T20:31:53.3982897Z 2025-05-07T20:31:53.3983059Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.3983574Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.3984043Z module_map=module_map) 2025-05-07T20:31:53.3984413Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.3984842Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.3985106Z E ^ 2025-05-07T20:31:53.3985570Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.3986020Z 2025-05-07T20:31:53.3986436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.3986954Z 2025-05-07T20:31:53.3987055Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.3987464Z self=, 2025-05-07T20:31:53.3987865Z T=4096, 2025-05-07T20:31:53.3988046Z D=7168, 2025-05-07T20:31:53.3988239Z scale_ub=None, 2025-05-07T20:31:53.3988449Z contiguous=False, 2025-05-07T20:31:53.3988672Z compiled=False, 2025-05-07T20:31:53.3988876Z ) 2025-05-07T20:31:53.3989190Z self = 2025-05-07T20:31:53.3989690Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:53.3989968Z 2025-05-07T20:31:53.3990046Z @given( 2025-05-07T20:31:53.3990268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.3990579Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.3990879Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.3991228Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.3991577Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.3991856Z ) 2025-05-07T20:31:53.3992203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.3992641Z def test_silu_mul_quant( 2025-05-07T20:31:53.3992871Z self, 2025-05-07T20:31:53.3993063Z T: int, 2025-05-07T20:31:53.3993255Z D: int, 2025-05-07T20:31:53.3993465Z scale_ub: Optional[float], 2025-05-07T20:31:53.3993735Z contiguous: bool, 2025-05-07T20:31:53.3994061Z compiled: bool, 2025-05-07T20:31:53.3994276Z ) -> None: 2025-05-07T20:31:53.3994492Z torch.manual_seed(2025) 2025-05-07T20:31:53.3994730Z 2025-05-07T20:31:53.3995001Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.3995338Z 2025-05-07T20:31:53.3995535Z x_sign = torch.sign(x) 2025-05-07T20:31:53.3995826Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.3996129Z x = x_sign * x_clamp 2025-05-07T20:31:53.3996361Z x0 = x[:, :D] 2025-05-07T20:31:53.3996579Z x1 = x[:, D:] 2025-05-07T20:31:53.3996780Z 2025-05-07T20:31:53.3996969Z if contiguous: 2025-05-07T20:31:53.3997197Z x0 = x0.contiguous() 2025-05-07T20:31:53.3997445Z x1 = x1.contiguous() 2025-05-07T20:31:53.3997687Z 2025-05-07T20:31:53.3997876Z if scale_ub is not None: 2025-05-07T20:31:53.3998137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.3998481Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.3998816Z ) 2025-05-07T20:31:53.3999023Z else: 2025-05-07T20:31:53.3999243Z scale_ub_tensor = None 2025-05-07T20:31:53.4005655Z 2025-05-07T20:31:53.4005928Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.4006251Z op = silu_mul_quant 2025-05-07T20:31:53.4006509Z if compiled: 2025-05-07T20:31:53.4006756Z op = torch.compile(op) 2025-05-07T20:31:53.4007060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.4007338Z 2025-05-07T20:31:53.4007528Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.4007698Z 2025-05-07T20:31:53.4007802Z moe/activation_test.py:117: 2025-05-07T20:31:53.4008109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.4008454Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.4008736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.4009558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.4010263Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.4010800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.4011545Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.4012226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.4012767Z kernel = self.compile( 2025-05-07T20:31:53.4013309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.4013971Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.4014373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.4014606Z 2025-05-07T20:31:53.4014811Z self = 2025-05-07T20:31:53.4015892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.4017265Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd396b52660>} 2025-05-07T20:31:53.4018622Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.4019654Z context = 2025-05-07T20:31:53.4019940Z 2025-05-07T20:31:53.4020273Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.4020806Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.4021327Z module_map=module_map) 2025-05-07T20:31:53.4021695Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.4022043Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.4022301Z E ^ 2025-05-07T20:31:53.4022781Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.4023243Z 2025-05-07T20:31:53.4023663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.4024191Z 2025-05-07T20:31:53.4024289Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.4024706Z self=, 2025-05-07T20:31:53.4025118Z T=128, 2025-05-07T20:31:53.4025308Z D=7168, 2025-05-07T20:31:53.4025502Z scale_ub=None, 2025-05-07T20:31:53.4025715Z contiguous=False, 2025-05-07T20:31:53.4025936Z compiled=True, 2025-05-07T20:31:53.4026130Z ) 2025-05-07T20:31:53.4469879Z self = 2025-05-07T20:31:53.4470829Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:53.4471317Z 2025-05-07T20:31:53.4471451Z @given( 2025-05-07T20:31:53.4471685Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.4471995Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.4472299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.4472626Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.4472951Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.4473231Z ) 2025-05-07T20:31:53.4473596Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.4474227Z def test_silu_mul_quant( 2025-05-07T20:31:53.4474472Z self, 2025-05-07T20:31:53.4474672Z T: int, 2025-05-07T20:31:53.4474856Z D: int, 2025-05-07T20:31:53.4475070Z scale_ub: Optional[float], 2025-05-07T20:31:53.4475348Z contiguous: bool, 2025-05-07T20:31:53.4475581Z compiled: bool, 2025-05-07T20:31:53.4475808Z ) -> None: 2025-05-07T20:31:53.4476021Z torch.manual_seed(2025) 2025-05-07T20:31:53.4476252Z 2025-05-07T20:31:53.4476521Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.4476858Z 2025-05-07T20:31:53.4477045Z x_sign = torch.sign(x) 2025-05-07T20:31:53.4477331Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.4477642Z x = x_sign * x_clamp 2025-05-07T20:31:53.4477876Z x0 = x[:, :D] 2025-05-07T20:31:53.4478091Z x1 = x[:, D:] 2025-05-07T20:31:53.4478291Z 2025-05-07T20:31:53.4478479Z if contiguous: 2025-05-07T20:31:53.4478709Z x0 = x0.contiguous() 2025-05-07T20:31:53.4478959Z x1 = x1.contiguous() 2025-05-07T20:31:53.4479194Z 2025-05-07T20:31:53.4479374Z if scale_ub is not None: 2025-05-07T20:31:53.4479643Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.4479999Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.4480302Z ) 2025-05-07T20:31:53.4480492Z else: 2025-05-07T20:31:53.4480691Z scale_ub_tensor = None 2025-05-07T20:31:53.4480934Z 2025-05-07T20:31:53.4481189Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.4481515Z op = silu_mul_quant 2025-05-07T20:31:53.4481758Z if compiled: 2025-05-07T20:31:53.4481999Z op = torch.compile(op) 2025-05-07T20:31:53.4482285Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.4482553Z 2025-05-07T20:31:53.4482886Z y_fp8, y_scale = fn() 2025-05-07T20:31:53.4483169Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:53.4483627Z 2025-05-07T20:31:53.4483861Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.4484185Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:53.4484472Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:53.4484781Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:53.4485134Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:53.4485434Z 2025-05-07T20:31:53.4485629Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:53.4485821Z 2025-05-07T20:31:53.4485923Z moe/activation_test.py:126: 2025-05-07T20:31:53.4486206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.4486533Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:53.4486855Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:53.4487645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:53.4488403Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:53.4488947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.4489628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.4490308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:53.4491027Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:53.4491777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:53.4492524Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:53.4493352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:53.4493988Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:53.4494586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:53.4495105Z fn() 2025-05-07T20:31:53.4495604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:53.4496182Z self.fn.run( 2025-05-07T20:31:53.4496648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.4497168Z kernel = self.compile( 2025-05-07T20:31:53.4497705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.4498363Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.4498760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.4498989Z 2025-05-07T20:31:53.4499194Z self = 2025-05-07T20:31:53.4500267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.4501673Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3968ce480>} 2025-05-07T20:31:53.4503003Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.4504130Z context = 2025-05-07T20:31:53.4504422Z 2025-05-07T20:31:53.4504590Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.4505104Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.4505570Z module_map=module_map) 2025-05-07T20:31:53.4505928Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.4506274Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:53.4506530Z E ^ 2025-05-07T20:31:53.4506987Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.4507437Z 2025-05-07T20:31:53.4507853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.4508373Z 2025-05-07T20:31:53.4508473Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.4508891Z self=, 2025-05-07T20:31:53.4509294Z T=128, 2025-05-07T20:31:53.4509474Z D=7168, 2025-05-07T20:31:53.4509660Z scale_ub=None, 2025-05-07T20:31:53.4509875Z contiguous=False, 2025-05-07T20:31:53.4510092Z compiled=False, 2025-05-07T20:31:53.4510291Z ) 2025-05-07T20:31:53.7726378Z self = 2025-05-07T20:31:53.7727463Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:53.7728007Z 2025-05-07T20:31:53.7728169Z @given( 2025-05-07T20:31:53.7728625Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.7729242Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.7729829Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.7730476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.7731117Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.7731684Z ) 2025-05-07T20:31:53.7732030Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.7732471Z def test_silu_mul_quant( 2025-05-07T20:31:53.7732708Z self, 2025-05-07T20:31:53.7732895Z T: int, 2025-05-07T20:31:53.7733091Z D: int, 2025-05-07T20:31:53.7733307Z scale_ub: Optional[float], 2025-05-07T20:31:53.7733574Z contiguous: bool, 2025-05-07T20:31:53.7733807Z compiled: bool, 2025-05-07T20:31:53.7734029Z ) -> None: 2025-05-07T20:31:53.7734236Z torch.manual_seed(2025) 2025-05-07T20:31:53.7734476Z 2025-05-07T20:31:53.7734748Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.7735091Z 2025-05-07T20:31:53.7735282Z x_sign = torch.sign(x) 2025-05-07T20:31:53.7735571Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.7735867Z x = x_sign * x_clamp 2025-05-07T20:31:53.7736108Z x0 = x[:, :D] 2025-05-07T20:31:53.7736329Z x1 = x[:, D:] 2025-05-07T20:31:53.7736530Z 2025-05-07T20:31:53.7736708Z if contiguous: 2025-05-07T20:31:53.7736935Z x0 = x0.contiguous() 2025-05-07T20:31:53.7737183Z x1 = x1.contiguous() 2025-05-07T20:31:53.7737424Z 2025-05-07T20:31:53.7737613Z if scale_ub is not None: 2025-05-07T20:31:53.7737881Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.7738207Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.7738672Z ) 2025-05-07T20:31:53.7738864Z else: 2025-05-07T20:31:53.7739067Z scale_ub_tensor = None 2025-05-07T20:31:53.7739326Z 2025-05-07T20:31:53.7739549Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.7739861Z op = silu_mul_quant 2025-05-07T20:31:53.7740122Z if compiled: 2025-05-07T20:31:53.7740372Z op = torch.compile(op) 2025-05-07T20:31:53.7740825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.7741141Z 2025-05-07T20:31:53.7741370Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.7741534Z 2025-05-07T20:31:53.7741633Z moe/activation_test.py:117: 2025-05-07T20:31:53.7741932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.7742271Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.7742552Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.7743245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.7743940Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.7744486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.7745167Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.7745848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.7746390Z kernel = self.compile( 2025-05-07T20:31:53.7746931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.7747600Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.7748032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.7748270Z 2025-05-07T20:31:53.7748480Z self = 2025-05-07T20:31:53.7749561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.7750932Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3968cf240>} 2025-05-07T20:31:53.7752446Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.7753472Z context = 2025-05-07T20:31:53.7753758Z 2025-05-07T20:31:53.7753935Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.7754460Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.7754924Z module_map=module_map) 2025-05-07T20:31:53.7755291Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.7755646Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.7755901Z E ^ 2025-05-07T20:31:53.7756381Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.7756837Z 2025-05-07T20:31:53.7757264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.7757781Z 2025-05-07T20:31:53.7757890Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.7758302Z self=, 2025-05-07T20:31:53.7758708Z T=4096, 2025-05-07T20:31:53.7758900Z D=5120, 2025-05-07T20:31:53.7759087Z scale_ub=1200.0, 2025-05-07T20:31:53.7759313Z contiguous=True, 2025-05-07T20:31:53.7759544Z compiled=False, 2025-05-07T20:31:53.7759747Z ) 2025-05-07T20:31:53.7760075Z self = 2025-05-07T20:31:53.7760576Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:53.7760848Z 2025-05-07T20:31:53.7760932Z @given( 2025-05-07T20:31:53.7761259Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.7761608Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.7761919Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.7762242Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.7762576Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.7762866Z ) 2025-05-07T20:31:53.7763208Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.7763760Z def test_silu_mul_quant( 2025-05-07T20:31:53.7764001Z self, 2025-05-07T20:31:53.7764192Z T: int, 2025-05-07T20:31:53.7764393Z D: int, 2025-05-07T20:31:53.7764622Z scale_ub: Optional[float], 2025-05-07T20:31:53.7764897Z contiguous: bool, 2025-05-07T20:31:53.7765131Z compiled: bool, 2025-05-07T20:31:53.7765357Z ) -> None: 2025-05-07T20:31:53.7765576Z torch.manual_seed(2025) 2025-05-07T20:31:53.7765815Z 2025-05-07T20:31:53.7766101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.7766451Z 2025-05-07T20:31:53.7766640Z x_sign = torch.sign(x) 2025-05-07T20:31:53.7766936Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.7767250Z x = x_sign * x_clamp 2025-05-07T20:31:53.7767484Z x0 = x[:, :D] 2025-05-07T20:31:53.7767703Z x1 = x[:, D:] 2025-05-07T20:31:53.7767910Z 2025-05-07T20:31:53.7768091Z if contiguous: 2025-05-07T20:31:53.7768325Z x0 = x0.contiguous() 2025-05-07T20:31:53.7768579Z x1 = x1.contiguous() 2025-05-07T20:31:53.7768815Z 2025-05-07T20:31:53.7769004Z if scale_ub is not None: 2025-05-07T20:31:53.7769279Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.7769608Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.7769918Z ) 2025-05-07T20:31:53.7770121Z else: 2025-05-07T20:31:53.7770336Z scale_ub_tensor = None 2025-05-07T20:31:53.7770668Z 2025-05-07T20:31:53.7770898Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.7771237Z op = silu_mul_quant 2025-05-07T20:31:53.7771506Z if compiled: 2025-05-07T20:31:53.7771745Z op = torch.compile(op) 2025-05-07T20:31:53.7772040Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.7772307Z 2025-05-07T20:31:53.7772495Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.7772657Z 2025-05-07T20:31:53.7772757Z moe/activation_test.py:117: 2025-05-07T20:31:53.7773042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.7773368Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.7773643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.7774334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.7775024Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.7775565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.7776250Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.7776911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.7777455Z kernel = self.compile( 2025-05-07T20:31:53.7778000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.7778660Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.7779053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.7779286Z 2025-05-07T20:31:53.7779493Z self = 2025-05-07T20:31:53.7780652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.7782021Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3968cfba0>} 2025-05-07T20:31:53.7783358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.7784382Z context = 2025-05-07T20:31:53.7784672Z 2025-05-07T20:31:53.7784838Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.7785361Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.7785824Z module_map=module_map) 2025-05-07T20:31:53.7786194Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.7786544Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.7786801Z E ^ 2025-05-07T20:31:53.7787257Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.7787714Z 2025-05-07T20:31:53.7788131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.7788642Z 2025-05-07T20:31:53.7788747Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.7789159Z self=, 2025-05-07T20:31:53.7789551Z T=1, 2025-05-07T20:31:53.7789734Z D=5120, 2025-05-07T20:31:53.7789927Z scale_ub=None, 2025-05-07T20:31:53.7790135Z contiguous=True, 2025-05-07T20:31:53.7790357Z compiled=True, 2025-05-07T20:31:53.7790563Z ) 2025-05-07T20:31:54.1166940Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.1169079Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:31:54.1171481Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.1172924Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.1173906Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.1175214Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.1176599Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.1177576Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.1178798Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.1180339Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.1181410Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.1182685Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.1183937Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:31:54.1185163Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.1186380Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:31:54.1187209Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.1188236Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:54.1189252Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:31:54.1190038Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:54.1191252Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.1192685Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.1193800Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:54.1194838Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:31:54.1196019Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.1197377Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.1198449Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.1199364Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.1200100Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:31:54.1201114Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.2027954Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.2030201Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:31:54.2032384Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.2033925Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.2034908Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.2036245Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.2037661Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.2039496Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.2041002Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.2042421Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.2043866Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.2045162Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.2046434Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:31:54.2047678Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.2048904Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:31:54.2049746Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.2050787Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:54.2051824Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:31:54.2052629Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:54.2053963Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.2055274Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.2056407Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:54.2057467Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:31:54.2058667Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.2060041Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.2061123Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.2062048Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.2062803Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:31:54.2063857Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.5040916Z self = 2025-05-07T20:31:54.5041673Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:54.5042041Z 2025-05-07T20:31:54.5042171Z @given( 2025-05-07T20:31:54.5042792Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.5043123Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.5043569Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.5043908Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.5044254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.5044555Z ) 2025-05-07T20:31:54.5044909Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.5045374Z def test_silu_mul_quant( 2025-05-07T20:31:54.5045634Z self, 2025-05-07T20:31:54.5045838Z T: int, 2025-05-07T20:31:54.5046052Z D: int, 2025-05-07T20:31:54.5046291Z scale_ub: Optional[float], 2025-05-07T20:31:54.5046570Z contiguous: bool, 2025-05-07T20:31:54.5046825Z compiled: bool, 2025-05-07T20:31:54.5047072Z ) -> None: 2025-05-07T20:31:54.5047305Z torch.manual_seed(2025) 2025-05-07T20:31:54.5047560Z 2025-05-07T20:31:54.5047857Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.5048218Z 2025-05-07T20:31:54.5048423Z x_sign = torch.sign(x) 2025-05-07T20:31:54.5048732Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.5049061Z x = x_sign * x_clamp 2025-05-07T20:31:54.5049306Z x0 = x[:, :D] 2025-05-07T20:31:54.5049538Z x1 = x[:, D:] 2025-05-07T20:31:54.5049766Z 2025-05-07T20:31:54.5049957Z if contiguous: 2025-05-07T20:31:54.5050206Z x0 = x0.contiguous() 2025-05-07T20:31:54.5050484Z x1 = x1.contiguous() 2025-05-07T20:31:54.5050728Z 2025-05-07T20:31:54.5050934Z if scale_ub is not None: 2025-05-07T20:31:54.5051222Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.5051560Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.5051886Z ) 2025-05-07T20:31:54.5052094Z else: 2025-05-07T20:31:54.5052485Z scale_ub_tensor = None 2025-05-07T20:31:54.5052755Z 2025-05-07T20:31:54.5053013Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.5053354Z op = silu_mul_quant 2025-05-07T20:31:54.5053613Z if compiled: 2025-05-07T20:31:54.5053884Z op = torch.compile(op) 2025-05-07T20:31:54.5054201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.5054488Z 2025-05-07T20:31:54.5054702Z y_fp8, y_scale = fn() 2025-05-07T20:31:54.5055007Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:54.5055305Z 2025-05-07T20:31:54.5055555Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.5055907Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:54.5056207Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:54.5056536Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:54.5056916Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:54.5057239Z 2025-05-07T20:31:54.5057443Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:54.5057651Z 2025-05-07T20:31:54.5057759Z moe/activation_test.py:126: 2025-05-07T20:31:54.5058068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.5058409Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:54.5058752Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:54.5059560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:54.5060333Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:54.5060886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.5061639Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.5062354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:54.5063189Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:54.5063964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:54.5064733Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:54.5065485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:54.5066147Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:54.5066775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:54.5067313Z fn() 2025-05-07T20:31:54.5067849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:54.5068451Z self.fn.run( 2025-05-07T20:31:54.5068925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.5069476Z kernel = self.compile( 2025-05-07T20:31:54.5070032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.5070695Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.5071112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.5071347Z 2025-05-07T20:31:54.5071566Z self = 2025-05-07T20:31:54.5072736Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.5074131Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3969c3560>} 2025-05-07T20:31:54.5075505Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.5076549Z context = 2025-05-07T20:31:54.5076839Z 2025-05-07T20:31:54.5077017Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.5077544Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.5078024Z module_map=module_map) 2025-05-07T20:31:54.5078400Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.5078780Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:54.5079053Z E ^ 2025-05-07T20:31:54.5079531Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.5079989Z 2025-05-07T20:31:54.5080422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.5080943Z 2025-05-07T20:31:54.5081057Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.5081475Z self=, 2025-05-07T20:31:54.5081933Z T=2048, 2025-05-07T20:31:54.5082135Z D=5120, 2025-05-07T20:31:54.5082329Z scale_ub=None, 2025-05-07T20:31:54.5082558Z contiguous=True, 2025-05-07T20:31:54.5082789Z compiled=True, 2025-05-07T20:31:54.5082999Z ) 2025-05-07T20:31:54.8300455Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.8301956Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:31:54.8303311Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.8304764Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.8305745Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.8307071Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.8308476Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.8309469Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.8310714Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.8312312Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.8313404Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.8314701Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.8315966Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:31:54.8317208Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.8318426Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:31:54.8319269Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.8320311Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:54.8321343Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:31:54.8322151Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:54.8323541Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.8324954Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.8326083Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:54.8327143Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:31:54.8328358Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.8329719Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.8330802Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.8331722Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.8332475Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:31:54.8333499Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.9155574Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.9156899Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:31:54.9158270Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.9159711Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.9160702Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.9162019Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.9163552Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.9164547Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.9165793Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.9167191Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.9168260Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.9169710Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.9170980Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:31:54.9172218Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.9173448Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:31:54.9174281Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.9175323Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:54.9176360Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:31:54.9177168Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:54.9178385Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.9179759Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.9180898Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:54.9181961Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:31:54.9183159Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.9184520Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.9185608Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.9186535Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.9187289Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:31:54.9188320Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.2170643Z self = 2025-05-07T20:31:55.2171247Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:55.2171533Z 2025-05-07T20:31:55.2171620Z @given( 2025-05-07T20:31:55.2171874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.2172209Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.2172552Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.2173267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.2173601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.2173887Z ) 2025-05-07T20:31:55.2174245Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.2174696Z def test_silu_mul_quant( 2025-05-07T20:31:55.2174935Z self, 2025-05-07T20:31:55.2175139Z T: int, 2025-05-07T20:31:55.2175346Z D: int, 2025-05-07T20:31:55.2175597Z scale_ub: Optional[float], 2025-05-07T20:31:55.2175867Z contiguous: bool, 2025-05-07T20:31:55.2176113Z compiled: bool, 2025-05-07T20:31:55.2176352Z ) -> None: 2025-05-07T20:31:55.2176567Z torch.manual_seed(2025) 2025-05-07T20:31:55.2176814Z 2025-05-07T20:31:55.2177096Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.2177440Z 2025-05-07T20:31:55.2177651Z x_sign = torch.sign(x) 2025-05-07T20:31:55.2177953Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.2178262Z x = x_sign * x_clamp 2025-05-07T20:31:55.2178508Z x0 = x[:, :D] 2025-05-07T20:31:55.2178730Z x1 = x[:, D:] 2025-05-07T20:31:55.2178941Z 2025-05-07T20:31:55.2179136Z if contiguous: 2025-05-07T20:31:55.2179378Z x0 = x0.contiguous() 2025-05-07T20:31:55.2179636Z x1 = x1.contiguous() 2025-05-07T20:31:55.2179887Z 2025-05-07T20:31:55.2180083Z if scale_ub is not None: 2025-05-07T20:31:55.2180364Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.2180697Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.2181011Z ) 2025-05-07T20:31:55.2181213Z else: 2025-05-07T20:31:55.2181422Z scale_ub_tensor = None 2025-05-07T20:31:55.2181681Z 2025-05-07T20:31:55.2181921Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.2182397Z op = silu_mul_quant 2025-05-07T20:31:55.2182657Z if compiled: 2025-05-07T20:31:55.2182910Z op = torch.compile(op) 2025-05-07T20:31:55.2183207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.2183486Z 2025-05-07T20:31:55.2183688Z y_fp8, y_scale = fn() 2025-05-07T20:31:55.2183974Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:55.2184274Z 2025-05-07T20:31:55.2184518Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.2184862Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:55.2185154Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:55.2185475Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:55.2185840Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:55.2186150Z 2025-05-07T20:31:55.2186358Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:55.2186555Z 2025-05-07T20:31:55.2186673Z moe/activation_test.py:126: 2025-05-07T20:31:55.2186975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.2187321Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:55.2187657Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:55.2188459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:55.2189220Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:55.2189782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.2190479Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.2191173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:55.2191965Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:55.2192822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:55.2193582Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:55.2194313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:55.2194963Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:55.2195577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:55.2196108Z fn() 2025-05-07T20:31:55.2196617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:55.2197204Z self.fn.run( 2025-05-07T20:31:55.2197689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.2198228Z kernel = self.compile( 2025-05-07T20:31:55.2198780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.2199450Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.2199855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.2200087Z 2025-05-07T20:31:55.2200299Z self = 2025-05-07T20:31:55.2201390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.2203576Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3969c2f20>} 2025-05-07T20:31:55.2204952Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.2205986Z context = 2025-05-07T20:31:55.2206275Z 2025-05-07T20:31:55.2206446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.2206977Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.2207454Z module_map=module_map) 2025-05-07T20:31:55.2207819Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.2208188Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:55.2208461Z E ^ 2025-05-07T20:31:55.2208934Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.2209399Z 2025-05-07T20:31:55.2209823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.2210348Z 2025-05-07T20:31:55.2210452Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.2210874Z self=, 2025-05-07T20:31:55.2211276Z T=128, 2025-05-07T20:31:55.2211473Z D=5120, 2025-05-07T20:31:55.2211677Z scale_ub=None, 2025-05-07T20:31:55.2211906Z contiguous=True, 2025-05-07T20:31:55.2212169Z compiled=True, 2025-05-07T20:31:55.2212382Z ) 2025-05-07T20:31:55.5623922Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:55.5624994Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:31:55.5626548Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:55.5627970Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:55.5628930Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.5630222Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:55.5631608Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.5632583Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.5633799Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:55.5635161Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.5636352Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.5637647Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:55.5639050Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:31:55.5640264Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:55.5641469Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:31:55.5642289Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.5643394Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:55.5644404Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:31:55.5645197Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:55.5646394Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:55.5647671Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:55.5648913Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:55.5649950Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:31:55.5651118Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:55.5652522Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:55.5653580Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.5654494Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.5655228Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:31:55.5656239Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.6497353Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:55.6498416Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:31:55.6500244Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:55.6501670Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:55.6502636Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.6503928Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:55.6505309Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.6506290Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.6507510Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:55.6508887Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.6509943Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.6511217Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:55.6512634Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:31:55.6513840Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:55.6515039Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:31:55.6515852Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.6516873Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:55.6517891Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:31:55.6518679Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:55.6519878Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:55.6521149Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:55.6522351Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:55.6523513Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:31:55.6524688Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:55.6526039Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:55.6527099Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.6528010Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.6528746Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:31:55.6529753Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.1846548Z self = 2025-05-07T20:31:56.1847103Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:56.1847370Z 2025-05-07T20:31:56.1847448Z @given( 2025-05-07T20:31:56.1847683Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.1847994Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.1848298Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.1848628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.1848965Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.1849465Z ) 2025-05-07T20:31:56.1849821Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.1850265Z def test_silu_mul_quant( 2025-05-07T20:31:56.1850500Z self, 2025-05-07T20:31:56.1850698Z T: int, 2025-05-07T20:31:56.1850900Z D: int, 2025-05-07T20:31:56.1851113Z scale_ub: Optional[float], 2025-05-07T20:31:56.1851389Z contiguous: bool, 2025-05-07T20:31:56.1851633Z compiled: bool, 2025-05-07T20:31:56.1851859Z ) -> None: 2025-05-07T20:31:56.1852079Z torch.manual_seed(2025) 2025-05-07T20:31:56.1852321Z 2025-05-07T20:31:56.1852589Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.1852935Z 2025-05-07T20:31:56.1853134Z x_sign = torch.sign(x) 2025-05-07T20:31:56.1853428Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.1853745Z x = x_sign * x_clamp 2025-05-07T20:31:56.1854006Z x0 = x[:, :D] 2025-05-07T20:31:56.1854230Z x1 = x[:, D:] 2025-05-07T20:31:56.1854436Z 2025-05-07T20:31:56.1854624Z if contiguous: 2025-05-07T20:31:56.1854861Z x0 = x0.contiguous() 2025-05-07T20:31:56.1855113Z x1 = x1.contiguous() 2025-05-07T20:31:56.1855355Z 2025-05-07T20:31:56.1855548Z if scale_ub is not None: 2025-05-07T20:31:56.1855816Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.1856155Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.1856467Z ) 2025-05-07T20:31:56.1856666Z else: 2025-05-07T20:31:56.1856879Z scale_ub_tensor = None 2025-05-07T20:31:56.1863658Z 2025-05-07T20:31:56.1863910Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.1864254Z op = silu_mul_quant 2025-05-07T20:31:56.1864506Z if compiled: 2025-05-07T20:31:56.1864767Z op = torch.compile(op) 2025-05-07T20:31:56.1865243Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.1865517Z 2025-05-07T20:31:56.1865724Z y_fp8, y_scale = fn() 2025-05-07T20:31:56.1866019Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:56.1866311Z 2025-05-07T20:31:56.1866556Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.1866891Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:56.1867179Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:56.1867498Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:56.1867856Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:56.1868162Z 2025-05-07T20:31:56.1868369Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:56.1868571Z 2025-05-07T20:31:56.1868674Z moe/activation_test.py:126: 2025-05-07T20:31:56.1868980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.1869319Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:56.1869659Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:56.1870462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:56.1871219Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:56.1871777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.1872526Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.1873232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:56.1873960Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:56.1874728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:56.1875579Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:56.1876317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:56.1876955Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:56.1877566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:56.1878093Z fn() 2025-05-07T20:31:56.1878601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:56.1879204Z self.fn.run( 2025-05-07T20:31:56.1879689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.1880233Z kernel = self.compile( 2025-05-07T20:31:56.1880780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.1881457Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.1881868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.1882117Z 2025-05-07T20:31:56.1882364Z self = 2025-05-07T20:31:56.1883523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.1884890Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd396163c40>} 2025-05-07T20:31:56.1886316Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.1887355Z context = 2025-05-07T20:31:56.1887640Z 2025-05-07T20:31:56.1887807Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.1888337Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.1888828Z module_map=module_map) 2025-05-07T20:31:56.1889213Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.1889573Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:56.1889850Z E ^ 2025-05-07T20:31:56.1890330Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.1890783Z 2025-05-07T20:31:56.1891215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.1891753Z 2025-05-07T20:31:56.1891864Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.1892291Z self=, 2025-05-07T20:31:56.1892706Z T=4096, 2025-05-07T20:31:56.1892901Z D=5120, 2025-05-07T20:31:56.1893103Z scale_ub=None, 2025-05-07T20:31:56.1893336Z contiguous=True, 2025-05-07T20:31:56.1893561Z compiled=True, 2025-05-07T20:31:56.1893783Z ) 2025-05-07T20:31:56.5348494Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.5349582Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:31:56.5350960Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.5352558Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.5353536Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.5354851Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.5356245Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.5357248Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.5358485Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.5359868Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.5360937Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.5362383Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.5363758Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:31:56.5364985Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.5366197Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:31:56.5367025Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.5368052Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:56.5369084Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:31:56.5369883Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:56.5371094Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.5372382Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.5373503Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:56.5374634Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:31:56.5375822Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.5377181Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.5378241Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.5379158Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.5379904Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:31:56.5380938Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.6227731Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.6229660Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:31:56.6232092Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.6233830Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.6234812Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.6236126Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.6237508Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.6238649Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.6239889Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.6241271Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.6242334Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.6243726Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.6245129Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:31:56.6246350Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.6247560Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:31:56.6248392Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.6249421Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:56.6250442Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:31:56.6251252Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:56.6252465Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.6253751Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.6254863Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:56.6256020Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:31:56.6257212Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.6258573Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.6259634Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.6260543Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.6261290Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:31:56.6262318Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.9776131Z self = 2025-05-07T20:31:56.9776676Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:56.9776957Z 2025-05-07T20:31:56.9777038Z @given( 2025-05-07T20:31:56.9777283Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.9777597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.9777911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.9778246Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.9778584Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.9778870Z ) 2025-05-07T20:31:56.9779226Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.9779844Z def test_silu_mul_quant( 2025-05-07T20:31:56.9780085Z self, 2025-05-07T20:31:56.9780286Z T: int, 2025-05-07T20:31:56.9780489Z D: int, 2025-05-07T20:31:56.9780704Z scale_ub: Optional[float], 2025-05-07T20:31:56.9780978Z contiguous: bool, 2025-05-07T20:31:56.9781219Z compiled: bool, 2025-05-07T20:31:56.9781443Z ) -> None: 2025-05-07T20:31:56.9781662Z torch.manual_seed(2025) 2025-05-07T20:31:56.9781906Z 2025-05-07T20:31:56.9782178Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.9782523Z 2025-05-07T20:31:56.9782725Z x_sign = torch.sign(x) 2025-05-07T20:31:56.9783014Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.9783329Z x = x_sign * x_clamp 2025-05-07T20:31:56.9783574Z x0 = x[:, :D] 2025-05-07T20:31:56.9783796Z x1 = x[:, D:] 2025-05-07T20:31:56.9784006Z 2025-05-07T20:31:56.9784200Z if contiguous: 2025-05-07T20:31:56.9784445Z x0 = x0.contiguous() 2025-05-07T20:31:56.9784703Z x1 = x1.contiguous() 2025-05-07T20:31:56.9784954Z 2025-05-07T20:31:56.9785148Z if scale_ub is not None: 2025-05-07T20:31:56.9785419Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.9785758Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.9786070Z ) 2025-05-07T20:31:56.9786261Z else: 2025-05-07T20:31:56.9786477Z scale_ub_tensor = None 2025-05-07T20:31:56.9786737Z 2025-05-07T20:31:56.9786969Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.9787290Z op = silu_mul_quant 2025-05-07T20:31:56.9787545Z if compiled: 2025-05-07T20:31:56.9787793Z op = torch.compile(op) 2025-05-07T20:31:56.9788095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.9788376Z 2025-05-07T20:31:56.9788575Z y_fp8, y_scale = fn() 2025-05-07T20:31:56.9788978Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:56.9789280Z 2025-05-07T20:31:56.9789522Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.9789852Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:56.9790146Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:56.9790466Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:56.9790823Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:56.9791132Z 2025-05-07T20:31:56.9791339Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:56.9791533Z 2025-05-07T20:31:56.9791643Z moe/activation_test.py:126: 2025-05-07T20:31:56.9791945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.9792286Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:56.9792609Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:56.9793409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:56.9794174Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:56.9794721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.9795411Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.9796107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:56.9796838Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:56.9797587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:56.9798343Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:56.9799086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:56.9799815Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:56.9800418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:56.9800940Z fn() 2025-05-07T20:31:56.9801454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:56.9802030Z self.fn.run( 2025-05-07T20:31:56.9802505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.9803040Z kernel = self.compile( 2025-05-07T20:31:56.9803730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.9804388Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.9804796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.9805032Z 2025-05-07T20:31:56.9805247Z self = 2025-05-07T20:31:56.9806335Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.9807706Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd301b49760>} 2025-05-07T20:31:56.9809049Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.9810158Z context = 2025-05-07T20:31:56.9810451Z 2025-05-07T20:31:56.9810624Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.9811147Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.9811611Z module_map=module_map) 2025-05-07T20:31:56.9811981Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.9812343Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:56.9812606Z E ^ 2025-05-07T20:31:56.9813075Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.9813524Z 2025-05-07T20:31:56.9813951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.9814469Z 2025-05-07T20:31:56.9814577Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.9814994Z self=, 2025-05-07T20:31:56.9815403Z T=16384, 2025-05-07T20:31:56.9815599Z D=5120, 2025-05-07T20:31:56.9815790Z scale_ub=None, 2025-05-07T20:31:56.9816004Z contiguous=True, 2025-05-07T20:31:56.9816232Z compiled=True, 2025-05-07T20:31:56.9816432Z ) 2025-05-07T20:31:57.0072214Z W0507 20:31:57.006000 87308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:31:57.0073463Z W0507 20:31:57.006000 87308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:31:57.0074795Z W0507 20:31:57.006000 87308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:31:57.0075789Z W0507 20:31:57.006000 87308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:31:57.0077040Z W0507 20:31:57.006000 87308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:31:57.0760638Z self = 2025-05-07T20:31:57.0761954Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:57.0762452Z 2025-05-07T20:31:57.0762589Z @given( 2025-05-07T20:31:57.0762953Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.0763586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.0764090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.0764621Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.0765167Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.0765625Z ) 2025-05-07T20:31:57.0766249Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.0767004Z def test_silu_mul_quant( 2025-05-07T20:31:57.0767395Z self, 2025-05-07T20:31:57.0767701Z T: int, 2025-05-07T20:31:57.0768007Z D: int, 2025-05-07T20:31:57.0768358Z scale_ub: Optional[float], 2025-05-07T20:31:57.0768800Z contiguous: bool, 2025-05-07T20:31:57.0769187Z compiled: bool, 2025-05-07T20:31:57.0769552Z ) -> None: 2025-05-07T20:31:57.0769896Z torch.manual_seed(2025) 2025-05-07T20:31:57.0770281Z 2025-05-07T20:31:57.0770742Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.0771309Z 2025-05-07T20:31:57.0771618Z x_sign = torch.sign(x) 2025-05-07T20:31:57.0772082Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.0772590Z x = x_sign * x_clamp 2025-05-07T20:31:57.0772979Z x0 = x[:, :D] 2025-05-07T20:31:57.0773713Z x1 = x[:, D:] 2025-05-07T20:31:57.0774060Z 2025-05-07T20:31:57.0774355Z if contiguous: 2025-05-07T20:31:57.0774715Z x0 = x0.contiguous() 2025-05-07T20:31:57.0775136Z x1 = x1.contiguous() 2025-05-07T20:31:57.0775522Z 2025-05-07T20:31:57.0775823Z if scale_ub is not None: 2025-05-07T20:31:57.0776253Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.0776795Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.0777294Z ) 2025-05-07T20:31:57.0777607Z else: 2025-05-07T20:31:57.0777946Z scale_ub_tensor = None 2025-05-07T20:31:57.0778375Z 2025-05-07T20:31:57.0778751Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.0779270Z op = silu_mul_quant 2025-05-07T20:31:57.0779679Z if compiled: 2025-05-07T20:31:57.0780040Z op = torch.compile(op) 2025-05-07T20:31:57.0780497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.0780935Z 2025-05-07T20:31:57.0781230Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.0781695Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.0782145Z 2025-05-07T20:31:57.0782517Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.0783111Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.0783624Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.0784181Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.0784809Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.0785340Z 2025-05-07T20:31:57.0785689Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.0786030Z 2025-05-07T20:31:57.0786207Z moe/activation_test.py:126: 2025-05-07T20:31:57.0786721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.0787318Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.0787896Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.0789605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.0790977Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.0791958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.0793176Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.0794421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.0795695Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.0796924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:57.0798208Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.0799562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.0800745Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.0801842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.0802823Z fn() 2025-05-07T20:31:57.0803912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.0804976Z self.fn.run( 2025-05-07T20:31:57.0805806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.0806776Z kernel = self.compile( 2025-05-07T20:31:57.0807753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.0809099Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.0809810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.0810235Z 2025-05-07T20:31:57.0810557Z self = 2025-05-07T20:31:57.0812319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.0814906Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd30110b600>} 2025-05-07T20:31:57.0817357Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.0819227Z context = 2025-05-07T20:31:57.0819758Z 2025-05-07T20:31:57.0820047Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.0820989Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.0821835Z module_map=module_map) 2025-05-07T20:31:57.0822460Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.0823093Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.0823556Z E ^ 2025-05-07T20:31:57.0824377Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.0825203Z 2025-05-07T20:31:57.0825928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.0826871Z 2025-05-07T20:31:57.0827241Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.0827972Z self=, 2025-05-07T20:31:57.0828680Z T=1, 2025-05-07T20:31:57.0828998Z D=5120, 2025-05-07T20:31:57.0829334Z scale_ub=1200.0, 2025-05-07T20:31:57.0829711Z contiguous=True, 2025-05-07T20:31:57.0830094Z compiled=True, 2025-05-07T20:31:57.0830465Z ) 2025-05-07T20:31:57.1908702Z self = 2025-05-07T20:31:57.1909606Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.1910066Z 2025-05-07T20:31:57.1910193Z @given( 2025-05-07T20:31:57.1910574Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.1911079Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.1911587Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.1912133Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.1912776Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.1913233Z ) 2025-05-07T20:31:57.1913818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.1914503Z def test_silu_mul_quant( 2025-05-07T20:31:57.1914863Z self, 2025-05-07T20:31:57.1915177Z T: int, 2025-05-07T20:31:57.1915487Z D: int, 2025-05-07T20:31:57.1915827Z scale_ub: Optional[float], 2025-05-07T20:31:57.1916267Z contiguous: bool, 2025-05-07T20:31:57.1916639Z compiled: bool, 2025-05-07T20:31:57.1916986Z ) -> None: 2025-05-07T20:31:57.1917330Z torch.manual_seed(2025) 2025-05-07T20:31:57.1917728Z 2025-05-07T20:31:57.1918169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.1918747Z 2025-05-07T20:31:57.1919056Z x_sign = torch.sign(x) 2025-05-07T20:31:57.1919536Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.1920472Z x = x_sign * x_clamp 2025-05-07T20:31:57.1920894Z x0 = x[:, :D] 2025-05-07T20:31:57.1921249Z x1 = x[:, D:] 2025-05-07T20:31:57.1921579Z 2025-05-07T20:31:57.1921882Z if contiguous: 2025-05-07T20:31:57.1922257Z x0 = x0.contiguous() 2025-05-07T20:31:57.1922672Z x1 = x1.contiguous() 2025-05-07T20:31:57.1923065Z 2025-05-07T20:31:57.1923504Z if scale_ub is not None: 2025-05-07T20:31:57.1923943Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.1924496Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.1925015Z ) 2025-05-07T20:31:57.1925317Z else: 2025-05-07T20:31:57.1925650Z scale_ub_tensor = None 2025-05-07T20:31:57.1926087Z 2025-05-07T20:31:57.1926445Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.1926968Z op = silu_mul_quant 2025-05-07T20:31:57.1939488Z if compiled: 2025-05-07T20:31:57.1939940Z op = torch.compile(op) 2025-05-07T20:31:57.1940471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.1940963Z 2025-05-07T20:31:57.1941290Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.1941593Z 2025-05-07T20:31:57.1941763Z moe/activation_test.py:117: 2025-05-07T20:31:57.1942280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.1942854Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.1943345Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.1944355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.1945362Z return fn(*args, **kwargs) 2025-05-07T20:31:57.1946477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.1947603Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.1948545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.1950059Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.1951262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.1952226Z kernel = self.compile( 2025-05-07T20:31:57.1953205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.1954383Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.1955091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.1955501Z 2025-05-07T20:31:57.1955875Z self = 2025-05-07T20:31:57.1957840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.1960325Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd301c96ac0>} 2025-05-07T20:31:57.1962746Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.1964750Z context = 2025-05-07T20:31:57.1965258Z 2025-05-07T20:31:57.1965562Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.1966485Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.1967325Z module_map=module_map) 2025-05-07T20:31:57.1968192Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.1968823Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.1969263Z E ^ 2025-05-07T20:31:57.1970094Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.1970917Z 2025-05-07T20:31:57.1971686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.1972617Z 2025-05-07T20:31:57.1972808Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.1973489Z self=, 2025-05-07T20:31:57.1974199Z T=1, 2025-05-07T20:31:57.1974519Z D=5120, 2025-05-07T20:31:57.1974840Z scale_ub=None, 2025-05-07T20:31:57.1975216Z contiguous=False, 2025-05-07T20:31:57.1975604Z compiled=True, 2025-05-07T20:31:57.1975948Z ) 2025-05-07T20:31:57.4277935Z self = 2025-05-07T20:31:57.4278518Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.4278790Z 2025-05-07T20:31:57.4278885Z @given( 2025-05-07T20:31:57.4279127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.4279459Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.4279782Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.4280122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.4280453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.4280753Z ) 2025-05-07T20:31:57.4281120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.4281572Z def test_silu_mul_quant( 2025-05-07T20:31:57.4281829Z self, 2025-05-07T20:31:57.4282039Z T: int, 2025-05-07T20:31:57.4282242Z D: int, 2025-05-07T20:31:57.4282473Z scale_ub: Optional[float], 2025-05-07T20:31:57.4283141Z contiguous: bool, 2025-05-07T20:31:57.4283507Z compiled: bool, 2025-05-07T20:31:57.4283747Z ) -> None: 2025-05-07T20:31:57.4283982Z torch.manual_seed(2025) 2025-05-07T20:31:57.4284233Z 2025-05-07T20:31:57.4284526Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.4284885Z 2025-05-07T20:31:57.4285092Z x_sign = torch.sign(x) 2025-05-07T20:31:57.4285390Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.4285714Z x = x_sign * x_clamp 2025-05-07T20:31:57.4285967Z x0 = x[:, :D] 2025-05-07T20:31:57.4286188Z x1 = x[:, D:] 2025-05-07T20:31:57.4286416Z 2025-05-07T20:31:57.4286620Z if contiguous: 2025-05-07T20:31:57.4286861Z x0 = x0.contiguous() 2025-05-07T20:31:57.4287132Z x1 = x1.contiguous() 2025-05-07T20:31:57.4287390Z 2025-05-07T20:31:57.4287581Z if scale_ub is not None: 2025-05-07T20:31:57.4287869Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.4288220Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.4288538Z ) 2025-05-07T20:31:57.4288735Z else: 2025-05-07T20:31:57.4288954Z scale_ub_tensor = None 2025-05-07T20:31:57.4289219Z 2025-05-07T20:31:57.4289452Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.4289779Z op = silu_mul_quant 2025-05-07T20:31:57.4290036Z if compiled: 2025-05-07T20:31:57.4290282Z op = torch.compile(op) 2025-05-07T20:31:57.4290587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.4290874Z 2025-05-07T20:31:57.4291066Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.4291361Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.4291664Z 2025-05-07T20:31:57.4291902Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.4292418Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.4292774Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.4293104Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.4293466Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.4293787Z 2025-05-07T20:31:57.4294001Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.4294201Z 2025-05-07T20:31:57.4294304Z moe/activation_test.py:126: 2025-05-07T20:31:57.4294609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.4294953Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.4295280Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.4296079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.4296846Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.4297414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.4298110Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.4298817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.4299554Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.4300330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:57.4301084Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.4301826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.4302480Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.4303100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.4303717Z fn() 2025-05-07T20:31:57.4304238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.4304832Z self.fn.run( 2025-05-07T20:31:57.4305308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.4305849Z kernel = self.compile( 2025-05-07T20:31:57.4306404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.4307062Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.4307469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.4307709Z 2025-05-07T20:31:57.4307921Z self = 2025-05-07T20:31:57.4309018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.4310442Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd301c97e20>} 2025-05-07T20:31:57.4311784Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.4312823Z context = 2025-05-07T20:31:57.4313123Z 2025-05-07T20:31:57.4313295Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.4313917Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.4314392Z module_map=module_map) 2025-05-07T20:31:57.4314771Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.4315149Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.4315414Z E ^ 2025-05-07T20:31:57.4315883Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.4316342Z 2025-05-07T20:31:57.4316762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.4317277Z 2025-05-07T20:31:57.4317389Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.4317799Z self=, 2025-05-07T20:31:57.4318205Z T=1, 2025-05-07T20:31:57.4318397Z D=5120, 2025-05-07T20:31:57.4318589Z scale_ub=None, 2025-05-07T20:31:57.4318808Z contiguous=True, 2025-05-07T20:31:57.4319044Z compiled=False, 2025-05-07T20:31:57.4319256Z ) 2025-05-07T20:31:57.5501328Z self = 2025-05-07T20:31:57.5502112Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:57.5502495Z 2025-05-07T20:31:57.5502607Z @given( 2025-05-07T20:31:57.5502937Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.5503352Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.5503763Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.5504197Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.5504574Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.5504859Z ) 2025-05-07T20:31:57.5505219Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.5505671Z def test_silu_mul_quant( 2025-05-07T20:31:57.5505913Z self, 2025-05-07T20:31:57.5506121Z T: int, 2025-05-07T20:31:57.5506696Z D: int, 2025-05-07T20:31:57.5506919Z scale_ub: Optional[float], 2025-05-07T20:31:57.5507206Z contiguous: bool, 2025-05-07T20:31:57.5507454Z compiled: bool, 2025-05-07T20:31:57.5507687Z ) -> None: 2025-05-07T20:31:57.5507914Z torch.manual_seed(2025) 2025-05-07T20:31:57.5508163Z 2025-05-07T20:31:57.5508440Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.5508796Z 2025-05-07T20:31:57.5509001Z x_sign = torch.sign(x) 2025-05-07T20:31:57.5509300Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.5509620Z x = x_sign * x_clamp 2025-05-07T20:31:57.5509869Z x0 = x[:, :D] 2025-05-07T20:31:57.5510099Z x1 = x[:, D:] 2025-05-07T20:31:57.5510315Z 2025-05-07T20:31:57.5510516Z if contiguous: 2025-05-07T20:31:57.5510760Z x0 = x0.contiguous() 2025-05-07T20:31:57.5511024Z x1 = x1.contiguous() 2025-05-07T20:31:57.5511279Z 2025-05-07T20:31:57.5511511Z if scale_ub is not None: 2025-05-07T20:31:57.5511796Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.5512134Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.5512450Z ) 2025-05-07T20:31:57.5512654Z else: 2025-05-07T20:31:57.5512875Z scale_ub_tensor = None 2025-05-07T20:31:57.5513129Z 2025-05-07T20:31:57.5513374Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.5513699Z op = silu_mul_quant 2025-05-07T20:31:57.5513948Z if compiled: 2025-05-07T20:31:57.5514203Z op = torch.compile(op) 2025-05-07T20:31:57.5514510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.5514788Z 2025-05-07T20:31:57.5514992Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.5515158Z 2025-05-07T20:31:57.5515273Z moe/activation_test.py:117: 2025-05-07T20:31:57.5515744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.5516096Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.5516388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.5517094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.5517788Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.5518335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.5519028Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.5519700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.5520247Z kernel = self.compile( 2025-05-07T20:31:57.5520802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.5521480Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.5521880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.5522121Z 2025-05-07T20:31:57.5522333Z self = 2025-05-07T20:31:57.5523623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.5525018Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3009f2de0>} 2025-05-07T20:31:57.5526365Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.5527524Z context = 2025-05-07T20:31:57.5527823Z 2025-05-07T20:31:57.5527994Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.5528525Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.5528994Z module_map=module_map) 2025-05-07T20:31:57.5529369Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.5529737Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.5530003Z E ^ 2025-05-07T20:31:57.5530471Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.5530931Z 2025-05-07T20:31:57.5531352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.5531872Z 2025-05-07T20:31:57.5531997Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.5532427Z self=, 2025-05-07T20:31:57.5532828Z T=128, 2025-05-07T20:31:57.5533028Z D=5120, 2025-05-07T20:31:57.5533232Z scale_ub=None, 2025-05-07T20:31:57.5533449Z contiguous=False, 2025-05-07T20:31:57.5533682Z compiled=True, 2025-05-07T20:31:57.5533904Z ) 2025-05-07T20:31:57.5534226Z self = 2025-05-07T20:31:57.5534728Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.5534999Z 2025-05-07T20:31:57.5535090Z @given( 2025-05-07T20:31:57.5535328Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.5535648Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.5535961Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.5536293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.5536698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.5537002Z ) 2025-05-07T20:31:57.5537358Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.5537800Z def test_silu_mul_quant( 2025-05-07T20:31:57.5538050Z self, 2025-05-07T20:31:57.5538246Z T: int, 2025-05-07T20:31:57.5538736Z D: int, 2025-05-07T20:31:57.5538961Z scale_ub: Optional[float], 2025-05-07T20:31:57.5539233Z contiguous: bool, 2025-05-07T20:31:57.5539469Z compiled: bool, 2025-05-07T20:31:57.5539699Z ) -> None: 2025-05-07T20:31:57.5539920Z torch.manual_seed(2025) 2025-05-07T20:31:57.5540156Z 2025-05-07T20:31:57.5540427Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.5540773Z 2025-05-07T20:31:57.5540962Z x_sign = torch.sign(x) 2025-05-07T20:31:57.5541254Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.5546576Z x = x_sign * x_clamp 2025-05-07T20:31:57.5546922Z x0 = x[:, :D] 2025-05-07T20:31:57.5547146Z x1 = x[:, D:] 2025-05-07T20:31:57.5547362Z 2025-05-07T20:31:57.5547547Z if contiguous: 2025-05-07T20:31:57.5547786Z x0 = x0.contiguous() 2025-05-07T20:31:57.5548048Z x1 = x1.contiguous() 2025-05-07T20:31:57.5548283Z 2025-05-07T20:31:57.5548483Z if scale_ub is not None: 2025-05-07T20:31:57.5548761Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.5549093Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.5549408Z ) 2025-05-07T20:31:57.5549612Z else: 2025-05-07T20:31:57.5549827Z scale_ub_tensor = None 2025-05-07T20:31:57.5550077Z 2025-05-07T20:31:57.5550316Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.5550638Z op = silu_mul_quant 2025-05-07T20:31:57.5550886Z if compiled: 2025-05-07T20:31:57.5551142Z op = torch.compile(op) 2025-05-07T20:31:57.5551582Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.5551856Z 2025-05-07T20:31:57.5552051Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.5552221Z 2025-05-07T20:31:57.5552332Z moe/activation_test.py:117: 2025-05-07T20:31:57.5552673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.5553010Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.5553298Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.5553860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.5554433Z return fn(*args, **kwargs) 2025-05-07T20:31:57.5555103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.5555797Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.5556343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.5557042Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.5557717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.5558253Z kernel = self.compile( 2025-05-07T20:31:57.5558803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.5559473Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.5559876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.5560105Z 2025-05-07T20:31:57.5560316Z self = 2025-05-07T20:31:57.5561525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.5562904Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3009f27a0>} 2025-05-07T20:31:57.5564381Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.5565409Z context = 2025-05-07T20:31:57.5565695Z 2025-05-07T20:31:57.5565864Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.5566390Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.5566864Z module_map=module_map) 2025-05-07T20:31:57.5567347Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.5567709Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.5567973Z E ^ 2025-05-07T20:31:57.5568447Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.5568901Z 2025-05-07T20:31:57.5569320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.5569844Z 2025-05-07T20:31:57.5569950Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.5570374Z self=, 2025-05-07T20:31:57.5570781Z T=128, 2025-05-07T20:31:57.5570978Z D=7168, 2025-05-07T20:31:57.5571176Z scale_ub=1200.0, 2025-05-07T20:31:57.5571408Z contiguous=False, 2025-05-07T20:31:57.5571632Z compiled=False, 2025-05-07T20:31:57.5571838Z ) 2025-05-07T20:31:57.6447580Z self = 2025-05-07T20:31:57.6448562Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:57.6448951Z 2025-05-07T20:31:57.6449103Z @given( 2025-05-07T20:31:57.6449349Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6449667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6449978Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6450306Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6450641Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6450930Z ) 2025-05-07T20:31:57.6451279Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6451732Z def test_silu_mul_quant( 2025-05-07T20:31:57.6451983Z self, 2025-05-07T20:31:57.6452182Z T: int, 2025-05-07T20:31:57.6452378Z D: int, 2025-05-07T20:31:57.6452611Z scale_ub: Optional[float], 2025-05-07T20:31:57.6452938Z contiguous: bool, 2025-05-07T20:31:57.6453183Z compiled: bool, 2025-05-07T20:31:57.6453423Z ) -> None: 2025-05-07T20:31:57.6453651Z torch.manual_seed(2025) 2025-05-07T20:31:57.6453891Z 2025-05-07T20:31:57.6454174Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6454523Z 2025-05-07T20:31:57.6454721Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6455021Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6455337Z x = x_sign * x_clamp 2025-05-07T20:31:57.6455581Z x0 = x[:, :D] 2025-05-07T20:31:57.6455804Z x1 = x[:, D:] 2025-05-07T20:31:57.6456018Z 2025-05-07T20:31:57.6456208Z if contiguous: 2025-05-07T20:31:57.6456447Z x0 = x0.contiguous() 2025-05-07T20:31:57.6456709Z x1 = x1.contiguous() 2025-05-07T20:31:57.6456945Z 2025-05-07T20:31:57.6457145Z if scale_ub is not None: 2025-05-07T20:31:57.6457599Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6457948Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6458255Z ) 2025-05-07T20:31:57.6458454Z else: 2025-05-07T20:31:57.6458668Z scale_ub_tensor = None 2025-05-07T20:31:57.6458920Z 2025-05-07T20:31:57.6459158Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6459478Z op = silu_mul_quant 2025-05-07T20:31:57.6459728Z if compiled: 2025-05-07T20:31:57.6459979Z op = torch.compile(op) 2025-05-07T20:31:57.6460282Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6460559Z 2025-05-07T20:31:57.6460761Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6460926Z 2025-05-07T20:31:57.6461033Z moe/activation_test.py:117: 2025-05-07T20:31:57.6461324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6461661Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6462069Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6462771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6463466Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6464014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6464831Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6465515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6466137Z kernel = self.compile( 2025-05-07T20:31:57.6466702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6467370Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6467778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6468089Z 2025-05-07T20:31:57.6468302Z self = 2025-05-07T20:31:57.6478391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6480168Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3009f23e0>} 2025-05-07T20:31:57.6481868Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6483196Z context = 2025-05-07T20:31:57.6483637Z 2025-05-07T20:31:57.6483812Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6484343Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6484815Z module_map=module_map) 2025-05-07T20:31:57.6485181Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6485540Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6485808Z E ^ 2025-05-07T20:31:57.6486274Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6486734Z 2025-05-07T20:31:57.6487157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6487680Z 2025-05-07T20:31:57.6487790Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6488328Z self=, 2025-05-07T20:31:57.6488737Z T=128, 2025-05-07T20:31:57.6488930Z D=5120, 2025-05-07T20:31:57.6489129Z scale_ub=None, 2025-05-07T20:31:57.6489345Z contiguous=False, 2025-05-07T20:31:57.6489578Z compiled=False, 2025-05-07T20:31:57.6489795Z ) 2025-05-07T20:31:57.6490118Z self = 2025-05-07T20:31:57.6490622Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.6490891Z 2025-05-07T20:31:57.6490979Z @given( 2025-05-07T20:31:57.6491216Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6491529Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6491847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6492181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6492508Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6492802Z ) 2025-05-07T20:31:57.6493164Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6493665Z def test_silu_mul_quant( 2025-05-07T20:31:57.6493917Z self, 2025-05-07T20:31:57.6494116Z T: int, 2025-05-07T20:31:57.6494311Z D: int, 2025-05-07T20:31:57.6494541Z scale_ub: Optional[float], 2025-05-07T20:31:57.6494821Z contiguous: bool, 2025-05-07T20:31:57.6495060Z compiled: bool, 2025-05-07T20:31:57.6495291Z ) -> None: 2025-05-07T20:31:57.6495511Z torch.manual_seed(2025) 2025-05-07T20:31:57.6495754Z 2025-05-07T20:31:57.6496021Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6496370Z 2025-05-07T20:31:57.6496567Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6496857Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6497169Z x = x_sign * x_clamp 2025-05-07T20:31:57.6497420Z x0 = x[:, :D] 2025-05-07T20:31:57.6497634Z x1 = x[:, D:] 2025-05-07T20:31:57.6497860Z 2025-05-07T20:31:57.6498106Z if contiguous: 2025-05-07T20:31:57.6498343Z x0 = x0.contiguous() 2025-05-07T20:31:57.6498607Z x1 = x1.contiguous() 2025-05-07T20:31:57.6498857Z 2025-05-07T20:31:57.6499043Z if scale_ub is not None: 2025-05-07T20:31:57.6499325Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6499665Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6499972Z ) 2025-05-07T20:31:57.6500168Z else: 2025-05-07T20:31:57.6500380Z scale_ub_tensor = None 2025-05-07T20:31:57.6500629Z 2025-05-07T20:31:57.6500866Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6501186Z op = silu_mul_quant 2025-05-07T20:31:57.6501437Z if compiled: 2025-05-07T20:31:57.6501677Z op = torch.compile(op) 2025-05-07T20:31:57.6501979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6502258Z 2025-05-07T20:31:57.6502455Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6502626Z 2025-05-07T20:31:57.6502725Z moe/activation_test.py:117: 2025-05-07T20:31:57.6503026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6503357Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6503646Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6504341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6505034Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6505572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6506250Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6506922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6507549Z kernel = self.compile( 2025-05-07T20:31:57.6508098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6508758Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6509161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6509387Z 2025-05-07T20:31:57.6509603Z self = 2025-05-07T20:31:57.6510682Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6512036Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3011e77e0>} 2025-05-07T20:31:57.6513428Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6514453Z context = 2025-05-07T20:31:57.6514737Z 2025-05-07T20:31:57.6514910Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6515428Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6515896Z module_map=module_map) 2025-05-07T20:31:57.6516265Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6516618Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6516874Z E ^ 2025-05-07T20:31:57.6517339Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6517791Z 2025-05-07T20:31:57.6518272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6518786Z 2025-05-07T20:31:57.6518896Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6519307Z self=, 2025-05-07T20:31:57.6519710Z T=128, 2025-05-07T20:31:57.6519898Z D=5120, 2025-05-07T20:31:57.6520083Z scale_ub=1200.0, 2025-05-07T20:31:57.6520306Z contiguous=True, 2025-05-07T20:31:57.6520528Z compiled=False, 2025-05-07T20:31:57.6520728Z ) 2025-05-07T20:31:57.7879450Z self = 2025-05-07T20:31:57.7880176Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.7880583Z 2025-05-07T20:31:57.7880666Z @given( 2025-05-07T20:31:57.7880905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7881253Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7881582Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7881918Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7882244Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7882536Z ) 2025-05-07T20:31:57.7882894Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7883433Z def test_silu_mul_quant( 2025-05-07T20:31:57.7883683Z self, 2025-05-07T20:31:57.7883883Z T: int, 2025-05-07T20:31:57.7884078Z D: int, 2025-05-07T20:31:57.7884307Z scale_ub: Optional[float], 2025-05-07T20:31:57.7884584Z contiguous: bool, 2025-05-07T20:31:57.7884829Z compiled: bool, 2025-05-07T20:31:57.7885056Z ) -> None: 2025-05-07T20:31:57.7885278Z torch.manual_seed(2025) 2025-05-07T20:31:57.7885526Z 2025-05-07T20:31:57.7885802Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7886546Z 2025-05-07T20:31:57.7886752Z x_sign = torch.sign(x) 2025-05-07T20:31:57.7887041Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.7887353Z x = x_sign * x_clamp 2025-05-07T20:31:57.7887593Z x0 = x[:, :D] 2025-05-07T20:31:57.7887805Z x1 = x[:, D:] 2025-05-07T20:31:57.7888013Z 2025-05-07T20:31:57.7888199Z if contiguous: 2025-05-07T20:31:57.7888432Z x0 = x0.contiguous() 2025-05-07T20:31:57.7888692Z x1 = x1.contiguous() 2025-05-07T20:31:57.7888935Z 2025-05-07T20:31:57.7889125Z if scale_ub is not None: 2025-05-07T20:31:57.7889402Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.7889742Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.7890053Z ) 2025-05-07T20:31:57.7890245Z else: 2025-05-07T20:31:57.7890462Z scale_ub_tensor = None 2025-05-07T20:31:57.7890717Z 2025-05-07T20:31:57.7891068Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.7891391Z op = silu_mul_quant 2025-05-07T20:31:57.7891643Z if compiled: 2025-05-07T20:31:57.7891885Z op = torch.compile(op) 2025-05-07T20:31:57.7892184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.7892465Z 2025-05-07T20:31:57.7892653Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.7892828Z 2025-05-07T20:31:57.7892931Z moe/activation_test.py:117: 2025-05-07T20:31:57.7893242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.7893572Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.7893857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.7894555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.7895249Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.7895790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.7896586Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.7897455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.7897999Z kernel = self.compile( 2025-05-07T20:31:57.7898546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.7899218Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.7899627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.7899859Z 2025-05-07T20:31:57.7900070Z self = 2025-05-07T20:31:57.7901152Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.7902552Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3011e51c0>} 2025-05-07T20:31:57.7903896Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.7904930Z context = 2025-05-07T20:31:57.7905215Z 2025-05-07T20:31:57.7905384Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.7905914Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.7906389Z module_map=module_map) 2025-05-07T20:31:57.7906851Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.7907221Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.7907486Z E ^ 2025-05-07T20:31:57.7907964Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.7908415Z 2025-05-07T20:31:57.7908838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.7909358Z 2025-05-07T20:31:57.7909464Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7909882Z self=, 2025-05-07T20:31:57.7910288Z T=1, 2025-05-07T20:31:57.7910468Z D=7168, 2025-05-07T20:31:57.7910666Z scale_ub=1200.0, 2025-05-07T20:31:57.7910896Z contiguous=True, 2025-05-07T20:31:57.7911115Z compiled=True, 2025-05-07T20:31:57.7911345Z ) 2025-05-07T20:31:57.7911671Z self = 2025-05-07T20:31:57.7912245Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.7912507Z 2025-05-07T20:31:57.7912594Z @given( 2025-05-07T20:31:57.7912829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7913138Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7913453Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7913786Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7914112Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7914410Z ) 2025-05-07T20:31:57.7914763Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7915201Z def test_silu_mul_quant( 2025-05-07T20:31:57.7915446Z self, 2025-05-07T20:31:57.7915642Z T: int, 2025-05-07T20:31:57.7915838Z D: int, 2025-05-07T20:31:57.7916064Z scale_ub: Optional[float], 2025-05-07T20:31:57.7916355Z contiguous: bool, 2025-05-07T20:31:57.7916636Z compiled: bool, 2025-05-07T20:31:57.7916863Z ) -> None: 2025-05-07T20:31:57.7917080Z torch.manual_seed(2025) 2025-05-07T20:31:57.7917321Z 2025-05-07T20:31:57.7917588Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7917936Z 2025-05-07T20:31:57.7918138Z x_sign = torch.sign(x) 2025-05-07T20:31:57.7918425Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.7918739Z x = x_sign * x_clamp 2025-05-07T20:31:57.7918986Z x0 = x[:, :D] 2025-05-07T20:31:57.7919197Z x1 = x[:, D:] 2025-05-07T20:31:57.7919414Z 2025-05-07T20:31:57.7919602Z if contiguous: 2025-05-07T20:31:57.7919832Z x0 = x0.contiguous() 2025-05-07T20:31:57.7920093Z x1 = x1.contiguous() 2025-05-07T20:31:57.7920340Z 2025-05-07T20:31:57.7920527Z if scale_ub is not None: 2025-05-07T20:31:57.7920809Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.7921155Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.7921463Z ) 2025-05-07T20:31:57.7921656Z else: 2025-05-07T20:31:57.7921869Z scale_ub_tensor = None 2025-05-07T20:31:57.7922120Z 2025-05-07T20:31:57.7922358Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.7922703Z op = silu_mul_quant 2025-05-07T20:31:57.7922987Z if compiled: 2025-05-07T20:31:57.7923415Z op = torch.compile(op) 2025-05-07T20:31:57.7923721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.7924005Z 2025-05-07T20:31:57.7924197Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.7924366Z 2025-05-07T20:31:57.7924467Z moe/activation_test.py:117: 2025-05-07T20:31:57.7924768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.7925098Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.7925480Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.7926052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.7926616Z return fn(*args, **kwargs) 2025-05-07T20:31:57.7927275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.7927971Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.7928516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.7929198Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.7929869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.7930415Z kernel = self.compile( 2025-05-07T20:31:57.7930972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.7931676Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.7932081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.7932309Z 2025-05-07T20:31:57.7932526Z self = 2025-05-07T20:31:57.7933601Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.7934956Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3011e67a0>} 2025-05-07T20:31:57.7936303Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.7937374Z context = 2025-05-07T20:31:57.7937661Z 2025-05-07T20:31:57.7937834Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.7938351Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.7939056Z module_map=module_map) 2025-05-07T20:31:57.7939426Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.7939782Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.7940038Z E ^ 2025-05-07T20:31:57.7940506Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.7940958Z 2025-05-07T20:31:57.7941385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.7941902Z 2025-05-07T20:31:57.7942013Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7942417Z self=, 2025-05-07T20:31:57.7942819Z T=1, 2025-05-07T20:31:57.7943007Z D=7168, 2025-05-07T20:31:57.7943197Z scale_ub=1200.0, 2025-05-07T20:31:57.7943423Z contiguous=False, 2025-05-07T20:31:57.7943654Z compiled=True, 2025-05-07T20:31:57.7943851Z ) 2025-05-07T20:31:58.0869718Z self = 2025-05-07T20:31:58.0870325Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:58.0870600Z 2025-05-07T20:31:58.0870697Z @given( 2025-05-07T20:31:58.0870945Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.0871276Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.0871598Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.0872259Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.0872613Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.0872913Z ) 2025-05-07T20:31:58.0873277Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.0873753Z def test_silu_mul_quant( 2025-05-07T20:31:58.0874005Z self, 2025-05-07T20:31:58.0874215Z T: int, 2025-05-07T20:31:58.0874423Z D: int, 2025-05-07T20:31:58.0874642Z scale_ub: Optional[float], 2025-05-07T20:31:58.0874929Z contiguous: bool, 2025-05-07T20:31:58.0875179Z compiled: bool, 2025-05-07T20:31:58.0875413Z ) -> None: 2025-05-07T20:31:58.0875638Z torch.manual_seed(2025) 2025-05-07T20:31:58.0875887Z 2025-05-07T20:31:58.0876166Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.0876528Z 2025-05-07T20:31:58.0876737Z x_sign = torch.sign(x) 2025-05-07T20:31:58.0877037Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.0877470Z x = x_sign * x_clamp 2025-05-07T20:31:58.0877720Z x0 = x[:, :D] 2025-05-07T20:31:58.0877953Z x1 = x[:, D:] 2025-05-07T20:31:58.0878165Z 2025-05-07T20:31:58.0878362Z if contiguous: 2025-05-07T20:31:58.0878608Z x0 = x0.contiguous() 2025-05-07T20:31:58.0878875Z x1 = x1.contiguous() 2025-05-07T20:31:58.0879124Z 2025-05-07T20:31:58.0879330Z if scale_ub is not None: 2025-05-07T20:31:58.0879614Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.0879971Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.0880287Z ) 2025-05-07T20:31:58.0880486Z else: 2025-05-07T20:31:58.0880707Z scale_ub_tensor = None 2025-05-07T20:31:58.0880969Z 2025-05-07T20:31:58.0881212Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.0881536Z op = silu_mul_quant 2025-05-07T20:31:58.0881816Z if compiled: 2025-05-07T20:31:58.0882173Z op = torch.compile(op) 2025-05-07T20:31:58.0882489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.0882826Z 2025-05-07T20:31:58.0883037Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.0883339Z 2025-05-07T20:31:58.0883448Z moe/activation_test.py:117: 2025-05-07T20:31:58.0883762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.0884111Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.0884395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.0884975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:58.0885557Z return fn(*args, **kwargs) 2025-05-07T20:31:58.0886225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.0886926Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.0887491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.0888191Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.0888863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.0889410Z kernel = self.compile( 2025-05-07T20:31:58.0889966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.0890634Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.0891036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.0891272Z 2025-05-07T20:31:58.0891483Z self = 2025-05-07T20:31:58.0892678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.0894120Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd301bb76a0>} 2025-05-07T20:31:58.0895478Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.0896515Z context = 2025-05-07T20:31:58.0896808Z 2025-05-07T20:31:58.0896984Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.0897513Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.0898000Z module_map=module_map) 2025-05-07T20:31:58.0898434Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.0909016Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.0909325Z E ^ 2025-05-07T20:31:58.0909811Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.0910268Z 2025-05-07T20:31:58.0910701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.0911229Z 2025-05-07T20:31:58.0911345Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.0911785Z self=, 2025-05-07T20:31:58.0912209Z T=1, 2025-05-07T20:31:58.0912406Z D=7168, 2025-05-07T20:31:58.0912620Z scale_ub=None, 2025-05-07T20:31:58.0912859Z contiguous=False, 2025-05-07T20:31:58.0913093Z compiled=True, 2025-05-07T20:31:58.0913330Z ) 2025-05-07T20:31:58.1581220Z self = 2025-05-07T20:31:58.1581952Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:58.1582227Z 2025-05-07T20:31:58.1582312Z @given( 2025-05-07T20:31:58.1582558Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.1582932Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.1583259Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.1583594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.1583933Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.1584232Z ) 2025-05-07T20:31:58.1584586Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.1585041Z def test_silu_mul_quant( 2025-05-07T20:31:58.1585297Z self, 2025-05-07T20:31:58.1585495Z T: int, 2025-05-07T20:31:58.1585704Z D: int, 2025-05-07T20:31:58.1585946Z scale_ub: Optional[float], 2025-05-07T20:31:58.1586226Z contiguous: bool, 2025-05-07T20:31:58.1586477Z compiled: bool, 2025-05-07T20:31:58.1586721Z ) -> None: 2025-05-07T20:31:58.1586945Z torch.manual_seed(2025) 2025-05-07T20:31:58.1587199Z 2025-05-07T20:31:58.1587487Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.1587840Z 2025-05-07T20:31:58.1588040Z x_sign = torch.sign(x) 2025-05-07T20:31:58.1588342Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.1588662Z x = x_sign * x_clamp 2025-05-07T20:31:58.1588907Z x0 = x[:, :D] 2025-05-07T20:31:58.1589134Z x1 = x[:, D:] 2025-05-07T20:31:58.1589349Z 2025-05-07T20:31:58.1589546Z if contiguous: 2025-05-07T20:31:58.1589796Z x0 = x0.contiguous() 2025-05-07T20:31:58.1590065Z x1 = x1.contiguous() 2025-05-07T20:31:58.1590309Z 2025-05-07T20:31:58.1590781Z if scale_ub is not None: 2025-05-07T20:31:58.1591081Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.1591420Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.1591742Z ) 2025-05-07T20:31:58.1591947Z else: 2025-05-07T20:31:58.1592164Z scale_ub_tensor = None 2025-05-07T20:31:58.1592428Z 2025-05-07T20:31:58.1592682Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.1593051Z op = silu_mul_quant 2025-05-07T20:31:58.1593315Z if compiled: 2025-05-07T20:31:58.1593579Z op = torch.compile(op) 2025-05-07T20:31:58.1593891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.1594169Z 2025-05-07T20:31:58.1594378Z y_fp8, y_scale = fn() 2025-05-07T20:31:58.1594687Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:58.1594981Z 2025-05-07T20:31:58.1595231Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.1595689Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:58.1595991Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:58.1596317Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:58.1596690Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:58.1597003Z 2025-05-07T20:31:58.1597214Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:58.1597418Z 2025-05-07T20:31:58.1597525Z moe/activation_test.py:126: 2025-05-07T20:31:58.1597830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.1598169Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:58.1598508Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:58.1599307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:58.1600069Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:58.1600635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.1601417Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.1602120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:58.1602848Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:58.1603748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:58.1604510Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:58.1605255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:58.1605899Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:58.1606522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:58.1607053Z fn() 2025-05-07T20:31:58.1607568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:58.1608158Z self.fn.run( 2025-05-07T20:31:58.1608645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.1609188Z kernel = self.compile( 2025-05-07T20:31:58.1609736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.1610404Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.1610823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.1611057Z 2025-05-07T20:31:58.1611370Z self = 2025-05-07T20:31:58.1612460Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.1613859Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3968dede0>} 2025-05-07T20:31:58.1615211Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.1616247Z context = 2025-05-07T20:31:58.1616537Z 2025-05-07T20:31:58.1616709Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.1617252Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.1617786Z module_map=module_map) 2025-05-07T20:31:58.1618164Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.1618529Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:58.1618809Z E ^ 2025-05-07T20:31:58.1619282Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.1619740Z 2025-05-07T20:31:58.1620164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.1620691Z 2025-05-07T20:31:58.1620801Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.1621227Z self=, 2025-05-07T20:31:58.1621642Z T=1, 2025-05-07T20:31:58.1621835Z D=5120, 2025-05-07T20:31:58.1622041Z scale_ub=1200.0, 2025-05-07T20:31:58.1622290Z contiguous=False, 2025-05-07T20:31:58.1622572Z compiled=True, 2025-05-07T20:31:58.1622815Z ) 2025-05-07T20:31:58.2832830Z self = 2025-05-07T20:31:58.2834268Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:58.2834831Z 2025-05-07T20:31:58.2835004Z @given( 2025-05-07T20:31:58.2835486Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.2836121Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.2836836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.2837509Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.2838168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.2839034Z ) 2025-05-07T20:31:58.2839751Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.2840645Z def test_silu_mul_quant( 2025-05-07T20:31:58.2841168Z self, 2025-05-07T20:31:58.2841591Z T: int, 2025-05-07T20:31:58.2841985Z D: int, 2025-05-07T20:31:58.2842437Z scale_ub: Optional[float], 2025-05-07T20:31:58.2842830Z contiguous: bool, 2025-05-07T20:31:58.2843111Z compiled: bool, 2025-05-07T20:31:58.2843452Z ) -> None: 2025-05-07T20:31:58.2843682Z torch.manual_seed(2025) 2025-05-07T20:31:58.2843934Z 2025-05-07T20:31:58.2844214Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.2844572Z 2025-05-07T20:31:58.2844779Z x_sign = torch.sign(x) 2025-05-07T20:31:58.2845076Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.2845396Z x = x_sign * x_clamp 2025-05-07T20:31:58.2845647Z x0 = x[:, :D] 2025-05-07T20:31:58.2845869Z x1 = x[:, D:] 2025-05-07T20:31:58.2846086Z 2025-05-07T20:31:58.2846283Z if contiguous: 2025-05-07T20:31:58.2846522Z x0 = x0.contiguous() 2025-05-07T20:31:58.2847099Z x1 = x1.contiguous() 2025-05-07T20:31:58.2847360Z 2025-05-07T20:31:58.2847559Z if scale_ub is not None: 2025-05-07T20:31:58.2847848Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.2848194Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.2848516Z ) 2025-05-07T20:31:58.2848713Z else: 2025-05-07T20:31:58.2848932Z scale_ub_tensor = None 2025-05-07T20:31:58.2849195Z 2025-05-07T20:31:58.2849431Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.2849756Z op = silu_mul_quant 2025-05-07T20:31:58.2850019Z if compiled: 2025-05-07T20:31:58.2850275Z op = torch.compile(op) 2025-05-07T20:31:58.2850580Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.2850871Z 2025-05-07T20:31:58.2851067Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.2851243Z 2025-05-07T20:31:58.2851348Z moe/activation_test.py:117: 2025-05-07T20:31:58.2851755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.2852096Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.2852388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.2852969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:58.2853548Z return fn(*args, **kwargs) 2025-05-07T20:31:58.2854218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.2854927Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.2855485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.2856180Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.2856882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.2857537Z kernel = self.compile( 2025-05-07T20:31:58.2858093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.2858780Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.2859191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.2859426Z 2025-05-07T20:31:58.2859648Z self = 2025-05-07T20:31:58.2860757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.2862161Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd396161d00>} 2025-05-07T20:31:58.2863588Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.2864641Z context = 2025-05-07T20:31:58.2864936Z 2025-05-07T20:31:58.2865120Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.2865656Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.2866144Z module_map=module_map) 2025-05-07T20:31:58.2866525Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.2866889Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.2867160Z E ^ 2025-05-07T20:31:58.2867776Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.2868247Z 2025-05-07T20:31:58.2868684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.2869212Z 2025-05-07T20:31:58.2869322Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.2869749Z self=, 2025-05-07T20:31:58.2870164Z T=1, 2025-05-07T20:31:58.2870354Z D=5120, 2025-05-07T20:31:58.2870558Z scale_ub=1200.0, 2025-05-07T20:31:58.2870796Z contiguous=False, 2025-05-07T20:31:58.2871027Z compiled=False, 2025-05-07T20:31:58.2871243Z ) 2025-05-07T20:31:58.2871574Z self = 2025-05-07T20:31:58.2872087Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:58.2872366Z 2025-05-07T20:31:58.2872450Z @given( 2025-05-07T20:31:58.2872690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.2873077Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.2873389Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.2873734Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.2874079Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.2874374Z ) 2025-05-07T20:31:58.2874741Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.2875206Z def test_silu_mul_quant( 2025-05-07T20:31:58.2875460Z self, 2025-05-07T20:31:58.2875658Z T: int, 2025-05-07T20:31:58.2875862Z D: int, 2025-05-07T20:31:58.2876090Z scale_ub: Optional[float], 2025-05-07T20:31:58.2876368Z contiguous: bool, 2025-05-07T20:31:58.2876619Z compiled: bool, 2025-05-07T20:31:58.2876857Z ) -> None: 2025-05-07T20:31:58.2877076Z torch.manual_seed(2025) 2025-05-07T20:31:58.2877329Z 2025-05-07T20:31:58.2877622Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.2878029Z 2025-05-07T20:31:58.2878236Z x_sign = torch.sign(x) 2025-05-07T20:31:58.2878539Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.2878856Z x = x_sign * x_clamp 2025-05-07T20:31:58.2879115Z x0 = x[:, :D] 2025-05-07T20:31:58.2879347Z x1 = x[:, D:] 2025-05-07T20:31:58.2879560Z 2025-05-07T20:31:58.2879761Z if contiguous: 2025-05-07T20:31:58.2880005Z x0 = x0.contiguous() 2025-05-07T20:31:58.2880267Z x1 = x1.contiguous() 2025-05-07T20:31:58.2880520Z 2025-05-07T20:31:58.2880722Z if scale_ub is not None: 2025-05-07T20:31:58.2881006Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.2881355Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.2881679Z ) 2025-05-07T20:31:58.2881887Z else: 2025-05-07T20:31:58.2882102Z scale_ub_tensor = None 2025-05-07T20:31:58.2882372Z 2025-05-07T20:31:58.2882633Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.2882953Z op = silu_mul_quant 2025-05-07T20:31:58.2883276Z if compiled: 2025-05-07T20:31:58.2883532Z op = torch.compile(op) 2025-05-07T20:31:58.2883833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.2884117Z 2025-05-07T20:31:58.2884321Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.2884488Z 2025-05-07T20:31:58.2884593Z moe/activation_test.py:117: 2025-05-07T20:31:58.2884899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.2885242Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.2885535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.2886236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.2886942Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.2887588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.2888287Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.2888975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.2889526Z kernel = self.compile( 2025-05-07T20:31:58.2890087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.2890760Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.2891170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.2891403Z 2025-05-07T20:31:58.2891622Z self = 2025-05-07T20:31:58.2892730Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.2894159Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3006d8b80>} 2025-05-07T20:31:58.2895536Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.2896583Z context = 2025-05-07T20:31:58.2896876Z 2025-05-07T20:31:58.2897054Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.2897591Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.2898088Z module_map=module_map) 2025-05-07T20:31:58.2898508Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.2898880Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.2899145Z E ^ 2025-05-07T20:31:58.2899628Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.2900089Z 2025-05-07T20:31:58.2900528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.2901057Z 2025-05-07T20:31:58.2901175Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.2901599Z self=, 2025-05-07T20:31:58.2902018Z T=16384, 2025-05-07T20:31:58.2902227Z D=5120, 2025-05-07T20:31:58.2902423Z scale_ub=1200.0, 2025-05-07T20:31:58.2902666Z contiguous=False, 2025-05-07T20:31:58.2902909Z compiled=True, 2025-05-07T20:31:58.2903120Z ) 2025-05-07T20:31:58.3601951Z self = 2025-05-07T20:31:58.3602708Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:58.3603005Z 2025-05-07T20:31:58.3603092Z @given( 2025-05-07T20:31:58.3603452Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.3603770Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.3604085Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.3604421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.3604756Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.3605042Z ) 2025-05-07T20:31:58.3605396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.3605848Z def test_silu_mul_quant( 2025-05-07T20:31:58.3606093Z self, 2025-05-07T20:31:58.3606299Z T: int, 2025-05-07T20:31:58.3606509Z D: int, 2025-05-07T20:31:58.3606925Z scale_ub: Optional[float], 2025-05-07T20:31:58.3607220Z contiguous: bool, 2025-05-07T20:31:58.3607466Z compiled: bool, 2025-05-07T20:31:58.3607695Z ) -> None: 2025-05-07T20:31:58.3607926Z torch.manual_seed(2025) 2025-05-07T20:31:58.3608183Z 2025-05-07T20:31:58.3608463Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.3608815Z 2025-05-07T20:31:58.3609025Z x_sign = torch.sign(x) 2025-05-07T20:31:58.3609318Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.3609639Z x = x_sign * x_clamp 2025-05-07T20:31:58.3609889Z x0 = x[:, :D] 2025-05-07T20:31:58.3610114Z x1 = x[:, D:] 2025-05-07T20:31:58.3610323Z 2025-05-07T20:31:58.3610524Z if contiguous: 2025-05-07T20:31:58.3610769Z x0 = x0.contiguous() 2025-05-07T20:31:58.3611031Z x1 = x1.contiguous() 2025-05-07T20:31:58.3611281Z 2025-05-07T20:31:58.3611485Z if scale_ub is not None: 2025-05-07T20:31:58.3611846Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.3612198Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.3612516Z ) 2025-05-07T20:31:58.3612712Z else: 2025-05-07T20:31:58.3612935Z scale_ub_tensor = None 2025-05-07T20:31:58.3613195Z 2025-05-07T20:31:58.3613432Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.3613764Z op = silu_mul_quant 2025-05-07T20:31:58.3614024Z if compiled: 2025-05-07T20:31:58.3614274Z op = torch.compile(op) 2025-05-07T20:31:58.3614579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.3614863Z 2025-05-07T20:31:58.3615060Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.3615235Z 2025-05-07T20:31:58.3615338Z moe/activation_test.py:117: 2025-05-07T20:31:58.3615648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.3615997Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.3616356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.3616928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:58.3617504Z return fn(*args, **kwargs) 2025-05-07T20:31:58.3618170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.3618877Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.3619426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.3620119Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.3620788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.3621338Z kernel = self.compile( 2025-05-07T20:31:58.3621895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.3622569Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.3622975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.3623214Z 2025-05-07T20:31:58.3623431Z self = 2025-05-07T20:31:58.3624518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.3625892Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3006d9e40>} 2025-05-07T20:31:58.3627334Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.3628378Z context = 2025-05-07T20:31:58.3628677Z 2025-05-07T20:31:58.3628846Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.3629383Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.3629867Z module_map=module_map) 2025-05-07T20:31:58.3630242Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.3630608Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.3630869Z E ^ 2025-05-07T20:31:58.3631342Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.3631815Z 2025-05-07T20:31:58.3632250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.3632824Z 2025-05-07T20:31:58.3632939Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.3633355Z self=, 2025-05-07T20:31:58.3633766Z T=2048, 2025-05-07T20:31:58.3633967Z D=7168, 2025-05-07T20:31:58.3634168Z scale_ub=1200.0, 2025-05-07T20:31:58.3634400Z contiguous=False, 2025-05-07T20:31:58.3634636Z compiled=True, 2025-05-07T20:31:58.3634857Z ) 2025-05-07T20:31:58.3635185Z self = 2025-05-07T20:31:58.3635691Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:58.3635970Z 2025-05-07T20:31:58.3636060Z @given( 2025-05-07T20:31:58.3636295Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.3636620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.3636943Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.3637324Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.3637666Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.3637960Z ) 2025-05-07T20:31:58.3638312Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.3638998Z def test_silu_mul_quant( 2025-05-07T20:31:58.3639513Z self, 2025-05-07T20:31:58.3639765Z T: int, 2025-05-07T20:31:58.3640041Z D: int, 2025-05-07T20:31:58.3640497Z scale_ub: Optional[float], 2025-05-07T20:31:58.3640829Z contiguous: bool, 2025-05-07T20:31:58.3641144Z compiled: bool, 2025-05-07T20:31:58.3652038Z ) -> None: 2025-05-07T20:31:58.3652408Z torch.manual_seed(2025) 2025-05-07T20:31:58.3652826Z 2025-05-07T20:31:58.3653158Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.3653514Z 2025-05-07T20:31:58.3653719Z x_sign = torch.sign(x) 2025-05-07T20:31:58.3654029Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.3654353Z x = x_sign * x_clamp 2025-05-07T20:31:58.3654604Z x0 = x[:, :D] 2025-05-07T20:31:58.3654821Z x1 = x[:, D:] 2025-05-07T20:31:58.3655033Z 2025-05-07T20:31:58.3655230Z if contiguous: 2025-05-07T20:31:58.3655464Z x0 = x0.contiguous() 2025-05-07T20:31:58.3655732Z x1 = x1.contiguous() 2025-05-07T20:31:58.3655985Z 2025-05-07T20:31:58.3656176Z if scale_ub is not None: 2025-05-07T20:31:58.3656459Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.3656801Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.3657111Z ) 2025-05-07T20:31:58.3657315Z else: 2025-05-07T20:31:58.3657534Z scale_ub_tensor = None 2025-05-07T20:31:58.3657788Z 2025-05-07T20:31:58.3658032Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.3658544Z op = silu_mul_quant 2025-05-07T20:31:58.3658814Z if compiled: 2025-05-07T20:31:58.3659065Z op = torch.compile(op) 2025-05-07T20:31:58.3659369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.3659652Z 2025-05-07T20:31:58.3659853Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.3660032Z 2025-05-07T20:31:58.3660137Z moe/activation_test.py:117: 2025-05-07T20:31:58.3660443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.3660783Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.3661075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.3661652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:58.3662229Z return fn(*args, **kwargs) 2025-05-07T20:31:58.3662891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.3663688Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.3664241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.3664930Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.3665605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.3666151Z kernel = self.compile( 2025-05-07T20:31:58.3666707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.3667368Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.3667777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.3668008Z 2025-05-07T20:31:58.3668230Z self = 2025-05-07T20:31:58.3669319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.3670756Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3006da980>} 2025-05-07T20:31:58.3672116Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.3673202Z context = 2025-05-07T20:31:58.3673493Z 2025-05-07T20:31:58.3673676Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.3674204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.3674687Z module_map=module_map) 2025-05-07T20:31:58.3675065Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.3675440Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.3675704Z E ^ 2025-05-07T20:31:58.3676181Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.3676630Z 2025-05-07T20:31:58.3677053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.3677575Z 2025-05-07T20:31:58.4564030Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.4564480Z self=, 2025-05-07T20:31:58.4564930Z T=1, 2025-05-07T20:31:58.4565145Z D=5120, 2025-05-07T20:31:58.4565429Z scale_ub=None, 2025-05-07T20:31:58.4565725Z contiguous=False, 2025-05-07T20:31:58.4566252Z compiled=False, 2025-05-07T20:31:58.4566553Z ) 2025-05-07T20:31:58.4567000Z self = 2025-05-07T20:31:58.4567539Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:58.4567809Z 2025-05-07T20:31:58.4567891Z @given( 2025-05-07T20:31:58.4568134Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.4568463Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.4568777Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.4569125Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.4569470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.4569771Z ) 2025-05-07T20:31:58.4570127Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.4570584Z def test_silu_mul_quant( 2025-05-07T20:31:58.4570837Z self, 2025-05-07T20:31:58.4571041Z T: int, 2025-05-07T20:31:58.4571336Z D: int, 2025-05-07T20:31:58.4571571Z scale_ub: Optional[float], 2025-05-07T20:31:58.4571849Z contiguous: bool, 2025-05-07T20:31:58.4572102Z compiled: bool, 2025-05-07T20:31:58.4572346Z ) -> None: 2025-05-07T20:31:58.4572573Z torch.manual_seed(2025) 2025-05-07T20:31:58.4572829Z 2025-05-07T20:31:58.4573119Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.4573469Z 2025-05-07T20:31:58.4573681Z x_sign = torch.sign(x) 2025-05-07T20:31:58.4573985Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.4574302Z x = x_sign * x_clamp 2025-05-07T20:31:58.4574554Z x0 = x[:, :D] 2025-05-07T20:31:58.4574842Z x1 = x[:, D:] 2025-05-07T20:31:58.4575149Z 2025-05-07T20:31:58.4575345Z if contiguous: 2025-05-07T20:31:58.4575593Z x0 = x0.contiguous() 2025-05-07T20:31:58.4575865Z x1 = x1.contiguous() 2025-05-07T20:31:58.4576114Z 2025-05-07T20:31:58.4576406Z if scale_ub is not None: 2025-05-07T20:31:58.4576694Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.4577036Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.4577355Z ) 2025-05-07T20:31:58.4577557Z else: 2025-05-07T20:31:58.4577772Z scale_ub_tensor = None 2025-05-07T20:31:58.4578032Z 2025-05-07T20:31:58.4578278Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.4578594Z op = silu_mul_quant 2025-05-07T20:31:58.4578854Z if compiled: 2025-05-07T20:31:58.4579109Z op = torch.compile(op) 2025-05-07T20:31:58.4579411Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.4579692Z 2025-05-07T20:31:58.4579897Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.4580063Z 2025-05-07T20:31:58.4580174Z moe/activation_test.py:117: 2025-05-07T20:31:58.4580473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.4580818Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.4581106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.4581799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.4582501Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.4583103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.4583798Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.4584469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.4585015Z kernel = self.compile( 2025-05-07T20:31:58.4585570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.4586318Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.4586733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.4586974Z 2025-05-07T20:31:58.4587186Z self = 2025-05-07T20:31:58.4588278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.4589648Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3004b4220>} 2025-05-07T20:31:58.4591006Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.4592089Z context = 2025-05-07T20:31:58.4592380Z 2025-05-07T20:31:58.4592556Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.4593087Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.4593560Z module_map=module_map) 2025-05-07T20:31:58.4593933Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.4594300Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.4594561Z E ^ 2025-05-07T20:31:58.4595042Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.4595499Z 2025-05-07T20:31:58.4595934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.4596455Z 2025-05-07T20:31:58.4596576Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.4597077Z self=, 2025-05-07T20:31:58.4597482Z T=4096, 2025-05-07T20:31:58.4597676Z D=7168, 2025-05-07T20:31:58.4597881Z scale_ub=1200.0, 2025-05-07T20:31:58.4598107Z contiguous=False, 2025-05-07T20:31:58.4598342Z compiled=False, 2025-05-07T20:31:58.4598553Z ) 2025-05-07T20:31:58.4598878Z self = 2025-05-07T20:31:58.4599392Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:58.4599676Z 2025-05-07T20:31:58.4599764Z @given( 2025-05-07T20:31:58.4600002Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.4600318Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.4600632Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.4600969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.4601306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.4601604Z ) 2025-05-07T20:31:58.4601963Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.4602412Z def test_silu_mul_quant( 2025-05-07T20:31:58.4602667Z self, 2025-05-07T20:31:58.4602888Z T: int, 2025-05-07T20:31:58.4603118Z D: int, 2025-05-07T20:31:58.4603497Z scale_ub: Optional[float], 2025-05-07T20:31:58.4603772Z contiguous: bool, 2025-05-07T20:31:58.4604019Z compiled: bool, 2025-05-07T20:31:58.4604250Z ) -> None: 2025-05-07T20:31:58.4604474Z torch.manual_seed(2025) 2025-05-07T20:31:58.4604721Z 2025-05-07T20:31:58.4605004Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.4605353Z 2025-05-07T20:31:58.4605551Z x_sign = torch.sign(x) 2025-05-07T20:31:58.4605849Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.4606279Z x = x_sign * x_clamp 2025-05-07T20:31:58.4606524Z x0 = x[:, :D] 2025-05-07T20:31:58.4606749Z x1 = x[:, D:] 2025-05-07T20:31:58.4606967Z 2025-05-07T20:31:58.4607155Z if contiguous: 2025-05-07T20:31:58.4607392Z x0 = x0.contiguous() 2025-05-07T20:31:58.4607657Z x1 = x1.contiguous() 2025-05-07T20:31:58.4607899Z 2025-05-07T20:31:58.4608095Z if scale_ub is not None: 2025-05-07T20:31:58.4608374Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.4608717Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.4609032Z ) 2025-05-07T20:31:58.4609240Z else: 2025-05-07T20:31:58.4609454Z scale_ub_tensor = None 2025-05-07T20:31:58.4609710Z 2025-05-07T20:31:58.4609948Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.4610267Z op = silu_mul_quant 2025-05-07T20:31:58.4610521Z if compiled: 2025-05-07T20:31:58.4610782Z op = torch.compile(op) 2025-05-07T20:31:58.4611136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.4611411Z 2025-05-07T20:31:58.4611611Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.4611777Z 2025-05-07T20:31:58.4611884Z moe/activation_test.py:117: 2025-05-07T20:31:58.4612180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.4612519Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.4612807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.4613511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.4614209Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.4614765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.4615461Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.4616143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.4616736Z kernel = self.compile( 2025-05-07T20:31:58.4617295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.4617972Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.4618373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.4618611Z 2025-05-07T20:31:58.4618825Z self = 2025-05-07T20:31:58.4619911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.4621294Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3004b5440>} 2025-05-07T20:31:58.4622658Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.4623697Z context = 2025-05-07T20:31:58.4623991Z 2025-05-07T20:31:58.4624161Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.4624695Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.4625169Z module_map=module_map) 2025-05-07T20:31:58.4625542Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.4625909Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.4626178Z E ^ 2025-05-07T20:31:58.4626728Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.4627197Z 2025-05-07T20:31:58.4627625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.4628145Z 2025-05-07T20:31:58.4628261Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.4628676Z self=, 2025-05-07T20:31:58.4629089Z T=16384, 2025-05-07T20:31:58.4629292Z D=7168, 2025-05-07T20:31:58.4629494Z scale_ub=None, 2025-05-07T20:31:58.4629707Z contiguous=True, 2025-05-07T20:31:58.4629939Z compiled=True, 2025-05-07T20:31:58.4630146Z ) 2025-05-07T20:31:58.8012856Z self = 2025-05-07T20:31:58.8013634Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:58.8014038Z 2025-05-07T20:31:58.8014150Z @given( 2025-05-07T20:31:58.8014615Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.8015040Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.8015414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.8015752Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.8016083Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.8016371Z ) 2025-05-07T20:31:58.8016726Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.8017173Z def test_silu_mul_quant( 2025-05-07T20:31:58.8017413Z self, 2025-05-07T20:31:58.8017612Z T: int, 2025-05-07T20:31:58.8017816Z D: int, 2025-05-07T20:31:58.8018034Z scale_ub: Optional[float], 2025-05-07T20:31:58.8018311Z contiguous: bool, 2025-05-07T20:31:58.8018554Z compiled: bool, 2025-05-07T20:31:58.8018782Z ) -> None: 2025-05-07T20:31:58.8019006Z torch.manual_seed(2025) 2025-05-07T20:31:58.8019262Z 2025-05-07T20:31:58.8019625Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.8019975Z 2025-05-07T20:31:58.8020176Z x_sign = torch.sign(x) 2025-05-07T20:31:58.8020468Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.8020783Z x = x_sign * x_clamp 2025-05-07T20:31:58.8021030Z x0 = x[:, :D] 2025-05-07T20:31:58.8021254Z x1 = x[:, D:] 2025-05-07T20:31:58.8021462Z 2025-05-07T20:31:58.8021659Z if contiguous: 2025-05-07T20:31:58.8021899Z x0 = x0.contiguous() 2025-05-07T20:31:58.8022159Z x1 = x1.contiguous() 2025-05-07T20:31:58.8022403Z 2025-05-07T20:31:58.8022601Z if scale_ub is not None: 2025-05-07T20:31:58.8022878Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.8023259Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.8023593Z ) 2025-05-07T20:31:58.8023788Z else: 2025-05-07T20:31:58.8024011Z scale_ub_tensor = None 2025-05-07T20:31:58.8024269Z 2025-05-07T20:31:58.8024503Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.8024821Z op = silu_mul_quant 2025-05-07T20:31:58.8025078Z if compiled: 2025-05-07T20:31:58.8025324Z op = torch.compile(op) 2025-05-07T20:31:58.8025625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.8025904Z 2025-05-07T20:31:58.8026103Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.8026270Z 2025-05-07T20:31:58.8026371Z moe/activation_test.py:117: 2025-05-07T20:31:58.8026669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.8027011Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.8027290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.8027857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:58.8028546Z return fn(*args, **kwargs) 2025-05-07T20:31:58.8029215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.8029912Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.8030622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.8031323Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.8031988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.8032527Z kernel = self.compile( 2025-05-07T20:31:58.8033075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.8033741Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.8034145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.8034438Z 2025-05-07T20:31:58.8034648Z self = 2025-05-07T20:31:58.8035734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.8037108Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3004b6520>} 2025-05-07T20:31:58.8038637Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.8039675Z context = 2025-05-07T20:31:58.8039974Z 2025-05-07T20:31:58.8040230Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.8040761Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.8041231Z module_map=module_map) 2025-05-07T20:31:58.8041603Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.8041969Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.8042230Z E ^ 2025-05-07T20:31:58.8042704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.8043355Z 2025-05-07T20:31:58.8043781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.8044303Z 2025-05-07T20:31:58.8044415Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.8044829Z self=, 2025-05-07T20:31:58.8045253Z T=4096, 2025-05-07T20:31:58.8045449Z D=5120, 2025-05-07T20:31:58.8045644Z scale_ub=None, 2025-05-07T20:31:58.8045869Z contiguous=False, 2025-05-07T20:31:58.8046101Z compiled=True, 2025-05-07T20:31:58.8046311Z ) 2025-05-07T20:31:58.8046659Z self = 2025-05-07T20:31:58.8047160Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:58.8047439Z 2025-05-07T20:31:58.8047522Z @given( 2025-05-07T20:31:58.8047755Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.8048069Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.8048383Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.8048714Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.8049039Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.8049333Z ) 2025-05-07T20:31:58.8049809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.8050271Z def test_silu_mul_quant( 2025-05-07T20:31:58.8050514Z self, 2025-05-07T20:31:58.8050717Z T: int, 2025-05-07T20:31:58.8050921Z D: int, 2025-05-07T20:31:58.8051143Z scale_ub: Optional[float], 2025-05-07T20:31:58.8051425Z contiguous: bool, 2025-05-07T20:31:58.8051674Z compiled: bool, 2025-05-07T20:31:58.8051900Z ) -> None: 2025-05-07T20:31:58.8052124Z torch.manual_seed(2025) 2025-05-07T20:31:58.8052372Z 2025-05-07T20:31:58.8052649Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.8053005Z 2025-05-07T20:31:58.8053212Z x_sign = torch.sign(x) 2025-05-07T20:31:58.8053510Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.8053829Z x = x_sign * x_clamp 2025-05-07T20:31:58.8054074Z x0 = x[:, :D] 2025-05-07T20:31:58.8054300Z x1 = x[:, D:] 2025-05-07T20:31:58.8054588Z 2025-05-07T20:31:58.8054789Z if contiguous: 2025-05-07T20:31:58.8055025Z x0 = x0.contiguous() 2025-05-07T20:31:58.8055291Z x1 = x1.contiguous() 2025-05-07T20:31:58.8055546Z 2025-05-07T20:31:58.8055748Z if scale_ub is not None: 2025-05-07T20:31:58.8056026Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.8056370Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.8056683Z ) 2025-05-07T20:31:58.8056874Z else: 2025-05-07T20:31:58.8057091Z scale_ub_tensor = None 2025-05-07T20:31:58.8057347Z 2025-05-07T20:31:58.8057577Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.8057892Z op = silu_mul_quant 2025-05-07T20:31:58.8058149Z if compiled: 2025-05-07T20:31:58.8058398Z op = torch.compile(op) 2025-05-07T20:31:58.8058702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.8058980Z 2025-05-07T20:31:58.8059182Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.8059403Z 2025-05-07T20:31:58.8059505Z moe/activation_test.py:117: 2025-05-07T20:31:58.8059804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.8060139Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.8060424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.8060989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:58.8061559Z return fn(*args, **kwargs) 2025-05-07T20:31:58.8062219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.8063038Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.8063679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.8064539Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.8073800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.8074372Z kernel = self.compile( 2025-05-07T20:31:58.8074922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.8075591Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.8075991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.8076220Z 2025-05-07T20:31:58.8076431Z self = 2025-05-07T20:31:58.8077516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.8079009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3004b6c00>} 2025-05-07T20:31:58.8080373Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.8081405Z context = 2025-05-07T20:31:58.8081694Z 2025-05-07T20:31:58.8081869Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.8082398Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.8082870Z module_map=module_map) 2025-05-07T20:31:58.8083323Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.8083728Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.8084047Z E ^ 2025-05-07T20:31:58.8084529Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.8084983Z 2025-05-07T20:31:58.8085404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.8085925Z 2025-05-07T20:31:58.9237871Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.9238656Z self=, 2025-05-07T20:31:58.9239230Z T=4096, 2025-05-07T20:31:58.9239493Z D=5120, 2025-05-07T20:31:58.9239754Z scale_ub=1200.0, 2025-05-07T20:31:58.9240070Z contiguous=False, 2025-05-07T20:31:58.9240379Z compiled=False, 2025-05-07T20:31:58.9240667Z ) 2025-05-07T20:31:58.9241110Z self = 2025-05-07T20:31:58.9241699Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:58.9242116Z 2025-05-07T20:31:58.9242205Z @given( 2025-05-07T20:31:58.9242441Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.9242765Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.9243079Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.9243577Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.9243915Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.9244206Z ) 2025-05-07T20:31:58.9244562Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.9245008Z def test_silu_mul_quant( 2025-05-07T20:31:58.9245264Z self, 2025-05-07T20:31:58.9245469Z T: int, 2025-05-07T20:31:58.9245673Z D: int, 2025-05-07T20:31:58.9245899Z scale_ub: Optional[float], 2025-05-07T20:31:58.9246179Z contiguous: bool, 2025-05-07T20:31:58.9246427Z compiled: bool, 2025-05-07T20:31:58.9246661Z ) -> None: 2025-05-07T20:31:58.9246893Z torch.manual_seed(2025) 2025-05-07T20:31:58.9247137Z 2025-05-07T20:31:58.9247416Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.9247757Z 2025-05-07T20:31:58.9247954Z x_sign = torch.sign(x) 2025-05-07T20:31:58.9248258Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.9248571Z x = x_sign * x_clamp 2025-05-07T20:31:58.9248812Z x0 = x[:, :D] 2025-05-07T20:31:58.9249043Z x1 = x[:, D:] 2025-05-07T20:31:58.9249258Z 2025-05-07T20:31:58.9249452Z if contiguous: 2025-05-07T20:31:58.9249681Z x0 = x0.contiguous() 2025-05-07T20:31:58.9249944Z x1 = x1.contiguous() 2025-05-07T20:31:58.9250179Z 2025-05-07T20:31:58.9250371Z if scale_ub is not None: 2025-05-07T20:31:58.9250646Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.9250988Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.9251430Z ) 2025-05-07T20:31:58.9251635Z else: 2025-05-07T20:31:58.9251853Z scale_ub_tensor = None 2025-05-07T20:31:58.9252111Z 2025-05-07T20:31:58.9252344Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.9252664Z op = silu_mul_quant 2025-05-07T20:31:58.9252947Z if compiled: 2025-05-07T20:31:58.9253223Z op = torch.compile(op) 2025-05-07T20:31:58.9253529Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.9253817Z 2025-05-07T20:31:58.9254014Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.9254184Z 2025-05-07T20:31:58.9254287Z moe/activation_test.py:117: 2025-05-07T20:31:58.9254586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.9254924Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.9255207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.9255916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.9256835Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.9257474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.9258305Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.9259115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.9259759Z kernel = self.compile( 2025-05-07T20:31:58.9260400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.9261197Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.9261655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.9261923Z 2025-05-07T20:31:58.9262166Z self = 2025-05-07T20:31:58.9263567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.9265304Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3000b8400>} 2025-05-07T20:31:58.9266995Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.9268270Z context = 2025-05-07T20:31:58.9268607Z 2025-05-07T20:31:58.9268794Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.9269414Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.9269968Z module_map=module_map) 2025-05-07T20:31:58.9270385Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.9270784Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.9271073Z E ^ 2025-05-07T20:31:58.9271617Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.9272174Z 2025-05-07T20:31:58.9272686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.9273330Z 2025-05-07T20:31:58.9273440Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.9273917Z self=, 2025-05-07T20:31:58.9274384Z T=4096, 2025-05-07T20:31:58.9274579Z D=5120, 2025-05-07T20:31:58.9274784Z scale_ub=1200.0, 2025-05-07T20:31:58.9275112Z contiguous=False, 2025-05-07T20:31:58.9275338Z compiled=True, 2025-05-07T20:31:58.9275546Z ) 2025-05-07T20:31:58.9275873Z self = 2025-05-07T20:31:58.9276370Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:58.9276650Z 2025-05-07T20:31:58.9276729Z @given( 2025-05-07T20:31:58.9276961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.9277270Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.9277579Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.9277906Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.9278233Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.9278519Z ) 2025-05-07T20:31:58.9278870Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.9279314Z def test_silu_mul_quant( 2025-05-07T20:31:58.9279608Z self, 2025-05-07T20:31:58.9279811Z T: int, 2025-05-07T20:31:58.9280012Z D: int, 2025-05-07T20:31:58.9280227Z scale_ub: Optional[float], 2025-05-07T20:31:58.9280501Z contiguous: bool, 2025-05-07T20:31:58.9280743Z compiled: bool, 2025-05-07T20:31:58.9280963Z ) -> None: 2025-05-07T20:31:58.9281185Z torch.manual_seed(2025) 2025-05-07T20:31:58.9281432Z 2025-05-07T20:31:58.9281703Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.9282048Z 2025-05-07T20:31:58.9282242Z x_sign = torch.sign(x) 2025-05-07T20:31:58.9282541Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.9282852Z x = x_sign * x_clamp 2025-05-07T20:31:58.9283090Z x0 = x[:, :D] 2025-05-07T20:31:58.9283424Z x1 = x[:, D:] 2025-05-07T20:31:58.9283630Z 2025-05-07T20:31:58.9283818Z if contiguous: 2025-05-07T20:31:58.9284055Z x0 = x0.contiguous() 2025-05-07T20:31:58.9284377Z x1 = x1.contiguous() 2025-05-07T20:31:58.9284624Z 2025-05-07T20:31:58.9284823Z if scale_ub is not None: 2025-05-07T20:31:58.9285094Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.9285434Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.9285749Z ) 2025-05-07T20:31:58.9285939Z else: 2025-05-07T20:31:58.9286152Z scale_ub_tensor = None 2025-05-07T20:31:58.9286405Z 2025-05-07T20:31:58.9286636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.9286952Z op = silu_mul_quant 2025-05-07T20:31:58.9287204Z if compiled: 2025-05-07T20:31:58.9287455Z op = torch.compile(op) 2025-05-07T20:31:58.9287749Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.9288028Z 2025-05-07T20:31:58.9288227Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.9288391Z 2025-05-07T20:31:58.9288490Z moe/activation_test.py:117: 2025-05-07T20:31:58.9288795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.9289130Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.9289413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.9289981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:58.9290547Z return fn(*args, **kwargs) 2025-05-07T20:31:58.9291215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.9291904Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.9292447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.9293139Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.9293894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.9294445Z kernel = self.compile( 2025-05-07T20:31:58.9294993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.9295655Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.9296053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.9296295Z 2025-05-07T20:31:58.9296530Z self = 2025-05-07T20:31:58.9297873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.9299607Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3000b9620>} 2025-05-07T20:31:58.9301030Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.9302061Z context = 2025-05-07T20:31:58.9302349Z 2025-05-07T20:31:58.9302523Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.9303048Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.9303516Z module_map=module_map) 2025-05-07T20:31:58.9303884Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.9304243Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.9304501Z E ^ 2025-05-07T20:31:58.9304972Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.9305473Z 2025-05-07T20:31:58.9305902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.9306422Z 2025-05-07T20:31:59.0185985Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.0187716Z self=, 2025-05-07T20:31:59.0188582Z T=2048, 2025-05-07T20:31:59.0188958Z D=7168, 2025-05-07T20:31:59.0189357Z scale_ub=1200.0, 2025-05-07T20:31:59.0189812Z contiguous=False, 2025-05-07T20:31:59.0190256Z compiled=False, 2025-05-07T20:31:59.0190671Z ) 2025-05-07T20:31:59.0191310Z self = 2025-05-07T20:31:59.0192376Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:59.0192927Z 2025-05-07T20:31:59.0193017Z @given( 2025-05-07T20:31:59.0193279Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.0193622Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.0193938Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.0194267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.0194608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.0194905Z ) 2025-05-07T20:31:59.0195267Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.0195713Z def test_silu_mul_quant( 2025-05-07T20:31:59.0195967Z self, 2025-05-07T20:31:59.0196174Z T: int, 2025-05-07T20:31:59.0196375Z D: int, 2025-05-07T20:31:59.0196605Z scale_ub: Optional[float], 2025-05-07T20:31:59.0196889Z contiguous: bool, 2025-05-07T20:31:59.0197129Z compiled: bool, 2025-05-07T20:31:59.0197365Z ) -> None: 2025-05-07T20:31:59.0197597Z torch.manual_seed(2025) 2025-05-07T20:31:59.0197838Z 2025-05-07T20:31:59.0198477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.0198844Z 2025-05-07T20:31:59.0199041Z x_sign = torch.sign(x) 2025-05-07T20:31:59.0199346Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.0199669Z x = x_sign * x_clamp 2025-05-07T20:31:59.0199912Z x0 = x[:, :D] 2025-05-07T20:31:59.0200137Z x1 = x[:, D:] 2025-05-07T20:31:59.0200355Z 2025-05-07T20:31:59.0200541Z if contiguous: 2025-05-07T20:31:59.0200784Z x0 = x0.contiguous() 2025-05-07T20:31:59.0201057Z x1 = x1.contiguous() 2025-05-07T20:31:59.0201307Z 2025-05-07T20:31:59.0201500Z if scale_ub is not None: 2025-05-07T20:31:59.0201782Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.0202131Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.0202442Z ) 2025-05-07T20:31:59.0202641Z else: 2025-05-07T20:31:59.0202864Z scale_ub_tensor = None 2025-05-07T20:31:59.0203346Z 2025-05-07T20:31:59.0203598Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.0203918Z op = silu_mul_quant 2025-05-07T20:31:59.0204168Z if compiled: 2025-05-07T20:31:59.0204423Z op = torch.compile(op) 2025-05-07T20:31:59.0204728Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.0205007Z 2025-05-07T20:31:59.0205210Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.0205386Z 2025-05-07T20:31:59.0205488Z moe/activation_test.py:117: 2025-05-07T20:31:59.0205795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.0206127Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.0206418Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.0207128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.0207830Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.0208479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.0209175Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.0209858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.0210395Z kernel = self.compile( 2025-05-07T20:31:59.0210953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.0211629Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.0212028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.0212269Z 2025-05-07T20:31:59.0212480Z self = 2025-05-07T20:31:59.0213634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.0215109Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3000ba480>} 2025-05-07T20:31:59.0216476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.0217519Z context = 2025-05-07T20:31:59.0217813Z 2025-05-07T20:31:59.0217984Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.0218527Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.0219098Z module_map=module_map) 2025-05-07T20:31:59.0219472Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.0219840Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.0220111Z E ^ 2025-05-07T20:31:59.0220592Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.0221049Z 2025-05-07T20:31:59.0221473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.0222000Z 2025-05-07T20:31:59.0222107Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.0222533Z self=, 2025-05-07T20:31:59.0222948Z T=1, 2025-05-07T20:31:59.0223137Z D=7168, 2025-05-07T20:31:59.0223339Z scale_ub=None, 2025-05-07T20:31:59.0223564Z contiguous=True, 2025-05-07T20:31:59.0223789Z compiled=False, 2025-05-07T20:31:59.0224051Z ) 2025-05-07T20:31:59.0224393Z self = 2025-05-07T20:31:59.0224881Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:59.0225152Z 2025-05-07T20:31:59.0225233Z @given( 2025-05-07T20:31:59.0225470Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.0225787Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.0226103Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.0226438Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.0226774Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.0227061Z ) 2025-05-07T20:31:59.0227421Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.0227873Z def test_silu_mul_quant( 2025-05-07T20:31:59.0228112Z self, 2025-05-07T20:31:59.0228317Z T: int, 2025-05-07T20:31:59.0228522Z D: int, 2025-05-07T20:31:59.0228747Z scale_ub: Optional[float], 2025-05-07T20:31:59.0229083Z contiguous: bool, 2025-05-07T20:31:59.0229331Z compiled: bool, 2025-05-07T20:31:59.0229554Z ) -> None: 2025-05-07T20:31:59.0229783Z torch.manual_seed(2025) 2025-05-07T20:31:59.0230035Z 2025-05-07T20:31:59.0230310Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.0230661Z 2025-05-07T20:31:59.0230867Z x_sign = torch.sign(x) 2025-05-07T20:31:59.0231162Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.0231483Z x = x_sign * x_clamp 2025-05-07T20:31:59.0231733Z x0 = x[:, :D] 2025-05-07T20:31:59.0231959Z x1 = x[:, D:] 2025-05-07T20:31:59.0232169Z 2025-05-07T20:31:59.0232368Z if contiguous: 2025-05-07T20:31:59.0232607Z x0 = x0.contiguous() 2025-05-07T20:31:59.0232868Z x1 = x1.contiguous() 2025-05-07T20:31:59.0233119Z 2025-05-07T20:31:59.0233329Z if scale_ub is not None: 2025-05-07T20:31:59.0233607Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.0233954Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.0234274Z ) 2025-05-07T20:31:59.0234467Z else: 2025-05-07T20:31:59.0234686Z scale_ub_tensor = None 2025-05-07T20:31:59.0234951Z 2025-05-07T20:31:59.0235186Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.0235511Z op = silu_mul_quant 2025-05-07T20:31:59.0235769Z if compiled: 2025-05-07T20:31:59.0236016Z op = torch.compile(op) 2025-05-07T20:31:59.0236324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.0236606Z 2025-05-07T20:31:59.0236803Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.0236979Z 2025-05-07T20:31:59.0237079Z moe/activation_test.py:117: 2025-05-07T20:31:59.0237381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.0237817Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.0238105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.0239094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.0239809Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.0240350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.0241046Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.0241727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.0242271Z kernel = self.compile( 2025-05-07T20:31:59.0242824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.0243654Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.0244138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.0244372Z 2025-05-07T20:31:59.0244594Z self = 2025-05-07T20:31:59.0245678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.0247058Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3000b9da0>} 2025-05-07T20:31:59.0248421Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.0249471Z context = 2025-05-07T20:31:59.0249861Z 2025-05-07T20:31:59.0250032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.0250572Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.0251052Z module_map=module_map) 2025-05-07T20:31:59.0251431Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.0251782Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.0252055Z E ^ 2025-05-07T20:31:59.0252532Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.0252987Z 2025-05-07T20:31:59.0253421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.0253944Z 2025-05-07T20:31:59.0254052Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.0254483Z self=, 2025-05-07T20:31:59.0254924Z T=16384, 2025-05-07T20:31:59.0255119Z D=7168, 2025-05-07T20:31:59.0255323Z scale_ub=1200.0, 2025-05-07T20:31:59.0255558Z contiguous=False, 2025-05-07T20:31:59.0255785Z compiled=True, 2025-05-07T20:31:59.4121236Z ) 2025-05-07T20:31:59.4122251Z self = 2025-05-07T20:31:59.4122831Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:59.4123119Z 2025-05-07T20:31:59.4123322Z @given( 2025-05-07T20:31:59.4123629Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.4124091Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.4124438Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.4124785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.4125118Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.4125850Z ) 2025-05-07T20:31:59.4126216Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.4126662Z def test_silu_mul_quant( 2025-05-07T20:31:59.4126914Z self, 2025-05-07T20:31:59.4127119Z T: int, 2025-05-07T20:31:59.4138158Z D: int, 2025-05-07T20:31:59.4138696Z scale_ub: Optional[float], 2025-05-07T20:31:59.4138989Z contiguous: bool, 2025-05-07T20:31:59.4139231Z compiled: bool, 2025-05-07T20:31:59.4139473Z ) -> None: 2025-05-07T20:31:59.4139704Z torch.manual_seed(2025) 2025-05-07T20:31:59.4139946Z 2025-05-07T20:31:59.4140231Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.4140579Z 2025-05-07T20:31:59.4140775Z x_sign = torch.sign(x) 2025-05-07T20:31:59.4141076Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.4141393Z x = x_sign * x_clamp 2025-05-07T20:31:59.4141822Z x0 = x[:, :D] 2025-05-07T20:31:59.4142046Z x1 = x[:, D:] 2025-05-07T20:31:59.4142265Z 2025-05-07T20:31:59.4142462Z if contiguous: 2025-05-07T20:31:59.4142693Z x0 = x0.contiguous() 2025-05-07T20:31:59.4142958Z x1 = x1.contiguous() 2025-05-07T20:31:59.4143209Z 2025-05-07T20:31:59.4143400Z if scale_ub is not None: 2025-05-07T20:31:59.4143679Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.4144025Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.4144334Z ) 2025-05-07T20:31:59.4144536Z else: 2025-05-07T20:31:59.4144760Z scale_ub_tensor = None 2025-05-07T20:31:59.4145014Z 2025-05-07T20:31:59.4145261Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.4145583Z op = silu_mul_quant 2025-05-07T20:31:59.4145831Z if compiled: 2025-05-07T20:31:59.4146091Z op = torch.compile(op) 2025-05-07T20:31:59.4146402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.4146771Z 2025-05-07T20:31:59.4146974Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.4147149Z 2025-05-07T20:31:59.4147250Z moe/activation_test.py:117: 2025-05-07T20:31:59.4147555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.4147885Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.4148175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.4148748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.4149314Z return fn(*args, **kwargs) 2025-05-07T20:31:59.4149985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.4150686Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.4151242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.4151932Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.4152616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.4153164Z kernel = self.compile( 2025-05-07T20:31:59.4153746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.4154419Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.4154819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.4155059Z 2025-05-07T20:31:59.4155271Z self = 2025-05-07T20:31:59.4156511Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.4157917Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfe1ca40>} 2025-05-07T20:31:59.4159267Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.4160306Z context = 2025-05-07T20:31:59.4160609Z 2025-05-07T20:31:59.4160781Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.4161314Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.4161782Z module_map=module_map) 2025-05-07T20:31:59.4162160Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.4162571Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.4162841Z E ^ 2025-05-07T20:31:59.4163507Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.4163969Z 2025-05-07T20:31:59.4164391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.4164909Z 2025-05-07T20:31:59.4165022Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.4165435Z self=, 2025-05-07T20:31:59.4165844Z T=1, 2025-05-07T20:31:59.4166035Z D=7168, 2025-05-07T20:31:59.4166238Z scale_ub=None, 2025-05-07T20:31:59.4166451Z contiguous=False, 2025-05-07T20:31:59.4166680Z compiled=False, 2025-05-07T20:31:59.4166896Z ) 2025-05-07T20:31:59.4167215Z self = 2025-05-07T20:31:59.4167717Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:59.4168031Z 2025-05-07T20:31:59.4168119Z @given( 2025-05-07T20:31:59.4168344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.4168664Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.4168976Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.4169299Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.4169631Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.4169926Z ) 2025-05-07T20:31:59.4170282Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.4170720Z def test_silu_mul_quant( 2025-05-07T20:31:59.4170967Z self, 2025-05-07T20:31:59.4171171Z T: int, 2025-05-07T20:31:59.4171365Z D: int, 2025-05-07T20:31:59.4171588Z scale_ub: Optional[float], 2025-05-07T20:31:59.4171864Z contiguous: bool, 2025-05-07T20:31:59.4172110Z compiled: bool, 2025-05-07T20:31:59.4172346Z ) -> None: 2025-05-07T20:31:59.4172566Z torch.manual_seed(2025) 2025-05-07T20:31:59.4172802Z 2025-05-07T20:31:59.4173084Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.4173432Z 2025-05-07T20:31:59.4173624Z x_sign = torch.sign(x) 2025-05-07T20:31:59.4173924Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.4174240Z x = x_sign * x_clamp 2025-05-07T20:31:59.4174485Z x0 = x[:, :D] 2025-05-07T20:31:59.4174700Z x1 = x[:, D:] 2025-05-07T20:31:59.4174913Z 2025-05-07T20:31:59.4175107Z if contiguous: 2025-05-07T20:31:59.4175338Z x0 = x0.contiguous() 2025-05-07T20:31:59.4175602Z x1 = x1.contiguous() 2025-05-07T20:31:59.4175850Z 2025-05-07T20:31:59.4176047Z if scale_ub is not None: 2025-05-07T20:31:59.4176322Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.4176775Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.4177086Z ) 2025-05-07T20:31:59.4177287Z else: 2025-05-07T20:31:59.4177502Z scale_ub_tensor = None 2025-05-07T20:31:59.4177750Z 2025-05-07T20:31:59.4177987Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.4178307Z op = silu_mul_quant 2025-05-07T20:31:59.4178553Z if compiled: 2025-05-07T20:31:59.4178806Z op = torch.compile(op) 2025-05-07T20:31:59.4179103Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.4179372Z 2025-05-07T20:31:59.4179571Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.4179741Z 2025-05-07T20:31:59.4179840Z moe/activation_test.py:117: 2025-05-07T20:31:59.4180138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.4180465Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.4180749Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.4181512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.4182203Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.4182750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.4183440Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.4184110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.4184643Z kernel = self.compile( 2025-05-07T20:31:59.4185190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.4185859Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.4186255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.4186501Z 2025-05-07T20:31:59.4186755Z self = 2025-05-07T20:31:59.4187844Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.4189221Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfe1d8a0>} 2025-05-07T20:31:59.4190574Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.4191595Z context = 2025-05-07T20:31:59.4191894Z 2025-05-07T20:31:59.4192068Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.4192607Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.4193085Z module_map=module_map) 2025-05-07T20:31:59.4193500Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.4193862Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.4194127Z E ^ 2025-05-07T20:31:59.4194592Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.4195054Z 2025-05-07T20:31:59.4195475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.4196001Z 2025-05-07T20:31:59.4196108Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.4196527Z self=, 2025-05-07T20:31:59.4196931Z T=2048, 2025-05-07T20:31:59.4197218Z D=7168, 2025-05-07T20:31:59.4197424Z scale_ub=None, 2025-05-07T20:31:59.4197636Z contiguous=False, 2025-05-07T20:31:59.4197867Z compiled=True, 2025-05-07T20:31:59.4198078Z ) 2025-05-07T20:31:59.4879374Z self = 2025-05-07T20:31:59.4879929Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:59.4880220Z 2025-05-07T20:31:59.4880303Z @given( 2025-05-07T20:31:59.4880547Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.4880865Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.4881181Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.4881524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.4881862Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.4882150Z ) 2025-05-07T20:31:59.4882530Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.4883453Z def test_silu_mul_quant( 2025-05-07T20:31:59.4883722Z self, 2025-05-07T20:31:59.4883922Z T: int, 2025-05-07T20:31:59.4884137Z D: int, 2025-05-07T20:31:59.4884367Z scale_ub: Optional[float], 2025-05-07T20:31:59.4884644Z contiguous: bool, 2025-05-07T20:31:59.4884900Z compiled: bool, 2025-05-07T20:31:59.4885138Z ) -> None: 2025-05-07T20:31:59.4885358Z torch.manual_seed(2025) 2025-05-07T20:31:59.4885609Z 2025-05-07T20:31:59.4885894Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.4886241Z 2025-05-07T20:31:59.4886449Z x_sign = torch.sign(x) 2025-05-07T20:31:59.4886755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.4887066Z x = x_sign * x_clamp 2025-05-07T20:31:59.4887316Z x0 = x[:, :D] 2025-05-07T20:31:59.4887552Z x1 = x[:, D:] 2025-05-07T20:31:59.4887767Z 2025-05-07T20:31:59.4887979Z if contiguous: 2025-05-07T20:31:59.4888327Z x0 = x0.contiguous() 2025-05-07T20:31:59.4888587Z x1 = x1.contiguous() 2025-05-07T20:31:59.4888844Z 2025-05-07T20:31:59.4889052Z if scale_ub is not None: 2025-05-07T20:31:59.4889339Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.4889677Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.4890002Z ) 2025-05-07T20:31:59.4890212Z else: 2025-05-07T20:31:59.4890433Z scale_ub_tensor = None 2025-05-07T20:31:59.4890695Z 2025-05-07T20:31:59.4890932Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.4891243Z op = silu_mul_quant 2025-05-07T20:31:59.4891500Z if compiled: 2025-05-07T20:31:59.4891755Z op = torch.compile(op) 2025-05-07T20:31:59.4892051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.4892334Z 2025-05-07T20:31:59.4892530Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.4892705Z 2025-05-07T20:31:59.4892809Z moe/activation_test.py:117: 2025-05-07T20:31:59.4893115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.4893481Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.4893771Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.4894339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.4894910Z return fn(*args, **kwargs) 2025-05-07T20:31:59.4895583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.4896281Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.4896825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.4897520Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.4898379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.4898935Z kernel = self.compile( 2025-05-07T20:31:59.4899484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.4900157Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.4900570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.4900800Z 2025-05-07T20:31:59.4901027Z self = 2025-05-07T20:31:59.4902106Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.4903507Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfe1eb60>} 2025-05-07T20:31:59.4904965Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.4906003Z context = 2025-05-07T20:31:59.4906297Z 2025-05-07T20:31:59.4906473Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.4907008Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.4907482Z module_map=module_map) 2025-05-07T20:31:59.4907860Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.4908222Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.4908497Z E ^ 2025-05-07T20:31:59.4908983Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.4909484Z 2025-05-07T20:31:59.4909918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.4910438Z 2025-05-07T20:31:59.4910550Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.4910970Z self=, 2025-05-07T20:31:59.4911386Z T=4096, 2025-05-07T20:31:59.4911576Z D=7168, 2025-05-07T20:31:59.4911777Z scale_ub=None, 2025-05-07T20:31:59.4912007Z contiguous=False, 2025-05-07T20:31:59.4912233Z compiled=True, 2025-05-07T20:31:59.4912451Z ) 2025-05-07T20:31:59.4912780Z self = 2025-05-07T20:31:59.4913275Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:59.4913560Z 2025-05-07T20:31:59.4913660Z @given( 2025-05-07T20:31:59.4913930Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.4914249Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.4914555Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.4914890Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.4915224Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.4915507Z ) 2025-05-07T20:31:59.4915861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.4916312Z def test_silu_mul_quant( 2025-05-07T20:31:59.4916552Z self, 2025-05-07T20:31:59.4916752Z T: int, 2025-05-07T20:31:59.4916954Z D: int, 2025-05-07T20:31:59.4917171Z scale_ub: Optional[float], 2025-05-07T20:31:59.4917451Z contiguous: bool, 2025-05-07T20:31:59.4917696Z compiled: bool, 2025-05-07T20:31:59.4917928Z ) -> None: 2025-05-07T20:31:59.4918144Z torch.manual_seed(2025) 2025-05-07T20:31:59.4918918Z 2025-05-07T20:31:59.4919207Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.4919550Z 2025-05-07T20:31:59.4919750Z x_sign = torch.sign(x) 2025-05-07T20:31:59.4920053Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.4920366Z x = x_sign * x_clamp 2025-05-07T20:31:59.4920613Z x0 = x[:, :D] 2025-05-07T20:31:59.4920840Z x1 = x[:, D:] 2025-05-07T20:31:59.4921049Z 2025-05-07T20:31:59.4921245Z if contiguous: 2025-05-07T20:31:59.4921488Z x0 = x0.contiguous() 2025-05-07T20:31:59.4921743Z x1 = x1.contiguous() 2025-05-07T20:31:59.4921994Z 2025-05-07T20:31:59.4922196Z if scale_ub is not None: 2025-05-07T20:31:59.4922467Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.4922814Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.4923133Z ) 2025-05-07T20:31:59.4923513Z else: 2025-05-07T20:31:59.4923739Z scale_ub_tensor = None 2025-05-07T20:31:59.4924001Z 2025-05-07T20:31:59.4924240Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.4924556Z op = silu_mul_quant 2025-05-07T20:31:59.4924819Z if compiled: 2025-05-07T20:31:59.4925070Z op = torch.compile(op) 2025-05-07T20:31:59.4925365Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.4925645Z 2025-05-07T20:31:59.4925846Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.4926014Z 2025-05-07T20:31:59.4926116Z moe/activation_test.py:117: 2025-05-07T20:31:59.4926416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.4926751Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.4927033Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.4927603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.4928177Z return fn(*args, **kwargs) 2025-05-07T20:31:59.4928901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.4929598Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.4930147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.4930846Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.4931525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.4932063Z kernel = self.compile( 2025-05-07T20:31:59.4932621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.4933292Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.4933696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.4933942Z 2025-05-07T20:31:59.4934155Z self = 2025-05-07T20:31:59.4935249Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.4936632Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfe1fe20>} 2025-05-07T20:31:59.4937989Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.4939261Z context = 2025-05-07T20:31:59.4939701Z 2025-05-07T20:31:59.4939878Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.4940406Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.4940880Z module_map=module_map) 2025-05-07T20:31:59.4941243Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.4941607Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.4941872Z E ^ 2025-05-07T20:31:59.4942339Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.4942800Z 2025-05-07T20:31:59.4943222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.4943748Z 2025-05-07T20:31:59.6205870Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.6206364Z self=, 2025-05-07T20:31:59.6207069Z T=16384, 2025-05-07T20:31:59.6207290Z D=5120, 2025-05-07T20:31:59.6207496Z scale_ub=1200.0, 2025-05-07T20:31:59.6207741Z contiguous=False, 2025-05-07T20:31:59.6208046Z compiled=False, 2025-05-07T20:31:59.6208344Z ) 2025-05-07T20:31:59.6208723Z self = 2025-05-07T20:31:59.6209236Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:59.6209536Z 2025-05-07T20:31:59.6209621Z @given( 2025-05-07T20:31:59.6209871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.6210192Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.6210516Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.6210861Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.6211203Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.6211499Z ) 2025-05-07T20:31:59.6211869Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.6212440Z def test_silu_mul_quant( 2025-05-07T20:31:59.6212683Z self, 2025-05-07T20:31:59.6212894Z T: int, 2025-05-07T20:31:59.6213109Z D: int, 2025-05-07T20:31:59.6213332Z scale_ub: Optional[float], 2025-05-07T20:31:59.6213666Z contiguous: bool, 2025-05-07T20:31:59.6213918Z compiled: bool, 2025-05-07T20:31:59.6214151Z ) -> None: 2025-05-07T20:31:59.6214380Z torch.manual_seed(2025) 2025-05-07T20:31:59.6214633Z 2025-05-07T20:31:59.6214915Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.6215272Z 2025-05-07T20:31:59.6215472Z x_sign = torch.sign(x) 2025-05-07T20:31:59.6215767Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.6216075Z x = x_sign * x_clamp 2025-05-07T20:31:59.6216323Z x0 = x[:, :D] 2025-05-07T20:31:59.6216544Z x1 = x[:, D:] 2025-05-07T20:31:59.6216764Z 2025-05-07T20:31:59.6216957Z if contiguous: 2025-05-07T20:31:59.6217195Z x0 = x0.contiguous() 2025-05-07T20:31:59.6217451Z x1 = x1.contiguous() 2025-05-07T20:31:59.6217699Z 2025-05-07T20:31:59.6217893Z if scale_ub is not None: 2025-05-07T20:31:59.6218183Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.6218526Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.6218846Z ) 2025-05-07T20:31:59.6219042Z else: 2025-05-07T20:31:59.6219259Z scale_ub_tensor = None 2025-05-07T20:31:59.6219521Z 2025-05-07T20:31:59.6219757Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.6220084Z op = silu_mul_quant 2025-05-07T20:31:59.6220349Z if compiled: 2025-05-07T20:31:59.6220595Z op = torch.compile(op) 2025-05-07T20:31:59.6220898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.6221339Z 2025-05-07T20:31:59.6221545Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.6221712Z 2025-05-07T20:31:59.6221813Z moe/activation_test.py:117: 2025-05-07T20:31:59.6222115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.6222456Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.6222740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.6223443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.6224168Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.6224713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.6235289Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.6236014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.6236677Z kernel = self.compile( 2025-05-07T20:31:59.6237247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.6237918Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.6238317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.6238866Z 2025-05-07T20:31:59.6239083Z self = 2025-05-07T20:31:59.6240168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.6241552Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfb38d60>} 2025-05-07T20:31:59.6243002Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.6244124Z context = 2025-05-07T20:31:59.6244421Z 2025-05-07T20:31:59.6244594Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.6245121Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.6245588Z module_map=module_map) 2025-05-07T20:31:59.6245964Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.6246330Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.6246586Z E ^ 2025-05-07T20:31:59.6247061Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.6247525Z 2025-05-07T20:31:59.6247951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.6248467Z 2025-05-07T20:31:59.6248578Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.6248988Z self=, 2025-05-07T20:31:59.6249398Z T=16384, 2025-05-07T20:31:59.6249595Z D=5120, 2025-05-07T20:31:59.6249785Z scale_ub=1200.0, 2025-05-07T20:31:59.6250016Z contiguous=True, 2025-05-07T20:31:59.6250239Z compiled=True, 2025-05-07T20:31:59.6250442Z ) 2025-05-07T20:31:59.6250769Z self = 2025-05-07T20:31:59.6251266Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:59.6251543Z 2025-05-07T20:31:59.6251626Z @given( 2025-05-07T20:31:59.6251849Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.6252296Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.6252616Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.6252944Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.6253331Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.6253626Z ) 2025-05-07T20:31:59.6253971Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.6254420Z def test_silu_mul_quant( 2025-05-07T20:31:59.6254669Z self, 2025-05-07T20:31:59.6254869Z T: int, 2025-05-07T20:31:59.6255066Z D: int, 2025-05-07T20:31:59.6255290Z scale_ub: Optional[float], 2025-05-07T20:31:59.6255567Z contiguous: bool, 2025-05-07T20:31:59.6255806Z compiled: bool, 2025-05-07T20:31:59.6256037Z ) -> None: 2025-05-07T20:31:59.6256259Z torch.manual_seed(2025) 2025-05-07T20:31:59.6256495Z 2025-05-07T20:31:59.6256783Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.6257203Z 2025-05-07T20:31:59.6257394Z x_sign = torch.sign(x) 2025-05-07T20:31:59.6257696Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.6258011Z x = x_sign * x_clamp 2025-05-07T20:31:59.6258252Z x0 = x[:, :D] 2025-05-07T20:31:59.6258482Z x1 = x[:, D:] 2025-05-07T20:31:59.6258706Z 2025-05-07T20:31:59.6258897Z if contiguous: 2025-05-07T20:31:59.6259135Z x0 = x0.contiguous() 2025-05-07T20:31:59.6259403Z x1 = x1.contiguous() 2025-05-07T20:31:59.6259638Z 2025-05-07T20:31:59.6259837Z if scale_ub is not None: 2025-05-07T20:31:59.6260115Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.6260444Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.6260758Z ) 2025-05-07T20:31:59.6260955Z else: 2025-05-07T20:31:59.6261161Z scale_ub_tensor = None 2025-05-07T20:31:59.6261418Z 2025-05-07T20:31:59.6261658Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.6262014Z op = silu_mul_quant 2025-05-07T20:31:59.6262261Z if compiled: 2025-05-07T20:31:59.6262506Z op = torch.compile(op) 2025-05-07T20:31:59.6262798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.6263079Z 2025-05-07T20:31:59.6263273Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.6263435Z 2025-05-07T20:31:59.6263534Z moe/activation_test.py:117: 2025-05-07T20:31:59.6263830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.6264196Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.6264495Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.6265053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.6265616Z return fn(*args, **kwargs) 2025-05-07T20:31:59.6266282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.6266974Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.6267514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.6268196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.6268870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.6269399Z kernel = self.compile( 2025-05-07T20:31:59.6269946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.6270608Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.6271003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.6271231Z 2025-05-07T20:31:59.6271531Z self = 2025-05-07T20:31:59.6272617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.6273985Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfb3a200>} 2025-05-07T20:31:59.6275335Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.6276354Z context = 2025-05-07T20:31:59.6276652Z 2025-05-07T20:31:59.6276819Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.6277391Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.6277861Z module_map=module_map) 2025-05-07T20:31:59.6278220Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.6278581Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.6278843Z E ^ 2025-05-07T20:31:59.6279306Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.6279768Z 2025-05-07T20:31:59.6280187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.6280716Z 2025-05-07T20:31:59.7586541Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.7587016Z self=, 2025-05-07T20:31:59.7587427Z T=16384, 2025-05-07T20:31:59.7587631Z D=5120, 2025-05-07T20:31:59.7587876Z scale_ub=None, 2025-05-07T20:31:59.7588409Z contiguous=False, 2025-05-07T20:31:59.7588636Z compiled=True, 2025-05-07T20:31:59.7588866Z ) 2025-05-07T20:31:59.7589194Z self = 2025-05-07T20:31:59.7589694Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:59.7589973Z 2025-05-07T20:31:59.7590053Z @given( 2025-05-07T20:31:59.7590291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.7590608Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.7590908Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.7591241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.7591569Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.7591848Z ) 2025-05-07T20:31:59.7592200Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.7592648Z def test_silu_mul_quant( 2025-05-07T20:31:59.7592896Z self, 2025-05-07T20:31:59.7593089Z T: int, 2025-05-07T20:31:59.7593288Z D: int, 2025-05-07T20:31:59.7593511Z scale_ub: Optional[float], 2025-05-07T20:31:59.7593776Z contiguous: bool, 2025-05-07T20:31:59.7594019Z compiled: bool, 2025-05-07T20:31:59.7594250Z ) -> None: 2025-05-07T20:31:59.7594467Z torch.manual_seed(2025) 2025-05-07T20:31:59.7594711Z 2025-05-07T20:31:59.7594987Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.7595324Z 2025-05-07T20:31:59.7595521Z x_sign = torch.sign(x) 2025-05-07T20:31:59.7595813Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.7596120Z x = x_sign * x_clamp 2025-05-07T20:31:59.7596364Z x0 = x[:, :D] 2025-05-07T20:31:59.7596583Z x1 = x[:, D:] 2025-05-07T20:31:59.7596784Z 2025-05-07T20:31:59.7596976Z if contiguous: 2025-05-07T20:31:59.7597364Z x0 = x0.contiguous() 2025-05-07T20:31:59.7597631Z x1 = x1.contiguous() 2025-05-07T20:31:59.7597875Z 2025-05-07T20:31:59.7598074Z if scale_ub is not None: 2025-05-07T20:31:59.7598353Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.7598685Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.7598998Z ) 2025-05-07T20:31:59.7599196Z else: 2025-05-07T20:31:59.7599400Z scale_ub_tensor = None 2025-05-07T20:31:59.7599654Z 2025-05-07T20:31:59.7599890Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.7600202Z op = silu_mul_quant 2025-05-07T20:31:59.7600455Z if compiled: 2025-05-07T20:31:59.7600709Z op = torch.compile(op) 2025-05-07T20:31:59.7601003Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.7601279Z 2025-05-07T20:31:59.7601476Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.7601638Z 2025-05-07T20:31:59.7601834Z moe/activation_test.py:117: 2025-05-07T20:31:59.7602134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.7602473Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.7602765Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.7603424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.7603994Z return fn(*args, **kwargs) 2025-05-07T20:31:59.7604658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.7605340Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.7605879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.7606575Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.7607248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.7607828Z kernel = self.compile( 2025-05-07T20:31:59.7608372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.7609055Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.7609448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.7609686Z 2025-05-07T20:31:59.7609899Z self = 2025-05-07T20:31:59.7610985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.7612376Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfb3ad40>} 2025-05-07T20:31:59.7613751Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.7614809Z context = 2025-05-07T20:31:59.7615102Z 2025-05-07T20:31:59.7615269Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.7615795Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.7616258Z module_map=module_map) 2025-05-07T20:31:59.7616626Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.7616984Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.7617246Z E ^ 2025-05-07T20:31:59.7617820Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.7618284Z 2025-05-07T20:31:59.7618704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.7619220Z 2025-05-07T20:31:59.7619332Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.7619746Z self=, 2025-05-07T20:31:59.7620155Z T=2048, 2025-05-07T20:31:59.7620347Z D=5120, 2025-05-07T20:31:59.7620543Z scale_ub=None, 2025-05-07T20:31:59.7620753Z contiguous=False, 2025-05-07T20:31:59.7620977Z compiled=True, 2025-05-07T20:31:59.7621179Z ) 2025-05-07T20:32:00.0310814Z self = 2025-05-07T20:32:00.0312274Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:00.0312824Z 2025-05-07T20:32:00.0312987Z @given( 2025-05-07T20:32:00.0313296Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.0313758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.0314070Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.0314396Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.0314730Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.0315018Z ) 2025-05-07T20:32:00.0315376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.0315817Z def test_silu_mul_quant( 2025-05-07T20:32:00.0316063Z self, 2025-05-07T20:32:00.0316267Z T: int, 2025-05-07T20:32:00.0316464Z D: int, 2025-05-07T20:32:00.0316688Z scale_ub: Optional[float], 2025-05-07T20:32:00.0316965Z contiguous: bool, 2025-05-07T20:32:00.0317204Z compiled: bool, 2025-05-07T20:32:00.0317433Z ) -> None: 2025-05-07T20:32:00.0317654Z torch.manual_seed(2025) 2025-05-07T20:32:00.0317896Z 2025-05-07T20:32:00.0318180Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.0318612Z 2025-05-07T20:32:00.0318806Z x_sign = torch.sign(x) 2025-05-07T20:32:00.0319106Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.0319426Z x = x_sign * x_clamp 2025-05-07T20:32:00.0319662Z x0 = x[:, :D] 2025-05-07T20:32:00.0319885Z x1 = x[:, D:] 2025-05-07T20:32:00.0320099Z 2025-05-07T20:32:00.0320284Z if contiguous: 2025-05-07T20:32:00.0320521Z x0 = x0.contiguous() 2025-05-07T20:32:00.0320786Z x1 = x1.contiguous() 2025-05-07T20:32:00.0321031Z 2025-05-07T20:32:00.0321224Z if scale_ub is not None: 2025-05-07T20:32:00.0321501Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.0321840Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.0322145Z ) 2025-05-07T20:32:00.0322345Z else: 2025-05-07T20:32:00.0322562Z scale_ub_tensor = None 2025-05-07T20:32:00.0322821Z 2025-05-07T20:32:00.0323063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.0323555Z op = silu_mul_quant 2025-05-07T20:32:00.0323803Z if compiled: 2025-05-07T20:32:00.0324055Z op = torch.compile(op) 2025-05-07T20:32:00.0324355Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.0324627Z 2025-05-07T20:32:00.0324822Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.0324993Z 2025-05-07T20:32:00.0325099Z moe/activation_test.py:117: 2025-05-07T20:32:00.0325403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.0325733Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.0326019Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.0326593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:00.0327155Z return fn(*args, **kwargs) 2025-05-07T20:32:00.0327964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.0328668Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.0329213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.0329894Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.0330562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.0331101Z kernel = self.compile( 2025-05-07T20:32:00.0331644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.0332308Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.0332708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.0332986Z 2025-05-07T20:32:00.0333204Z self = 2025-05-07T20:32:00.0334280Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.0335661Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfc6c7c0>} 2025-05-07T20:32:00.0337016Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.0338051Z context = 2025-05-07T20:32:00.0338338Z 2025-05-07T20:32:00.0338688Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.0339290Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.0339765Z module_map=module_map) 2025-05-07T20:32:00.0340139Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.0340494Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.0340762Z E ^ 2025-05-07T20:32:00.0341236Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.0341690Z 2025-05-07T20:32:00.0342117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.0342632Z 2025-05-07T20:32:00.0342739Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.0343160Z self=, 2025-05-07T20:32:00.0343570Z T=2048, 2025-05-07T20:32:00.0343761Z D=5120, 2025-05-07T20:32:00.0343957Z scale_ub=1200.0, 2025-05-07T20:32:00.0344187Z contiguous=False, 2025-05-07T20:32:00.0344409Z compiled=True, 2025-05-07T20:32:00.0344616Z ) 2025-05-07T20:32:00.0344943Z self = 2025-05-07T20:32:00.0345437Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:00.0345715Z 2025-05-07T20:32:00.0345793Z @given( 2025-05-07T20:32:00.0346025Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.0346339Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.0346640Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.0346969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.0347302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.0347584Z ) 2025-05-07T20:32:00.0348056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.0348505Z def test_silu_mul_quant( 2025-05-07T20:32:00.0348746Z self, 2025-05-07T20:32:00.0348935Z T: int, 2025-05-07T20:32:00.0349137Z D: int, 2025-05-07T20:32:00.0349359Z scale_ub: Optional[float], 2025-05-07T20:32:00.0349627Z contiguous: bool, 2025-05-07T20:32:00.0349872Z compiled: bool, 2025-05-07T20:32:00.0350101Z ) -> None: 2025-05-07T20:32:00.0350311Z torch.manual_seed(2025) 2025-05-07T20:32:00.0350554Z 2025-05-07T20:32:00.0350830Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.0351166Z 2025-05-07T20:32:00.0351363Z x_sign = torch.sign(x) 2025-05-07T20:32:00.0351655Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.0351959Z x = x_sign * x_clamp 2025-05-07T20:32:00.0352203Z x0 = x[:, :D] 2025-05-07T20:32:00.0352425Z x1 = x[:, D:] 2025-05-07T20:32:00.0352629Z 2025-05-07T20:32:00.0352891Z if contiguous: 2025-05-07T20:32:00.0353128Z x0 = x0.contiguous() 2025-05-07T20:32:00.0353385Z x1 = x1.contiguous() 2025-05-07T20:32:00.0353628Z 2025-05-07T20:32:00.0353823Z if scale_ub is not None: 2025-05-07T20:32:00.0354103Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.0354440Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.0354759Z ) 2025-05-07T20:32:00.0354960Z else: 2025-05-07T20:32:00.0355163Z scale_ub_tensor = None 2025-05-07T20:32:00.0355419Z 2025-05-07T20:32:00.0355655Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.0355964Z op = silu_mul_quant 2025-05-07T20:32:00.0356218Z if compiled: 2025-05-07T20:32:00.0356465Z op = torch.compile(op) 2025-05-07T20:32:00.0356759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.0357037Z 2025-05-07T20:32:00.0357234Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.0357452Z 2025-05-07T20:32:00.0357550Z moe/activation_test.py:117: 2025-05-07T20:32:00.0357844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.0358178Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.0358459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.0359016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:00.0359579Z return fn(*args, **kwargs) 2025-05-07T20:32:00.0360243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.0360931Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.0361473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.0362160Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.0362837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.0363465Z kernel = self.compile( 2025-05-07T20:32:00.0364013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.0364674Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.0365069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.0365322Z 2025-05-07T20:32:00.0365532Z self = 2025-05-07T20:32:00.0366616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.0368072Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfc6d300>} 2025-05-07T20:32:00.0369431Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.0370461Z context = 2025-05-07T20:32:00.0370752Z 2025-05-07T20:32:00.0370919Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.0371446Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.0371919Z module_map=module_map) 2025-05-07T20:32:00.0372280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.0372639Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.0372905Z E ^ 2025-05-07T20:32:00.0373415Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.0373882Z 2025-05-07T20:32:00.0374304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.0374830Z 2025-05-07T20:32:00.1702808Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.1704137Z self=, 2025-05-07T20:32:00.1704613Z T=4096, 2025-05-07T20:32:00.1704815Z D=5120, 2025-05-07T20:32:00.1705015Z scale_ub=1200.0, 2025-05-07T20:32:00.1705246Z contiguous=True, 2025-05-07T20:32:00.1705477Z compiled=True, 2025-05-07T20:32:00.1705684Z ) 2025-05-07T20:32:00.1706021Z self = 2025-05-07T20:32:00.1706530Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:00.1706809Z 2025-05-07T20:32:00.1706944Z @given( 2025-05-07T20:32:00.1707588Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.1707917Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.1708226Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.1708566Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.1708899Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.1709190Z ) 2025-05-07T20:32:00.1720380Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.1720842Z def test_silu_mul_quant( 2025-05-07T20:32:00.1721098Z self, 2025-05-07T20:32:00.1721302Z T: int, 2025-05-07T20:32:00.1721500Z D: int, 2025-05-07T20:32:00.1721733Z scale_ub: Optional[float], 2025-05-07T20:32:00.1722016Z contiguous: bool, 2025-05-07T20:32:00.1722261Z compiled: bool, 2025-05-07T20:32:00.1722493Z ) -> None: 2025-05-07T20:32:00.1722733Z torch.manual_seed(2025) 2025-05-07T20:32:00.1722991Z 2025-05-07T20:32:00.1723397Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.1723760Z 2025-05-07T20:32:00.1723965Z x_sign = torch.sign(x) 2025-05-07T20:32:00.1724263Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.1724588Z x = x_sign * x_clamp 2025-05-07T20:32:00.1724839Z x0 = x[:, :D] 2025-05-07T20:32:00.1725060Z x1 = x[:, D:] 2025-05-07T20:32:00.1725277Z 2025-05-07T20:32:00.1725475Z if contiguous: 2025-05-07T20:32:00.1725711Z x0 = x0.contiguous() 2025-05-07T20:32:00.1725981Z x1 = x1.contiguous() 2025-05-07T20:32:00.1726228Z 2025-05-07T20:32:00.1726418Z if scale_ub is not None: 2025-05-07T20:32:00.1726705Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.1727052Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.1727365Z ) 2025-05-07T20:32:00.1727815Z else: 2025-05-07T20:32:00.1728046Z scale_ub_tensor = None 2025-05-07T20:32:00.1728314Z 2025-05-07T20:32:00.1728550Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.1728879Z op = silu_mul_quant 2025-05-07T20:32:00.1729142Z if compiled: 2025-05-07T20:32:00.1729393Z op = torch.compile(op) 2025-05-07T20:32:00.1729700Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.1729984Z 2025-05-07T20:32:00.1730179Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.1730352Z 2025-05-07T20:32:00.1730456Z moe/activation_test.py:117: 2025-05-07T20:32:00.1730761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.1731097Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.1731392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.1731971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:00.1732661Z return fn(*args, **kwargs) 2025-05-07T20:32:00.1733330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.1734034Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.1734582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.1735269Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.1735954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.1736496Z kernel = self.compile( 2025-05-07T20:32:00.1737053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.1737716Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.1738132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.1738698Z 2025-05-07T20:32:00.1738922Z self = 2025-05-07T20:32:00.1740012Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.1741402Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfc6dbc0>} 2025-05-07T20:32:00.1742752Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.1743792Z context = 2025-05-07T20:32:00.1744095Z 2025-05-07T20:32:00.1744273Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.1744798Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.1745273Z module_map=module_map) 2025-05-07T20:32:00.1745646Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.1746005Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.1746265Z E ^ 2025-05-07T20:32:00.1746736Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.1747190Z 2025-05-07T20:32:00.1747621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.1748138Z 2025-05-07T20:32:00.1748247Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.1748791Z self=, 2025-05-07T20:32:00.1749209Z T=128, 2025-05-07T20:32:00.1749396Z D=5120, 2025-05-07T20:32:00.1749585Z scale_ub=1200.0, 2025-05-07T20:32:00.1749812Z contiguous=False, 2025-05-07T20:32:00.1750042Z compiled=True, 2025-05-07T20:32:00.1750242Z ) 2025-05-07T20:32:00.2574235Z self = 2025-05-07T20:32:00.2575031Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:00.2575318Z 2025-05-07T20:32:00.2575395Z @given( 2025-05-07T20:32:00.2575636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.2575945Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.2576258Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.2576594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.2576918Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.2577209Z ) 2025-05-07T20:32:00.2577891Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.2578347Z def test_silu_mul_quant( 2025-05-07T20:32:00.2578584Z self, 2025-05-07T20:32:00.2578782Z T: int, 2025-05-07T20:32:00.2579004Z D: int, 2025-05-07T20:32:00.2579235Z scale_ub: Optional[float], 2025-05-07T20:32:00.2579537Z contiguous: bool, 2025-05-07T20:32:00.2579806Z compiled: bool, 2025-05-07T20:32:00.2580052Z ) -> None: 2025-05-07T20:32:00.2580286Z torch.manual_seed(2025) 2025-05-07T20:32:00.2580554Z 2025-05-07T20:32:00.2580853Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.2581255Z 2025-05-07T20:32:00.2581470Z x_sign = torch.sign(x) 2025-05-07T20:32:00.2581789Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.2582147Z x = x_sign * x_clamp 2025-05-07T20:32:00.2582419Z x0 = x[:, :D] 2025-05-07T20:32:00.2582662Z x1 = x[:, D:] 2025-05-07T20:32:00.2582992Z 2025-05-07T20:32:00.2583185Z if contiguous: 2025-05-07T20:32:00.2583421Z x0 = x0.contiguous() 2025-05-07T20:32:00.2583687Z x1 = x1.contiguous() 2025-05-07T20:32:00.2583936Z 2025-05-07T20:32:00.2584126Z if scale_ub is not None: 2025-05-07T20:32:00.2584408Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.2584752Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.2585069Z ) 2025-05-07T20:32:00.2585260Z else: 2025-05-07T20:32:00.2585479Z scale_ub_tensor = None 2025-05-07T20:32:00.2585738Z 2025-05-07T20:32:00.2585974Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.2586292Z op = silu_mul_quant 2025-05-07T20:32:00.2586543Z if compiled: 2025-05-07T20:32:00.2586794Z op = torch.compile(op) 2025-05-07T20:32:00.2587098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.2587385Z 2025-05-07T20:32:00.2587578Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.2587753Z 2025-05-07T20:32:00.2587855Z moe/activation_test.py:117: 2025-05-07T20:32:00.2588159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.2588498Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.2588779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.2589346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:00.2589914Z return fn(*args, **kwargs) 2025-05-07T20:32:00.2590575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.2591271Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.2591814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.2592673Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.2593348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.2593891Z kernel = self.compile( 2025-05-07T20:32:00.2594448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.2595109Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.2595515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.2595754Z 2025-05-07T20:32:00.2595967Z self = 2025-05-07T20:32:00.2597055Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.2598960Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfc6eb60>} 2025-05-07T20:32:00.2600311Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.2601348Z context = 2025-05-07T20:32:00.2601640Z 2025-05-07T20:32:00.2601817Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.2602351Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.2602826Z module_map=module_map) 2025-05-07T20:32:00.2603201Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.2603748Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.2604065Z E ^ 2025-05-07T20:32:00.2604539Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.2604994Z 2025-05-07T20:32:00.2605430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.2605955Z 2025-05-07T20:32:00.2606069Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.2606488Z self=, 2025-05-07T20:32:00.2606899Z T=16384, 2025-05-07T20:32:00.2607099Z D=7168, 2025-05-07T20:32:00.2607298Z scale_ub=1200.0, 2025-05-07T20:32:00.2607529Z contiguous=True, 2025-05-07T20:32:00.2607765Z compiled=True, 2025-05-07T20:32:00.2607972Z ) 2025-05-07T20:32:00.2608299Z self = 2025-05-07T20:32:00.2608818Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:00.2609103Z 2025-05-07T20:32:00.2609192Z @given( 2025-05-07T20:32:00.2609423Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.2609746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.2610060Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.2610386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.2610717Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.2611011Z ) 2025-05-07T20:32:00.2611358Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.2611811Z def test_silu_mul_quant( 2025-05-07T20:32:00.2612054Z self, 2025-05-07T20:32:00.2612251Z T: int, 2025-05-07T20:32:00.2612477Z D: int, 2025-05-07T20:32:00.2612710Z scale_ub: Optional[float], 2025-05-07T20:32:00.2612982Z contiguous: bool, 2025-05-07T20:32:00.2613233Z compiled: bool, 2025-05-07T20:32:00.2613557Z ) -> None: 2025-05-07T20:32:00.2613786Z torch.manual_seed(2025) 2025-05-07T20:32:00.2614031Z 2025-05-07T20:32:00.2614311Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.2614653Z 2025-05-07T20:32:00.2614858Z x_sign = torch.sign(x) 2025-05-07T20:32:00.2615160Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.2615470Z x = x_sign * x_clamp 2025-05-07T20:32:00.2615714Z x0 = x[:, :D] 2025-05-07T20:32:00.2615937Z x1 = x[:, D:] 2025-05-07T20:32:00.2616144Z 2025-05-07T20:32:00.2616335Z if contiguous: 2025-05-07T20:32:00.2616576Z x0 = x0.contiguous() 2025-05-07T20:32:00.2616841Z x1 = x1.contiguous() 2025-05-07T20:32:00.2617080Z 2025-05-07T20:32:00.2617276Z if scale_ub is not None: 2025-05-07T20:32:00.2617549Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.2617888Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.2618312Z ) 2025-05-07T20:32:00.2618512Z else: 2025-05-07T20:32:00.2618729Z scale_ub_tensor = None 2025-05-07T20:32:00.2618989Z 2025-05-07T20:32:00.2619233Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.2619549Z op = silu_mul_quant 2025-05-07T20:32:00.2619809Z if compiled: 2025-05-07T20:32:00.2620071Z op = torch.compile(op) 2025-05-07T20:32:00.2620367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.2620650Z 2025-05-07T20:32:00.2620856Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.2621024Z 2025-05-07T20:32:00.2621141Z moe/activation_test.py:117: 2025-05-07T20:32:00.2621443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.2621789Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.2622079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.2622646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:00.2623273Z return fn(*args, **kwargs) 2025-05-07T20:32:00.2623941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.2624639Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.2625183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.2625880Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.2626561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.2627100Z kernel = self.compile( 2025-05-07T20:32:00.2627662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.2628337Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.2628752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.2628985Z 2025-05-07T20:32:00.2629201Z self = 2025-05-07T20:32:00.2630299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.2631684Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf954720>} 2025-05-07T20:32:00.2633048Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.2634221Z context = 2025-05-07T20:32:00.2634529Z 2025-05-07T20:32:00.2634701Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.2635238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.2635722Z module_map=module_map) 2025-05-07T20:32:00.2636089Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.2636461Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.2636734Z E ^ 2025-05-07T20:32:00.2637211Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.2637670Z 2025-05-07T20:32:00.2638093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.2638911Z 2025-05-07T20:32:00.3601838Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.3602833Z self=, 2025-05-07T20:32:00.3603416Z T=16384, 2025-05-07T20:32:00.3603624Z D=5120, 2025-05-07T20:32:00.3603816Z scale_ub=1200.0, 2025-05-07T20:32:00.3604042Z contiguous=True, 2025-05-07T20:32:00.3604272Z compiled=False, 2025-05-07T20:32:00.3604480Z ) 2025-05-07T20:32:00.3604808Z self = 2025-05-07T20:32:00.3605314Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:00.3605595Z 2025-05-07T20:32:00.3605684Z @given( 2025-05-07T20:32:00.3605912Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.3606236Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.3606547Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.3606871Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.3607211Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.3607616Z ) 2025-05-07T20:32:00.3607963Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.3608414Z def test_silu_mul_quant( 2025-05-07T20:32:00.3608662Z self, 2025-05-07T20:32:00.3608865Z T: int, 2025-05-07T20:32:00.3609062Z D: int, 2025-05-07T20:32:00.3609284Z scale_ub: Optional[float], 2025-05-07T20:32:00.3609563Z contiguous: bool, 2025-05-07T20:32:00.3609799Z compiled: bool, 2025-05-07T20:32:00.3610032Z ) -> None: 2025-05-07T20:32:00.3610255Z torch.manual_seed(2025) 2025-05-07T20:32:00.3610496Z 2025-05-07T20:32:00.3610777Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.3611127Z 2025-05-07T20:32:00.3611322Z x_sign = torch.sign(x) 2025-05-07T20:32:00.3611625Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.3611942Z x = x_sign * x_clamp 2025-05-07T20:32:00.3612186Z x0 = x[:, :D] 2025-05-07T20:32:00.3612414Z x1 = x[:, D:] 2025-05-07T20:32:00.3612633Z 2025-05-07T20:32:00.3612822Z if contiguous: 2025-05-07T20:32:00.3613066Z x0 = x0.contiguous() 2025-05-07T20:32:00.3613336Z x1 = x1.contiguous() 2025-05-07T20:32:00.3613580Z 2025-05-07T20:32:00.3613779Z if scale_ub is not None: 2025-05-07T20:32:00.3614061Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.3614404Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.3614712Z ) 2025-05-07T20:32:00.3614912Z else: 2025-05-07T20:32:00.3615132Z scale_ub_tensor = None 2025-05-07T20:32:00.3615383Z 2025-05-07T20:32:00.3615624Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.3615942Z op = silu_mul_quant 2025-05-07T20:32:00.3616193Z if compiled: 2025-05-07T20:32:00.3616449Z op = torch.compile(op) 2025-05-07T20:32:00.3616906Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.3617184Z 2025-05-07T20:32:00.3617382Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.3617546Z 2025-05-07T20:32:00.3617655Z moe/activation_test.py:117: 2025-05-07T20:32:00.3617947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.3618286Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.3618575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.3619271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.3619960Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.3620505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.3621196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.3621878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.3622457Z kernel = self.compile( 2025-05-07T20:32:00.3623011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.3623678Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.3624075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.3624316Z 2025-05-07T20:32:00.3624525Z self = 2025-05-07T20:32:00.3625606Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.3626995Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf955d00>} 2025-05-07T20:32:00.3628387Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.3629414Z context = 2025-05-07T20:32:00.3629711Z 2025-05-07T20:32:00.3629883Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.3630418Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.3630892Z module_map=module_map) 2025-05-07T20:32:00.3631256Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.3631617Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.3631887Z E ^ 2025-05-07T20:32:00.3632357Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.3632826Z 2025-05-07T20:32:00.3633248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.3633774Z 2025-05-07T20:32:00.3633882Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.3634307Z self=, 2025-05-07T20:32:00.3634706Z T=1, 2025-05-07T20:32:00.3634900Z D=7168, 2025-05-07T20:32:00.3635100Z scale_ub=1200.0, 2025-05-07T20:32:00.3635324Z contiguous=False, 2025-05-07T20:32:00.3635558Z compiled=False, 2025-05-07T20:32:00.3635774Z ) 2025-05-07T20:32:00.3636094Z self = 2025-05-07T20:32:00.3636592Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:00.3636862Z 2025-05-07T20:32:00.3636951Z @given( 2025-05-07T20:32:00.3637268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.3637595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.3637911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.3638248Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.3638859Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.3639156Z ) 2025-05-07T20:32:00.3639517Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.3639955Z def test_silu_mul_quant( 2025-05-07T20:32:00.3640204Z self, 2025-05-07T20:32:00.3640404Z T: int, 2025-05-07T20:32:00.3640598Z D: int, 2025-05-07T20:32:00.3640821Z scale_ub: Optional[float], 2025-05-07T20:32:00.3641099Z contiguous: bool, 2025-05-07T20:32:00.3641335Z compiled: bool, 2025-05-07T20:32:00.3641562Z ) -> None: 2025-05-07T20:32:00.3641779Z torch.manual_seed(2025) 2025-05-07T20:32:00.3642017Z 2025-05-07T20:32:00.3642384Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.3642733Z 2025-05-07T20:32:00.3642933Z x_sign = torch.sign(x) 2025-05-07T20:32:00.3643348Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.3643696Z x = x_sign * x_clamp 2025-05-07T20:32:00.3643942Z x0 = x[:, :D] 2025-05-07T20:32:00.3644159Z x1 = x[:, D:] 2025-05-07T20:32:00.3644371Z 2025-05-07T20:32:00.3644560Z if contiguous: 2025-05-07T20:32:00.3644792Z x0 = x0.contiguous() 2025-05-07T20:32:00.3645057Z x1 = x1.contiguous() 2025-05-07T20:32:00.3645306Z 2025-05-07T20:32:00.3645497Z if scale_ub is not None: 2025-05-07T20:32:00.3645782Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.3646127Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.3646435Z ) 2025-05-07T20:32:00.3646631Z else: 2025-05-07T20:32:00.3646854Z scale_ub_tensor = None 2025-05-07T20:32:00.3647193Z 2025-05-07T20:32:00.3647437Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.3647762Z op = silu_mul_quant 2025-05-07T20:32:00.3648024Z if compiled: 2025-05-07T20:32:00.3648267Z op = torch.compile(op) 2025-05-07T20:32:00.3648565Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.3648846Z 2025-05-07T20:32:00.3649036Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.3649211Z 2025-05-07T20:32:00.3649313Z moe/activation_test.py:117: 2025-05-07T20:32:00.3649609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.3649938Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.3650227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.3650926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.3651629Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.3652169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.3652860Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.3653581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.3654112Z kernel = self.compile( 2025-05-07T20:32:00.3654656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.3665555Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.3665978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.3666211Z 2025-05-07T20:32:00.3666430Z self = 2025-05-07T20:32:00.3667684Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.3669083Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf9553a0>} 2025-05-07T20:32:00.3670432Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.3671463Z context = 2025-05-07T20:32:00.3671753Z 2025-05-07T20:32:00.3671932Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.3672454Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.3672984Z module_map=module_map) 2025-05-07T20:32:00.3673383Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.3673762Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.3674030Z E ^ 2025-05-07T20:32:00.3674504Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.3674959Z 2025-05-07T20:32:00.3675390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.3675907Z 2025-05-07T20:32:00.7033673Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.7034426Z self=, 2025-05-07T20:32:00.7034844Z T=4096, 2025-05-07T20:32:00.7035042Z D=7168, 2025-05-07T20:32:00.7035244Z scale_ub=1200.0, 2025-05-07T20:32:00.7035485Z contiguous=False, 2025-05-07T20:32:00.7035745Z compiled=True, 2025-05-07T20:32:00.7036278Z ) 2025-05-07T20:32:00.7036609Z self = 2025-05-07T20:32:00.7037111Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:00.7037397Z 2025-05-07T20:32:00.7037478Z @given( 2025-05-07T20:32:00.7037717Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.7038032Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.7038335Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.7038973Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.7039312Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.7039600Z ) 2025-05-07T20:32:00.7039965Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.7040410Z def test_silu_mul_quant( 2025-05-07T20:32:00.7040647Z self, 2025-05-07T20:32:00.7040850Z T: int, 2025-05-07T20:32:00.7041060Z D: int, 2025-05-07T20:32:00.7041284Z scale_ub: Optional[float], 2025-05-07T20:32:00.7041564Z contiguous: bool, 2025-05-07T20:32:00.7041812Z compiled: bool, 2025-05-07T20:32:00.7042050Z ) -> None: 2025-05-07T20:32:00.7042269Z torch.manual_seed(2025) 2025-05-07T20:32:00.7042529Z 2025-05-07T20:32:00.7042836Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.7043184Z 2025-05-07T20:32:00.7043496Z x_sign = torch.sign(x) 2025-05-07T20:32:00.7043797Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.7044107Z x = x_sign * x_clamp 2025-05-07T20:32:00.7044355Z x0 = x[:, :D] 2025-05-07T20:32:00.7044577Z x1 = x[:, D:] 2025-05-07T20:32:00.7044788Z 2025-05-07T20:32:00.7044982Z if contiguous: 2025-05-07T20:32:00.7045221Z x0 = x0.contiguous() 2025-05-07T20:32:00.7045476Z x1 = x1.contiguous() 2025-05-07T20:32:00.7045727Z 2025-05-07T20:32:00.7046092Z if scale_ub is not None: 2025-05-07T20:32:00.7046370Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.7046712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.7047026Z ) 2025-05-07T20:32:00.7047224Z else: 2025-05-07T20:32:00.7047446Z scale_ub_tensor = None 2025-05-07T20:32:00.7047704Z 2025-05-07T20:32:00.7047934Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.7048249Z op = silu_mul_quant 2025-05-07T20:32:00.7048502Z if compiled: 2025-05-07T20:32:00.7048750Z op = torch.compile(op) 2025-05-07T20:32:00.7049045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.7049325Z 2025-05-07T20:32:00.7049520Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.7049684Z 2025-05-07T20:32:00.7049786Z moe/activation_test.py:117: 2025-05-07T20:32:00.7050093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7050521Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.7050798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.7051364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:00.7051929Z return fn(*args, **kwargs) 2025-05-07T20:32:00.7052593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.7053276Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.7053815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.7054500Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.7055163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.7055706Z kernel = self.compile( 2025-05-07T20:32:00.7056324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.7056991Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.7057391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7057627Z 2025-05-07T20:32:00.7057837Z self = 2025-05-07T20:32:00.7058926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.7060320Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3003bc2c0>} 2025-05-07T20:32:00.7061664Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.7062694Z context = 2025-05-07T20:32:00.7062989Z 2025-05-07T20:32:00.7063158Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.7063685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.7064202Z module_map=module_map) 2025-05-07T20:32:00.7064575Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.7064936Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.7065201Z E ^ 2025-05-07T20:32:00.7065663Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.7066121Z 2025-05-07T20:32:00.7066627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.7067148Z 2025-05-07T20:32:00.7067260Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.7067679Z self=, 2025-05-07T20:32:00.7068074Z T=128, 2025-05-07T20:32:00.7068264Z D=7168, 2025-05-07T20:32:00.7068460Z scale_ub=1200.0, 2025-05-07T20:32:00.7068677Z contiguous=False, 2025-05-07T20:32:00.7068902Z compiled=True, 2025-05-07T20:32:00.7069105Z ) 2025-05-07T20:32:00.7795234Z self = 2025-05-07T20:32:00.7796622Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:00.7797169Z 2025-05-07T20:32:00.7797321Z @given( 2025-05-07T20:32:00.7797787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.7798405Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.7799375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.7800023Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.7800665Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.7801223Z ) 2025-05-07T20:32:00.7801904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.7802777Z def test_silu_mul_quant( 2025-05-07T20:32:00.7803431Z self, 2025-05-07T20:32:00.7803818Z T: int, 2025-05-07T20:32:00.7804109Z D: int, 2025-05-07T20:32:00.7804332Z scale_ub: Optional[float], 2025-05-07T20:32:00.7804601Z contiguous: bool, 2025-05-07T20:32:00.7804843Z compiled: bool, 2025-05-07T20:32:00.7805077Z ) -> None: 2025-05-07T20:32:00.7805291Z torch.manual_seed(2025) 2025-05-07T20:32:00.7805535Z 2025-05-07T20:32:00.7805812Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.7806155Z 2025-05-07T20:32:00.7806459Z x_sign = torch.sign(x) 2025-05-07T20:32:00.7806752Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.7807059Z x = x_sign * x_clamp 2025-05-07T20:32:00.7807307Z x0 = x[:, :D] 2025-05-07T20:32:00.7807526Z x1 = x[:, D:] 2025-05-07T20:32:00.7807736Z 2025-05-07T20:32:00.7807919Z if contiguous: 2025-05-07T20:32:00.7808159Z x0 = x0.contiguous() 2025-05-07T20:32:00.7808427Z x1 = x1.contiguous() 2025-05-07T20:32:00.7808666Z 2025-05-07T20:32:00.7808862Z if scale_ub is not None: 2025-05-07T20:32:00.7809141Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.7809475Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.7809785Z ) 2025-05-07T20:32:00.7809981Z else: 2025-05-07T20:32:00.7810188Z scale_ub_tensor = None 2025-05-07T20:32:00.7810444Z 2025-05-07T20:32:00.7810683Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.7810996Z op = silu_mul_quant 2025-05-07T20:32:00.7811245Z if compiled: 2025-05-07T20:32:00.7811492Z op = torch.compile(op) 2025-05-07T20:32:00.7811786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.7812066Z 2025-05-07T20:32:00.7812259Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.7812424Z 2025-05-07T20:32:00.7812530Z moe/activation_test.py:117: 2025-05-07T20:32:00.7812817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7813180Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.7813474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.7814072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:00.7814637Z return fn(*args, **kwargs) 2025-05-07T20:32:00.7815453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.7816151Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.7816693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.7817378Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.7818050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.7818583Z kernel = self.compile( 2025-05-07T20:32:00.7819127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.7819794Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.7820188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7820424Z 2025-05-07T20:32:00.7820639Z self = 2025-05-07T20:32:00.7821797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.7823187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3003bce00>} 2025-05-07T20:32:00.7824593Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.7825617Z context = 2025-05-07T20:32:00.7825914Z 2025-05-07T20:32:00.7826085Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.7826612Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.7827130Z module_map=module_map) 2025-05-07T20:32:00.7827491Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.7827849Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.7828110Z E ^ 2025-05-07T20:32:00.7828570Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.7829028Z 2025-05-07T20:32:00.7829449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.7829972Z 2025-05-07T20:32:00.7830074Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.7830489Z self=, 2025-05-07T20:32:00.7830886Z T=2048, 2025-05-07T20:32:00.7831074Z D=7168, 2025-05-07T20:32:00.7831269Z scale_ub=None, 2025-05-07T20:32:00.7831487Z contiguous=True, 2025-05-07T20:32:00.7831721Z compiled=True, 2025-05-07T20:32:00.7831925Z ) 2025-05-07T20:32:00.7832242Z self = 2025-05-07T20:32:00.7832734Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:00.7833002Z 2025-05-07T20:32:00.7833087Z @given( 2025-05-07T20:32:00.7833311Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.7833624Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.7833932Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.7834269Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.7834595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.7834882Z ) 2025-05-07T20:32:00.7835230Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.7835668Z def test_silu_mul_quant( 2025-05-07T20:32:00.7835914Z self, 2025-05-07T20:32:00.7836197Z T: int, 2025-05-07T20:32:00.7836396Z D: int, 2025-05-07T20:32:00.7836617Z scale_ub: Optional[float], 2025-05-07T20:32:00.7836889Z contiguous: bool, 2025-05-07T20:32:00.7837122Z compiled: bool, 2025-05-07T20:32:00.7837346Z ) -> None: 2025-05-07T20:32:00.7837564Z torch.manual_seed(2025) 2025-05-07T20:32:00.7837801Z 2025-05-07T20:32:00.7838077Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.7838633Z 2025-05-07T20:32:00.7838835Z x_sign = torch.sign(x) 2025-05-07T20:32:00.7839133Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.7839443Z x = x_sign * x_clamp 2025-05-07T20:32:00.7839689Z x0 = x[:, :D] 2025-05-07T20:32:00.7839902Z x1 = x[:, D:] 2025-05-07T20:32:00.7840112Z 2025-05-07T20:32:00.7840307Z if contiguous: 2025-05-07T20:32:00.7840539Z x0 = x0.contiguous() 2025-05-07T20:32:00.7840881Z x1 = x1.contiguous() 2025-05-07T20:32:00.7841126Z 2025-05-07T20:32:00.7841317Z if scale_ub is not None: 2025-05-07T20:32:00.7841592Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.7841926Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.7842230Z ) 2025-05-07T20:32:00.7842423Z else: 2025-05-07T20:32:00.7842635Z scale_ub_tensor = None 2025-05-07T20:32:00.7842882Z 2025-05-07T20:32:00.7843114Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.7843516Z op = silu_mul_quant 2025-05-07T20:32:00.7843759Z if compiled: 2025-05-07T20:32:00.7844012Z op = torch.compile(op) 2025-05-07T20:32:00.7844307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.7844583Z 2025-05-07T20:32:00.7844772Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.7844942Z 2025-05-07T20:32:00.7845042Z moe/activation_test.py:117: 2025-05-07T20:32:00.7845344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7845748Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.7846036Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.7846598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:00.7847152Z return fn(*args, **kwargs) 2025-05-07T20:32:00.7847816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.7848505Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.7849047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.7849730Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.7850405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.7850948Z kernel = self.compile( 2025-05-07T20:32:00.7851496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.7852153Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.7852553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7852781Z 2025-05-07T20:32:00.7853001Z self = 2025-05-07T20:32:00.7854079Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.7855623Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3003bda80>} 2025-05-07T20:32:00.7856986Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.7858026Z context = 2025-05-07T20:32:00.7858319Z 2025-05-07T20:32:00.7858494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.7859015Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.7859490Z module_map=module_map) 2025-05-07T20:32:00.7859862Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.7860225Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.7860482Z E ^ 2025-05-07T20:32:00.7860957Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.7861457Z 2025-05-07T20:32:00.7861885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.7862400Z 2025-05-07T20:32:00.8504476Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.8505144Z self=, 2025-05-07T20:32:00.8505555Z T=16384, 2025-05-07T20:32:00.8505753Z D=5120, 2025-05-07T20:32:00.8505950Z scale_ub=None, 2025-05-07T20:32:00.8506163Z contiguous=False, 2025-05-07T20:32:00.8506393Z compiled=False, 2025-05-07T20:32:00.8506596Z ) 2025-05-07T20:32:00.8506919Z self = 2025-05-07T20:32:00.8507413Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:00.8507695Z 2025-05-07T20:32:00.8507776Z @given( 2025-05-07T20:32:00.8508004Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.8508567Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.8508873Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.8509201Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.8509524Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.8509813Z ) 2025-05-07T20:32:00.8510163Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.8510603Z def test_silu_mul_quant( 2025-05-07T20:32:00.8510840Z self, 2025-05-07T20:32:00.8511039Z T: int, 2025-05-07T20:32:00.8511242Z D: int, 2025-05-07T20:32:00.8511452Z scale_ub: Optional[float], 2025-05-07T20:32:00.8511728Z contiguous: bool, 2025-05-07T20:32:00.8511969Z compiled: bool, 2025-05-07T20:32:00.8512190Z ) -> None: 2025-05-07T20:32:00.8512410Z torch.manual_seed(2025) 2025-05-07T20:32:00.8512657Z 2025-05-07T20:32:00.8512933Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.8513289Z 2025-05-07T20:32:00.8513486Z x_sign = torch.sign(x) 2025-05-07T20:32:00.8513774Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.8515799Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.8517687Z 2025-05-07T20:32:00.8517808Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:00.8518025Z 2025-05-07T20:32:00.8518132Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.8518714Z self=, 2025-05-07T20:32:00.8519123Z T=4096, 2025-05-07T20:32:00.8519317Z D=7168, 2025-05-07T20:32:00.8519513Z scale_ub=1200.0, 2025-05-07T20:32:00.8519734Z contiguous=True, 2025-05-07T20:32:00.8519963Z compiled=True, 2025-05-07T20:32:00.8520174Z ) 2025-05-07T20:32:00.8520502Z self = 2025-05-07T20:32:00.8520995Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:00.8521277Z 2025-05-07T20:32:00.8521359Z @given( 2025-05-07T20:32:00.8521593Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.8521899Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.8522209Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.8522545Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.8522881Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.8523379Z ) 2025-05-07T20:32:00.8523737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.8524182Z def test_silu_mul_quant( 2025-05-07T20:32:00.8524416Z self, 2025-05-07T20:32:00.8524623Z T: int, 2025-05-07T20:32:00.8524836Z D: int, 2025-05-07T20:32:00.8525055Z scale_ub: Optional[float], 2025-05-07T20:32:00.8525344Z contiguous: bool, 2025-05-07T20:32:00.8525593Z compiled: bool, 2025-05-07T20:32:00.8525821Z ) -> None: 2025-05-07T20:32:00.8526051Z torch.manual_seed(2025) 2025-05-07T20:32:00.8526305Z 2025-05-07T20:32:00.8526579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.8526930Z 2025-05-07T20:32:00.8527137Z x_sign = torch.sign(x) 2025-05-07T20:32:00.8527434Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.8529455Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.8531389Z 2025-05-07T20:32:00.8531513Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:00.8531741Z 2025-05-07T20:32:00.8531849Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.8532278Z self=, 2025-05-07T20:32:00.8532687Z T=16384, 2025-05-07T20:32:00.8532897Z D=7168, 2025-05-07T20:32:00.8533104Z scale_ub=None, 2025-05-07T20:32:00.8533318Z contiguous=False, 2025-05-07T20:32:00.8533594Z compiled=False, 2025-05-07T20:32:00.8533824Z ) 2025-05-07T20:32:00.8534138Z self = 2025-05-07T20:32:00.8534643Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:00.8534929Z 2025-05-07T20:32:00.8535008Z @given( 2025-05-07T20:32:00.8535252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.8535563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.8535872Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.8536202Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.8536527Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.8536820Z ) 2025-05-07T20:32:00.8537172Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.8537618Z def test_silu_mul_quant( 2025-05-07T20:32:00.8537854Z self, 2025-05-07T20:32:00.8538178Z T: int, 2025-05-07T20:32:00.8538653Z D: int, 2025-05-07T20:32:00.8538873Z scale_ub: Optional[float], 2025-05-07T20:32:00.8539149Z contiguous: bool, 2025-05-07T20:32:00.8539394Z compiled: bool, 2025-05-07T20:32:00.8539617Z ) -> None: 2025-05-07T20:32:00.8539839Z torch.manual_seed(2025) 2025-05-07T20:32:00.8540084Z 2025-05-07T20:32:00.8540354Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.8542418Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.8544368Z 2025-05-07T20:32:00.8544486Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:00.8544703Z 2025-05-07T20:32:00.8544807Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.8545292Z self=, 2025-05-07T20:32:00.8556246Z T=2048, 2025-05-07T20:32:00.8556452Z D=7168, 2025-05-07T20:32:00.8556642Z scale_ub=1200.0, 2025-05-07T20:32:00.8556879Z contiguous=True, 2025-05-07T20:32:00.8557108Z compiled=True, 2025-05-07T20:32:00.8557310Z ) 2025-05-07T20:32:00.8557640Z self = 2025-05-07T20:32:00.8558146Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:00.8558418Z 2025-05-07T20:32:00.8558506Z @given( 2025-05-07T20:32:00.8558737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.8559069Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.8559504Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.8559829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.8560158Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.8560445Z ) 2025-05-07T20:32:00.8560785Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.8561217Z def test_silu_mul_quant( 2025-05-07T20:32:00.8561464Z self, 2025-05-07T20:32:00.8561658Z T: int, 2025-05-07T20:32:00.8561862Z D: int, 2025-05-07T20:32:00.8562087Z scale_ub: Optional[float], 2025-05-07T20:32:00.8562357Z contiguous: bool, 2025-05-07T20:32:00.8562602Z compiled: bool, 2025-05-07T20:32:00.8562835Z ) -> None: 2025-05-07T20:32:00.8563058Z torch.manual_seed(2025) 2025-05-07T20:32:00.8563435Z 2025-05-07T20:32:00.8563717Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.8564067Z 2025-05-07T20:32:00.8564253Z x_sign = torch.sign(x) 2025-05-07T20:32:00.8564552Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.8566550Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.8568405Z 2025-05-07T20:32:00.8568526Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:00.8568739Z 2025-05-07T20:32:00.8568849Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.8569383Z self=, 2025-05-07T20:32:00.8569798Z T=2048, 2025-05-07T20:32:00.8569995Z D=7168, 2025-05-07T20:32:00.8570183Z scale_ub=None, 2025-05-07T20:32:00.8570400Z contiguous=True, 2025-05-07T20:32:00.8570629Z compiled=False, 2025-05-07T20:32:00.8570831Z ) 2025-05-07T20:32:00.9431780Z self = 2025-05-07T20:32:00.9433104Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:00.9433662Z 2025-05-07T20:32:00.9433819Z @given( 2025-05-07T20:32:00.9434157Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.9434467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.9434776Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.9435104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.9435429Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.9435990Z ) 2025-05-07T20:32:00.9436354Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.9436794Z def test_silu_mul_quant( 2025-05-07T20:32:00.9437037Z self, 2025-05-07T20:32:00.9437234Z T: int, 2025-05-07T20:32:00.9437431Z D: int, 2025-05-07T20:32:00.9437659Z scale_ub: Optional[float], 2025-05-07T20:32:00.9437932Z contiguous: bool, 2025-05-07T20:32:00.9438166Z compiled: bool, 2025-05-07T20:32:00.9438662Z ) -> None: 2025-05-07T20:32:00.9438889Z torch.manual_seed(2025) 2025-05-07T20:32:00.9439129Z 2025-05-07T20:32:00.9439410Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.9439751Z 2025-05-07T20:32:00.9439948Z > x_sign = torch.sign(x) 2025-05-07T20:32:00.9441891Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.9443939Z 2025-05-07T20:32:00.9444057Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:00.9444273Z 2025-05-07T20:32:00.9444375Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.9444786Z self=, 2025-05-07T20:32:00.9445186Z T=1, 2025-05-07T20:32:00.9445369Z D=7168, 2025-05-07T20:32:00.9445562Z scale_ub=1200.0, 2025-05-07T20:32:00.9445779Z contiguous=True, 2025-05-07T20:32:00.9446005Z compiled=False, 2025-05-07T20:32:00.9446217Z ) 2025-05-07T20:32:00.9446538Z self = 2025-05-07T20:32:00.9447024Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:00.9447296Z 2025-05-07T20:32:00.9447375Z @given( 2025-05-07T20:32:00.9447605Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.9447907Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.9448217Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.9448546Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.9448867Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.9449151Z ) 2025-05-07T20:32:00.9449501Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.9449941Z def test_silu_mul_quant( 2025-05-07T20:32:00.9450175Z self, 2025-05-07T20:32:00.9450374Z T: int, 2025-05-07T20:32:00.9450571Z D: int, 2025-05-07T20:32:00.9450785Z scale_ub: Optional[float], 2025-05-07T20:32:00.9451228Z contiguous: bool, 2025-05-07T20:32:00.9451476Z compiled: bool, 2025-05-07T20:32:00.9451694Z ) -> None: 2025-05-07T20:32:00.9451912Z torch.manual_seed(2025) 2025-05-07T20:32:00.9452154Z 2025-05-07T20:32:00.9452425Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.9452772Z 2025-05-07T20:32:00.9452974Z x_sign = torch.sign(x) 2025-05-07T20:32:00.9453261Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.9453575Z x = x_sign * x_clamp 2025-05-07T20:32:00.9453818Z x0 = x[:, :D] 2025-05-07T20:32:00.9454028Z x1 = x[:, D:] 2025-05-07T20:32:00.9454243Z 2025-05-07T20:32:00.9454432Z if contiguous: 2025-05-07T20:32:00.9454665Z x0 = x0.contiguous() 2025-05-07T20:32:00.9454928Z x1 = x1.contiguous() 2025-05-07T20:32:00.9455171Z 2025-05-07T20:32:00.9455368Z if scale_ub is not None: 2025-05-07T20:32:00.9455717Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.9456055Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.9456365Z ) 2025-05-07T20:32:00.9456556Z else: 2025-05-07T20:32:00.9456768Z scale_ub_tensor = None 2025-05-07T20:32:00.9457023Z 2025-05-07T20:32:00.9457253Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.9457565Z op = silu_mul_quant 2025-05-07T20:32:00.9457812Z if compiled: 2025-05-07T20:32:00.9458056Z op = torch.compile(op) 2025-05-07T20:32:00.9458354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.9458635Z 2025-05-07T20:32:00.9458826Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.9458995Z 2025-05-07T20:32:00.9459095Z moe/activation_test.py:117: 2025-05-07T20:32:00.9459394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.9459728Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.9460014Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.9460756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.9461454Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.9461990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.9462678Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.9463351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.9463888Z kernel = self.compile( 2025-05-07T20:32:00.9464430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.9465095Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.9465542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.9465870Z 2025-05-07T20:32:00.9466173Z self = 2025-05-07T20:32:00.9467363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.9468744Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf661440>} 2025-05-07T20:32:00.9470096Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.9471128Z context = 2025-05-07T20:32:00.9471525Z 2025-05-07T20:32:00.9471695Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.9472216Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.9472685Z module_map=module_map) 2025-05-07T20:32:00.9473050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.9473402Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.9473664Z E ^ 2025-05-07T20:32:00.9474131Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.9474636Z 2025-05-07T20:32:00.9475372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.9475898Z 2025-05-07T20:32:00.9476002Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.9476925Z self=, 2025-05-07T20:32:00.9477618Z T=128, 2025-05-07T20:32:00.9477849Z D=5120, 2025-05-07T20:32:00.9478063Z scale_ub=None, 2025-05-07T20:32:00.9478278Z contiguous=True, 2025-05-07T20:32:00.9478505Z compiled=False, 2025-05-07T20:32:00.9478710Z ) 2025-05-07T20:32:01.0026597Z self = 2025-05-07T20:32:01.0027361Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.0027635Z 2025-05-07T20:32:01.0027723Z @given( 2025-05-07T20:32:01.0027950Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.0028265Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.0028575Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.0028899Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.0029231Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.0029520Z ) 2025-05-07T20:32:01.0029910Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.0030612Z def test_silu_mul_quant( 2025-05-07T20:32:01.0030857Z self, 2025-05-07T20:32:01.0031061Z T: int, 2025-05-07T20:32:01.0031261Z D: int, 2025-05-07T20:32:01.0031483Z scale_ub: Optional[float], 2025-05-07T20:32:01.0031760Z contiguous: bool, 2025-05-07T20:32:01.0031996Z compiled: bool, 2025-05-07T20:32:01.0032224Z ) -> None: 2025-05-07T20:32:01.0032440Z torch.manual_seed(2025) 2025-05-07T20:32:01.0032677Z 2025-05-07T20:32:01.0032948Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.0033295Z 2025-05-07T20:32:01.0033481Z x_sign = torch.sign(x) 2025-05-07T20:32:01.0033775Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.0034109Z x = x_sign * x_clamp 2025-05-07T20:32:01.0034368Z x0 = x[:, :D] 2025-05-07T20:32:01.0034588Z x1 = x[:, D:] 2025-05-07T20:32:01.0034810Z 2025-05-07T20:32:01.0034998Z if contiguous: 2025-05-07T20:32:01.0035228Z x0 = x0.contiguous() 2025-05-07T20:32:01.0035490Z x1 = x1.contiguous() 2025-05-07T20:32:01.0035731Z 2025-05-07T20:32:01.0035921Z if scale_ub is not None: 2025-05-07T20:32:01.0036200Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.0036538Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.0036840Z ) 2025-05-07T20:32:01.0037035Z else: 2025-05-07T20:32:01.0037246Z scale_ub_tensor = None 2025-05-07T20:32:01.0037492Z 2025-05-07T20:32:01.0037729Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.0038046Z op = silu_mul_quant 2025-05-07T20:32:01.0038293Z if compiled: 2025-05-07T20:32:01.0038824Z op = torch.compile(op) 2025-05-07T20:32:01.0039125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.0039586Z 2025-05-07T20:32:01.0039788Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.0039959Z 2025-05-07T20:32:01.0040060Z moe/activation_test.py:117: 2025-05-07T20:32:01.0040358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.0040685Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.0040967Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.0041662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.0042347Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.0042891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.0043700Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.0044372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.0044997Z kernel = self.compile( 2025-05-07T20:32:01.0045545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.0046206Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.0046601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.0046837Z 2025-05-07T20:32:01.0047048Z self = 2025-05-07T20:32:01.0048124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.0049499Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf662660>} 2025-05-07T20:32:01.0050914Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.0051933Z context = 2025-05-07T20:32:01.0052225Z 2025-05-07T20:32:01.0052393Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.0052918Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.0053384Z module_map=module_map) 2025-05-07T20:32:01.0053744Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.0054097Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.0054362Z E ^ 2025-05-07T20:32:01.0054821Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.0055285Z 2025-05-07T20:32:01.0055707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.0056228Z 2025-05-07T20:32:01.0056331Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.0056742Z self=, 2025-05-07T20:32:01.0057137Z T=128, 2025-05-07T20:32:01.0057328Z D=7168, 2025-05-07T20:32:01.0057525Z scale_ub=None, 2025-05-07T20:32:01.0057732Z contiguous=True, 2025-05-07T20:32:01.0057956Z compiled=False, 2025-05-07T20:32:01.0058164Z ) 2025-05-07T20:32:01.0058478Z self = 2025-05-07T20:32:01.0058967Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.0059241Z 2025-05-07T20:32:01.0059320Z @given( 2025-05-07T20:32:01.0059547Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.0059943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.0060256Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.0060584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.0060905Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.0061193Z ) 2025-05-07T20:32:01.0061542Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.0061976Z def test_silu_mul_quant( 2025-05-07T20:32:01.0062221Z self, 2025-05-07T20:32:01.0062417Z T: int, 2025-05-07T20:32:01.0062609Z D: int, 2025-05-07T20:32:01.0062828Z scale_ub: Optional[float], 2025-05-07T20:32:01.0063103Z contiguous: bool, 2025-05-07T20:32:01.0063335Z compiled: bool, 2025-05-07T20:32:01.0063572Z ) -> None: 2025-05-07T20:32:01.0063828Z torch.manual_seed(2025) 2025-05-07T20:32:01.0064078Z 2025-05-07T20:32:01.0064345Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.0064799Z 2025-05-07T20:32:01.0064996Z x_sign = torch.sign(x) 2025-05-07T20:32:01.0065283Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.0065598Z x = x_sign * x_clamp 2025-05-07T20:32:01.0065844Z x0 = x[:, :D] 2025-05-07T20:32:01.0066055Z x1 = x[:, D:] 2025-05-07T20:32:01.0066273Z 2025-05-07T20:32:01.0066463Z if contiguous: 2025-05-07T20:32:01.0066695Z x0 = x0.contiguous() 2025-05-07T20:32:01.0066957Z x1 = x1.contiguous() 2025-05-07T20:32:01.0067200Z 2025-05-07T20:32:01.0067388Z if scale_ub is not None: 2025-05-07T20:32:01.0067661Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.0067999Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.0068304Z ) 2025-05-07T20:32:01.0068498Z else: 2025-05-07T20:32:01.0068709Z scale_ub_tensor = None 2025-05-07T20:32:01.0068964Z 2025-05-07T20:32:01.0069199Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.0069565Z op = silu_mul_quant 2025-05-07T20:32:01.0069817Z if compiled: 2025-05-07T20:32:01.0070061Z op = torch.compile(op) 2025-05-07T20:32:01.0070362Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.0070637Z 2025-05-07T20:32:01.0070825Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.0070997Z 2025-05-07T20:32:01.0071096Z moe/activation_test.py:117: 2025-05-07T20:32:01.0071393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.0071720Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.0072003Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.0072703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.0073401Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.0073947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.0074640Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.0075320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.0075856Z kernel = self.compile( 2025-05-07T20:32:01.0076402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.0077072Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.0077472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.0077701Z 2025-05-07T20:32:01.0077910Z self = 2025-05-07T20:32:01.0079076Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.0080457Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf6636a0>} 2025-05-07T20:32:01.0081812Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.0082846Z context = 2025-05-07T20:32:01.0083136Z 2025-05-07T20:32:01.0083421Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.0083953Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.0084422Z module_map=module_map) 2025-05-07T20:32:01.0084835Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.0085202Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.0085468Z E ^ 2025-05-07T20:32:01.0085942Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.0086400Z 2025-05-07T20:32:01.0086822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.0087343Z 2025-05-07T20:32:01.0087448Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.0087865Z self=, 2025-05-07T20:32:01.0088273Z T=2048, 2025-05-07T20:32:01.0088462Z D=7168, 2025-05-07T20:32:01.0088656Z scale_ub=1200.0, 2025-05-07T20:32:01.0088881Z contiguous=True, 2025-05-07T20:32:01.0089100Z compiled=False, 2025-05-07T20:32:01.0089310Z ) 2025-05-07T20:32:01.0760735Z self = 2025-05-07T20:32:01.0762472Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.0763026Z 2025-05-07T20:32:01.0763188Z @given( 2025-05-07T20:32:01.0763692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.0764049Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.0764355Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.0764677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.0765009Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.0765297Z ) 2025-05-07T20:32:01.0765642Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.0766089Z def test_silu_mul_quant( 2025-05-07T20:32:01.0766332Z self, 2025-05-07T20:32:01.0766531Z T: int, 2025-05-07T20:32:01.0766728Z D: int, 2025-05-07T20:32:01.0766953Z scale_ub: Optional[float], 2025-05-07T20:32:01.0767234Z contiguous: bool, 2025-05-07T20:32:01.0767467Z compiled: bool, 2025-05-07T20:32:01.0767693Z ) -> None: 2025-05-07T20:32:01.0767913Z torch.manual_seed(2025) 2025-05-07T20:32:01.0768147Z 2025-05-07T20:32:01.0768424Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.0770485Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.0772349Z 2025-05-07T20:32:01.0772646Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.0772860Z 2025-05-07T20:32:01.0772969Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.0773375Z self=, 2025-05-07T20:32:01.0773842Z T=1, 2025-05-07T20:32:01.0774036Z D=5120, 2025-05-07T20:32:01.0774227Z scale_ub=1200.0, 2025-05-07T20:32:01.0774457Z contiguous=True, 2025-05-07T20:32:01.0774682Z compiled=False, 2025-05-07T20:32:01.0774884Z ) 2025-05-07T20:32:01.0775206Z self = 2025-05-07T20:32:01.0775699Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.0775966Z 2025-05-07T20:32:01.0776047Z @given( 2025-05-07T20:32:01.0776278Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.0776593Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.0776902Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.0777323Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.0777660Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.0777952Z ) 2025-05-07T20:32:01.0778293Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.0778737Z def test_silu_mul_quant( 2025-05-07T20:32:01.0778981Z self, 2025-05-07T20:32:01.0779168Z T: int, 2025-05-07T20:32:01.0779366Z D: int, 2025-05-07T20:32:01.0779585Z scale_ub: Optional[float], 2025-05-07T20:32:01.0790100Z contiguous: bool, 2025-05-07T20:32:01.0790387Z compiled: bool, 2025-05-07T20:32:01.0790623Z ) -> None: 2025-05-07T20:32:01.0790848Z torch.manual_seed(2025) 2025-05-07T20:32:01.0791089Z 2025-05-07T20:32:01.0791379Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.0791733Z 2025-05-07T20:32:01.0791926Z x_sign = torch.sign(x) 2025-05-07T20:32:01.0792238Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.0792639Z x = x_sign * x_clamp 2025-05-07T20:32:01.0792886Z x0 = x[:, :D] 2025-05-07T20:32:01.0793104Z x1 = x[:, D:] 2025-05-07T20:32:01.0793316Z 2025-05-07T20:32:01.0793522Z if contiguous: 2025-05-07T20:32:01.0793793Z x0 = x0.contiguous() 2025-05-07T20:32:01.0794058Z x1 = x1.contiguous() 2025-05-07T20:32:01.0794308Z 2025-05-07T20:32:01.0794495Z if scale_ub is not None: 2025-05-07T20:32:01.0794780Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.0795125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.0795435Z ) 2025-05-07T20:32:01.0795642Z else: 2025-05-07T20:32:01.0795860Z scale_ub_tensor = None 2025-05-07T20:32:01.0796117Z 2025-05-07T20:32:01.0796356Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.0796678Z op = silu_mul_quant 2025-05-07T20:32:01.0796934Z if compiled: 2025-05-07T20:32:01.0797187Z op = torch.compile(op) 2025-05-07T20:32:01.0797489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.0797760Z 2025-05-07T20:32:01.0797959Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.0798132Z 2025-05-07T20:32:01.0798239Z moe/activation_test.py:117: 2025-05-07T20:32:01.0798542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.0798874Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.0799160Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.0799860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.0800547Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.0801094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.0801870Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.0802553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.0803088Z kernel = self.compile( 2025-05-07T20:32:01.0803732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.0804392Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.0804780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.0805018Z 2025-05-07T20:32:01.0805229Z self = 2025-05-07T20:32:01.0806316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.0807749Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf454b80>} 2025-05-07T20:32:01.0809099Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.0810119Z context = 2025-05-07T20:32:01.0810420Z 2025-05-07T20:32:01.0810589Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.0811118Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.0811592Z module_map=module_map) 2025-05-07T20:32:01.0811952Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.0812322Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.0812634Z E ^ 2025-05-07T20:32:01.0813097Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.0813556Z 2025-05-07T20:32:01.0813976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.0814488Z 2025-05-07T20:32:01.0814599Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.0815003Z self=, 2025-05-07T20:32:01.0815406Z T=2048, 2025-05-07T20:32:01.0815601Z D=5120, 2025-05-07T20:32:01.0815788Z scale_ub=None, 2025-05-07T20:32:01.0816007Z contiguous=True, 2025-05-07T20:32:01.0816235Z compiled=False, 2025-05-07T20:32:01.0816432Z ) 2025-05-07T20:32:01.0816751Z self = 2025-05-07T20:32:01.0817253Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.0817526Z 2025-05-07T20:32:01.0817599Z @given( 2025-05-07T20:32:01.0817830Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.0818142Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.0818446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.0818764Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.0819091Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.0819377Z ) 2025-05-07T20:32:01.0819721Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.0820163Z def test_silu_mul_quant( 2025-05-07T20:32:01.0820407Z self, 2025-05-07T20:32:01.0820597Z T: int, 2025-05-07T20:32:01.0820793Z D: int, 2025-05-07T20:32:01.0821011Z scale_ub: Optional[float], 2025-05-07T20:32:01.0821275Z contiguous: bool, 2025-05-07T20:32:01.0821512Z compiled: bool, 2025-05-07T20:32:01.0821820Z ) -> None: 2025-05-07T20:32:01.0822032Z torch.manual_seed(2025) 2025-05-07T20:32:01.0822273Z 2025-05-07T20:32:01.0822543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.0822888Z 2025-05-07T20:32:01.0823073Z > x_sign = torch.sign(x) 2025-05-07T20:32:01.0825018Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.0826869Z 2025-05-07T20:32:01.0826987Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:01.0827251Z 2025-05-07T20:32:01.0827361Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.0827768Z self=, 2025-05-07T20:32:01.0828170Z T=16384, 2025-05-07T20:32:01.0828367Z D=5120, 2025-05-07T20:32:01.0828559Z scale_ub=None, 2025-05-07T20:32:01.0828766Z contiguous=True, 2025-05-07T20:32:01.0828992Z compiled=False, 2025-05-07T20:32:01.0829196Z ) 2025-05-07T20:32:01.1531017Z self = 2025-05-07T20:32:01.1532578Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.1533283Z 2025-05-07T20:32:01.1533457Z @given( 2025-05-07T20:32:01.1533814Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.1534124Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.1534437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.1534798Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.1535369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.1535664Z ) 2025-05-07T20:32:01.1536022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.1536479Z def test_silu_mul_quant( 2025-05-07T20:32:01.1536720Z self, 2025-05-07T20:32:01.1536922Z T: int, 2025-05-07T20:32:01.1537124Z D: int, 2025-05-07T20:32:01.1537343Z scale_ub: Optional[float], 2025-05-07T20:32:01.1537620Z contiguous: bool, 2025-05-07T20:32:01.1537861Z compiled: bool, 2025-05-07T20:32:01.1538091Z ) -> None: 2025-05-07T20:32:01.1538309Z torch.manual_seed(2025) 2025-05-07T20:32:01.1538866Z 2025-05-07T20:32:01.1539135Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.1541186Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.1543066Z 2025-05-07T20:32:01.1543184Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.1543405Z 2025-05-07T20:32:01.1543509Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.1543975Z self=, 2025-05-07T20:32:01.1544367Z T=4096, 2025-05-07T20:32:01.1544559Z D=5120, 2025-05-07T20:32:01.1544758Z scale_ub=None, 2025-05-07T20:32:01.1544967Z contiguous=True, 2025-05-07T20:32:01.1545196Z compiled=False, 2025-05-07T20:32:01.1545407Z ) 2025-05-07T20:32:01.1545888Z self = 2025-05-07T20:32:01.1546386Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.1546665Z 2025-05-07T20:32:01.1546743Z @given( 2025-05-07T20:32:01.1546972Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.1547280Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.1547587Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.1547918Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.1548242Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.1548529Z ) 2025-05-07T20:32:01.1548880Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.1549318Z def test_silu_mul_quant( 2025-05-07T20:32:01.1549565Z self, 2025-05-07T20:32:01.1549765Z T: int, 2025-05-07T20:32:01.1549958Z D: int, 2025-05-07T20:32:01.1550265Z scale_ub: Optional[float], 2025-05-07T20:32:01.1550544Z contiguous: bool, 2025-05-07T20:32:01.1550775Z compiled: bool, 2025-05-07T20:32:01.1551001Z ) -> None: 2025-05-07T20:32:01.1551219Z torch.manual_seed(2025) 2025-05-07T20:32:01.1551466Z 2025-05-07T20:32:01.1551734Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.1553771Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.1555622Z 2025-05-07T20:32:01.1555821Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.1556036Z 2025-05-07T20:32:01.1556148Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.1556556Z self=, 2025-05-07T20:32:01.1556964Z T=2048, 2025-05-07T20:32:01.1557152Z D=5120, 2025-05-07T20:32:01.1557348Z scale_ub=None, 2025-05-07T20:32:01.1557557Z contiguous=False, 2025-05-07T20:32:01.1557785Z compiled=False, 2025-05-07T20:32:01.1557989Z ) 2025-05-07T20:32:01.1558302Z self = 2025-05-07T20:32:01.1558798Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.1559069Z 2025-05-07T20:32:01.1559155Z @given( 2025-05-07T20:32:01.1559377Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.1559688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.1559999Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.1560327Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.1560669Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.1560964Z ) 2025-05-07T20:32:01.1561308Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.1561754Z def test_silu_mul_quant( 2025-05-07T20:32:01.1562004Z self, 2025-05-07T20:32:01.1562200Z T: int, 2025-05-07T20:32:01.1562405Z D: int, 2025-05-07T20:32:01.1562625Z scale_ub: Optional[float], 2025-05-07T20:32:01.1562890Z contiguous: bool, 2025-05-07T20:32:01.1563135Z compiled: bool, 2025-05-07T20:32:01.1563462Z ) -> None: 2025-05-07T20:32:01.1563673Z torch.manual_seed(2025) 2025-05-07T20:32:01.1563918Z 2025-05-07T20:32:01.1564199Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.1566350Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.1568201Z 2025-05-07T20:32:01.1568328Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.1568538Z 2025-05-07T20:32:01.1568639Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.1569055Z self=, 2025-05-07T20:32:01.1569458Z T=4096, 2025-05-07T20:32:01.1569640Z D=7168, 2025-05-07T20:32:01.1569829Z scale_ub=None, 2025-05-07T20:32:01.1570044Z contiguous=True, 2025-05-07T20:32:01.1570321Z compiled=True, 2025-05-07T20:32:01.1570522Z ) 2025-05-07T20:32:01.1570841Z self = 2025-05-07T20:32:01.1571332Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.1571599Z 2025-05-07T20:32:01.1571676Z @given( 2025-05-07T20:32:01.1571906Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.1572215Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.1572516Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.1572845Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.1573175Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.1573458Z ) 2025-05-07T20:32:01.1573809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.1574252Z def test_silu_mul_quant( 2025-05-07T20:32:01.1574502Z self, 2025-05-07T20:32:01.1574699Z T: int, 2025-05-07T20:32:01.1574947Z D: int, 2025-05-07T20:32:01.1575169Z scale_ub: Optional[float], 2025-05-07T20:32:01.1575436Z contiguous: bool, 2025-05-07T20:32:01.1575677Z compiled: bool, 2025-05-07T20:32:01.1575908Z ) -> None: 2025-05-07T20:32:01.1576119Z torch.manual_seed(2025) 2025-05-07T20:32:01.1576364Z 2025-05-07T20:32:01.1576642Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.1578680Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.1580526Z 2025-05-07T20:32:01.1580671Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.1580883Z 2025-05-07T20:32:01.1580985Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.1581402Z self=, 2025-05-07T20:32:01.1581804Z T=2048, 2025-05-07T20:32:01.1581989Z D=5120, 2025-05-07T20:32:01.1582183Z scale_ub=1200.0, 2025-05-07T20:32:01.1582409Z contiguous=False, 2025-05-07T20:32:01.1582638Z compiled=False, 2025-05-07T20:32:01.1582837Z ) 2025-05-07T20:32:01.1583158Z self = 2025-05-07T20:32:01.1583659Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.1583975Z 2025-05-07T20:32:01.1584063Z @given( 2025-05-07T20:32:01.1584297Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.1584695Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.1585002Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.1585337Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.1585668Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.1585955Z ) 2025-05-07T20:32:01.1586304Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.1586748Z def test_silu_mul_quant( 2025-05-07T20:32:01.1586988Z self, 2025-05-07T20:32:01.1587179Z T: int, 2025-05-07T20:32:01.1587379Z D: int, 2025-05-07T20:32:01.1587601Z scale_ub: Optional[float], 2025-05-07T20:32:01.1587868Z contiguous: bool, 2025-05-07T20:32:01.1588108Z compiled: bool, 2025-05-07T20:32:01.1588332Z ) -> None: 2025-05-07T20:32:01.1588544Z torch.manual_seed(2025) 2025-05-07T20:32:01.1588788Z 2025-05-07T20:32:01.1589061Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.1591156Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.1593007Z 2025-05-07T20:32:01.1593130Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.1593341Z 2025-05-07T20:32:01.1593445Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.1593858Z self=, 2025-05-07T20:32:01.1594259Z T=4096, 2025-05-07T20:32:01.1594444Z D=7168, 2025-05-07T20:32:01.1594644Z scale_ub=1200.0, 2025-05-07T20:32:01.1594911Z contiguous=True, 2025-05-07T20:32:01.1595128Z compiled=False, 2025-05-07T20:32:01.1595336Z ) 2025-05-07T20:32:01.2518290Z self = 2025-05-07T20:32:01.2519042Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.2519524Z 2025-05-07T20:32:01.2519653Z @given( 2025-05-07T20:32:01.2519950Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2520267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2520576Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2520906Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2521230Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2521523Z ) 2025-05-07T20:32:01.2521877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2522318Z def test_silu_mul_quant( 2025-05-07T20:32:01.2522607Z self, 2025-05-07T20:32:01.2522809Z T: int, 2025-05-07T20:32:01.2523004Z D: int, 2025-05-07T20:32:01.2523335Z scale_ub: Optional[float], 2025-05-07T20:32:01.2523616Z contiguous: bool, 2025-05-07T20:32:01.2523895Z compiled: bool, 2025-05-07T20:32:01.2524122Z ) -> None: 2025-05-07T20:32:01.2524344Z torch.manual_seed(2025) 2025-05-07T20:32:01.2524585Z 2025-05-07T20:32:01.2524862Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2527243Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.2529127Z 2025-05-07T20:32:01.2529248Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.2529459Z 2025-05-07T20:32:01.2529569Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2529977Z self=, 2025-05-07T20:32:01.2530386Z T=16384, 2025-05-07T20:32:01.2530584Z D=7168, 2025-05-07T20:32:01.2530772Z scale_ub=None, 2025-05-07T20:32:01.2530988Z contiguous=False, 2025-05-07T20:32:01.2531215Z compiled=True, 2025-05-07T20:32:01.2531415Z ) 2025-05-07T20:32:01.2531736Z self = 2025-05-07T20:32:01.2532233Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.2532510Z 2025-05-07T20:32:01.2532594Z @given( 2025-05-07T20:32:01.2532824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2533231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2533537Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2533860Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2534186Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2534476Z ) 2025-05-07T20:32:01.2534819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2535263Z def test_silu_mul_quant( 2025-05-07T20:32:01.2535504Z self, 2025-05-07T20:32:01.2535702Z T: int, 2025-05-07T20:32:01.2535897Z D: int, 2025-05-07T20:32:01.2536117Z scale_ub: Optional[float], 2025-05-07T20:32:01.2536389Z contiguous: bool, 2025-05-07T20:32:01.2536627Z compiled: bool, 2025-05-07T20:32:01.2536854Z ) -> None: 2025-05-07T20:32:01.2537075Z torch.manual_seed(2025) 2025-05-07T20:32:01.2537315Z 2025-05-07T20:32:01.2537597Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2540014Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.2541867Z 2025-05-07T20:32:01.2541996Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.2542209Z 2025-05-07T20:32:01.2542322Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2542728Z self=, 2025-05-07T20:32:01.2543141Z T=4096, 2025-05-07T20:32:01.2543337Z D=7168, 2025-05-07T20:32:01.2543522Z scale_ub=None, 2025-05-07T20:32:01.2543740Z contiguous=True, 2025-05-07T20:32:01.2543970Z compiled=False, 2025-05-07T20:32:01.2544167Z ) 2025-05-07T20:32:01.2544487Z self = 2025-05-07T20:32:01.2544984Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.2545253Z 2025-05-07T20:32:01.2545333Z @given( 2025-05-07T20:32:01.2545563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2545879Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2546185Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2546507Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2546835Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2547121Z ) 2025-05-07T20:32:01.2547597Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2548054Z def test_silu_mul_quant( 2025-05-07T20:32:01.2548302Z self, 2025-05-07T20:32:01.2548495Z T: int, 2025-05-07T20:32:01.2548699Z D: int, 2025-05-07T20:32:01.2548920Z scale_ub: Optional[float], 2025-05-07T20:32:01.2549193Z contiguous: bool, 2025-05-07T20:32:01.2549433Z compiled: bool, 2025-05-07T20:32:01.2549658Z ) -> None: 2025-05-07T20:32:01.2549873Z torch.manual_seed(2025) 2025-05-07T20:32:01.2550116Z 2025-05-07T20:32:01.2550394Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2552440Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.2554405Z 2025-05-07T20:32:01.2554529Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.2554741Z 2025-05-07T20:32:01.2554847Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2555262Z self=, 2025-05-07T20:32:01.2555670Z T=16384, 2025-05-07T20:32:01.2555860Z D=7168, 2025-05-07T20:32:01.2556060Z scale_ub=None, 2025-05-07T20:32:01.2556279Z contiguous=True, 2025-05-07T20:32:01.2556502Z compiled=False, 2025-05-07T20:32:01.2556709Z ) 2025-05-07T20:32:01.2557032Z self = 2025-05-07T20:32:01.2557529Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.2557808Z 2025-05-07T20:32:01.2557961Z @given( 2025-05-07T20:32:01.2558195Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2558511Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2558813Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2559143Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2559491Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2559773Z ) 2025-05-07T20:32:01.2560128Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2560575Z def test_silu_mul_quant( 2025-05-07T20:32:01.2560821Z self, 2025-05-07T20:32:01.2561015Z T: int, 2025-05-07T20:32:01.2561221Z D: int, 2025-05-07T20:32:01.2561445Z scale_ub: Optional[float], 2025-05-07T20:32:01.2561716Z contiguous: bool, 2025-05-07T20:32:01.2561960Z compiled: bool, 2025-05-07T20:32:01.2562196Z ) -> None: 2025-05-07T20:32:01.2562419Z torch.manual_seed(2025) 2025-05-07T20:32:01.2562673Z 2025-05-07T20:32:01.2562953Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2565158Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.2567016Z 2025-05-07T20:32:01.2567136Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.2567357Z 2025-05-07T20:32:01.2567463Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2568007Z self=, 2025-05-07T20:32:01.2568428Z T=16384, 2025-05-07T20:32:01.2568620Z D=7168, 2025-05-07T20:32:01.2568818Z scale_ub=1200.0, 2025-05-07T20:32:01.2580374Z contiguous=True, 2025-05-07T20:32:01.2580646Z compiled=False, 2025-05-07T20:32:01.2580843Z ) 2025-05-07T20:32:01.2581152Z self = 2025-05-07T20:32:01.2581645Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.2581925Z 2025-05-07T20:32:01.2582004Z @given( 2025-05-07T20:32:01.2582226Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2582529Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2582824Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2583141Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2583453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2583815Z ) 2025-05-07T20:32:01.2584162Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2584589Z def test_silu_mul_quant( 2025-05-07T20:32:01.2584819Z self, 2025-05-07T20:32:01.2585005Z T: int, 2025-05-07T20:32:01.2585190Z D: int, 2025-05-07T20:32:01.2585400Z scale_ub: Optional[float], 2025-05-07T20:32:01.2585661Z contiguous: bool, 2025-05-07T20:32:01.2585887Z compiled: bool, 2025-05-07T20:32:01.2586100Z ) -> None: 2025-05-07T20:32:01.2586316Z torch.manual_seed(2025) 2025-05-07T20:32:01.2586555Z 2025-05-07T20:32:01.2586834Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2588890Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.2590804Z 2025-05-07T20:32:01.2590925Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.2591138Z 2025-05-07T20:32:01.2591250Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2591657Z self=, 2025-05-07T20:32:01.2592075Z T=128, 2025-05-07T20:32:01.2592267Z D=5120, 2025-05-07T20:32:01.2592457Z scale_ub=1200.0, 2025-05-07T20:32:01.2592688Z contiguous=False, 2025-05-07T20:32:01.2592919Z compiled=False, 2025-05-07T20:32:01.2593123Z ) 2025-05-07T20:32:01.3601822Z self = 2025-05-07T20:32:01.3602404Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.3602703Z 2025-05-07T20:32:01.3602792Z @given( 2025-05-07T20:32:01.3603019Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3603449Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3603764Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3604138Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3604474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3604766Z ) 2025-05-07T20:32:01.3605125Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3605563Z def test_silu_mul_quant( 2025-05-07T20:32:01.3605813Z self, 2025-05-07T20:32:01.3606015Z T: int, 2025-05-07T20:32:01.3606210Z D: int, 2025-05-07T20:32:01.3606437Z scale_ub: Optional[float], 2025-05-07T20:32:01.3606714Z contiguous: bool, 2025-05-07T20:32:01.3606952Z compiled: bool, 2025-05-07T20:32:01.3607555Z ) -> None: 2025-05-07T20:32:01.3607804Z torch.manual_seed(2025) 2025-05-07T20:32:01.3608054Z 2025-05-07T20:32:01.3608328Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3608686Z 2025-05-07T20:32:01.3608884Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3609176Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3609489Z x = x_sign * x_clamp 2025-05-07T20:32:01.3609738Z x0 = x[:, :D] 2025-05-07T20:32:01.3609963Z x1 = x[:, D:] 2025-05-07T20:32:01.3610173Z 2025-05-07T20:32:01.3610369Z if contiguous: 2025-05-07T20:32:01.3610609Z x0 = x0.contiguous() 2025-05-07T20:32:01.3610865Z x1 = x1.contiguous() 2025-05-07T20:32:01.3611112Z 2025-05-07T20:32:01.3611311Z if scale_ub is not None: 2025-05-07T20:32:01.3611585Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3611936Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3612343Z ) 2025-05-07T20:32:01.3612533Z else: 2025-05-07T20:32:01.3612747Z scale_ub_tensor = None 2025-05-07T20:32:01.3613006Z 2025-05-07T20:32:01.3613240Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3613562Z op = silu_mul_quant 2025-05-07T20:32:01.3613815Z if compiled: 2025-05-07T20:32:01.3614060Z op = torch.compile(op) 2025-05-07T20:32:01.3614372Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3614651Z 2025-05-07T20:32:01.3614841Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3615010Z 2025-05-07T20:32:01.3615112Z moe/activation_test.py:117: 2025-05-07T20:32:01.3615409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3615742Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3616027Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3616727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3617521Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3618054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3618742Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3619410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3619940Z kernel = self.compile( 2025-05-07T20:32:01.3620485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3621148Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3621548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3621778Z 2025-05-07T20:32:01.3621996Z self = 2025-05-07T20:32:01.3623078Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3624465Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf52b7e0>} 2025-05-07T20:32:01.3625815Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3626847Z context = 2025-05-07T20:32:01.3627135Z 2025-05-07T20:32:01.3627385Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3627920Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3628392Z module_map=module_map) 2025-05-07T20:32:01.3628752Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3629106Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3629369Z E ^ 2025-05-07T20:32:01.3629831Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3630288Z 2025-05-07T20:32:01.3630709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3631230Z 2025-05-07T20:32:01.3631335Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3631751Z self=, 2025-05-07T20:32:01.3632148Z T=2048, 2025-05-07T20:32:01.3632389Z D=7168, 2025-05-07T20:32:01.3632586Z scale_ub=None, 2025-05-07T20:32:01.3632799Z contiguous=False, 2025-05-07T20:32:01.3633028Z compiled=False, 2025-05-07T20:32:01.3633237Z ) 2025-05-07T20:32:01.3633560Z self = 2025-05-07T20:32:01.3634053Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.3634353Z 2025-05-07T20:32:01.3634437Z @given( 2025-05-07T20:32:01.3634697Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3635006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3635323Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3635655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3635978Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3636269Z ) 2025-05-07T20:32:01.3636620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3637073Z def test_silu_mul_quant( 2025-05-07T20:32:01.3637357Z self, 2025-05-07T20:32:01.3637556Z T: int, 2025-05-07T20:32:01.3637756Z D: int, 2025-05-07T20:32:01.3637972Z scale_ub: Optional[float], 2025-05-07T20:32:01.3638247Z contiguous: bool, 2025-05-07T20:32:01.3638755Z compiled: bool, 2025-05-07T20:32:01.3638978Z ) -> None: 2025-05-07T20:32:01.3639196Z torch.manual_seed(2025) 2025-05-07T20:32:01.3639437Z 2025-05-07T20:32:01.3639703Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3641767Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3643715Z 2025-05-07T20:32:01.3643835Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3644051Z 2025-05-07T20:32:01.3644156Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3644566Z self=, 2025-05-07T20:32:01.3644962Z T=128, 2025-05-07T20:32:01.3645151Z D=7168, 2025-05-07T20:32:01.3645345Z scale_ub=1200.0, 2025-05-07T20:32:01.3645561Z contiguous=True, 2025-05-07T20:32:01.3645784Z compiled=True, 2025-05-07T20:32:01.3645988Z ) 2025-05-07T20:32:01.3954471Z self = 2025-05-07T20:32:01.3955271Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.3955592Z 2025-05-07T20:32:01.3955678Z @given( 2025-05-07T20:32:01.3956161Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3956483Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3956786Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3957126Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3957462Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3957754Z ) 2025-05-07T20:32:01.3958104Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3958556Z def test_silu_mul_quant( 2025-05-07T20:32:01.3958803Z self, 2025-05-07T20:32:01.3958997Z T: int, 2025-05-07T20:32:01.3959204Z D: int, 2025-05-07T20:32:01.3959429Z scale_ub: Optional[float], 2025-05-07T20:32:01.3959699Z contiguous: bool, 2025-05-07T20:32:01.3959945Z compiled: bool, 2025-05-07T20:32:01.3960183Z ) -> None: 2025-05-07T20:32:01.3960398Z torch.manual_seed(2025) 2025-05-07T20:32:01.3960728Z 2025-05-07T20:32:01.3961008Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3961347Z 2025-05-07T20:32:01.3961552Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3961850Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3962165Z x = x_sign * x_clamp 2025-05-07T20:32:01.3962405Z x0 = x[:, :D] 2025-05-07T20:32:01.3962628Z x1 = x[:, D:] 2025-05-07T20:32:01.3962847Z 2025-05-07T20:32:01.3963036Z if contiguous: 2025-05-07T20:32:01.3963465Z x0 = x0.contiguous() 2025-05-07T20:32:01.3963745Z x1 = x1.contiguous() 2025-05-07T20:32:01.3964022Z 2025-05-07T20:32:01.3964221Z if scale_ub is not None: 2025-05-07T20:32:01.3964496Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3964829Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3965144Z ) 2025-05-07T20:32:01.3965345Z else: 2025-05-07T20:32:01.3965564Z scale_ub_tensor = None 2025-05-07T20:32:01.3965913Z 2025-05-07T20:32:01.3966154Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3966466Z op = silu_mul_quant 2025-05-07T20:32:01.3966744Z if compiled: 2025-05-07T20:32:01.3966999Z op = torch.compile(op) 2025-05-07T20:32:01.3967294Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3967577Z 2025-05-07T20:32:01.3967777Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3967943Z 2025-05-07T20:32:01.3968047Z moe/activation_test.py:117: 2025-05-07T20:32:01.3968347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3968687Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3968971Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3969542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3970120Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3970796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3971487Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3972034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3972726Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3973401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3973938Z kernel = self.compile( 2025-05-07T20:32:01.3974537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3975211Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3975721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3975965Z 2025-05-07T20:32:01.3976176Z self = 2025-05-07T20:32:01.3977264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3978657Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf8d6a20>} 2025-05-07T20:32:01.3980006Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3981030Z context = 2025-05-07T20:32:01.3981370Z 2025-05-07T20:32:01.3981549Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3982073Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3982545Z module_map=module_map) 2025-05-07T20:32:01.3982906Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3983268Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3983529Z E ^ 2025-05-07T20:32:01.3983994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3984512Z 2025-05-07T20:32:01.3984935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3985463Z 2025-05-07T20:32:01.3985568Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3985983Z self=, 2025-05-07T20:32:01.3986440Z T=128, 2025-05-07T20:32:01.3986630Z D=7168, 2025-05-07T20:32:01.3986826Z scale_ub=1200.0, 2025-05-07T20:32:01.3987048Z contiguous=True, 2025-05-07T20:32:01.3987274Z compiled=False, 2025-05-07T20:32:01.3987491Z ) 2025-05-07T20:32:01.3987819Z self = 2025-05-07T20:32:01.3988325Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.3988592Z 2025-05-07T20:32:01.3988675Z @given( 2025-05-07T20:32:01.3988905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3989225Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3989535Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3989870Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3990198Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3990488Z ) 2025-05-07T20:32:01.3990848Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3991290Z def test_silu_mul_quant( 2025-05-07T20:32:01.3991536Z self, 2025-05-07T20:32:01.3991733Z T: int, 2025-05-07T20:32:01.3991931Z D: int, 2025-05-07T20:32:01.3992153Z scale_ub: Optional[float], 2025-05-07T20:32:01.3992440Z contiguous: bool, 2025-05-07T20:32:01.3992680Z compiled: bool, 2025-05-07T20:32:01.3992909Z ) -> None: 2025-05-07T20:32:01.3993129Z torch.manual_seed(2025) 2025-05-07T20:32:01.3993366Z 2025-05-07T20:32:01.3993648Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3993997Z 2025-05-07T20:32:01.3994199Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3994489Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3996587Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3998453Z 2025-05-07T20:32:01.3998573Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:01.3998785Z 2025-05-07T20:32:01.3998893Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3999304Z self=, 2025-05-07T20:32:01.3999713Z T=128, 2025-05-07T20:32:01.3999904Z D=5120, 2025-05-07T20:32:01.4000097Z scale_ub=1200.0, 2025-05-07T20:32:01.4000317Z contiguous=True, 2025-05-07T20:32:01.4000543Z compiled=True, 2025-05-07T20:32:01.4000747Z ) 2025-05-07T20:32:01.4001106Z self = 2025-05-07T20:32:01.4001601Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.4001869Z 2025-05-07T20:32:01.4001952Z @given( 2025-05-07T20:32:01.4002180Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.4002496Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.4002804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.4003131Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.4003610Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.4003896Z ) 2025-05-07T20:32:01.4004245Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.4004688Z def test_silu_mul_quant( 2025-05-07T20:32:01.4004932Z self, 2025-05-07T20:32:01.4005129Z T: int, 2025-05-07T20:32:01.4005323Z D: int, 2025-05-07T20:32:01.4005557Z scale_ub: Optional[float], 2025-05-07T20:32:01.4005895Z contiguous: bool, 2025-05-07T20:32:01.4006129Z compiled: bool, 2025-05-07T20:32:01.4006355Z ) -> None: 2025-05-07T20:32:01.4006576Z torch.manual_seed(2025) 2025-05-07T20:32:01.4006817Z 2025-05-07T20:32:01.4007100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.4007444Z 2025-05-07T20:32:01.4007637Z x_sign = torch.sign(x) 2025-05-07T20:32:01.4007935Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.4009942Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.4011810Z 2025-05-07T20:32:01.4011929Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:01.4012141Z 2025-05-07T20:32:01.4012259Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.4012673Z self=, 2025-05-07T20:32:01.4013083Z T=128, 2025-05-07T20:32:01.4013282Z D=7168, 2025-05-07T20:32:01.4013472Z scale_ub=None, 2025-05-07T20:32:01.4013692Z contiguous=True, 2025-05-07T20:32:01.4013945Z compiled=True, 2025-05-07T20:32:01.4014165Z ) 2025-05-07T20:32:01.8518747Z self = 2025-05-07T20:32:01.8519449Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.8519783Z 2025-05-07T20:32:01.8519869Z @given( 2025-05-07T20:32:01.8520552Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8520887Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8521202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.8521538Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.8521864Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.8522164Z ) 2025-05-07T20:32:01.8522520Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.8522965Z def test_silu_mul_quant( 2025-05-07T20:32:01.8523202Z self, 2025-05-07T20:32:01.8523570Z T: int, 2025-05-07T20:32:01.8523774Z D: int, 2025-05-07T20:32:01.8523986Z scale_ub: Optional[float], 2025-05-07T20:32:01.8524264Z contiguous: bool, 2025-05-07T20:32:01.8524505Z compiled: bool, 2025-05-07T20:32:01.8524730Z ) -> None: 2025-05-07T20:32:01.8524950Z torch.manual_seed(2025) 2025-05-07T20:32:01.8525195Z 2025-05-07T20:32:01.8525474Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8527652Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.8529529Z 2025-05-07T20:32:01.8529649Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.8529869Z 2025-05-07T20:32:01.8562813Z FAILED 2025-05-07T20:32:01.8562987Z 2025-05-07T20:32:01.8563189Z =================================== FAILURES =================================== 2025-05-07T20:32:01.8563975Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:01.8564815Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:01.8565726Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor 2025-05-07T20:32:01.8566551Z | yield 2025-05-07T20:32:01.8567193Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run 2025-05-07T20:32:01.8567987Z | self._callTestMethod(testMethod) 2025-05-07T20:32:01.8568966Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod 2025-05-07T20:32:01.8569801Z | if method() is not None: 2025-05-07T20:32:01.8570162Z | ^^^^^^^^ 2025-05-07T20:32:01.8571133Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:01.8572525Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8572975Z | ^^^^^^^ 2025-05-07T20:32:01.8573820Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:01.8574834Z | raise the_error_hypothesis_found 2025-05-07T20:32:01.8575454Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:01.8576057Z +-+---------------- 1 ---------------- 2025-05-07T20:32:01.8576480Z | Traceback (most recent call last): 2025-05-07T20:32:01.8577522Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:01.8578675Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8579230Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.8598740Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.8601767Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:01.8602416Z | self=, 2025-05-07T20:32:01.8603005Z | T=2048, 2025-05-07T20:32:01.8603476Z | D=5120, # or any other generated value 2025-05-07T20:32:01.8603973Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:01.8604501Z | contiguous=True, # or any other generated value 2025-05-07T20:32:01.8605144Z | compiled=False, # or any other generated value 2025-05-07T20:32:01.8605591Z | ) 2025-05-07T20:32:01.8605840Z | 2025-05-07T20:32:01.8606615Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:01.8607522Z +---------------- 2 ---------------- 2025-05-07T20:32:01.8607948Z | Traceback (most recent call last): 2025-05-07T20:32:01.8608992Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:01.8610145Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8610699Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.8613683Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.8616774Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:01.8617402Z | self=, 2025-05-07T20:32:01.8618009Z | T=128, 2025-05-07T20:32:01.8618302Z | D=7168, 2025-05-07T20:32:01.8618597Z | scale_ub=None, 2025-05-07T20:32:01.8618947Z | contiguous=True, 2025-05-07T20:32:01.8619307Z | compiled=True, 2025-05-07T20:32:01.8619626Z | ) 2025-05-07T20:32:01.8619887Z | 2025-05-07T20:32:01.8620602Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:01.8621223Z +---------------- 3 ---------------- 2025-05-07T20:32:01.8621511Z | Traceback (most recent call last): 2025-05-07T20:32:01.8622230Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:01.8623013Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8623387Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.8625511Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.8627482Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:01.8627927Z | self=, 2025-05-07T20:32:01.8628335Z | T=128, 2025-05-07T20:32:01.8628535Z | D=5120, 2025-05-07T20:32:01.8628749Z | scale_ub=1200.0, 2025-05-07T20:32:01.8628998Z | contiguous=True, 2025-05-07T20:32:01.8629235Z | compiled=True, 2025-05-07T20:32:01.8629464Z | ) 2025-05-07T20:32:01.8629651Z | 2025-05-07T20:32:01.8630174Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:01.8630781Z +---------------- 4 ---------------- 2025-05-07T20:32:01.8631120Z | Traceback (most recent call last): 2025-05-07T20:32:01.8631829Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:01.8632542Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.8632829Z | ^^^^^^^^ 2025-05-07T20:32:01.8633464Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:01.8634161Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.8634502Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.8635306Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:01.8636099Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.8636718Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:01.8637503Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.8637943Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.8639027Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:01.8639815Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.8640297Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.8640970Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:01.8641774Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.8642243Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.8642881Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:01.8643672Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.8644040Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.8644637Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:01.8645206Z | fn() 2025-05-07T20:32:01.8645769Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:01.8646397Z | self.fn.run( 2025-05-07T20:32:01.8647082Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:01.8647669Z | kernel = self.compile( 2025-05-07T20:32:01.8647923Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:01.8648522Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:01.8649232Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.8649616Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.8650267Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:01.8651062Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.8651549Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.8652041Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.8652445Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.8652741Z | ^ 2025-05-07T20:32:01.8653283Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.8654009Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:01.8654473Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:01.8655085Z | self=, 2025-05-07T20:32:01.8655587Z | T=1, # or any other generated value 2025-05-07T20:32:01.8655940Z | D=5120, # or any other generated value 2025-05-07T20:32:01.8656328Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:01.8656734Z | contiguous=True, # or any other generated value 2025-05-07T20:32:01.8657159Z | compiled=True, # or any other generated value 2025-05-07T20:32:01.8657549Z | ) 2025-05-07T20:32:01.8657735Z | 2025-05-07T20:32:01.8658256Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:01.8658882Z +------------------------------------ 2025-05-07T20:32:01.8659250Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:01.8659621Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.8660040Z self=, 2025-05-07T20:32:01.8660445Z T=1, 2025-05-07T20:32:01.8660627Z D=5120, 2025-05-07T20:32:01.8660825Z scale_ub=None, 2025-05-07T20:32:01.8661040Z contiguous=True, 2025-05-07T20:32:01.8661262Z compiled=True, 2025-05-07T20:32:01.8661464Z ) 2025-05-07T20:32:01.8661781Z self = 2025-05-07T20:32:01.8662271Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.8662534Z 2025-05-07T20:32:01.8662609Z @given( 2025-05-07T20:32:01.8662844Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8663158Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8663456Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.8663787Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.8664119Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.8664401Z ) 2025-05-07T20:32:01.8664751Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.8665192Z def test_silu_mul_quant( 2025-05-07T20:32:01.8665427Z self, 2025-05-07T20:32:01.8665613Z T: int, 2025-05-07T20:32:01.8665814Z D: int, 2025-05-07T20:32:01.8666034Z scale_ub: Optional[float], 2025-05-07T20:32:01.8666299Z contiguous: bool, 2025-05-07T20:32:01.8666628Z compiled: bool, 2025-05-07T20:32:01.8666859Z ) -> None: 2025-05-07T20:32:01.8667066Z torch.manual_seed(2025) 2025-05-07T20:32:01.8667305Z 2025-05-07T20:32:01.8667575Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8667909Z 2025-05-07T20:32:01.8668102Z x_sign = torch.sign(x) 2025-05-07T20:32:01.8668393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.8668693Z x = x_sign * x_clamp 2025-05-07T20:32:01.8668935Z x0 = x[:, :D] 2025-05-07T20:32:01.8669150Z x1 = x[:, D:] 2025-05-07T20:32:01.8669351Z 2025-05-07T20:32:01.8669547Z if contiguous: 2025-05-07T20:32:01.8669783Z x0 = x0.contiguous() 2025-05-07T20:32:01.8670035Z x1 = x1.contiguous() 2025-05-07T20:32:01.8670281Z 2025-05-07T20:32:01.8670477Z if scale_ub is not None: 2025-05-07T20:32:01.8670752Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.8671140Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.8671453Z ) 2025-05-07T20:32:01.8671654Z else: 2025-05-07T20:32:01.8671857Z scale_ub_tensor = None 2025-05-07T20:32:01.8672112Z 2025-05-07T20:32:01.8672351Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8672655Z op = silu_mul_quant 2025-05-07T20:32:01.8672906Z if compiled: 2025-05-07T20:32:01.8673158Z op = torch.compile(op) 2025-05-07T20:32:01.8673444Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8673724Z 2025-05-07T20:32:01.8673915Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.8674191Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.8674484Z 2025-05-07T20:32:01.8674721Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8675054Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.8675337Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.8675714Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.8676074Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.8676397Z 2025-05-07T20:32:01.8676600Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.8676792Z 2025-05-07T20:32:01.8676899Z moe/activation_test.py:126: 2025-05-07T20:32:01.8677189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8677525Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.8677855Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.8678649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.8679410Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.8679959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.8680654Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.8681344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.8682074Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.8682831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.8683696Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.8684433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.8685091Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.8685781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.8686307Z fn() 2025-05-07T20:32:01.8686820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.8687402Z self.fn.run( 2025-05-07T20:32:01.8687874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.8688400Z kernel = self.compile( 2025-05-07T20:32:01.8688942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.8689599Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.8689986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8690218Z 2025-05-07T20:32:01.8690425Z self = 2025-05-07T20:32:01.8691512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.8692966Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd39c2c7ce0>} 2025-05-07T20:32:01.8694599Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.8696086Z context = 2025-05-07T20:32:01.8696501Z 2025-05-07T20:32:01.8696739Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.8697512Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.8698205Z module_map=module_map) 2025-05-07T20:32:01.8698820Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.8699313Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.8699681Z E ^ 2025-05-07T20:32:01.8700317Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.8700928Z 2025-05-07T20:32:01.8701499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.8702211Z 2025-05-07T20:32:01.8702352Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.8702929Z self=, 2025-05-07T20:32:01.8703479Z T=2048, 2025-05-07T20:32:01.8703738Z D=5120, 2025-05-07T20:32:01.8704004Z scale_ub=1200.0, 2025-05-07T20:32:01.8704305Z contiguous=True, 2025-05-07T20:32:01.8704614Z compiled=False, 2025-05-07T20:32:01.8704918Z ) 2025-05-07T20:32:01.8705362Z self = 2025-05-07T20:32:01.8706053Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.8706438Z 2025-05-07T20:32:01.8706552Z @given( 2025-05-07T20:32:01.8706860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8707295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8707723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.8708179Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.8708628Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.8709022Z ) 2025-05-07T20:32:01.8709504Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.8710112Z def test_silu_mul_quant( 2025-05-07T20:32:01.8710445Z self, 2025-05-07T20:32:01.8710708Z T: int, 2025-05-07T20:32:01.8710973Z D: int, 2025-05-07T20:32:01.8711384Z scale_ub: Optional[float], 2025-05-07T20:32:01.8711768Z contiguous: bool, 2025-05-07T20:32:01.8712098Z compiled: bool, 2025-05-07T20:32:01.8712408Z ) -> None: 2025-05-07T20:32:01.8712701Z torch.manual_seed(2025) 2025-05-07T20:32:01.8713036Z 2025-05-07T20:32:01.8713410Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8713887Z 2025-05-07T20:32:01.8714155Z x_sign = torch.sign(x) 2025-05-07T20:32:01.8714553Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.8714984Z x = x_sign * x_clamp 2025-05-07T20:32:01.8715318Z x0 = x[:, :D] 2025-05-07T20:32:01.8715610Z x1 = x[:, D:] 2025-05-07T20:32:01.8715900Z 2025-05-07T20:32:01.8716159Z if contiguous: 2025-05-07T20:32:01.8716472Z x0 = x0.contiguous() 2025-05-07T20:32:01.8716835Z x1 = x1.contiguous() 2025-05-07T20:32:01.8717169Z 2025-05-07T20:32:01.8717436Z if scale_ub is not None: 2025-05-07T20:32:01.8717884Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.8718357Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.8718774Z ) 2025-05-07T20:32:01.8719023Z else: 2025-05-07T20:32:01.8719293Z scale_ub_tensor = None 2025-05-07T20:32:01.8719612Z 2025-05-07T20:32:01.8719909Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8720312Z op = silu_mul_quant 2025-05-07T20:32:01.8720631Z if compiled: 2025-05-07T20:32:01.8720939Z op = torch.compile(op) 2025-05-07T20:32:01.8721322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8721695Z 2025-05-07T20:32:01.8721934Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.8722155Z 2025-05-07T20:32:01.8722290Z moe/activation_test.py:117: 2025-05-07T20:32:01.8722673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8723114Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.8723715Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8724660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.8725572Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.8726307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.8727256Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.8728184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.8728911Z kernel = self.compile( 2025-05-07T20:32:01.8729620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.8730522Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.8731090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8731390Z 2025-05-07T20:32:01.8731653Z self = 2025-05-07T20:32:01.8733063Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.8734881Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd39c531760>} 2025-05-07T20:32:01.8736644Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.8738063Z context = 2025-05-07T20:32:01.8738723Z 2025-05-07T20:32:01.8738943Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.8739622Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.8740224Z module_map=module_map) 2025-05-07T20:32:01.8740687Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.8741149Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.8741487Z E ^ 2025-05-07T20:32:01.8742093Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.8742714Z 2025-05-07T20:32:01.8743298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.8744029Z 2025-05-07T20:32:01.8744173Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.8744901Z self=, 2025-05-07T20:32:01.8745465Z T=2048, 2025-05-07T20:32:01.8745726Z D=5120, 2025-05-07T20:32:01.8745994Z scale_ub=1200.0, 2025-05-07T20:32:01.8746304Z contiguous=True, 2025-05-07T20:32:01.8746631Z compiled=True, 2025-05-07T20:32:01.8746921Z ) 2025-05-07T20:32:01.8747379Z self = 2025-05-07T20:32:01.8748049Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.8748429Z 2025-05-07T20:32:01.8748536Z @given( 2025-05-07T20:32:01.8749361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8749827Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8750262Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.8750728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.8751182Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.8751598Z ) 2025-05-07T20:32:01.8752198Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.8752812Z def test_silu_mul_quant( 2025-05-07T20:32:01.8753131Z self, 2025-05-07T20:32:01.8753398Z T: int, 2025-05-07T20:32:01.8753668Z D: int, 2025-05-07T20:32:01.8754003Z scale_ub: Optional[float], 2025-05-07T20:32:01.8754367Z contiguous: bool, 2025-05-07T20:32:01.8754685Z compiled: bool, 2025-05-07T20:32:01.8754978Z ) -> None: 2025-05-07T20:32:01.8755264Z torch.manual_seed(2025) 2025-05-07T20:32:01.8755581Z 2025-05-07T20:32:01.8755929Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8756378Z 2025-05-07T20:32:01.8756634Z x_sign = torch.sign(x) 2025-05-07T20:32:01.8757007Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.8757412Z x = x_sign * x_clamp 2025-05-07T20:32:01.8757732Z x0 = x[:, :D] 2025-05-07T20:32:01.8758017Z x1 = x[:, D:] 2025-05-07T20:32:01.8758292Z 2025-05-07T20:32:01.8758532Z if contiguous: 2025-05-07T20:32:01.8758824Z x0 = x0.contiguous() 2025-05-07T20:32:01.8759160Z x1 = x1.contiguous() 2025-05-07T20:32:01.8759473Z 2025-05-07T20:32:01.8760277Z if scale_ub is not None: 2025-05-07T20:32:01.8760834Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.8761274Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.8761674Z ) 2025-05-07T20:32:01.8761917Z else: 2025-05-07T20:32:01.8762192Z scale_ub_tensor = None 2025-05-07T20:32:01.8762518Z 2025-05-07T20:32:01.8762809Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8763397Z op = silu_mul_quant 2025-05-07T20:32:01.8763741Z if compiled: 2025-05-07T20:32:01.8764068Z op = torch.compile(op) 2025-05-07T20:32:01.8764644Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8765026Z 2025-05-07T20:32:01.8765271Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.8765644Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.8766036Z 2025-05-07T20:32:01.8766347Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8766790Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.8767177Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.8767596Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.8768062Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.8768483Z 2025-05-07T20:32:01.8768753Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.8769017Z 2025-05-07T20:32:01.8769146Z moe/activation_test.py:126: 2025-05-07T20:32:01.8769528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8769963Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.8770451Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.8771484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.8772476Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.8773184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.8774071Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.8790355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.8791427Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.8792472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.8793534Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.8794668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.8795556Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.8796375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.8797079Z fn() 2025-05-07T20:32:01.8797756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.8798550Z self.fn.run( 2025-05-07T20:32:01.8799208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.8799970Z kernel = self.compile( 2025-05-07T20:32:01.8800706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.8801609Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.8802137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8802454Z 2025-05-07T20:32:01.8802738Z self = 2025-05-07T20:32:01.8804441Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.8806391Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd396be71a0>} 2025-05-07T20:32:01.8808364Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.8809770Z context = 2025-05-07T20:32:01.8810156Z 2025-05-07T20:32:01.8810378Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.8811082Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.8811718Z module_map=module_map) 2025-05-07T20:32:01.8812206Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.8812675Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.8813034Z E ^ 2025-05-07T20:32:01.8813665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.8814337Z 2025-05-07T20:32:01.8814905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.8815685Z 2025-05-07T20:32:01.8815827Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.8816403Z self=, 2025-05-07T20:32:01.8816969Z T=16384, 2025-05-07T20:32:01.8817229Z D=7168, 2025-05-07T20:32:01.8817496Z scale_ub=1200.0, 2025-05-07T20:32:01.8817808Z contiguous=False, 2025-05-07T20:32:01.8818121Z compiled=False, 2025-05-07T20:32:01.8818398Z ) 2025-05-07T20:32:01.8818827Z self = 2025-05-07T20:32:01.8819503Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.8819883Z 2025-05-07T20:32:01.8819984Z @given( 2025-05-07T20:32:01.8820291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8820707Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8821109Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.8821551Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.8822058Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.8822434Z ) 2025-05-07T20:32:01.8822903Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.8823510Z def test_silu_mul_quant( 2025-05-07T20:32:01.8823828Z self, 2025-05-07T20:32:01.8824095Z T: int, 2025-05-07T20:32:01.8824367Z D: int, 2025-05-07T20:32:01.8824654Z scale_ub: Optional[float], 2025-05-07T20:32:01.8825023Z contiguous: bool, 2025-05-07T20:32:01.8825350Z compiled: bool, 2025-05-07T20:32:01.8825648Z ) -> None: 2025-05-07T20:32:01.8825951Z torch.manual_seed(2025) 2025-05-07T20:32:01.8826293Z 2025-05-07T20:32:01.8826670Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8827142Z 2025-05-07T20:32:01.8827416Z x_sign = torch.sign(x) 2025-05-07T20:32:01.8827829Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.8828265Z x = x_sign * x_clamp 2025-05-07T20:32:01.8828607Z x0 = x[:, :D] 2025-05-07T20:32:01.8828918Z x1 = x[:, D:] 2025-05-07T20:32:01.8829203Z 2025-05-07T20:32:01.8829460Z if contiguous: 2025-05-07T20:32:01.8829788Z x0 = x0.contiguous() 2025-05-07T20:32:01.8830143Z x1 = x1.contiguous() 2025-05-07T20:32:01.8830479Z 2025-05-07T20:32:01.8830747Z if scale_ub is not None: 2025-05-07T20:32:01.8831126Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.8831593Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.8832024Z ) 2025-05-07T20:32:01.8832287Z else: 2025-05-07T20:32:01.8832560Z scale_ub_tensor = None 2025-05-07T20:32:01.8832899Z 2025-05-07T20:32:01.8833219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8833643Z op = silu_mul_quant 2025-05-07T20:32:01.8833991Z if compiled: 2025-05-07T20:32:01.8834496Z op = torch.compile(op) 2025-05-07T20:32:01.8834900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8835264Z 2025-05-07T20:32:01.8835515Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.8835731Z 2025-05-07T20:32:01.8835863Z moe/activation_test.py:117: 2025-05-07T20:32:01.8836268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8836731Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.8837134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8838107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.8839367Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.8840128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.8841105Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.8842172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.8842901Z kernel = self.compile( 2025-05-07T20:32:01.8843794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.8844755Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.8845314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8845637Z 2025-05-07T20:32:01.8845928Z self = 2025-05-07T20:32:01.8847386Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.8849263Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3970572e0>} 2025-05-07T20:32:01.8851253Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.8852675Z context = 2025-05-07T20:32:01.8853068Z 2025-05-07T20:32:01.8853258Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.8853774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.8854265Z module_map=module_map) 2025-05-07T20:32:01.8854656Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.8855013Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.8855277Z E ^ 2025-05-07T20:32:01.8855751Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.8856202Z 2025-05-07T20:32:01.8856629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.8857148Z 2025-05-07T20:32:01.8857262Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.8857671Z self=, 2025-05-07T20:32:01.8858077Z T=1, 2025-05-07T20:32:01.8858267Z D=7168, 2025-05-07T20:32:01.8858452Z scale_ub=None, 2025-05-07T20:32:01.8858664Z contiguous=True, 2025-05-07T20:32:01.8858887Z compiled=True, 2025-05-07T20:32:01.8859084Z ) 2025-05-07T20:32:01.8859402Z self = 2025-05-07T20:32:01.8859884Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.8860289Z 2025-05-07T20:32:01.8860375Z @given( 2025-05-07T20:32:01.8860606Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8860918Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8861217Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.8861547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.8861875Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.8862163Z ) 2025-05-07T20:32:01.8862505Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.8862947Z def test_silu_mul_quant( 2025-05-07T20:32:01.8863185Z self, 2025-05-07T20:32:01.8863373Z T: int, 2025-05-07T20:32:01.8863572Z D: int, 2025-05-07T20:32:01.8863789Z scale_ub: Optional[float], 2025-05-07T20:32:01.8864053Z contiguous: bool, 2025-05-07T20:32:01.8864292Z compiled: bool, 2025-05-07T20:32:01.8864512Z ) -> None: 2025-05-07T20:32:01.8864777Z torch.manual_seed(2025) 2025-05-07T20:32:01.8865017Z 2025-05-07T20:32:01.8865297Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8865635Z 2025-05-07T20:32:01.8865829Z x_sign = torch.sign(x) 2025-05-07T20:32:01.8866118Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.8866428Z x = x_sign * x_clamp 2025-05-07T20:32:01.8866660Z x0 = x[:, :D] 2025-05-07T20:32:01.8866876Z x1 = x[:, D:] 2025-05-07T20:32:01.8867084Z 2025-05-07T20:32:01.8867264Z if contiguous: 2025-05-07T20:32:01.8867497Z x0 = x0.contiguous() 2025-05-07T20:32:01.8867758Z x1 = x1.contiguous() 2025-05-07T20:32:01.8867991Z 2025-05-07T20:32:01.8868183Z if scale_ub is not None: 2025-05-07T20:32:01.8868459Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.8868789Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.8869101Z ) 2025-05-07T20:32:01.8869347Z else: 2025-05-07T20:32:01.8869548Z scale_ub_tensor = None 2025-05-07T20:32:01.8869795Z 2025-05-07T20:32:01.8870032Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8870338Z op = silu_mul_quant 2025-05-07T20:32:01.8870589Z if compiled: 2025-05-07T20:32:01.8870839Z op = torch.compile(op) 2025-05-07T20:32:01.8871365Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8871637Z 2025-05-07T20:32:01.8871828Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.8872111Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.8872396Z 2025-05-07T20:32:01.8872631Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8872967Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.8873253Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.8873576Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.8873951Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.8874255Z 2025-05-07T20:32:01.8874456Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.8874650Z 2025-05-07T20:32:01.8874757Z moe/activation_test.py:126: 2025-05-07T20:32:01.8875044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8875378Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.8875704Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.8876495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.8877257Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.8877810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.8878589Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.8879286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.8880014Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.8880773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.8881530Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.8882264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.8883043Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.8883934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.8884476Z fn() 2025-05-07T20:32:01.8885088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.8885745Z self.fn.run( 2025-05-07T20:32:01.8886220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.8886751Z kernel = self.compile( 2025-05-07T20:32:01.8887300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.8887968Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.8888365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8888593Z 2025-05-07T20:32:01.8888802Z self = 2025-05-07T20:32:01.8889885Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.8891309Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd397055440>} 2025-05-07T20:32:01.8892652Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.8893675Z context = 2025-05-07T20:32:01.8893968Z 2025-05-07T20:32:01.8894136Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.8894658Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.8894770Z module_map=module_map) 2025-05-07T20:32:01.8894943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.8895050Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.8895126Z E ^ 2025-05-07T20:32:01.8895490Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.8895495Z 2025-05-07T20:32:01.8895913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.8895918Z 2025-05-07T20:32:01.8896027Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.8896249Z self=, 2025-05-07T20:32:01.8896329Z T=4096, 2025-05-07T20:32:01.8896412Z D=5120, 2025-05-07T20:32:01.8896495Z scale_ub=None, 2025-05-07T20:32:01.8896583Z contiguous=False, 2025-05-07T20:32:01.8896676Z compiled=False, 2025-05-07T20:32:01.8896752Z ) 2025-05-07T20:32:01.8896973Z self = 2025-05-07T20:32:01.8897270Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.8897276Z 2025-05-07T20:32:01.8897354Z @given( 2025-05-07T20:32:01.8897482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8897581Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8897696Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.8897818Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.8897932Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.8898010Z ) 2025-05-07T20:32:01.8898263Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.8898358Z def test_silu_mul_quant( 2025-05-07T20:32:01.8898436Z self, 2025-05-07T20:32:01.8898522Z T: int, 2025-05-07T20:32:01.8898597Z D: int, 2025-05-07T20:32:01.8898703Z scale_ub: Optional[float], 2025-05-07T20:32:01.8898795Z contiguous: bool, 2025-05-07T20:32:01.8898931Z compiled: bool, 2025-05-07T20:32:01.8899021Z ) -> None: 2025-05-07T20:32:01.8899116Z torch.manual_seed(2025) 2025-05-07T20:32:01.8899189Z 2025-05-07T20:32:01.8899367Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8899441Z 2025-05-07T20:32:01.8899533Z x_sign = torch.sign(x) 2025-05-07T20:32:01.8899666Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.8899756Z x = x_sign * x_clamp 2025-05-07T20:32:01.8899836Z x0 = x[:, :D] 2025-05-07T20:32:01.8899923Z x1 = x[:, D:] 2025-05-07T20:32:01.8899996Z 2025-05-07T20:32:01.8900084Z if contiguous: 2025-05-07T20:32:01.8900177Z x0 = x0.contiguous() 2025-05-07T20:32:01.8900264Z x1 = x1.contiguous() 2025-05-07T20:32:01.8900343Z 2025-05-07T20:32:01.8900432Z if scale_ub is not None: 2025-05-07T20:32:01.8900537Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.8900686Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.8900806Z ) 2025-05-07T20:32:01.8900882Z else: 2025-05-07T20:32:01.8900987Z scale_ub_tensor = None 2025-05-07T20:32:01.8901060Z 2025-05-07T20:32:01.8901189Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8901288Z op = silu_mul_quant 2025-05-07T20:32:01.8901372Z if compiled: 2025-05-07T20:32:01.8901479Z op = torch.compile(op) 2025-05-07T20:32:01.8901584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8901653Z 2025-05-07T20:32:01.8901758Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.8901762Z 2025-05-07T20:32:01.8901858Z moe/activation_test.py:117: 2025-05-07T20:32:01.8901986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8902090Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.8902190Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8902703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.8902806Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.8903171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.8903403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.8903747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.8903841Z kernel = self.compile( 2025-05-07T20:32:01.8904237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.8904411Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.8904621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8904633Z 2025-05-07T20:32:01.8904838Z self = 2025-05-07T20:32:01.8905610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.8906119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd396b53b00>} 2025-05-07T20:32:01.8906872Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.8907070Z context = 2025-05-07T20:32:01.8907074Z 2025-05-07T20:32:01.8907292Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.8907557Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.8907669Z module_map=module_map) 2025-05-07T20:32:01.8907831Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.8907937Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.8908012Z E ^ 2025-05-07T20:32:01.8908371Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.8908377Z 2025-05-07T20:32:01.8908799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.8908804Z 2025-05-07T20:32:01.8908906Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.8909136Z self=, 2025-05-07T20:32:01.8909219Z T=4096, 2025-05-07T20:32:01.8909339Z D=7168, 2025-05-07T20:32:01.8909426Z scale_ub=None, 2025-05-07T20:32:01.8909513Z contiguous=False, 2025-05-07T20:32:01.8909596Z compiled=False, 2025-05-07T20:32:01.8909676Z ) 2025-05-07T20:32:01.8909893Z self = 2025-05-07T20:32:01.8910068Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.8910073Z 2025-05-07T20:32:01.8910154Z @given( 2025-05-07T20:32:01.8910270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8910369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8910491Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.8910606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.8910725Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.8910799Z ) 2025-05-07T20:32:01.8911048Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.8911156Z def test_silu_mul_quant( 2025-05-07T20:32:01.8911230Z self, 2025-05-07T20:32:01.8911306Z T: int, 2025-05-07T20:32:01.8911388Z D: int, 2025-05-07T20:32:01.8911484Z scale_ub: Optional[float], 2025-05-07T20:32:01.8911571Z contiguous: bool, 2025-05-07T20:32:01.8911662Z compiled: bool, 2025-05-07T20:32:01.8911740Z ) -> None: 2025-05-07T20:32:01.8911834Z torch.manual_seed(2025) 2025-05-07T20:32:01.8912578Z 2025-05-07T20:32:01.8912752Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8912833Z 2025-05-07T20:32:01.8912926Z x_sign = torch.sign(x) 2025-05-07T20:32:01.8913048Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.8913146Z x = x_sign * x_clamp 2025-05-07T20:32:01.8913226Z x0 = x[:, :D] 2025-05-07T20:32:01.8913305Z x1 = x[:, D:] 2025-05-07T20:32:01.8913385Z 2025-05-07T20:32:01.8913555Z if contiguous: 2025-05-07T20:32:01.8913649Z x0 = x0.contiguous() 2025-05-07T20:32:01.8913746Z x1 = x1.contiguous() 2025-05-07T20:32:01.8913818Z 2025-05-07T20:32:01.8913918Z if scale_ub is not None: 2025-05-07T20:32:01.8914024Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.8914158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.8914243Z ) 2025-05-07T20:32:01.8914322Z else: 2025-05-07T20:32:01.8914415Z scale_ub_tensor = None 2025-05-07T20:32:01.8914494Z 2025-05-07T20:32:01.8914624Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8914712Z op = silu_mul_quant 2025-05-07T20:32:01.8914804Z if compiled: 2025-05-07T20:32:01.8914902Z op = torch.compile(op) 2025-05-07T20:32:01.8915006Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8915086Z 2025-05-07T20:32:01.8915226Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.8915234Z 2025-05-07T20:32:01.8915336Z moe/activation_test.py:117: 2025-05-07T20:32:01.8915464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8915562Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.8915668Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8916174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.8916271Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.8916642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.8916865Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.8917214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.8917314Z kernel = self.compile( 2025-05-07T20:32:01.8917749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.8917929Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.8918054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8918059Z 2025-05-07T20:32:01.8918274Z self = 2025-05-07T20:32:01.8919046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.8919550Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd396b52660>} 2025-05-07T20:32:01.8920308Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.8920503Z context = 2025-05-07T20:32:01.8920508Z 2025-05-07T20:32:01.8920678Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.8920942Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.8921049Z module_map=module_map) 2025-05-07T20:32:01.8921217Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.8921316Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.8921394Z E ^ 2025-05-07T20:32:01.8921758Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.8921762Z 2025-05-07T20:32:01.8922283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.8922292Z 2025-05-07T20:32:01.8922404Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.8922627Z self=, 2025-05-07T20:32:01.8922706Z T=128, 2025-05-07T20:32:01.8922791Z D=7168, 2025-05-07T20:32:01.8922868Z scale_ub=None, 2025-05-07T20:32:01.8922958Z contiguous=False, 2025-05-07T20:32:01.8923037Z compiled=True, 2025-05-07T20:32:01.8923112Z ) 2025-05-07T20:32:01.8923472Z self = 2025-05-07T20:32:01.8923643Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.8923648Z 2025-05-07T20:32:01.8923725Z @given( 2025-05-07T20:32:01.8923854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8923954Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8924145Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.8924271Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.8924383Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.8924465Z ) 2025-05-07T20:32:01.8924711Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.8924803Z def test_silu_mul_quant( 2025-05-07T20:32:01.8924882Z self, 2025-05-07T20:32:01.8924959Z T: int, 2025-05-07T20:32:01.8925036Z D: int, 2025-05-07T20:32:01.8925137Z scale_ub: Optional[float], 2025-05-07T20:32:01.8925226Z contiguous: bool, 2025-05-07T20:32:01.8925310Z compiled: bool, 2025-05-07T20:32:01.8925396Z ) -> None: 2025-05-07T20:32:01.8925492Z torch.manual_seed(2025) 2025-05-07T20:32:01.8925565Z 2025-05-07T20:32:01.8925745Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8925823Z 2025-05-07T20:32:01.8925989Z x_sign = torch.sign(x) 2025-05-07T20:32:01.8926122Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.8926211Z x = x_sign * x_clamp 2025-05-07T20:32:01.8926298Z x0 = x[:, :D] 2025-05-07T20:32:01.8926378Z x1 = x[:, D:] 2025-05-07T20:32:01.8926453Z 2025-05-07T20:32:01.8926543Z if contiguous: 2025-05-07T20:32:01.8926633Z x0 = x0.contiguous() 2025-05-07T20:32:01.8926721Z x1 = x1.contiguous() 2025-05-07T20:32:01.8926803Z 2025-05-07T20:32:01.8926894Z if scale_ub is not None: 2025-05-07T20:32:01.8927002Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.8927146Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.8927226Z ) 2025-05-07T20:32:01.8927306Z else: 2025-05-07T20:32:01.8927405Z scale_ub_tensor = None 2025-05-07T20:32:01.8927480Z 2025-05-07T20:32:01.8927623Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8927717Z op = silu_mul_quant 2025-05-07T20:32:01.8927800Z if compiled: 2025-05-07T20:32:01.8927912Z op = torch.compile(op) 2025-05-07T20:32:01.8928018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8928091Z 2025-05-07T20:32:01.8928194Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.8928316Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.8928389Z 2025-05-07T20:32:01.8928533Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8928634Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.8928733Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.8928866Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.8929008Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.8929093Z 2025-05-07T20:32:01.8929193Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.8929202Z 2025-05-07T20:32:01.8929379Z moe/activation_test.py:126: 2025-05-07T20:32:01.8929516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8929622Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.8929756Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.8930329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.8930430Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.8930801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.8931027Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.8931401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.8931671Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.8932115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.8932378Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.8932758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.8932926Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.8933279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.8933357Z fn() 2025-05-07T20:32:01.8933763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.8933853Z self.fn.run( 2025-05-07T20:32:01.8934204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.8934348Z kernel = self.compile( 2025-05-07T20:32:01.8934733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.8934909Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.8935041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8935046Z 2025-05-07T20:32:01.8935250Z self = 2025-05-07T20:32:01.8936032Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.8936540Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3968ce480>} 2025-05-07T20:32:01.8937298Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.8937493Z context = 2025-05-07T20:32:01.8937498Z 2025-05-07T20:32:01.8937663Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.8937933Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.8938040Z module_map=module_map) 2025-05-07T20:32:01.8938200Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.8938308Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.8938619Z E ^ 2025-05-07T20:32:01.8939219Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.8939241Z 2025-05-07T20:32:01.8939665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.8939670Z 2025-05-07T20:32:01.8939774Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.8940005Z self=, 2025-05-07T20:32:01.8940081Z T=128, 2025-05-07T20:32:01.8940158Z D=7168, 2025-05-07T20:32:01.8940248Z scale_ub=None, 2025-05-07T20:32:01.8940334Z contiguous=False, 2025-05-07T20:32:01.8940421Z compiled=False, 2025-05-07T20:32:01.8940500Z ) 2025-05-07T20:32:01.8940718Z self = 2025-05-07T20:32:01.8940893Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.8940898Z 2025-05-07T20:32:01.8940977Z @given( 2025-05-07T20:32:01.8941102Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8941277Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8941391Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.8941506Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.8941626Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.8941698Z ) 2025-05-07T20:32:01.8941949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.8942040Z def test_silu_mul_quant( 2025-05-07T20:32:01.8942114Z self, 2025-05-07T20:32:01.8942199Z T: int, 2025-05-07T20:32:01.8942273Z D: int, 2025-05-07T20:32:01.8942371Z scale_ub: Optional[float], 2025-05-07T20:32:01.8942465Z contiguous: bool, 2025-05-07T20:32:01.8942550Z compiled: bool, 2025-05-07T20:32:01.8942626Z ) -> None: 2025-05-07T20:32:01.8942725Z torch.manual_seed(2025) 2025-05-07T20:32:01.8942797Z 2025-05-07T20:32:01.8942975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8943128Z 2025-05-07T20:32:01.8943220Z x_sign = torch.sign(x) 2025-05-07T20:32:01.8943344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.8943439Z x = x_sign * x_clamp 2025-05-07T20:32:01.8943518Z x0 = x[:, :D] 2025-05-07T20:32:01.8943604Z x1 = x[:, D:] 2025-05-07T20:32:01.8943677Z 2025-05-07T20:32:01.8943760Z if contiguous: 2025-05-07T20:32:01.8943857Z x0 = x0.contiguous() 2025-05-07T20:32:01.8943947Z x1 = x1.contiguous() 2025-05-07T20:32:01.8944019Z 2025-05-07T20:32:01.8944123Z if scale_ub is not None: 2025-05-07T20:32:01.8944228Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.8944364Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.8944447Z ) 2025-05-07T20:32:01.8944525Z else: 2025-05-07T20:32:01.8944622Z scale_ub_tensor = None 2025-05-07T20:32:01.8944708Z 2025-05-07T20:32:01.8944837Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8944933Z op = silu_mul_quant 2025-05-07T20:32:01.8945018Z if compiled: 2025-05-07T20:32:01.8945117Z op = torch.compile(op) 2025-05-07T20:32:01.8945231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8945304Z 2025-05-07T20:32:01.8945396Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.8945401Z 2025-05-07T20:32:01.8945505Z moe/activation_test.py:117: 2025-05-07T20:32:01.8945634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8945735Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.8945839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8946345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.8946529Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.8946900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.8947123Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.8947479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.8947571Z kernel = self.compile( 2025-05-07T20:32:01.8947960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.8948141Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.8948267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8948271Z 2025-05-07T20:32:01.8948481Z self = 2025-05-07T20:32:01.8949260Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.8949809Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3968cf240>} 2025-05-07T20:32:01.8950566Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.8959467Z context = 2025-05-07T20:32:01.8959478Z 2025-05-07T20:32:01.8959672Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.8959938Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.8960061Z module_map=module_map) 2025-05-07T20:32:01.8960300Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.8960395Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.8960473Z E ^ 2025-05-07T20:32:01.8960832Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.8960837Z 2025-05-07T20:32:01.8961260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.8961265Z 2025-05-07T20:32:01.8961364Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.8961587Z self=, 2025-05-07T20:32:01.8961664Z T=4096, 2025-05-07T20:32:01.8961736Z D=5120, 2025-05-07T20:32:01.8961815Z scale_ub=1200.0, 2025-05-07T20:32:01.8961899Z contiguous=True, 2025-05-07T20:32:01.8961979Z compiled=False, 2025-05-07T20:32:01.8962057Z ) 2025-05-07T20:32:01.8962278Z self = 2025-05-07T20:32:01.8962451Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.8962456Z 2025-05-07T20:32:01.8962533Z @given( 2025-05-07T20:32:01.8962649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8962744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8962858Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.8962971Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.8963080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.8963154Z ) 2025-05-07T20:32:01.8963532Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.8963623Z def test_silu_mul_quant( 2025-05-07T20:32:01.8963698Z self, 2025-05-07T20:32:01.8963772Z T: int, 2025-05-07T20:32:01.8963935Z D: int, 2025-05-07T20:32:01.8964036Z scale_ub: Optional[float], 2025-05-07T20:32:01.8964122Z contiguous: bool, 2025-05-07T20:32:01.8964206Z compiled: bool, 2025-05-07T20:32:01.8964283Z ) -> None: 2025-05-07T20:32:01.8964376Z torch.manual_seed(2025) 2025-05-07T20:32:01.8964449Z 2025-05-07T20:32:01.8964619Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8964691Z 2025-05-07T20:32:01.8964783Z x_sign = torch.sign(x) 2025-05-07T20:32:01.8964905Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.8964994Z x = x_sign * x_clamp 2025-05-07T20:32:01.8965088Z x0 = x[:, :D] 2025-05-07T20:32:01.8965169Z x1 = x[:, D:] 2025-05-07T20:32:01.8965242Z 2025-05-07T20:32:01.8965334Z if contiguous: 2025-05-07T20:32:01.8965428Z x0 = x0.contiguous() 2025-05-07T20:32:01.8965529Z x1 = x1.contiguous() 2025-05-07T20:32:01.8965604Z 2025-05-07T20:32:01.8965780Z if scale_ub is not None: 2025-05-07T20:32:01.8965898Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.8966034Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.8966112Z ) 2025-05-07T20:32:01.8966197Z else: 2025-05-07T20:32:01.8966292Z scale_ub_tensor = None 2025-05-07T20:32:01.8966371Z 2025-05-07T20:32:01.8966510Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8966601Z op = silu_mul_quant 2025-05-07T20:32:01.8966687Z if compiled: 2025-05-07T20:32:01.8966795Z op = torch.compile(op) 2025-05-07T20:32:01.8966907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8966993Z 2025-05-07T20:32:01.8967090Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.8967095Z 2025-05-07T20:32:01.8967197Z moe/activation_test.py:117: 2025-05-07T20:32:01.8967337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8967498Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.8967601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8968117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.8968217Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.8968588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.8968813Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.8969159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.8969260Z kernel = self.compile( 2025-05-07T20:32:01.8969648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.8969834Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.8969972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8969976Z 2025-05-07T20:32:01.8970181Z self = 2025-05-07T20:32:01.8970970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.8971474Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3968cfba0>} 2025-05-07T20:32:01.8972233Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.8972503Z context = 2025-05-07T20:32:01.8972510Z 2025-05-07T20:32:01.8972677Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.8972949Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.8973058Z module_map=module_map) 2025-05-07T20:32:01.8973228Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.8973330Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.8973405Z E ^ 2025-05-07T20:32:01.8973798Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.8973803Z 2025-05-07T20:32:01.8974245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.8974249Z 2025-05-07T20:32:01.8974355Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.8974635Z self=, 2025-05-07T20:32:01.8974714Z T=1, 2025-05-07T20:32:01.8974804Z D=5120, 2025-05-07T20:32:01.8974886Z scale_ub=None, 2025-05-07T20:32:01.8974972Z contiguous=True, 2025-05-07T20:32:01.8975061Z compiled=True, 2025-05-07T20:32:01.8975133Z ) 2025-05-07T20:32:01.8975355Z self = 2025-05-07T20:32:01.8975527Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.8975531Z 2025-05-07T20:32:01.8975607Z @given( 2025-05-07T20:32:01.8975728Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8975838Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8975958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.8976084Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.8976205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.8976327Z ) 2025-05-07T20:32:01.8976581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.8976675Z def test_silu_mul_quant( 2025-05-07T20:32:01.8976750Z self, 2025-05-07T20:32:01.8976838Z T: int, 2025-05-07T20:32:01.8976916Z D: int, 2025-05-07T20:32:01.8977013Z scale_ub: Optional[float], 2025-05-07T20:32:01.8977113Z contiguous: bool, 2025-05-07T20:32:01.8977202Z compiled: bool, 2025-05-07T20:32:01.8977282Z ) -> None: 2025-05-07T20:32:01.8977389Z torch.manual_seed(2025) 2025-05-07T20:32:01.8977467Z 2025-05-07T20:32:01.8977646Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8977721Z 2025-05-07T20:32:01.8977816Z x_sign = torch.sign(x) 2025-05-07T20:32:01.8977952Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.8978042Z x = x_sign * x_clamp 2025-05-07T20:32:01.8978131Z x0 = x[:, :D] 2025-05-07T20:32:01.8978222Z x1 = x[:, D:] 2025-05-07T20:32:01.8978296Z 2025-05-07T20:32:01.8978379Z if contiguous: 2025-05-07T20:32:01.8978479Z x0 = x0.contiguous() 2025-05-07T20:32:01.8978569Z x1 = x1.contiguous() 2025-05-07T20:32:01.8978647Z 2025-05-07T20:32:01.8978747Z if scale_ub is not None: 2025-05-07T20:32:01.8978854Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.8979003Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.8979080Z ) 2025-05-07T20:32:01.8979156Z else: 2025-05-07T20:32:01.8979258Z scale_ub_tensor = None 2025-05-07T20:32:01.8979333Z 2025-05-07T20:32:01.8979464Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8979566Z op = silu_mul_quant 2025-05-07T20:32:01.8979650Z if compiled: 2025-05-07T20:32:01.8979750Z op = torch.compile(op) 2025-05-07T20:32:01.8979949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8980024Z 2025-05-07T20:32:01.8980118Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.8980247Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.8980318Z 2025-05-07T20:32:01.8980455Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8980565Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.8980667Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.8980801Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.8980941Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.8981015Z 2025-05-07T20:32:01.8981126Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.8981131Z 2025-05-07T20:32:01.8981227Z moe/activation_test.py:126: 2025-05-07T20:32:01.8981363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8981482Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.8981659Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.8982223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.8982333Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.8982699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.8982921Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.8983301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.8983558Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.8983966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.8984269Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.8984650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.8984824Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.8985168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.8985250Z fn() 2025-05-07T20:32:01.8985655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.8985738Z self.fn.run( 2025-05-07T20:32:01.8986090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.8986187Z kernel = self.compile( 2025-05-07T20:32:01.8986575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.8986763Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.8986892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8986896Z 2025-05-07T20:32:01.8987108Z self = 2025-05-07T20:32:01.8987887Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.8988389Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3969c3560>} 2025-05-07T20:32:01.8989230Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.8989428Z context = 2025-05-07T20:32:01.8989433Z 2025-05-07T20:32:01.8989603Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.8989868Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.8989973Z module_map=module_map) 2025-05-07T20:32:01.8990141Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.8990244Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.8990328Z E ^ 2025-05-07T20:32:01.8990689Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.8990694Z 2025-05-07T20:32:01.8991117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.8991164Z 2025-05-07T20:32:01.8991275Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.8991497Z self=, 2025-05-07T20:32:01.8991582Z T=2048, 2025-05-07T20:32:01.8991660Z D=5120, 2025-05-07T20:32:01.8991743Z scale_ub=None, 2025-05-07T20:32:01.8991834Z contiguous=True, 2025-05-07T20:32:01.8991915Z compiled=True, 2025-05-07T20:32:01.8991987Z ) 2025-05-07T20:32:01.8992213Z self = 2025-05-07T20:32:01.8992383Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.8992388Z 2025-05-07T20:32:01.8992461Z @given( 2025-05-07T20:32:01.8992589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8992686Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8992809Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.8992935Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.8993090Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.8993173Z ) 2025-05-07T20:32:01.8993419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.8993510Z def test_silu_mul_quant( 2025-05-07T20:32:01.8993588Z self, 2025-05-07T20:32:01.8993662Z T: int, 2025-05-07T20:32:01.8993738Z D: int, 2025-05-07T20:32:01.8993840Z scale_ub: Optional[float], 2025-05-07T20:32:01.8993929Z contiguous: bool, 2025-05-07T20:32:01.8994014Z compiled: bool, 2025-05-07T20:32:01.8994097Z ) -> None: 2025-05-07T20:32:01.8994191Z torch.manual_seed(2025) 2025-05-07T20:32:01.8994271Z 2025-05-07T20:32:01.8994441Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8994517Z 2025-05-07T20:32:01.8994619Z x_sign = torch.sign(x) 2025-05-07T20:32:01.8994749Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.8994844Z x = x_sign * x_clamp 2025-05-07T20:32:01.8994928Z x0 = x[:, :D] 2025-05-07T20:32:01.8995009Z x1 = x[:, D:] 2025-05-07T20:32:01.8995082Z 2025-05-07T20:32:01.8995169Z if contiguous: 2025-05-07T20:32:01.8995260Z x0 = x0.contiguous() 2025-05-07T20:32:01.8995348Z x1 = x1.contiguous() 2025-05-07T20:32:01.8995428Z 2025-05-07T20:32:01.8995518Z if scale_ub is not None: 2025-05-07T20:32:01.8995622Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.8995766Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.8995837Z ) 2025-05-07T20:32:01.8995924Z else: 2025-05-07T20:32:01.8996017Z scale_ub_tensor = None 2025-05-07T20:32:01.8996088Z 2025-05-07T20:32:01.8996225Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8996315Z op = silu_mul_quant 2025-05-07T20:32:01.8996509Z if compiled: 2025-05-07T20:32:01.8996621Z op = torch.compile(op) 2025-05-07T20:32:01.8996729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8996801Z 2025-05-07T20:32:01.8996898Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.8997019Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.8997092Z 2025-05-07T20:32:01.8997239Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8997341Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.8997448Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.8997570Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.8997709Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.8997790Z 2025-05-07T20:32:01.8997889Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.8997894Z 2025-05-07T20:32:01.8997992Z moe/activation_test.py:126: 2025-05-07T20:32:01.8998135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8998286Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.8998427Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.8998993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.8999095Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.8999466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.8999690Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9000063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.9000326Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9000738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.9001041Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9001420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.9001589Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.9001941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.9002018Z fn() 2025-05-07T20:32:01.9002432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.9002515Z self.fn.run( 2025-05-07T20:32:01.9002857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9002962Z kernel = self.compile( 2025-05-07T20:32:01.9003516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9003693Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9003826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9003831Z 2025-05-07T20:32:01.9004038Z self = 2025-05-07T20:32:01.9004818Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9005320Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3969c2f20>} 2025-05-07T20:32:01.9006162Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9006362Z context = 2025-05-07T20:32:01.9006367Z 2025-05-07T20:32:01.9006533Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9006805Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9006919Z module_map=module_map) 2025-05-07T20:32:01.9007089Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9007190Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.9007266Z E ^ 2025-05-07T20:32:01.9007628Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9007633Z 2025-05-07T20:32:01.9008097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9008103Z 2025-05-07T20:32:01.9008210Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9008439Z self=, 2025-05-07T20:32:01.9008517Z T=128, 2025-05-07T20:32:01.9008595Z D=5120, 2025-05-07T20:32:01.9008676Z scale_ub=None, 2025-05-07T20:32:01.9008760Z contiguous=True, 2025-05-07T20:32:01.9008851Z compiled=True, 2025-05-07T20:32:01.9008919Z ) 2025-05-07T20:32:01.9009143Z self = 2025-05-07T20:32:01.9009318Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.9009324Z 2025-05-07T20:32:01.9009401Z @given( 2025-05-07T20:32:01.9009519Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9009622Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9009879Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9010004Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9010117Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9010194Z ) 2025-05-07T20:32:01.9010444Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9010536Z def test_silu_mul_quant( 2025-05-07T20:32:01.9010613Z self, 2025-05-07T20:32:01.9010696Z T: int, 2025-05-07T20:32:01.9010771Z D: int, 2025-05-07T20:32:01.9010869Z scale_ub: Optional[float], 2025-05-07T20:32:01.9010966Z contiguous: bool, 2025-05-07T20:32:01.9011053Z compiled: bool, 2025-05-07T20:32:01.9011130Z ) -> None: 2025-05-07T20:32:01.9011233Z torch.manual_seed(2025) 2025-05-07T20:32:01.9011307Z 2025-05-07T20:32:01.9011488Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9011564Z 2025-05-07T20:32:01.9011663Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9011793Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9011882Z x = x_sign * x_clamp 2025-05-07T20:32:01.9011962Z x0 = x[:, :D] 2025-05-07T20:32:01.9012048Z x1 = x[:, D:] 2025-05-07T20:32:01.9012120Z 2025-05-07T20:32:01.9012202Z if contiguous: 2025-05-07T20:32:01.9012301Z x0 = x0.contiguous() 2025-05-07T20:32:01.9012389Z x1 = x1.contiguous() 2025-05-07T20:32:01.9012461Z 2025-05-07T20:32:01.9012561Z if scale_ub is not None: 2025-05-07T20:32:01.9012668Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9012811Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9012884Z ) 2025-05-07T20:32:01.9012960Z else: 2025-05-07T20:32:01.9013060Z scale_ub_tensor = None 2025-05-07T20:32:01.9013133Z 2025-05-07T20:32:01.9013344Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9013450Z op = silu_mul_quant 2025-05-07T20:32:01.9013536Z if compiled: 2025-05-07T20:32:01.9013641Z op = torch.compile(op) 2025-05-07T20:32:01.9013754Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9013827Z 2025-05-07T20:32:01.9013918Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.9014045Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.9014119Z 2025-05-07T20:32:01.9014255Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9014362Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.9014460Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.9014587Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.9014727Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.9014799Z 2025-05-07T20:32:01.9014909Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.9014966Z 2025-05-07T20:32:01.9015069Z moe/activation_test.py:126: 2025-05-07T20:32:01.9015199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9015311Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.9015443Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.9016018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.9016117Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.9016486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9016716Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9017088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.9017356Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9017807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.9018062Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9018448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.9018615Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.9018960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.9019045Z fn() 2025-05-07T20:32:01.9019449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.9019535Z self.fn.run( 2025-05-07T20:32:01.9019884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9019981Z kernel = self.compile( 2025-05-07T20:32:01.9020374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9020552Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9020686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9020691Z 2025-05-07T20:32:01.9020905Z self = 2025-05-07T20:32:01.9021683Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9022273Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd396163c40>} 2025-05-07T20:32:01.9023037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9023229Z context = 2025-05-07T20:32:01.9023234Z 2025-05-07T20:32:01.9023409Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9023675Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9023790Z module_map=module_map) 2025-05-07T20:32:01.9023953Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9024053Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.9024139Z E ^ 2025-05-07T20:32:01.9024505Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9024551Z 2025-05-07T20:32:01.9024971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9024982Z 2025-05-07T20:32:01.9025087Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9025311Z self=, 2025-05-07T20:32:01.9025396Z T=4096, 2025-05-07T20:32:01.9025470Z D=5120, 2025-05-07T20:32:01.9025549Z scale_ub=None, 2025-05-07T20:32:01.9025643Z contiguous=True, 2025-05-07T20:32:01.9025725Z compiled=True, 2025-05-07T20:32:01.9025801Z ) 2025-05-07T20:32:01.9026028Z self = 2025-05-07T20:32:01.9026200Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.9026205Z 2025-05-07T20:32:01.9026290Z @given( 2025-05-07T20:32:01.9026420Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9026566Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9026688Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9026805Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9026920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9027000Z ) 2025-05-07T20:32:01.9027246Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9027340Z def test_silu_mul_quant( 2025-05-07T20:32:01.9027420Z self, 2025-05-07T20:32:01.9027497Z T: int, 2025-05-07T20:32:01.9027575Z D: int, 2025-05-07T20:32:01.9027689Z scale_ub: Optional[float], 2025-05-07T20:32:01.9027780Z contiguous: bool, 2025-05-07T20:32:01.9027876Z compiled: bool, 2025-05-07T20:32:01.9027952Z ) -> None: 2025-05-07T20:32:01.9028049Z torch.manual_seed(2025) 2025-05-07T20:32:01.9028127Z 2025-05-07T20:32:01.9028304Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9028380Z 2025-05-07T20:32:01.9028479Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9028602Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9028688Z x = x_sign * x_clamp 2025-05-07T20:32:01.9028776Z x0 = x[:, :D] 2025-05-07T20:32:01.9028859Z x1 = x[:, D:] 2025-05-07T20:32:01.9028932Z 2025-05-07T20:32:01.9029022Z if contiguous: 2025-05-07T20:32:01.9029112Z x0 = x0.contiguous() 2025-05-07T20:32:01.9029208Z x1 = x1.contiguous() 2025-05-07T20:32:01.9029280Z 2025-05-07T20:32:01.9029372Z if scale_ub is not None: 2025-05-07T20:32:01.9029486Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9029619Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9029695Z ) 2025-05-07T20:32:01.9029782Z else: 2025-05-07T20:32:01.9029986Z scale_ub_tensor = None 2025-05-07T20:32:01.9030070Z 2025-05-07T20:32:01.9030207Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9030300Z op = silu_mul_quant 2025-05-07T20:32:01.9030388Z if compiled: 2025-05-07T20:32:01.9030493Z op = torch.compile(op) 2025-05-07T20:32:01.9030598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9030668Z 2025-05-07T20:32:01.9030767Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.9030893Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.9030971Z 2025-05-07T20:32:01.9031106Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9031208Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.9031314Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.9031434Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.9031574Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.9031713Z 2025-05-07T20:32:01.9031817Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.9031822Z 2025-05-07T20:32:01.9031933Z moe/activation_test.py:126: 2025-05-07T20:32:01.9032065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9032175Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.9032323Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.9032894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.9032997Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.9033374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9033599Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9033989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.9034302Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9034705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.9034974Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9035358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.9035538Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.9035884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.9035963Z fn() 2025-05-07T20:32:01.9036380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.9036468Z self.fn.run( 2025-05-07T20:32:01.9036813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9036918Z kernel = self.compile( 2025-05-07T20:32:01.9037304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9037492Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9037620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9037624Z 2025-05-07T20:32:01.9037831Z self = 2025-05-07T20:32:01.9038967Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9039691Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd301b49760>} 2025-05-07T20:32:01.9040464Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9040656Z context = 2025-05-07T20:32:01.9040661Z 2025-05-07T20:32:01.9040825Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9041101Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9041208Z module_map=module_map) 2025-05-07T20:32:01.9041376Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9041479Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.9041626Z E ^ 2025-05-07T20:32:01.9041995Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9042000Z 2025-05-07T20:32:01.9042420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9042424Z 2025-05-07T20:32:01.9042538Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9042760Z self=, 2025-05-07T20:32:01.9042838Z T=16384, 2025-05-07T20:32:01.9042926Z D=5120, 2025-05-07T20:32:01.9043008Z scale_ub=None, 2025-05-07T20:32:01.9043090Z contiguous=True, 2025-05-07T20:32:01.9043177Z compiled=True, 2025-05-07T20:32:01.9043406Z ) 2025-05-07T20:32:01.9043627Z self = 2025-05-07T20:32:01.9043817Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.9043901Z 2025-05-07T20:32:01.9043977Z @given( 2025-05-07T20:32:01.9044107Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9044206Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9044321Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9044446Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9044560Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9044633Z ) 2025-05-07T20:32:01.9044886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9044980Z def test_silu_mul_quant( 2025-05-07T20:32:01.9045054Z self, 2025-05-07T20:32:01.9045139Z T: int, 2025-05-07T20:32:01.9045214Z D: int, 2025-05-07T20:32:01.9045311Z scale_ub: Optional[float], 2025-05-07T20:32:01.9045407Z contiguous: bool, 2025-05-07T20:32:01.9045494Z compiled: bool, 2025-05-07T20:32:01.9045582Z ) -> None: 2025-05-07T20:32:01.9045687Z torch.manual_seed(2025) 2025-05-07T20:32:01.9045759Z 2025-05-07T20:32:01.9045936Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9046012Z 2025-05-07T20:32:01.9046106Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9046239Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9046328Z x = x_sign * x_clamp 2025-05-07T20:32:01.9046409Z x0 = x[:, :D] 2025-05-07T20:32:01.9046492Z x1 = x[:, D:] 2025-05-07T20:32:01.9046565Z 2025-05-07T20:32:01.9046648Z if contiguous: 2025-05-07T20:32:01.9046746Z x0 = x0.contiguous() 2025-05-07T20:32:01.9046834Z x1 = x1.contiguous() 2025-05-07T20:32:01.9046906Z 2025-05-07T20:32:01.9047003Z if scale_ub is not None: 2025-05-07T20:32:01.9047109Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9047248Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9047413Z ) 2025-05-07T20:32:01.9047495Z else: 2025-05-07T20:32:01.9047599Z scale_ub_tensor = None 2025-05-07T20:32:01.9047670Z 2025-05-07T20:32:01.9047804Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9047901Z op = silu_mul_quant 2025-05-07T20:32:01.9047987Z if compiled: 2025-05-07T20:32:01.9048087Z op = torch.compile(op) 2025-05-07T20:32:01.9048199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9048270Z 2025-05-07T20:32:01.9048360Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.9048484Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.9048557Z 2025-05-07T20:32:01.9048700Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9048800Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.9048898Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.9049030Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.9049216Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.9049288Z 2025-05-07T20:32:01.9049399Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.9049404Z 2025-05-07T20:32:01.9049502Z moe/activation_test.py:126: 2025-05-07T20:32:01.9049636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9049744Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.9049877Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.9050452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.9050553Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.9050917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9051152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9051574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.9051835Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9052236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.9052490Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9052872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.9053038Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.9053388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.9053467Z fn() 2025-05-07T20:32:01.9053877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.9053971Z self.fn.run( 2025-05-07T20:32:01.9054315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9054409Z kernel = self.compile( 2025-05-07T20:32:01.9054805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9054983Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9055120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9055124Z 2025-05-07T20:32:01.9055338Z self = 2025-05-07T20:32:01.9056201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9056717Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd30110b600>} 2025-05-07T20:32:01.9057474Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9057674Z context = 2025-05-07T20:32:01.9057679Z 2025-05-07T20:32:01.9057846Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9058113Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9058228Z module_map=module_map) 2025-05-07T20:32:01.9058403Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9058559Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.9058639Z E ^ 2025-05-07T20:32:01.9058996Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9059000Z 2025-05-07T20:32:01.9059425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9059430Z 2025-05-07T20:32:01.9059530Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9059765Z self=, 2025-05-07T20:32:01.9059840Z T=1, 2025-05-07T20:32:01.9059917Z D=5120, 2025-05-07T20:32:01.9060009Z scale_ub=1200.0, 2025-05-07T20:32:01.9060094Z contiguous=True, 2025-05-07T20:32:01.9060175Z compiled=True, 2025-05-07T20:32:01.9060256Z ) 2025-05-07T20:32:01.9060481Z self = 2025-05-07T20:32:01.9060720Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.9060724Z 2025-05-07T20:32:01.9060809Z @given( 2025-05-07T20:32:01.9060929Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9061031Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9061146Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9061263Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9061382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9061457Z ) 2025-05-07T20:32:01.9061704Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9061802Z def test_silu_mul_quant( 2025-05-07T20:32:01.9061875Z self, 2025-05-07T20:32:01.9061952Z T: int, 2025-05-07T20:32:01.9062034Z D: int, 2025-05-07T20:32:01.9062133Z scale_ub: Optional[float], 2025-05-07T20:32:01.9062228Z contiguous: bool, 2025-05-07T20:32:01.9062322Z compiled: bool, 2025-05-07T20:32:01.9062401Z ) -> None: 2025-05-07T20:32:01.9062503Z torch.manual_seed(2025) 2025-05-07T20:32:01.9062571Z 2025-05-07T20:32:01.9062741Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9062822Z 2025-05-07T20:32:01.9062913Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9063040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9063135Z x = x_sign * x_clamp 2025-05-07T20:32:01.9063215Z x0 = x[:, :D] 2025-05-07T20:32:01.9063298Z x1 = x[:, D:] 2025-05-07T20:32:01.9063379Z 2025-05-07T20:32:01.9063460Z if contiguous: 2025-05-07T20:32:01.9063548Z x0 = x0.contiguous() 2025-05-07T20:32:01.9063644Z x1 = x1.contiguous() 2025-05-07T20:32:01.9063715Z 2025-05-07T20:32:01.9063808Z if scale_ub is not None: 2025-05-07T20:32:01.9064014Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9064175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9064274Z ) 2025-05-07T20:32:01.9064355Z else: 2025-05-07T20:32:01.9064446Z scale_ub_tensor = None 2025-05-07T20:32:01.9064529Z 2025-05-07T20:32:01.9064661Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9064751Z op = silu_mul_quant 2025-05-07T20:32:01.9064840Z if compiled: 2025-05-07T20:32:01.9064939Z op = torch.compile(op) 2025-05-07T20:32:01.9065045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9065126Z 2025-05-07T20:32:01.9065215Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9065220Z 2025-05-07T20:32:01.9065324Z moe/activation_test.py:117: 2025-05-07T20:32:01.9065454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9065554Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9065668Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9066112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9066205Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9066716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9066817Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9067187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9067412Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9067757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9067858Z kernel = self.compile( 2025-05-07T20:32:01.9068257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9068480Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9068613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9068617Z 2025-05-07T20:32:01.9068822Z self = 2025-05-07T20:32:01.9069610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9070116Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd301c96ac0>} 2025-05-07T20:32:01.9070888Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9071084Z context = 2025-05-07T20:32:01.9071089Z 2025-05-07T20:32:01.9071256Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9071533Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9071640Z module_map=module_map) 2025-05-07T20:32:01.9071803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9071910Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9071990Z E ^ 2025-05-07T20:32:01.9072355Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9072360Z 2025-05-07T20:32:01.9072782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9072868Z 2025-05-07T20:32:01.9072974Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9073205Z self=, 2025-05-07T20:32:01.9073284Z T=1, 2025-05-07T20:32:01.9073368Z D=5120, 2025-05-07T20:32:01.9073447Z scale_ub=None, 2025-05-07T20:32:01.9073534Z contiguous=False, 2025-05-07T20:32:01.9073626Z compiled=True, 2025-05-07T20:32:01.9073701Z ) 2025-05-07T20:32:01.9073921Z self = 2025-05-07T20:32:01.9074092Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.9074097Z 2025-05-07T20:32:01.9074170Z @given( 2025-05-07T20:32:01.9074290Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9074395Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9074511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9074640Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9074796Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9074872Z ) 2025-05-07T20:32:01.9075125Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9075222Z def test_silu_mul_quant( 2025-05-07T20:32:01.9075295Z self, 2025-05-07T20:32:01.9075377Z T: int, 2025-05-07T20:32:01.9075453Z D: int, 2025-05-07T20:32:01.9075549Z scale_ub: Optional[float], 2025-05-07T20:32:01.9075647Z contiguous: bool, 2025-05-07T20:32:01.9075731Z compiled: bool, 2025-05-07T20:32:01.9075806Z ) -> None: 2025-05-07T20:32:01.9075910Z torch.manual_seed(2025) 2025-05-07T20:32:01.9075983Z 2025-05-07T20:32:01.9076160Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9076235Z 2025-05-07T20:32:01.9076327Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9076463Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9076601Z x = x_sign * x_clamp 2025-05-07T20:32:01.9076679Z x0 = x[:, :D] 2025-05-07T20:32:01.9076761Z x1 = x[:, D:] 2025-05-07T20:32:01.9076833Z 2025-05-07T20:32:01.9076915Z if contiguous: 2025-05-07T20:32:01.9077015Z x0 = x0.contiguous() 2025-05-07T20:32:01.9077103Z x1 = x1.contiguous() 2025-05-07T20:32:01.9077174Z 2025-05-07T20:32:01.9077269Z if scale_ub is not None: 2025-05-07T20:32:01.9077377Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9077515Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9077599Z ) 2025-05-07T20:32:01.9077676Z else: 2025-05-07T20:32:01.9077776Z scale_ub_tensor = None 2025-05-07T20:32:01.9077847Z 2025-05-07T20:32:01.9077977Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9078074Z op = silu_mul_quant 2025-05-07T20:32:01.9078160Z if compiled: 2025-05-07T20:32:01.9078268Z op = torch.compile(op) 2025-05-07T20:32:01.9078381Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9078453Z 2025-05-07T20:32:01.9078543Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.9078674Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.9078745Z 2025-05-07T20:32:01.9078883Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9078988Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.9079086Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.9079213Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.9079353Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.9079425Z 2025-05-07T20:32:01.9079533Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.9079538Z 2025-05-07T20:32:01.9079635Z moe/activation_test.py:126: 2025-05-07T20:32:01.9079847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9079963Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.9080099Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.9080674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.9080776Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.9081143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9081378Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9081751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.9082009Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9082422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.9082721Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9083107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.9083410Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.9083756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.9083845Z fn() 2025-05-07T20:32:01.9090565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.9090671Z self.fn.run( 2025-05-07T20:32:01.9091033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9091141Z kernel = self.compile( 2025-05-07T20:32:01.9091628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9091819Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9091951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9091957Z 2025-05-07T20:32:01.9092169Z self = 2025-05-07T20:32:01.9092961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9093467Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd301c97e20>} 2025-05-07T20:32:01.9094289Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9094486Z context = 2025-05-07T20:32:01.9094491Z 2025-05-07T20:32:01.9094657Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9094934Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9095049Z module_map=module_map) 2025-05-07T20:32:01.9095219Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9095322Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.9095400Z E ^ 2025-05-07T20:32:01.9095770Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9095775Z 2025-05-07T20:32:01.9096276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9096285Z 2025-05-07T20:32:01.9096397Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9096621Z self=, 2025-05-07T20:32:01.9096704Z T=1, 2025-05-07T20:32:01.9096793Z D=5120, 2025-05-07T20:32:01.9096878Z scale_ub=None, 2025-05-07T20:32:01.9096961Z contiguous=True, 2025-05-07T20:32:01.9097059Z compiled=False, 2025-05-07T20:32:01.9097131Z ) 2025-05-07T20:32:01.9097353Z self = 2025-05-07T20:32:01.9097530Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.9097535Z 2025-05-07T20:32:01.9097615Z @given( 2025-05-07T20:32:01.9097745Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9097841Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9098033Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9098165Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9098278Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9098353Z ) 2025-05-07T20:32:01.9098611Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9098707Z def test_silu_mul_quant( 2025-05-07T20:32:01.9098788Z self, 2025-05-07T20:32:01.9098871Z T: int, 2025-05-07T20:32:01.9098947Z D: int, 2025-05-07T20:32:01.9099055Z scale_ub: Optional[float], 2025-05-07T20:32:01.9099149Z contiguous: bool, 2025-05-07T20:32:01.9099234Z compiled: bool, 2025-05-07T20:32:01.9099320Z ) -> None: 2025-05-07T20:32:01.9099418Z torch.manual_seed(2025) 2025-05-07T20:32:01.9099491Z 2025-05-07T20:32:01.9099671Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9099750Z 2025-05-07T20:32:01.9099848Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9100036Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9100126Z x = x_sign * x_clamp 2025-05-07T20:32:01.9100208Z x0 = x[:, :D] 2025-05-07T20:32:01.9100295Z x1 = x[:, D:] 2025-05-07T20:32:01.9100367Z 2025-05-07T20:32:01.9100454Z if contiguous: 2025-05-07T20:32:01.9100555Z x0 = x0.contiguous() 2025-05-07T20:32:01.9100645Z x1 = x1.contiguous() 2025-05-07T20:32:01.9100726Z 2025-05-07T20:32:01.9100821Z if scale_ub is not None: 2025-05-07T20:32:01.9100932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9101078Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9101154Z ) 2025-05-07T20:32:01.9101232Z else: 2025-05-07T20:32:01.9101334Z scale_ub_tensor = None 2025-05-07T20:32:01.9101409Z 2025-05-07T20:32:01.9101541Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9101650Z op = silu_mul_quant 2025-05-07T20:32:01.9101734Z if compiled: 2025-05-07T20:32:01.9101838Z op = torch.compile(op) 2025-05-07T20:32:01.9101950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9102023Z 2025-05-07T20:32:01.9102118Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9102123Z 2025-05-07T20:32:01.9102219Z moe/activation_test.py:117: 2025-05-07T20:32:01.9102350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9102456Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9102556Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9103060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9103165Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9103615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9103854Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9104199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9104295Z kernel = self.compile( 2025-05-07T20:32:01.9104689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9104870Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9104996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9105009Z 2025-05-07T20:32:01.9105213Z self = 2025-05-07T20:32:01.9105997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9106550Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3009f2de0>} 2025-05-07T20:32:01.9107305Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9107506Z context = 2025-05-07T20:32:01.9107510Z 2025-05-07T20:32:01.9107676Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9107940Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9108055Z module_map=module_map) 2025-05-07T20:32:01.9108227Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9108375Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9108454Z E ^ 2025-05-07T20:32:01.9108815Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9108820Z 2025-05-07T20:32:01.9109247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9109251Z 2025-05-07T20:32:01.9109354Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9109581Z self=, 2025-05-07T20:32:01.9109670Z T=128, 2025-05-07T20:32:01.9109749Z D=5120, 2025-05-07T20:32:01.9109837Z scale_ub=None, 2025-05-07T20:32:01.9109922Z contiguous=False, 2025-05-07T20:32:01.9110004Z compiled=True, 2025-05-07T20:32:01.9110087Z ) 2025-05-07T20:32:01.9110305Z self = 2025-05-07T20:32:01.9110483Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.9110491Z 2025-05-07T20:32:01.9110578Z @given( 2025-05-07T20:32:01.9110696Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9110799Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9110923Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9111039Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9111162Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9111237Z ) 2025-05-07T20:32:01.9111482Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9111584Z def test_silu_mul_quant( 2025-05-07T20:32:01.9111661Z self, 2025-05-07T20:32:01.9111738Z T: int, 2025-05-07T20:32:01.9111827Z D: int, 2025-05-07T20:32:01.9111925Z scale_ub: Optional[float], 2025-05-07T20:32:01.9112013Z contiguous: bool, 2025-05-07T20:32:01.9112189Z compiled: bool, 2025-05-07T20:32:01.9112277Z ) -> None: 2025-05-07T20:32:01.9112371Z torch.manual_seed(2025) 2025-05-07T20:32:01.9112454Z 2025-05-07T20:32:01.9112629Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9112714Z 2025-05-07T20:32:01.9112808Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9112935Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9113041Z x = x_sign * x_clamp 2025-05-07T20:32:01.9113125Z x0 = x[:, :D] 2025-05-07T20:32:01.9113205Z x1 = x[:, D:] 2025-05-07T20:32:01.9113282Z 2025-05-07T20:32:01.9113368Z if contiguous: 2025-05-07T20:32:01.9113461Z x0 = x0.contiguous() 2025-05-07T20:32:01.9113560Z x1 = x1.contiguous() 2025-05-07T20:32:01.9113636Z 2025-05-07T20:32:01.9113727Z if scale_ub is not None: 2025-05-07T20:32:01.9113839Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9114022Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9114102Z ) 2025-05-07T20:32:01.9114191Z else: 2025-05-07T20:32:01.9114285Z scale_ub_tensor = None 2025-05-07T20:32:01.9114369Z 2025-05-07T20:32:01.9114499Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9114591Z op = silu_mul_quant 2025-05-07T20:32:01.9114683Z if compiled: 2025-05-07T20:32:01.9114784Z op = torch.compile(op) 2025-05-07T20:32:01.9114890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9114961Z 2025-05-07T20:32:01.9115060Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9115065Z 2025-05-07T20:32:01.9115163Z moe/activation_test.py:117: 2025-05-07T20:32:01.9115298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9115398Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9115500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9115935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9116028Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9116530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9116635Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9117000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9117231Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9117574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9117668Z kernel = self.compile( 2025-05-07T20:32:01.9118062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9118248Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9118377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9118389Z 2025-05-07T20:32:01.9118593Z self = 2025-05-07T20:32:01.9119365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9119873Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3009f27a0>} 2025-05-07T20:32:01.9120704Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9120907Z context = 2025-05-07T20:32:01.9120911Z 2025-05-07T20:32:01.9121075Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9121341Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9121451Z module_map=module_map) 2025-05-07T20:32:01.9121615Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9121720Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9121798Z E ^ 2025-05-07T20:32:01.9122155Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9122159Z 2025-05-07T20:32:01.9122585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9122633Z 2025-05-07T20:32:01.9122747Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9122972Z self=, 2025-05-07T20:32:01.9123056Z T=128, 2025-05-07T20:32:01.9123133Z D=7168, 2025-05-07T20:32:01.9123320Z scale_ub=1200.0, 2025-05-07T20:32:01.9123406Z contiguous=False, 2025-05-07T20:32:01.9123486Z compiled=False, 2025-05-07T20:32:01.9123569Z ) 2025-05-07T20:32:01.9123787Z self = 2025-05-07T20:32:01.9123960Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.9123965Z 2025-05-07T20:32:01.9124046Z @given( 2025-05-07T20:32:01.9124166Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9124263Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9124386Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9124502Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9124682Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9124757Z ) 2025-05-07T20:32:01.9125001Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9125101Z def test_silu_mul_quant( 2025-05-07T20:32:01.9125176Z self, 2025-05-07T20:32:01.9125249Z T: int, 2025-05-07T20:32:01.9125331Z D: int, 2025-05-07T20:32:01.9125428Z scale_ub: Optional[float], 2025-05-07T20:32:01.9125516Z contiguous: bool, 2025-05-07T20:32:01.9125608Z compiled: bool, 2025-05-07T20:32:01.9125687Z ) -> None: 2025-05-07T20:32:01.9125778Z torch.manual_seed(2025) 2025-05-07T20:32:01.9125856Z 2025-05-07T20:32:01.9126026Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9126104Z 2025-05-07T20:32:01.9126195Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9126321Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9126428Z x = x_sign * x_clamp 2025-05-07T20:32:01.9126512Z x0 = x[:, :D] 2025-05-07T20:32:01.9126591Z x1 = x[:, D:] 2025-05-07T20:32:01.9126672Z 2025-05-07T20:32:01.9126755Z if contiguous: 2025-05-07T20:32:01.9126847Z x0 = x0.contiguous() 2025-05-07T20:32:01.9126942Z x1 = x1.contiguous() 2025-05-07T20:32:01.9127012Z 2025-05-07T20:32:01.9127102Z if scale_ub is not None: 2025-05-07T20:32:01.9127213Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9127347Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9127429Z ) 2025-05-07T20:32:01.9127503Z else: 2025-05-07T20:32:01.9127599Z scale_ub_tensor = None 2025-05-07T20:32:01.9127676Z 2025-05-07T20:32:01.9127806Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9127897Z op = silu_mul_quant 2025-05-07T20:32:01.9127990Z if compiled: 2025-05-07T20:32:01.9128198Z op = torch.compile(op) 2025-05-07T20:32:01.9128313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9128395Z 2025-05-07T20:32:01.9128487Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9128492Z 2025-05-07T20:32:01.9128593Z moe/activation_test.py:117: 2025-05-07T20:32:01.9128729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9128829Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9128933Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9129438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9129536Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9129911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9130144Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9130533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9130635Z kernel = self.compile( 2025-05-07T20:32:01.9131023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9131207Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9131333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9131338Z 2025-05-07T20:32:01.9131548Z self = 2025-05-07T20:32:01.9132336Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9132842Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3009f23e0>} 2025-05-07T20:32:01.9133649Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9133842Z context = 2025-05-07T20:32:01.9133846Z 2025-05-07T20:32:01.9134018Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9134282Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9134389Z module_map=module_map) 2025-05-07T20:32:01.9134559Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9134656Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9134731Z E ^ 2025-05-07T20:32:01.9135103Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9135111Z 2025-05-07T20:32:01.9135529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9135534Z 2025-05-07T20:32:01.9135644Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9135865Z self=, 2025-05-07T20:32:01.9135945Z T=128, 2025-05-07T20:32:01.9136024Z D=5120, 2025-05-07T20:32:01.9136108Z scale_ub=None, 2025-05-07T20:32:01.9136194Z contiguous=False, 2025-05-07T20:32:01.9136283Z compiled=False, 2025-05-07T20:32:01.9136359Z ) 2025-05-07T20:32:01.9136574Z self = 2025-05-07T20:32:01.9136753Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.9136760Z 2025-05-07T20:32:01.9136919Z @given( 2025-05-07T20:32:01.9137046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9137145Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9137257Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9137383Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9137494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9137569Z ) 2025-05-07T20:32:01.9137817Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9137910Z def test_silu_mul_quant( 2025-05-07T20:32:01.9137992Z self, 2025-05-07T20:32:01.9138069Z T: int, 2025-05-07T20:32:01.9138145Z D: int, 2025-05-07T20:32:01.9138246Z scale_ub: Optional[float], 2025-05-07T20:32:01.9138335Z contiguous: bool, 2025-05-07T20:32:01.9138673Z compiled: bool, 2025-05-07T20:32:01.9138800Z ) -> None: 2025-05-07T20:32:01.9138942Z torch.manual_seed(2025) 2025-05-07T20:32:01.9139167Z 2025-05-07T20:32:01.9139348Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9139421Z 2025-05-07T20:32:01.9139510Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9139644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9139732Z x = x_sign * x_clamp 2025-05-07T20:32:01.9139813Z x0 = x[:, :D] 2025-05-07T20:32:01.9139902Z x1 = x[:, D:] 2025-05-07T20:32:01.9139972Z 2025-05-07T20:32:01.9140060Z if contiguous: 2025-05-07T20:32:01.9140149Z x0 = x0.contiguous() 2025-05-07T20:32:01.9140235Z x1 = x1.contiguous() 2025-05-07T20:32:01.9140313Z 2025-05-07T20:32:01.9140404Z if scale_ub is not None: 2025-05-07T20:32:01.9140513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9140650Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9140726Z ) 2025-05-07T20:32:01.9140812Z else: 2025-05-07T20:32:01.9140993Z scale_ub_tensor = None 2025-05-07T20:32:01.9141061Z 2025-05-07T20:32:01.9141190Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9141290Z op = silu_mul_quant 2025-05-07T20:32:01.9141374Z if compiled: 2025-05-07T20:32:01.9141478Z op = torch.compile(op) 2025-05-07T20:32:01.9141587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9141659Z 2025-05-07T20:32:01.9141756Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9141761Z 2025-05-07T20:32:01.9141860Z moe/activation_test.py:117: 2025-05-07T20:32:01.9141989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9142093Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9142193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9142699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9142807Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9143171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9143401Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9143743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9143838Z kernel = self.compile( 2025-05-07T20:32:01.9144235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9144413Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9144547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9144552Z 2025-05-07T20:32:01.9144756Z self = 2025-05-07T20:32:01.9145669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9146188Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3011e77e0>} 2025-05-07T20:32:01.9146942Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9147139Z context = 2025-05-07T20:32:01.9147144Z 2025-05-07T20:32:01.9147310Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9147579Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9147833Z module_map=module_map) 2025-05-07T20:32:01.9147994Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9148100Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9148179Z E ^ 2025-05-07T20:32:01.9148536Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9148541Z 2025-05-07T20:32:01.9148967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9148972Z 2025-05-07T20:32:01.9149076Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9149304Z self=, 2025-05-07T20:32:01.9149384Z T=128, 2025-05-07T20:32:01.9149464Z D=5120, 2025-05-07T20:32:01.9149555Z scale_ub=1200.0, 2025-05-07T20:32:01.9149649Z contiguous=True, 2025-05-07T20:32:01.9149821Z compiled=False, 2025-05-07T20:32:01.9149901Z ) 2025-05-07T20:32:01.9150119Z self = 2025-05-07T20:32:01.9150290Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.9150295Z 2025-05-07T20:32:01.9150378Z @given( 2025-05-07T20:32:01.9150497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9150601Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9150716Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9150833Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9150952Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9151026Z ) 2025-05-07T20:32:01.9151270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9151372Z def test_silu_mul_quant( 2025-05-07T20:32:01.9151448Z self, 2025-05-07T20:32:01.9151536Z T: int, 2025-05-07T20:32:01.9151621Z D: int, 2025-05-07T20:32:01.9151719Z scale_ub: Optional[float], 2025-05-07T20:32:01.9151810Z contiguous: bool, 2025-05-07T20:32:01.9151904Z compiled: bool, 2025-05-07T20:32:01.9151981Z ) -> None: 2025-05-07T20:32:01.9152081Z torch.manual_seed(2025) 2025-05-07T20:32:01.9152156Z 2025-05-07T20:32:01.9152331Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9152406Z 2025-05-07T20:32:01.9152515Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9152639Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9152729Z x = x_sign * x_clamp 2025-05-07T20:32:01.9152815Z x0 = x[:, :D] 2025-05-07T20:32:01.9152895Z x1 = x[:, D:] 2025-05-07T20:32:01.9152978Z 2025-05-07T20:32:01.9153062Z if contiguous: 2025-05-07T20:32:01.9153152Z x0 = x0.contiguous() 2025-05-07T20:32:01.9153331Z x1 = x1.contiguous() 2025-05-07T20:32:01.9153408Z 2025-05-07T20:32:01.9153499Z if scale_ub is not None: 2025-05-07T20:32:01.9153614Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9153750Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9153829Z ) 2025-05-07T20:32:01.9153914Z else: 2025-05-07T20:32:01.9154008Z scale_ub_tensor = None 2025-05-07T20:32:01.9154084Z 2025-05-07T20:32:01.9154222Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9154312Z op = silu_mul_quant 2025-05-07T20:32:01.9154398Z if compiled: 2025-05-07T20:32:01.9154504Z op = torch.compile(op) 2025-05-07T20:32:01.9154610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9154691Z 2025-05-07T20:32:01.9154782Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9154786Z 2025-05-07T20:32:01.9154883Z moe/activation_test.py:117: 2025-05-07T20:32:01.9155067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9155169Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9155270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9155784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9155880Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9156252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9156480Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9156824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9156923Z kernel = self.compile( 2025-05-07T20:32:01.9157315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9157536Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9157670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9157674Z 2025-05-07T20:32:01.9157879Z self = 2025-05-07T20:32:01.9158666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9159170Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3011e51c0>} 2025-05-07T20:32:01.9159940Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9160135Z context = 2025-05-07T20:32:01.9160139Z 2025-05-07T20:32:01.9160306Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9160580Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9160685Z module_map=module_map) 2025-05-07T20:32:01.9160857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9160954Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9161032Z E ^ 2025-05-07T20:32:01.9161398Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9161402Z 2025-05-07T20:32:01.9161820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9161929Z 2025-05-07T20:32:01.9162035Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9162264Z self=, 2025-05-07T20:32:01.9162341Z T=1, 2025-05-07T20:32:01.9162426Z D=7168, 2025-05-07T20:32:01.9162510Z scale_ub=1200.0, 2025-05-07T20:32:01.9162593Z contiguous=True, 2025-05-07T20:32:01.9162685Z compiled=True, 2025-05-07T20:32:01.9162759Z ) 2025-05-07T20:32:01.9162977Z self = 2025-05-07T20:32:01.9163150Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.9163155Z 2025-05-07T20:32:01.9163338Z @given( 2025-05-07T20:32:01.9163457Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9163561Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9163676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9163811Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9163972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9164046Z ) 2025-05-07T20:32:01.9164297Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9164392Z def test_silu_mul_quant( 2025-05-07T20:32:01.9164471Z self, 2025-05-07T20:32:01.9164552Z T: int, 2025-05-07T20:32:01.9164630Z D: int, 2025-05-07T20:32:01.9164727Z scale_ub: Optional[float], 2025-05-07T20:32:01.9164822Z contiguous: bool, 2025-05-07T20:32:01.9164910Z compiled: bool, 2025-05-07T20:32:01.9164989Z ) -> None: 2025-05-07T20:32:01.9165091Z torch.manual_seed(2025) 2025-05-07T20:32:01.9165168Z 2025-05-07T20:32:01.9165347Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9165419Z 2025-05-07T20:32:01.9165511Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9165642Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9165785Z x = x_sign * x_clamp 2025-05-07T20:32:01.9165867Z x0 = x[:, :D] 2025-05-07T20:32:01.9165957Z x1 = x[:, D:] 2025-05-07T20:32:01.9166029Z 2025-05-07T20:32:01.9166111Z if contiguous: 2025-05-07T20:32:01.9166215Z x0 = x0.contiguous() 2025-05-07T20:32:01.9166305Z x1 = x1.contiguous() 2025-05-07T20:32:01.9166379Z 2025-05-07T20:32:01.9166478Z if scale_ub is not None: 2025-05-07T20:32:01.9166584Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9166724Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9166801Z ) 2025-05-07T20:32:01.9166881Z else: 2025-05-07T20:32:01.9166983Z scale_ub_tensor = None 2025-05-07T20:32:01.9167057Z 2025-05-07T20:32:01.9167190Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9167286Z op = silu_mul_quant 2025-05-07T20:32:01.9167373Z if compiled: 2025-05-07T20:32:01.9167479Z op = torch.compile(op) 2025-05-07T20:32:01.9167597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9167672Z 2025-05-07T20:32:01.9167766Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9167770Z 2025-05-07T20:32:01.9167878Z moe/activation_test.py:117: 2025-05-07T20:32:01.9168009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9168116Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9168213Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9168588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9168689Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9169192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9169288Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9169745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9169973Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9170325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9170419Z kernel = self.compile( 2025-05-07T20:32:01.9170810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9170995Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9171122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9171127Z 2025-05-07T20:32:01.9171338Z self = 2025-05-07T20:32:01.9172121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9172669Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3011e67a0>} 2025-05-07T20:32:01.9173435Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9173626Z context = 2025-05-07T20:32:01.9173630Z 2025-05-07T20:32:01.9173802Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9174065Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9174173Z module_map=module_map) 2025-05-07T20:32:01.9174399Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9174495Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9174582Z E ^ 2025-05-07T20:32:01.9174940Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9174945Z 2025-05-07T20:32:01.9175361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9175366Z 2025-05-07T20:32:01.9175474Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9175697Z self=, 2025-05-07T20:32:01.9175774Z T=1, 2025-05-07T20:32:01.9175858Z D=7168, 2025-05-07T20:32:01.9175941Z scale_ub=1200.0, 2025-05-07T20:32:01.9176035Z contiguous=False, 2025-05-07T20:32:01.9176116Z compiled=True, 2025-05-07T20:32:01.9176190Z ) 2025-05-07T20:32:01.9176418Z self = 2025-05-07T20:32:01.9176590Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.9176595Z 2025-05-07T20:32:01.9176672Z @given( 2025-05-07T20:32:01.9176797Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9176896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9177012Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9177134Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9177246Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9177327Z ) 2025-05-07T20:32:01.9177570Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9177664Z def test_silu_mul_quant( 2025-05-07T20:32:01.9177748Z self, 2025-05-07T20:32:01.9177825Z T: int, 2025-05-07T20:32:01.9177903Z D: int, 2025-05-07T20:32:01.9178699Z scale_ub: Optional[float], 2025-05-07T20:32:01.9178803Z contiguous: bool, 2025-05-07T20:32:01.9178890Z compiled: bool, 2025-05-07T20:32:01.9178978Z ) -> None: 2025-05-07T20:32:01.9179074Z torch.manual_seed(2025) 2025-05-07T20:32:01.9179147Z 2025-05-07T20:32:01.9179326Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9179400Z 2025-05-07T20:32:01.9179499Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9179623Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9179713Z x = x_sign * x_clamp 2025-05-07T20:32:01.9179798Z x0 = x[:, :D] 2025-05-07T20:32:01.9179883Z x1 = x[:, D:] 2025-05-07T20:32:01.9179954Z 2025-05-07T20:32:01.9180044Z if contiguous: 2025-05-07T20:32:01.9180138Z x0 = x0.contiguous() 2025-05-07T20:32:01.9180225Z x1 = x1.contiguous() 2025-05-07T20:32:01.9180303Z 2025-05-07T20:32:01.9180395Z if scale_ub is not None: 2025-05-07T20:32:01.9180555Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9180698Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9180775Z ) 2025-05-07T20:32:01.9180859Z else: 2025-05-07T20:32:01.9180953Z scale_ub_tensor = None 2025-05-07T20:32:01.9181027Z 2025-05-07T20:32:01.9181163Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9181253Z op = silu_mul_quant 2025-05-07T20:32:01.9181339Z if compiled: 2025-05-07T20:32:01.9181444Z op = torch.compile(op) 2025-05-07T20:32:01.9181549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9181619Z 2025-05-07T20:32:01.9181714Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9181719Z 2025-05-07T20:32:01.9181818Z moe/activation_test.py:117: 2025-05-07T20:32:01.9181944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9182051Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9182206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9182587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9182681Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9183181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9183286Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9183649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9183881Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9184224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9184319Z kernel = self.compile( 2025-05-07T20:32:01.9184715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9184896Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9185024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9185028Z 2025-05-07T20:32:01.9185241Z self = 2025-05-07T20:32:01.9186017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9186537Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd301bb76a0>} 2025-05-07T20:32:01.9187371Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9187575Z context = 2025-05-07T20:32:01.9187579Z 2025-05-07T20:32:01.9187746Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9188010Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9189619Z module_map=module_map) 2025-05-07T20:32:01.9189782Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9189882Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9189968Z E ^ 2025-05-07T20:32:01.9190326Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9190331Z 2025-05-07T20:32:01.9190764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9190812Z 2025-05-07T20:32:01.9190916Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9191139Z self=, 2025-05-07T20:32:01.9191224Z T=1, 2025-05-07T20:32:01.9191300Z D=7168, 2025-05-07T20:32:01.9191381Z scale_ub=None, 2025-05-07T20:32:01.9191474Z contiguous=False, 2025-05-07T20:32:01.9191558Z compiled=True, 2025-05-07T20:32:01.9191642Z ) 2025-05-07T20:32:01.9191860Z self = 2025-05-07T20:32:01.9192025Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.9192030Z 2025-05-07T20:32:01.9192114Z @given( 2025-05-07T20:32:01.9192229Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9192329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9192451Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9192574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9192761Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9192839Z ) 2025-05-07T20:32:01.9193083Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9193181Z def test_silu_mul_quant( 2025-05-07T20:32:01.9193256Z self, 2025-05-07T20:32:01.9193331Z T: int, 2025-05-07T20:32:01.9193414Z D: int, 2025-05-07T20:32:01.9193514Z scale_ub: Optional[float], 2025-05-07T20:32:01.9193602Z contiguous: bool, 2025-05-07T20:32:01.9193694Z compiled: bool, 2025-05-07T20:32:01.9193774Z ) -> None: 2025-05-07T20:32:01.9193866Z torch.manual_seed(2025) 2025-05-07T20:32:01.9193945Z 2025-05-07T20:32:01.9194114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9194189Z 2025-05-07T20:32:01.9194294Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9194442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9194560Z x = x_sign * x_clamp 2025-05-07T20:32:01.9194641Z x0 = x[:, :D] 2025-05-07T20:32:01.9194719Z x1 = x[:, D:] 2025-05-07T20:32:01.9194800Z 2025-05-07T20:32:01.9194882Z if contiguous: 2025-05-07T20:32:01.9194973Z x0 = x0.contiguous() 2025-05-07T20:32:01.9195070Z x1 = x1.contiguous() 2025-05-07T20:32:01.9195143Z 2025-05-07T20:32:01.9195233Z if scale_ub is not None: 2025-05-07T20:32:01.9195347Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9195484Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9195559Z ) 2025-05-07T20:32:01.9195644Z else: 2025-05-07T20:32:01.9195736Z scale_ub_tensor = None 2025-05-07T20:32:01.9195817Z 2025-05-07T20:32:01.9195947Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9196036Z op = silu_mul_quant 2025-05-07T20:32:01.9196215Z if compiled: 2025-05-07T20:32:01.9196317Z op = torch.compile(op) 2025-05-07T20:32:01.9196422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9196500Z 2025-05-07T20:32:01.9196594Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.9196716Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.9196795Z 2025-05-07T20:32:01.9196935Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9197035Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.9197145Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.9197269Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.9197413Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.9197484Z 2025-05-07T20:32:01.9197587Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.9197592Z 2025-05-07T20:32:01.9197694Z moe/activation_test.py:126: 2025-05-07T20:32:01.9197873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9197980Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.9198129Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.9198700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.9198809Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.9199175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9199396Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9199776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.9200033Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9200446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.9200752Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9201132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.9201306Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.9201652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.9201730Z fn() 2025-05-07T20:32:01.9202144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.9202227Z self.fn.run( 2025-05-07T20:32:01.9202580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9202682Z kernel = self.compile( 2025-05-07T20:32:01.9203071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9203406Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9203532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9203537Z 2025-05-07T20:32:01.9203743Z self = 2025-05-07T20:32:01.9204530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9205032Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3968dede0>} 2025-05-07T20:32:01.9205877Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9206071Z context = 2025-05-07T20:32:01.9206076Z 2025-05-07T20:32:01.9206251Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9206519Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9206626Z module_map=module_map) 2025-05-07T20:32:01.9206795Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9206896Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.9206974Z E ^ 2025-05-07T20:32:01.9207341Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9207410Z 2025-05-07T20:32:01.9209036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9209041Z 2025-05-07T20:32:01.9209150Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9209373Z self=, 2025-05-07T20:32:01.9209453Z T=1, 2025-05-07T20:32:01.9209536Z D=5120, 2025-05-07T20:32:01.9209621Z scale_ub=1200.0, 2025-05-07T20:32:01.9209708Z contiguous=False, 2025-05-07T20:32:01.9209798Z compiled=True, 2025-05-07T20:32:01.9209873Z ) 2025-05-07T20:32:01.9210097Z self = 2025-05-07T20:32:01.9210265Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.9210270Z 2025-05-07T20:32:01.9210349Z @given( 2025-05-07T20:32:01.9210477Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9210582Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9210745Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9210869Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9210982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9211058Z ) 2025-05-07T20:32:01.9211307Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9211401Z def test_silu_mul_quant( 2025-05-07T20:32:01.9211484Z self, 2025-05-07T20:32:01.9211563Z T: int, 2025-05-07T20:32:01.9211637Z D: int, 2025-05-07T20:32:01.9211740Z scale_ub: Optional[float], 2025-05-07T20:32:01.9211830Z contiguous: bool, 2025-05-07T20:32:01.9211916Z compiled: bool, 2025-05-07T20:32:01.9212001Z ) -> None: 2025-05-07T20:32:01.9212095Z torch.manual_seed(2025) 2025-05-07T20:32:01.9212169Z 2025-05-07T20:32:01.9212345Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9212427Z 2025-05-07T20:32:01.9212523Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9212654Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9212743Z x = x_sign * x_clamp 2025-05-07T20:32:01.9212829Z x0 = x[:, :D] 2025-05-07T20:32:01.9212909Z x1 = x[:, D:] 2025-05-07T20:32:01.9212983Z 2025-05-07T20:32:01.9213076Z if contiguous: 2025-05-07T20:32:01.9213166Z x0 = x0.contiguous() 2025-05-07T20:32:01.9213256Z x1 = x1.contiguous() 2025-05-07T20:32:01.9213337Z 2025-05-07T20:32:01.9213424Z if scale_ub is not None: 2025-05-07T20:32:01.9213530Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9213669Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9213744Z ) 2025-05-07T20:32:01.9213821Z else: 2025-05-07T20:32:01.9213923Z scale_ub_tensor = None 2025-05-07T20:32:01.9213996Z 2025-05-07T20:32:01.9214206Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9214310Z op = silu_mul_quant 2025-05-07T20:32:01.9214397Z if compiled: 2025-05-07T20:32:01.9214505Z op = torch.compile(op) 2025-05-07T20:32:01.9214611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9214686Z 2025-05-07T20:32:01.9214783Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9214788Z 2025-05-07T20:32:01.9214883Z moe/activation_test.py:117: 2025-05-07T20:32:01.9215009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9215116Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9215215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9215593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9215687Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9228388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9228586Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9228973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9229201Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9229550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9229656Z kernel = self.compile( 2025-05-07T20:32:01.9230046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9230226Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9230365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9230371Z 2025-05-07T20:32:01.9230587Z self = 2025-05-07T20:32:01.9231420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9231920Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd396161d00>} 2025-05-07T20:32:01.9232680Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9232871Z context = 2025-05-07T20:32:01.9232876Z 2025-05-07T20:32:01.9233043Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9233324Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9233437Z module_map=module_map) 2025-05-07T20:32:01.9233602Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9233710Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9233791Z E ^ 2025-05-07T20:32:01.9234158Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9234163Z 2025-05-07T20:32:01.9234583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9234587Z 2025-05-07T20:32:01.9234695Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9234929Z self=, 2025-05-07T20:32:01.9235008Z T=1, 2025-05-07T20:32:01.9235094Z D=5120, 2025-05-07T20:32:01.9235182Z scale_ub=1200.0, 2025-05-07T20:32:01.9235358Z contiguous=False, 2025-05-07T20:32:01.9235457Z compiled=False, 2025-05-07T20:32:01.9235532Z ) 2025-05-07T20:32:01.9235752Z self = 2025-05-07T20:32:01.9235933Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.9235938Z 2025-05-07T20:32:01.9236017Z @given( 2025-05-07T20:32:01.9236138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9236246Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9236363Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9236494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9236610Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9236687Z ) 2025-05-07T20:32:01.9236942Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9237038Z def test_silu_mul_quant( 2025-05-07T20:32:01.9237165Z self, 2025-05-07T20:32:01.9237266Z T: int, 2025-05-07T20:32:01.9237346Z D: int, 2025-05-07T20:32:01.9237445Z scale_ub: Optional[float], 2025-05-07T20:32:01.9237545Z contiguous: bool, 2025-05-07T20:32:01.9237639Z compiled: bool, 2025-05-07T20:32:01.9237720Z ) -> None: 2025-05-07T20:32:01.9237827Z torch.manual_seed(2025) 2025-05-07T20:32:01.9237901Z 2025-05-07T20:32:01.9238083Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9238160Z 2025-05-07T20:32:01.9238256Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9238678Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9238812Z x = x_sign * x_clamp 2025-05-07T20:32:01.9238926Z x0 = x[:, :D] 2025-05-07T20:32:01.9239052Z x1 = x[:, D:] 2025-05-07T20:32:01.9239130Z 2025-05-07T20:32:01.9239218Z if contiguous: 2025-05-07T20:32:01.9239322Z x0 = x0.contiguous() 2025-05-07T20:32:01.9239588Z x1 = x1.contiguous() 2025-05-07T20:32:01.9239664Z 2025-05-07T20:32:01.9239768Z if scale_ub is not None: 2025-05-07T20:32:01.9239876Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9240010Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9240099Z ) 2025-05-07T20:32:01.9240179Z else: 2025-05-07T20:32:01.9240284Z scale_ub_tensor = None 2025-05-07T20:32:01.9240357Z 2025-05-07T20:32:01.9240490Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9240586Z op = silu_mul_quant 2025-05-07T20:32:01.9240673Z if compiled: 2025-05-07T20:32:01.9240772Z op = torch.compile(op) 2025-05-07T20:32:01.9240887Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9240959Z 2025-05-07T20:32:01.9241056Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9241060Z 2025-05-07T20:32:01.9241167Z moe/activation_test.py:117: 2025-05-07T20:32:01.9241306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9241417Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9241519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9242029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9242137Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9242499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9242724Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9243076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9243170Z kernel = self.compile( 2025-05-07T20:32:01.9243785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9243968Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9244096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9244101Z 2025-05-07T20:32:01.9244314Z self = 2025-05-07T20:32:01.9245083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9245591Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3006d8b80>} 2025-05-07T20:32:01.9246353Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9246605Z context = 2025-05-07T20:32:01.9246622Z 2025-05-07T20:32:01.9246787Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9247051Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9247167Z module_map=module_map) 2025-05-07T20:32:01.9247330Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9247430Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9247516Z E ^ 2025-05-07T20:32:01.9247872Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9247877Z 2025-05-07T20:32:01.9248309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9248359Z 2025-05-07T20:32:01.9248466Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9248689Z self=, 2025-05-07T20:32:01.9248779Z T=16384, 2025-05-07T20:32:01.9248857Z D=5120, 2025-05-07T20:32:01.9248939Z scale_ub=1200.0, 2025-05-07T20:32:01.9249036Z contiguous=False, 2025-05-07T20:32:01.9249121Z compiled=True, 2025-05-07T20:32:01.9249197Z ) 2025-05-07T20:32:01.9249423Z self = 2025-05-07T20:32:01.9249601Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.9249605Z 2025-05-07T20:32:01.9249692Z @given( 2025-05-07T20:32:01.9249810Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9249908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9250031Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9250154Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9250270Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9250354Z ) 2025-05-07T20:32:01.9250599Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9250702Z def test_silu_mul_quant( 2025-05-07T20:32:01.9250780Z self, 2025-05-07T20:32:01.9250859Z T: int, 2025-05-07T20:32:01.9250944Z D: int, 2025-05-07T20:32:01.9251043Z scale_ub: Optional[float], 2025-05-07T20:32:01.9251133Z contiguous: bool, 2025-05-07T20:32:01.9251229Z compiled: bool, 2025-05-07T20:32:01.9251308Z ) -> None: 2025-05-07T20:32:01.9251402Z torch.manual_seed(2025) 2025-05-07T20:32:01.9251485Z 2025-05-07T20:32:01.9251655Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9251729Z 2025-05-07T20:32:01.9251831Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9252037Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9252134Z x = x_sign * x_clamp 2025-05-07T20:32:01.9252222Z x0 = x[:, :D] 2025-05-07T20:32:01.9252303Z x1 = x[:, D:] 2025-05-07T20:32:01.9252393Z 2025-05-07T20:32:01.9252483Z if contiguous: 2025-05-07T20:32:01.9252575Z x0 = x0.contiguous() 2025-05-07T20:32:01.9252672Z x1 = x1.contiguous() 2025-05-07T20:32:01.9252747Z 2025-05-07T20:32:01.9252835Z if scale_ub is not None: 2025-05-07T20:32:01.9252951Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9253084Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9253164Z ) 2025-05-07T20:32:01.9253252Z else: 2025-05-07T20:32:01.9253348Z scale_ub_tensor = None 2025-05-07T20:32:01.9253423Z 2025-05-07T20:32:01.9253561Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9253652Z op = silu_mul_quant 2025-05-07T20:32:01.9253787Z if compiled: 2025-05-07T20:32:01.9253892Z op = torch.compile(op) 2025-05-07T20:32:01.9253996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9254071Z 2025-05-07T20:32:01.9254170Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9254174Z 2025-05-07T20:32:01.9254274Z moe/activation_test.py:117: 2025-05-07T20:32:01.9254402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9254515Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9254615Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9254995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9255084Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9255585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9255691Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9256107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9256331Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9256681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9256775Z kernel = self.compile( 2025-05-07T20:32:01.9257167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9257343Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9257471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9257476Z 2025-05-07T20:32:01.9257690Z self = 2025-05-07T20:32:01.9258465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9258972Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3006d9e40>} 2025-05-07T20:32:01.9259723Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9259912Z context = 2025-05-07T20:32:01.9259927Z 2025-05-07T20:32:01.9260090Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9260351Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9260547Z module_map=module_map) 2025-05-07T20:32:01.9260714Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9260811Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9260893Z E ^ 2025-05-07T20:32:01.9261248Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9261253Z 2025-05-07T20:32:01.9261672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9261676Z 2025-05-07T20:32:01.9261777Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9262000Z self=, 2025-05-07T20:32:01.9262085Z T=2048, 2025-05-07T20:32:01.9262161Z D=7168, 2025-05-07T20:32:01.9262242Z scale_ub=1200.0, 2025-05-07T20:32:01.9262331Z contiguous=False, 2025-05-07T20:32:01.9262411Z compiled=True, 2025-05-07T20:32:01.9262537Z ) 2025-05-07T20:32:01.9262766Z self = 2025-05-07T20:32:01.9262940Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.9262944Z 2025-05-07T20:32:01.9263028Z @given( 2025-05-07T20:32:01.9263147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9263243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9263362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9263477Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9263588Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9263667Z ) 2025-05-07T20:32:01.9263910Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9264008Z def test_silu_mul_quant( 2025-05-07T20:32:01.9264087Z self, 2025-05-07T20:32:01.9264164Z T: int, 2025-05-07T20:32:01.9264266Z D: int, 2025-05-07T20:32:01.9264437Z scale_ub: Optional[float], 2025-05-07T20:32:01.9264530Z contiguous: bool, 2025-05-07T20:32:01.9264618Z compiled: bool, 2025-05-07T20:32:01.9264694Z ) -> None: 2025-05-07T20:32:01.9264788Z torch.manual_seed(2025) 2025-05-07T20:32:01.9264868Z 2025-05-07T20:32:01.9265035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9265107Z 2025-05-07T20:32:01.9265209Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9265335Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9265423Z x = x_sign * x_clamp 2025-05-07T20:32:01.9265512Z x0 = x[:, :D] 2025-05-07T20:32:01.9265590Z x1 = x[:, D:] 2025-05-07T20:32:01.9265673Z 2025-05-07T20:32:01.9265758Z if contiguous: 2025-05-07T20:32:01.9265845Z x0 = x0.contiguous() 2025-05-07T20:32:01.9265940Z x1 = x1.contiguous() 2025-05-07T20:32:01.9266011Z 2025-05-07T20:32:01.9266104Z if scale_ub is not None: 2025-05-07T20:32:01.9266223Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9266357Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9266431Z ) 2025-05-07T20:32:01.9266516Z else: 2025-05-07T20:32:01.9266610Z scale_ub_tensor = None 2025-05-07T20:32:01.9266684Z 2025-05-07T20:32:01.9266824Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9266914Z op = silu_mul_quant 2025-05-07T20:32:01.9267005Z if compiled: 2025-05-07T20:32:01.9267107Z op = torch.compile(op) 2025-05-07T20:32:01.9267210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9267292Z 2025-05-07T20:32:01.9267385Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9267390Z 2025-05-07T20:32:01.9267485Z moe/activation_test.py:117: 2025-05-07T20:32:01.9267620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9267804Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9267908Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9268283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9268375Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9268882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9268980Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9269339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9269566Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9269906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9270000Z kernel = self.compile( 2025-05-07T20:32:01.9270445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9270619Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9270754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9270759Z 2025-05-07T20:32:01.9270964Z self = 2025-05-07T20:32:01.9271732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9272238Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3006da980>} 2025-05-07T20:32:01.9272996Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9273247Z context = 2025-05-07T20:32:01.9273251Z 2025-05-07T20:32:01.9273416Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9273685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9273793Z module_map=module_map) 2025-05-07T20:32:01.9273953Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9274057Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9274133Z E ^ 2025-05-07T20:32:01.9274488Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9274493Z 2025-05-07T20:32:01.9274917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9274927Z 2025-05-07T20:32:01.9275031Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9275258Z self=, 2025-05-07T20:32:01.9275331Z T=1, 2025-05-07T20:32:01.9275405Z D=5120, 2025-05-07T20:32:01.9275492Z scale_ub=None, 2025-05-07T20:32:01.9275583Z contiguous=False, 2025-05-07T20:32:01.9275666Z compiled=False, 2025-05-07T20:32:01.9275747Z ) 2025-05-07T20:32:01.9275964Z self = 2025-05-07T20:32:01.9276131Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.9276143Z 2025-05-07T20:32:01.9276219Z @given( 2025-05-07T20:32:01.9276335Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9276439Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9276657Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9276782Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9276902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9276974Z ) 2025-05-07T20:32:01.9277218Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9277317Z def test_silu_mul_quant( 2025-05-07T20:32:01.9277391Z self, 2025-05-07T20:32:01.9277470Z T: int, 2025-05-07T20:32:01.9277553Z D: int, 2025-05-07T20:32:01.9277650Z scale_ub: Optional[float], 2025-05-07T20:32:01.9277747Z contiguous: bool, 2025-05-07T20:32:01.9277829Z compiled: bool, 2025-05-07T20:32:01.9277905Z ) -> None: 2025-05-07T20:32:01.9278004Z torch.manual_seed(2025) 2025-05-07T20:32:01.9278079Z 2025-05-07T20:32:01.9278248Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9278330Z 2025-05-07T20:32:01.9278471Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9278601Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9278702Z x = x_sign * x_clamp 2025-05-07T20:32:01.9278787Z x0 = x[:, :D] 2025-05-07T20:32:01.9278869Z x1 = x[:, D:] 2025-05-07T20:32:01.9278956Z 2025-05-07T20:32:01.9279046Z if contiguous: 2025-05-07T20:32:01.9279151Z x0 = x0.contiguous() 2025-05-07T20:32:01.9279242Z x1 = x1.contiguous() 2025-05-07T20:32:01.9279318Z 2025-05-07T20:32:01.9279419Z if scale_ub is not None: 2025-05-07T20:32:01.9279525Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9279661Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9279749Z ) 2025-05-07T20:32:01.9279829Z else: 2025-05-07T20:32:01.9279925Z scale_ub_tensor = None 2025-05-07T20:32:01.9280011Z 2025-05-07T20:32:01.9280143Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9280244Z op = silu_mul_quant 2025-05-07T20:32:01.9280384Z if compiled: 2025-05-07T20:32:01.9280486Z op = torch.compile(op) 2025-05-07T20:32:01.9280602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9280677Z 2025-05-07T20:32:01.9280770Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9280774Z 2025-05-07T20:32:01.9280885Z moe/activation_test.py:117: 2025-05-07T20:32:01.9281018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9281118Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9281229Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9281729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9281827Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9282205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9282433Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9282786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9282880Z kernel = self.compile( 2025-05-07T20:32:01.9283328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9283514Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9283643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9283647Z 2025-05-07T20:32:01.9283865Z self = 2025-05-07T20:32:01.9284774Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9285281Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3004b4220>} 2025-05-07T20:32:01.9286044Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9286238Z context = 2025-05-07T20:32:01.9286243Z 2025-05-07T20:32:01.9286424Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9286690Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9286797Z module_map=module_map) 2025-05-07T20:32:01.9286968Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9287117Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9287201Z E ^ 2025-05-07T20:32:01.9287557Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9287562Z 2025-05-07T20:32:01.9287978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9287982Z 2025-05-07T20:32:01.9288087Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9288307Z self=, 2025-05-07T20:32:01.9288393Z T=4096, 2025-05-07T20:32:01.9288471Z D=7168, 2025-05-07T20:32:01.9288553Z scale_ub=1200.0, 2025-05-07T20:32:01.9288646Z contiguous=False, 2025-05-07T20:32:01.9288728Z compiled=False, 2025-05-07T20:32:01.9288804Z ) 2025-05-07T20:32:01.9289026Z self = 2025-05-07T20:32:01.9289211Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.9289257Z 2025-05-07T20:32:01.9289333Z @given( 2025-05-07T20:32:01.9289460Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9289556Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9289670Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9289791Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9289902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9289983Z ) 2025-05-07T20:32:01.9290224Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9290315Z def test_silu_mul_quant( 2025-05-07T20:32:01.9290394Z self, 2025-05-07T20:32:01.9290467Z T: int, 2025-05-07T20:32:01.9290543Z D: int, 2025-05-07T20:32:01.9290642Z scale_ub: Optional[float], 2025-05-07T20:32:01.9290730Z contiguous: bool, 2025-05-07T20:32:01.9290821Z compiled: bool, 2025-05-07T20:32:01.9290913Z ) -> None: 2025-05-07T20:32:01.9291008Z torch.manual_seed(2025) 2025-05-07T20:32:01.9291083Z 2025-05-07T20:32:01.9291259Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9291331Z 2025-05-07T20:32:01.9291419Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9291549Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9291637Z x = x_sign * x_clamp 2025-05-07T20:32:01.9291713Z x0 = x[:, :D] 2025-05-07T20:32:01.9291799Z x1 = x[:, D:] 2025-05-07T20:32:01.9291871Z 2025-05-07T20:32:01.9291955Z if contiguous: 2025-05-07T20:32:01.9292057Z x0 = x0.contiguous() 2025-05-07T20:32:01.9292143Z x1 = x1.contiguous() 2025-05-07T20:32:01.9292222Z 2025-05-07T20:32:01.9292311Z if scale_ub is not None: 2025-05-07T20:32:01.9292416Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9292639Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9292718Z ) 2025-05-07T20:32:01.9292794Z else: 2025-05-07T20:32:01.9292894Z scale_ub_tensor = None 2025-05-07T20:32:01.9292966Z 2025-05-07T20:32:01.9293093Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9293186Z op = silu_mul_quant 2025-05-07T20:32:01.9293271Z if compiled: 2025-05-07T20:32:01.9293369Z op = torch.compile(op) 2025-05-07T20:32:01.9293479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9293553Z 2025-05-07T20:32:01.9293653Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9293657Z 2025-05-07T20:32:01.9293754Z moe/activation_test.py:117: 2025-05-07T20:32:01.9293881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9293985Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9294083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9294631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9294737Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9295097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9295325Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9295670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9295762Z kernel = self.compile( 2025-05-07T20:32:01.9296152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9296329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9296454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9296511Z 2025-05-07T20:32:01.9296715Z self = 2025-05-07T20:32:01.9297483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9297987Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3004b5440>} 2025-05-07T20:32:01.9298737Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9298934Z context = 2025-05-07T20:32:01.9298939Z 2025-05-07T20:32:01.9299107Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9299371Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9299482Z module_map=module_map) 2025-05-07T20:32:01.9299640Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9299739Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9299818Z E ^ 2025-05-07T20:32:01.9300170Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9300175Z 2025-05-07T20:32:01.9300595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9300600Z 2025-05-07T20:32:01.9300702Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9300922Z self=, 2025-05-07T20:32:01.9301007Z T=16384, 2025-05-07T20:32:01.9301165Z D=7168, 2025-05-07T20:32:01.9301255Z scale_ub=None, 2025-05-07T20:32:01.9301340Z contiguous=True, 2025-05-07T20:32:01.9301421Z compiled=True, 2025-05-07T20:32:01.9301498Z ) 2025-05-07T20:32:01.9301714Z self = 2025-05-07T20:32:01.9301887Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.9301892Z 2025-05-07T20:32:01.9301973Z @given( 2025-05-07T20:32:01.9302089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9302188Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9302307Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9302421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9302539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9302610Z ) 2025-05-07T20:32:01.9302856Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9302996Z def test_silu_mul_quant( 2025-05-07T20:32:01.9303072Z self, 2025-05-07T20:32:01.9303146Z T: int, 2025-05-07T20:32:01.9303226Z D: int, 2025-05-07T20:32:01.9303323Z scale_ub: Optional[float], 2025-05-07T20:32:01.9303408Z contiguous: bool, 2025-05-07T20:32:01.9303500Z compiled: bool, 2025-05-07T20:32:01.9303577Z ) -> None: 2025-05-07T20:32:01.9303669Z torch.manual_seed(2025) 2025-05-07T20:32:01.9303749Z 2025-05-07T20:32:01.9303916Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9303996Z 2025-05-07T20:32:01.9304086Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9304208Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9304301Z x = x_sign * x_clamp 2025-05-07T20:32:01.9304380Z x0 = x[:, :D] 2025-05-07T20:32:01.9304458Z x1 = x[:, D:] 2025-05-07T20:32:01.9304539Z 2025-05-07T20:32:01.9304630Z if contiguous: 2025-05-07T20:32:01.9304787Z x0 = x0.contiguous() 2025-05-07T20:32:01.9304885Z x1 = x1.contiguous() 2025-05-07T20:32:01.9304953Z 2025-05-07T20:32:01.9305042Z if scale_ub is not None: 2025-05-07T20:32:01.9305154Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9305287Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9305367Z ) 2025-05-07T20:32:01.9305443Z else: 2025-05-07T20:32:01.9305536Z scale_ub_tensor = None 2025-05-07T20:32:01.9305611Z 2025-05-07T20:32:01.9305741Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9305831Z op = silu_mul_quant 2025-05-07T20:32:01.9305922Z if compiled: 2025-05-07T20:32:01.9306021Z op = torch.compile(op) 2025-05-07T20:32:01.9306124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9306205Z 2025-05-07T20:32:01.9306295Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9306313Z 2025-05-07T20:32:01.9306410Z moe/activation_test.py:117: 2025-05-07T20:32:01.9306544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9306645Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9306749Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9307121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9307218Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9307726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9307827Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9308190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9308419Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9308843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9308949Z kernel = self.compile( 2025-05-07T20:32:01.9309334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9309507Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9309644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9309649Z 2025-05-07T20:32:01.9309852Z self = 2025-05-07T20:32:01.9310628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9311135Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3004b6520>} 2025-05-07T20:32:01.9311931Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9312129Z context = 2025-05-07T20:32:01.9312133Z 2025-05-07T20:32:01.9312296Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9312567Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9312671Z module_map=module_map) 2025-05-07T20:32:01.9312831Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9312932Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9313007Z E ^ 2025-05-07T20:32:01.9313375Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9313421Z 2025-05-07T20:32:01.9313833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9313837Z 2025-05-07T20:32:01.9313938Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9314163Z self=, 2025-05-07T20:32:01.9314236Z T=4096, 2025-05-07T20:32:01.9314312Z D=5120, 2025-05-07T20:32:01.9314395Z scale_ub=None, 2025-05-07T20:32:01.9314478Z contiguous=False, 2025-05-07T20:32:01.9314566Z compiled=True, 2025-05-07T20:32:01.9314640Z ) 2025-05-07T20:32:01.9314855Z self = 2025-05-07T20:32:01.9315032Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.9315036Z 2025-05-07T20:32:01.9315118Z @given( 2025-05-07T20:32:01.9315237Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9315338Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9315452Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9315566Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9315681Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9315754Z ) 2025-05-07T20:32:01.9316002Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9316092Z def test_silu_mul_quant( 2025-05-07T20:32:01.9316168Z self, 2025-05-07T20:32:01.9316250Z T: int, 2025-05-07T20:32:01.9316324Z D: int, 2025-05-07T20:32:01.9316420Z scale_ub: Optional[float], 2025-05-07T20:32:01.9316515Z contiguous: bool, 2025-05-07T20:32:01.9316599Z compiled: bool, 2025-05-07T20:32:01.9316674Z ) -> None: 2025-05-07T20:32:01.9316772Z torch.manual_seed(2025) 2025-05-07T20:32:01.9316927Z 2025-05-07T20:32:01.9317100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9317176Z 2025-05-07T20:32:01.9317266Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9317393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9317479Z x = x_sign * x_clamp 2025-05-07T20:32:01.9317556Z x0 = x[:, :D] 2025-05-07T20:32:01.9317637Z x1 = x[:, D:] 2025-05-07T20:32:01.9317708Z 2025-05-07T20:32:01.9317790Z if contiguous: 2025-05-07T20:32:01.9317884Z x0 = x0.contiguous() 2025-05-07T20:32:01.9317970Z x1 = x1.contiguous() 2025-05-07T20:32:01.9318040Z 2025-05-07T20:32:01.9318134Z if scale_ub is not None: 2025-05-07T20:32:01.9318240Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9318374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9318457Z ) 2025-05-07T20:32:01.9318583Z else: 2025-05-07T20:32:01.9318677Z scale_ub_tensor = None 2025-05-07T20:32:01.9318753Z 2025-05-07T20:32:01.9318881Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9318977Z op = silu_mul_quant 2025-05-07T20:32:01.9319060Z if compiled: 2025-05-07T20:32:01.9319159Z op = torch.compile(op) 2025-05-07T20:32:01.9319268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9319340Z 2025-05-07T20:32:01.9319429Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9319433Z 2025-05-07T20:32:01.9319536Z moe/activation_test.py:117: 2025-05-07T20:32:01.9319664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9319761Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9319865Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9320238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9320393Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9320895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9320990Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9321356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9321577Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9321928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9322018Z kernel = self.compile( 2025-05-07T20:32:01.9322402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9322584Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9322713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9322723Z 2025-05-07T20:32:01.9322929Z self = 2025-05-07T20:32:01.9323790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9324346Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3004b6c00>} 2025-05-07T20:32:01.9325109Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9325297Z context = 2025-05-07T20:32:01.9325383Z 2025-05-07T20:32:01.9325557Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9325821Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9325928Z module_map=module_map) 2025-05-07T20:32:01.9326094Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9326190Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9326267Z E ^ 2025-05-07T20:32:01.9326629Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9326633Z 2025-05-07T20:32:01.9327044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9327049Z 2025-05-07T20:32:01.9327159Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9327384Z self=, 2025-05-07T20:32:01.9327502Z T=4096, 2025-05-07T20:32:01.9327582Z D=5120, 2025-05-07T20:32:01.9327667Z scale_ub=1200.0, 2025-05-07T20:32:01.9327751Z contiguous=False, 2025-05-07T20:32:01.9327839Z compiled=False, 2025-05-07T20:32:01.9327910Z ) 2025-05-07T20:32:01.9328133Z self = 2025-05-07T20:32:01.9328308Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.9328312Z 2025-05-07T20:32:01.9328389Z @given( 2025-05-07T20:32:01.9328513Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9328608Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9328721Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9328846Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9328962Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9329036Z ) 2025-05-07T20:32:01.9329291Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9329427Z def test_silu_mul_quant( 2025-05-07T20:32:01.9329510Z self, 2025-05-07T20:32:01.9329586Z T: int, 2025-05-07T20:32:01.9329663Z D: int, 2025-05-07T20:32:01.9329766Z scale_ub: Optional[float], 2025-05-07T20:32:01.9329854Z contiguous: bool, 2025-05-07T20:32:01.9329940Z compiled: bool, 2025-05-07T20:32:01.9330023Z ) -> None: 2025-05-07T20:32:01.9330115Z torch.manual_seed(2025) 2025-05-07T20:32:01.9330187Z 2025-05-07T20:32:01.9330365Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9330439Z 2025-05-07T20:32:01.9330533Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9330662Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9330748Z x = x_sign * x_clamp 2025-05-07T20:32:01.9330829Z x0 = x[:, :D] 2025-05-07T20:32:01.9330907Z x1 = x[:, D:] 2025-05-07T20:32:01.9330992Z 2025-05-07T20:32:01.9331079Z if contiguous: 2025-05-07T20:32:01.9331173Z x0 = x0.contiguous() 2025-05-07T20:32:01.9331261Z x1 = x1.contiguous() 2025-05-07T20:32:01.9331338Z 2025-05-07T20:32:01.9331426Z if scale_ub is not None: 2025-05-07T20:32:01.9331529Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9331669Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9331741Z ) 2025-05-07T20:32:01.9331819Z else: 2025-05-07T20:32:01.9331917Z scale_ub_tensor = None 2025-05-07T20:32:01.9331988Z 2025-05-07T20:32:01.9332115Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9332209Z op = silu_mul_quant 2025-05-07T20:32:01.9332294Z if compiled: 2025-05-07T20:32:01.9332397Z op = torch.compile(op) 2025-05-07T20:32:01.9332499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9332654Z 2025-05-07T20:32:01.9332753Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9332757Z 2025-05-07T20:32:01.9332858Z moe/activation_test.py:117: 2025-05-07T20:32:01.9332986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9333086Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9333185Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9333694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9333788Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9334147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9334375Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9334718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9334880Z kernel = self.compile( 2025-05-07T20:32:01.9335272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9335444Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9335578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9335583Z 2025-05-07T20:32:01.9335785Z self = 2025-05-07T20:32:01.9336555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9337059Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3000b8400>} 2025-05-07T20:32:01.9337860Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9338052Z context = 2025-05-07T20:32:01.9338057Z 2025-05-07T20:32:01.9338221Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9338685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9338847Z module_map=module_map) 2025-05-07T20:32:01.9339014Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9339117Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9339192Z E ^ 2025-05-07T20:32:01.9339555Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9339576Z 2025-05-07T20:32:01.9339998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9340002Z 2025-05-07T20:32:01.9340103Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9340328Z self=, 2025-05-07T20:32:01.9340404Z T=4096, 2025-05-07T20:32:01.9340479Z D=5120, 2025-05-07T20:32:01.9340570Z scale_ub=1200.0, 2025-05-07T20:32:01.9340651Z contiguous=False, 2025-05-07T20:32:01.9340729Z compiled=True, 2025-05-07T20:32:01.9340805Z ) 2025-05-07T20:32:01.9341020Z self = 2025-05-07T20:32:01.9341193Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.9341198Z 2025-05-07T20:32:01.9341278Z @given( 2025-05-07T20:32:01.9341396Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9341732Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9341847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9341961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9342082Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9342155Z ) 2025-05-07T20:32:01.9342400Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9342496Z def test_silu_mul_quant( 2025-05-07T20:32:01.9342568Z self, 2025-05-07T20:32:01.9342644Z T: int, 2025-05-07T20:32:01.9342726Z D: int, 2025-05-07T20:32:01.9342822Z scale_ub: Optional[float], 2025-05-07T20:32:01.9342908Z contiguous: bool, 2025-05-07T20:32:01.9342997Z compiled: bool, 2025-05-07T20:32:01.9343073Z ) -> None: 2025-05-07T20:32:01.9343171Z torch.manual_seed(2025) 2025-05-07T20:32:01.9343241Z 2025-05-07T20:32:01.9343415Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9343557Z 2025-05-07T20:32:01.9343647Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9343773Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9343864Z x = x_sign * x_clamp 2025-05-07T20:32:01.9343940Z x0 = x[:, :D] 2025-05-07T20:32:01.9344018Z x1 = x[:, D:] 2025-05-07T20:32:01.9344092Z 2025-05-07T20:32:01.9344175Z if contiguous: 2025-05-07T20:32:01.9344264Z x0 = x0.contiguous() 2025-05-07T20:32:01.9344355Z x1 = x1.contiguous() 2025-05-07T20:32:01.9344427Z 2025-05-07T20:32:01.9344520Z if scale_ub is not None: 2025-05-07T20:32:01.9344621Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9344752Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9344832Z ) 2025-05-07T20:32:01.9344903Z else: 2025-05-07T20:32:01.9344998Z scale_ub_tensor = None 2025-05-07T20:32:01.9345080Z 2025-05-07T20:32:01.9345215Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9345370Z op = silu_mul_quant 2025-05-07T20:32:01.9345461Z if compiled: 2025-05-07T20:32:01.9345558Z op = torch.compile(op) 2025-05-07T20:32:01.9345664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9345741Z 2025-05-07T20:32:01.9345831Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9345836Z 2025-05-07T20:32:01.9345938Z moe/activation_test.py:117: 2025-05-07T20:32:01.9346066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9346164Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9346265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9346634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9346724Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9347229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9347328Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9347690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9347910Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9348251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9348346Z kernel = self.compile( 2025-05-07T20:32:01.9348730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9348901Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9349036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9349040Z 2025-05-07T20:32:01.9349333Z self = 2025-05-07T20:32:01.9350111Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9350608Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3000b9620>} 2025-05-07T20:32:01.9351365Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9351551Z context = 2025-05-07T20:32:01.9351555Z 2025-05-07T20:32:01.9355567Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9355937Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9356055Z module_map=module_map) 2025-05-07T20:32:01.9356217Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9356315Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9356394Z E ^ 2025-05-07T20:32:01.9356755Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9356761Z 2025-05-07T20:32:01.9357185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9357189Z 2025-05-07T20:32:01.9357289Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9357512Z self=, 2025-05-07T20:32:01.9357595Z T=2048, 2025-05-07T20:32:01.9357674Z D=7168, 2025-05-07T20:32:01.9357762Z scale_ub=1200.0, 2025-05-07T20:32:01.9357899Z contiguous=False, 2025-05-07T20:32:01.9357980Z compiled=False, 2025-05-07T20:32:01.9358050Z ) 2025-05-07T20:32:01.9358271Z self = 2025-05-07T20:32:01.9358444Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.9358449Z 2025-05-07T20:32:01.9358530Z @given( 2025-05-07T20:32:01.9358648Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9358743Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9358865Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9358981Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9359094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9359179Z ) 2025-05-07T20:32:01.9359422Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9359521Z def test_silu_mul_quant( 2025-05-07T20:32:01.9359610Z self, 2025-05-07T20:32:01.9359689Z T: int, 2025-05-07T20:32:01.9359768Z D: int, 2025-05-07T20:32:01.9359865Z scale_ub: Optional[float], 2025-05-07T20:32:01.9359955Z contiguous: bool, 2025-05-07T20:32:01.9360044Z compiled: bool, 2025-05-07T20:32:01.9360122Z ) -> None: 2025-05-07T20:32:01.9360214Z torch.manual_seed(2025) 2025-05-07T20:32:01.9360291Z 2025-05-07T20:32:01.9360461Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9360535Z 2025-05-07T20:32:01.9360635Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9360756Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9360845Z x = x_sign * x_clamp 2025-05-07T20:32:01.9360925Z x0 = x[:, :D] 2025-05-07T20:32:01.9361005Z x1 = x[:, D:] 2025-05-07T20:32:01.9361082Z 2025-05-07T20:32:01.9361162Z if contiguous: 2025-05-07T20:32:01.9361337Z x0 = x0.contiguous() 2025-05-07T20:32:01.9361438Z x1 = x1.contiguous() 2025-05-07T20:32:01.9361509Z 2025-05-07T20:32:01.9361595Z if scale_ub is not None: 2025-05-07T20:32:01.9361703Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9361835Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9361909Z ) 2025-05-07T20:32:01.9361986Z else: 2025-05-07T20:32:01.9362076Z scale_ub_tensor = None 2025-05-07T20:32:01.9362148Z 2025-05-07T20:32:01.9362278Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9362367Z op = silu_mul_quant 2025-05-07T20:32:01.9362447Z if compiled: 2025-05-07T20:32:01.9362548Z op = torch.compile(op) 2025-05-07T20:32:01.9362655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9362732Z 2025-05-07T20:32:01.9362821Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9362825Z 2025-05-07T20:32:01.9362975Z moe/activation_test.py:117: 2025-05-07T20:32:01.9363110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9363303Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9363401Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9363903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9364001Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9364364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9364583Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9364923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9365016Z kernel = self.compile( 2025-05-07T20:32:01.9365403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9365625Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9365754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9365761Z 2025-05-07T20:32:01.9365962Z self = 2025-05-07T20:32:01.9366731Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9367228Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3000ba480>} 2025-05-07T20:32:01.9367987Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9368179Z context = 2025-05-07T20:32:01.9368184Z 2025-05-07T20:32:01.9368345Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9368614Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9368716Z module_map=module_map) 2025-05-07T20:32:01.9368879Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9368973Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9369045Z E ^ 2025-05-07T20:32:01.9369399Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9369404Z 2025-05-07T20:32:01.9369918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9369928Z 2025-05-07T20:32:01.9370029Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9370254Z self=, 2025-05-07T20:32:01.9370328Z T=1, 2025-05-07T20:32:01.9370406Z D=7168, 2025-05-07T20:32:01.9370486Z scale_ub=None, 2025-05-07T20:32:01.9370571Z contiguous=True, 2025-05-07T20:32:01.9370659Z compiled=False, 2025-05-07T20:32:01.9370730Z ) 2025-05-07T20:32:01.9370944Z self = 2025-05-07T20:32:01.9371112Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.9371116Z 2025-05-07T20:32:01.9371192Z @given( 2025-05-07T20:32:01.9371309Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9371409Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9371520Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9371695Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9371809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9371881Z ) 2025-05-07T20:32:01.9372131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9372220Z def test_silu_mul_quant( 2025-05-07T20:32:01.9372290Z self, 2025-05-07T20:32:01.9372367Z T: int, 2025-05-07T20:32:01.9372437Z D: int, 2025-05-07T20:32:01.9372534Z scale_ub: Optional[float], 2025-05-07T20:32:01.9372622Z contiguous: bool, 2025-05-07T20:32:01.9372705Z compiled: bool, 2025-05-07T20:32:01.9372778Z ) -> None: 2025-05-07T20:32:01.9372871Z torch.manual_seed(2025) 2025-05-07T20:32:01.9372944Z 2025-05-07T20:32:01.9373118Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9373191Z 2025-05-07T20:32:01.9373279Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9373412Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9373544Z x = x_sign * x_clamp 2025-05-07T20:32:01.9373623Z x0 = x[:, :D] 2025-05-07T20:32:01.9373707Z x1 = x[:, D:] 2025-05-07T20:32:01.9373774Z 2025-05-07T20:32:01.9373855Z if contiguous: 2025-05-07T20:32:01.9373946Z x0 = x0.contiguous() 2025-05-07T20:32:01.9374033Z x1 = x1.contiguous() 2025-05-07T20:32:01.9374106Z 2025-05-07T20:32:01.9374198Z if scale_ub is not None: 2025-05-07T20:32:01.9374300Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9374437Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9374506Z ) 2025-05-07T20:32:01.9374579Z else: 2025-05-07T20:32:01.9374675Z scale_ub_tensor = None 2025-05-07T20:32:01.9374743Z 2025-05-07T20:32:01.9374869Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9374965Z op = silu_mul_quant 2025-05-07T20:32:01.9375054Z if compiled: 2025-05-07T20:32:01.9375149Z op = torch.compile(op) 2025-05-07T20:32:01.9375254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9375323Z 2025-05-07T20:32:01.9375411Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9375415Z 2025-05-07T20:32:01.9375518Z moe/activation_test.py:117: 2025-05-07T20:32:01.9375641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9375743Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9375839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9376341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9376441Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9376800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9377107Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9377460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9377553Z kernel = self.compile( 2025-05-07T20:32:01.9377940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9378116Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9378242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9378246Z 2025-05-07T20:32:01.9378459Z self = 2025-05-07T20:32:01.9379235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9379774Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3000b9da0>} 2025-05-07T20:32:01.9380532Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9380722Z context = 2025-05-07T20:32:01.9380726Z 2025-05-07T20:32:01.9380890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9381149Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9381252Z module_map=module_map) 2025-05-07T20:32:01.9381415Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9381517Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9381638Z E ^ 2025-05-07T20:32:01.9381990Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9381994Z 2025-05-07T20:32:01.9382402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9382407Z 2025-05-07T20:32:01.9382513Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9382733Z self=, 2025-05-07T20:32:01.9382805Z T=16384, 2025-05-07T20:32:01.9382884Z D=7168, 2025-05-07T20:32:01.9382962Z scale_ub=1200.0, 2025-05-07T20:32:01.9383048Z contiguous=False, 2025-05-07T20:32:01.9383126Z compiled=True, 2025-05-07T20:32:01.9383194Z ) 2025-05-07T20:32:01.9383413Z self = 2025-05-07T20:32:01.9383591Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.9383601Z 2025-05-07T20:32:01.9383680Z @given( 2025-05-07T20:32:01.9383825Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9383931Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9384056Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9384177Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9384287Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9384362Z ) 2025-05-07T20:32:01.9384604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9384694Z def test_silu_mul_quant( 2025-05-07T20:32:01.9384770Z self, 2025-05-07T20:32:01.9384844Z T: int, 2025-05-07T20:32:01.9384920Z D: int, 2025-05-07T20:32:01.9385016Z scale_ub: Optional[float], 2025-05-07T20:32:01.9385100Z contiguous: bool, 2025-05-07T20:32:01.9385185Z compiled: bool, 2025-05-07T20:32:01.9385346Z ) -> None: 2025-05-07T20:32:01.9385444Z torch.manual_seed(2025) 2025-05-07T20:32:01.9385517Z 2025-05-07T20:32:01.9385690Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9385761Z 2025-05-07T20:32:01.9385852Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9385972Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9386056Z x = x_sign * x_clamp 2025-05-07T20:32:01.9386137Z x0 = x[:, :D] 2025-05-07T20:32:01.9386212Z x1 = x[:, D:] 2025-05-07T20:32:01.9386283Z 2025-05-07T20:32:01.9386368Z if contiguous: 2025-05-07T20:32:01.9386455Z x0 = x0.contiguous() 2025-05-07T20:32:01.9386540Z x1 = x1.contiguous() 2025-05-07T20:32:01.9386615Z 2025-05-07T20:32:01.9386703Z if scale_ub is not None: 2025-05-07T20:32:01.9386803Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9386943Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9387059Z ) 2025-05-07T20:32:01.9387132Z else: 2025-05-07T20:32:01.9387228Z scale_ub_tensor = None 2025-05-07T20:32:01.9387298Z 2025-05-07T20:32:01.9387430Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9387517Z op = silu_mul_quant 2025-05-07T20:32:01.9387598Z if compiled: 2025-05-07T20:32:01.9387698Z op = torch.compile(op) 2025-05-07T20:32:01.9387801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9387869Z 2025-05-07T20:32:01.9387961Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9387965Z 2025-05-07T20:32:01.9388058Z moe/activation_test.py:117: 2025-05-07T20:32:01.9388182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9388279Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9388374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9388753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9388895Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9389391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9389490Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9389847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9390067Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9390413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9390503Z kernel = self.compile( 2025-05-07T20:32:01.9390887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9391059Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9391188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9391193Z 2025-05-07T20:32:01.9391405Z self = 2025-05-07T20:32:01.9392169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9392672Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfe1ca40>} 2025-05-07T20:32:01.9393417Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9393684Z context = 2025-05-07T20:32:01.9393692Z 2025-05-07T20:32:01.9393854Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9394116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9394222Z module_map=module_map) 2025-05-07T20:32:01.9394378Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9394470Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9394548Z E ^ 2025-05-07T20:32:01.9394900Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9394905Z 2025-05-07T20:32:01.9395318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9395323Z 2025-05-07T20:32:01.9395427Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9395691Z self=, 2025-05-07T20:32:01.9395766Z T=1, 2025-05-07T20:32:01.9395839Z D=7168, 2025-05-07T20:32:01.9395919Z scale_ub=None, 2025-05-07T20:32:01.9396007Z contiguous=False, 2025-05-07T20:32:01.9396087Z compiled=False, 2025-05-07T20:32:01.9396158Z ) 2025-05-07T20:32:01.9396373Z self = 2025-05-07T20:32:01.9396534Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.9396538Z 2025-05-07T20:32:01.9396618Z @given( 2025-05-07T20:32:01.9396733Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9396827Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9396942Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9397053Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9397168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9397294Z ) 2025-05-07T20:32:01.9397535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9397629Z def test_silu_mul_quant( 2025-05-07T20:32:01.9397702Z self, 2025-05-07T20:32:01.9397775Z T: int, 2025-05-07T20:32:01.9397853Z D: int, 2025-05-07T20:32:01.9397948Z scale_ub: Optional[float], 2025-05-07T20:32:01.9398035Z contiguous: bool, 2025-05-07T20:32:01.9398117Z compiled: bool, 2025-05-07T20:32:01.9398191Z ) -> None: 2025-05-07T20:32:01.9398282Z torch.manual_seed(2025) 2025-05-07T20:32:01.9398359Z 2025-05-07T20:32:01.9398525Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9398598Z 2025-05-07T20:32:01.9398691Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9398811Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9398900Z x = x_sign * x_clamp 2025-05-07T20:32:01.9398983Z x0 = x[:, :D] 2025-05-07T20:32:01.9399063Z x1 = x[:, D:] 2025-05-07T20:32:01.9399138Z 2025-05-07T20:32:01.9399216Z if contiguous: 2025-05-07T20:32:01.9399303Z x0 = x0.contiguous() 2025-05-07T20:32:01.9399394Z x1 = x1.contiguous() 2025-05-07T20:32:01.9399463Z 2025-05-07T20:32:01.9399548Z if scale_ub is not None: 2025-05-07T20:32:01.9399654Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9399785Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9399857Z ) 2025-05-07T20:32:01.9399929Z else: 2025-05-07T20:32:01.9400017Z scale_ub_tensor = None 2025-05-07T20:32:01.9400087Z 2025-05-07T20:32:01.9400217Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9400305Z op = silu_mul_quant 2025-05-07T20:32:01.9400392Z if compiled: 2025-05-07T20:32:01.9400489Z op = torch.compile(op) 2025-05-07T20:32:01.9400698Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9400775Z 2025-05-07T20:32:01.9400862Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9400866Z 2025-05-07T20:32:01.9400959Z moe/activation_test.py:117: 2025-05-07T20:32:01.9401087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9401185Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9401280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9401778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9401872Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9402231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9402451Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9402798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9402937Z kernel = self.compile( 2025-05-07T20:32:01.9403411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9403586Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9403713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9403718Z 2025-05-07T20:32:01.9403947Z self = 2025-05-07T20:32:01.9404744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9405249Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfe1d8a0>} 2025-05-07T20:32:01.9406046Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9406235Z context = 2025-05-07T20:32:01.9406239Z 2025-05-07T20:32:01.9406401Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9406665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9406768Z module_map=module_map) 2025-05-07T20:32:01.9406932Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9407028Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9407100Z E ^ 2025-05-07T20:32:01.9407459Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9407469Z 2025-05-07T20:32:01.9407881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9407885Z 2025-05-07T20:32:01.9407987Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9408206Z self=, 2025-05-07T20:32:01.9408279Z T=2048, 2025-05-07T20:32:01.9408354Z D=7168, 2025-05-07T20:32:01.9408435Z scale_ub=None, 2025-05-07T20:32:01.9408516Z contiguous=False, 2025-05-07T20:32:01.9408599Z compiled=True, 2025-05-07T20:32:01.9408670Z ) 2025-05-07T20:32:01.9408883Z self = 2025-05-07T20:32:01.9409057Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.9409062Z 2025-05-07T20:32:01.9409136Z @given( 2025-05-07T20:32:01.9409339Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9409437Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9409547Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9409660Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9409768Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9409837Z ) 2025-05-07T20:32:01.9410083Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9410172Z def test_silu_mul_quant( 2025-05-07T20:32:01.9410245Z self, 2025-05-07T20:32:01.9410322Z T: int, 2025-05-07T20:32:01.9410393Z D: int, 2025-05-07T20:32:01.9410486Z scale_ub: Optional[float], 2025-05-07T20:32:01.9410572Z contiguous: bool, 2025-05-07T20:32:01.9410657Z compiled: bool, 2025-05-07T20:32:01.9410736Z ) -> None: 2025-05-07T20:32:01.9410827Z torch.manual_seed(2025) 2025-05-07T20:32:01.9410897Z 2025-05-07T20:32:01.9411113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9411187Z 2025-05-07T20:32:01.9411278Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9411403Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9411487Z x = x_sign * x_clamp 2025-05-07T20:32:01.9411564Z x0 = x[:, :D] 2025-05-07T20:32:01.9411644Z x1 = x[:, D:] 2025-05-07T20:32:01.9411712Z 2025-05-07T20:32:01.9411792Z if contiguous: 2025-05-07T20:32:01.9411882Z x0 = x0.contiguous() 2025-05-07T20:32:01.9411967Z x1 = x1.contiguous() 2025-05-07T20:32:01.9412036Z 2025-05-07T20:32:01.9412122Z if scale_ub is not None: 2025-05-07T20:32:01.9412224Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9412357Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9412435Z ) 2025-05-07T20:32:01.9412506Z else: 2025-05-07T20:32:01.9412605Z scale_ub_tensor = None 2025-05-07T20:32:01.9412728Z 2025-05-07T20:32:01.9412855Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9412946Z op = silu_mul_quant 2025-05-07T20:32:01.9413028Z if compiled: 2025-05-07T20:32:01.9413121Z op = torch.compile(op) 2025-05-07T20:32:01.9413226Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9413298Z 2025-05-07T20:32:01.9413384Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9413389Z 2025-05-07T20:32:01.9413482Z moe/activation_test.py:117: 2025-05-07T20:32:01.9413606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9413700Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9413800Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9414168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9414266Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9414773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9414866Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9415228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9415448Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9415789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9415882Z kernel = self.compile( 2025-05-07T20:32:01.9416262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9416437Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9416642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9416651Z 2025-05-07T20:32:01.9416856Z self = 2025-05-07T20:32:01.9417628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9418129Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfe1eb60>} 2025-05-07T20:32:01.9418881Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9419069Z context = 2025-05-07T20:32:01.9419073Z 2025-05-07T20:32:01.9419285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9419548Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9419650Z module_map=module_map) 2025-05-07T20:32:01.9419812Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9419907Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9419981Z E ^ 2025-05-07T20:32:01.9420334Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9420339Z 2025-05-07T20:32:01.9420748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9420753Z 2025-05-07T20:32:01.9420857Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9421077Z self=, 2025-05-07T20:32:01.9421156Z T=4096, 2025-05-07T20:32:01.9421280Z D=7168, 2025-05-07T20:32:01.9421359Z scale_ub=None, 2025-05-07T20:32:01.9421441Z contiguous=False, 2025-05-07T20:32:01.9421521Z compiled=True, 2025-05-07T20:32:01.9421590Z ) 2025-05-07T20:32:01.9421802Z self = 2025-05-07T20:32:01.9421974Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.9421979Z 2025-05-07T20:32:01.9422053Z @given( 2025-05-07T20:32:01.9422170Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9422268Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9422377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9422493Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9422602Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9422673Z ) 2025-05-07T20:32:01.9422927Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9423023Z def test_silu_mul_quant( 2025-05-07T20:32:01.9423100Z self, 2025-05-07T20:32:01.9423172Z T: int, 2025-05-07T20:32:01.9423246Z D: int, 2025-05-07T20:32:01.9423340Z scale_ub: Optional[float], 2025-05-07T20:32:01.9423425Z contiguous: bool, 2025-05-07T20:32:01.9423506Z compiled: bool, 2025-05-07T20:32:01.9423580Z ) -> None: 2025-05-07T20:32:01.9423669Z torch.manual_seed(2025) 2025-05-07T20:32:01.9423738Z 2025-05-07T20:32:01.9423907Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9423976Z 2025-05-07T20:32:01.9424065Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9424191Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9424276Z x = x_sign * x_clamp 2025-05-07T20:32:01.9424351Z x0 = x[:, :D] 2025-05-07T20:32:01.9424430Z x1 = x[:, D:] 2025-05-07T20:32:01.9424499Z 2025-05-07T20:32:01.9424663Z if contiguous: 2025-05-07T20:32:01.9424756Z x0 = x0.contiguous() 2025-05-07T20:32:01.9424841Z x1 = x1.contiguous() 2025-05-07T20:32:01.9424913Z 2025-05-07T20:32:01.9424998Z if scale_ub is not None: 2025-05-07T20:32:01.9425100Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9425232Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9425301Z ) 2025-05-07T20:32:01.9425373Z else: 2025-05-07T20:32:01.9425465Z scale_ub_tensor = None 2025-05-07T20:32:01.9425535Z 2025-05-07T20:32:01.9425660Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9425748Z op = silu_mul_quant 2025-05-07T20:32:01.9425830Z if compiled: 2025-05-07T20:32:01.9425925Z op = torch.compile(op) 2025-05-07T20:32:01.9426027Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9426093Z 2025-05-07T20:32:01.9426189Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9426237Z 2025-05-07T20:32:01.9426330Z moe/activation_test.py:117: 2025-05-07T20:32:01.9426453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9426552Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9426646Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9427012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9427102Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9427595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9427689Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9428044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9428268Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9428676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9428766Z kernel = self.compile( 2025-05-07T20:32:01.9429150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9429323Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9429445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9429450Z 2025-05-07T20:32:01.9429657Z self = 2025-05-07T20:32:01.9430422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9430928Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfe1fe20>} 2025-05-07T20:32:01.9431684Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9431871Z context = 2025-05-07T20:32:01.9431876Z 2025-05-07T20:32:01.9432038Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9432297Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9432403Z module_map=module_map) 2025-05-07T20:32:01.9432564Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9432655Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9432731Z E ^ 2025-05-07T20:32:01.9433164Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9433171Z 2025-05-07T20:32:01.9433582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9433590Z 2025-05-07T20:32:01.9433686Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9433910Z self=, 2025-05-07T20:32:01.9433987Z T=16384, 2025-05-07T20:32:01.9434058Z D=5120, 2025-05-07T20:32:01.9434136Z scale_ub=1200.0, 2025-05-07T20:32:01.9434224Z contiguous=False, 2025-05-07T20:32:01.9434303Z compiled=False, 2025-05-07T20:32:01.9434373Z ) 2025-05-07T20:32:01.9434588Z self = 2025-05-07T20:32:01.9434764Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.9434809Z 2025-05-07T20:32:01.9434897Z @given( 2025-05-07T20:32:01.9435015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9435108Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9435219Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9435333Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9435441Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9435513Z ) 2025-05-07T20:32:01.9435756Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9435845Z def test_silu_mul_quant( 2025-05-07T20:32:01.9435922Z self, 2025-05-07T20:32:01.9435995Z T: int, 2025-05-07T20:32:01.9436069Z D: int, 2025-05-07T20:32:01.9436168Z scale_ub: Optional[float], 2025-05-07T20:32:01.9436254Z contiguous: bool, 2025-05-07T20:32:01.9436338Z compiled: bool, 2025-05-07T20:32:01.9436411Z ) -> None: 2025-05-07T20:32:01.9436515Z torch.manual_seed(2025) 2025-05-07T20:32:01.9436629Z 2025-05-07T20:32:01.9436793Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9436862Z 2025-05-07T20:32:01.9436952Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9437072Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9437154Z x = x_sign * x_clamp 2025-05-07T20:32:01.9437238Z x0 = x[:, :D] 2025-05-07T20:32:01.9437319Z x1 = x[:, D:] 2025-05-07T20:32:01.9437387Z 2025-05-07T20:32:01.9437468Z if contiguous: 2025-05-07T20:32:01.9437560Z x0 = x0.contiguous() 2025-05-07T20:32:01.9437647Z x1 = x1.contiguous() 2025-05-07T20:32:01.9437715Z 2025-05-07T20:32:01.9437801Z if scale_ub is not None: 2025-05-07T20:32:01.9437907Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9438038Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9438112Z ) 2025-05-07T20:32:01.9438197Z else: 2025-05-07T20:32:01.9438288Z scale_ub_tensor = None 2025-05-07T20:32:01.9438356Z 2025-05-07T20:32:01.9438712Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9438833Z op = silu_mul_quant 2025-05-07T20:32:01.9438915Z if compiled: 2025-05-07T20:32:01.9439015Z op = torch.compile(op) 2025-05-07T20:32:01.9439118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9439188Z 2025-05-07T20:32:01.9439274Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9439278Z 2025-05-07T20:32:01.9439372Z moe/activation_test.py:117: 2025-05-07T20:32:01.9439501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9439597Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9439691Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9440334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9440435Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9440795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9441014Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9441354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9441443Z kernel = self.compile( 2025-05-07T20:32:01.9441823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9441994Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9442119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9442123Z 2025-05-07T20:32:01.9442334Z self = 2025-05-07T20:32:01.9443162Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9443727Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfb38d60>} 2025-05-07T20:32:01.9444482Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9444693Z context = 2025-05-07T20:32:01.9444698Z 2025-05-07T20:32:01.9444883Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9445152Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9445323Z module_map=module_map) 2025-05-07T20:32:01.9445482Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9445580Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9445654Z E ^ 2025-05-07T20:32:01.9446011Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9446016Z 2025-05-07T20:32:01.9446428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9446432Z 2025-05-07T20:32:01.9446533Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9446758Z self=, 2025-05-07T20:32:01.9446834Z T=16384, 2025-05-07T20:32:01.9446907Z D=5120, 2025-05-07T20:32:01.9446986Z scale_ub=1200.0, 2025-05-07T20:32:01.9447071Z contiguous=True, 2025-05-07T20:32:01.9447159Z compiled=True, 2025-05-07T20:32:01.9447228Z ) 2025-05-07T20:32:01.9447440Z self = 2025-05-07T20:32:01.9447612Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.9447616Z 2025-05-07T20:32:01.9447695Z @given( 2025-05-07T20:32:01.9447810Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9447907Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9448016Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9448131Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9448241Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9448307Z ) 2025-05-07T20:32:01.9448550Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9448638Z def test_silu_mul_quant( 2025-05-07T20:32:01.9448793Z self, 2025-05-07T20:32:01.9448874Z T: int, 2025-05-07T20:32:01.9448946Z D: int, 2025-05-07T20:32:01.9449039Z scale_ub: Optional[float], 2025-05-07T20:32:01.9449132Z contiguous: bool, 2025-05-07T20:32:01.9449213Z compiled: bool, 2025-05-07T20:32:01.9449288Z ) -> None: 2025-05-07T20:32:01.9449385Z torch.manual_seed(2025) 2025-05-07T20:32:01.9449455Z 2025-05-07T20:32:01.9449621Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9449690Z 2025-05-07T20:32:01.9449780Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9449905Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9449989Z x = x_sign * x_clamp 2025-05-07T20:32:01.9450066Z x0 = x[:, :D] 2025-05-07T20:32:01.9450144Z x1 = x[:, D:] 2025-05-07T20:32:01.9450212Z 2025-05-07T20:32:01.9450292Z if contiguous: 2025-05-07T20:32:01.9450382Z x0 = x0.contiguous() 2025-05-07T20:32:01.9450515Z x1 = x1.contiguous() 2025-05-07T20:32:01.9450587Z 2025-05-07T20:32:01.9450677Z if scale_ub is not None: 2025-05-07T20:32:01.9450777Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9450909Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9450983Z ) 2025-05-07T20:32:01.9451056Z else: 2025-05-07T20:32:01.9451149Z scale_ub_tensor = None 2025-05-07T20:32:01.9451217Z 2025-05-07T20:32:01.9451341Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9451430Z op = silu_mul_quant 2025-05-07T20:32:01.9451511Z if compiled: 2025-05-07T20:32:01.9451605Z op = torch.compile(op) 2025-05-07T20:32:01.9451710Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9451778Z 2025-05-07T20:32:01.9451867Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9451871Z 2025-05-07T20:32:01.9451967Z moe/activation_test.py:117: 2025-05-07T20:32:01.9452098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9452241Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9452334Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9452699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9452790Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9453279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9453371Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9453727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9453945Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9454287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9454382Z kernel = self.compile( 2025-05-07T20:32:01.9454761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9454935Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9455057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9455061Z 2025-05-07T20:32:01.9455267Z self = 2025-05-07T20:32:01.9456038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9456612Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfb3a200>} 2025-05-07T20:32:01.9457369Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9457555Z context = 2025-05-07T20:32:01.9457559Z 2025-05-07T20:32:01.9457724Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9457983Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9458085Z module_map=module_map) 2025-05-07T20:32:01.9458244Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9458337Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9458411Z E ^ 2025-05-07T20:32:01.9458771Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9458842Z 2025-05-07T20:32:01.9459258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9459806Z 2025-05-07T20:32:01.9459904Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9460314Z self=, 2025-05-07T20:32:01.9460703Z T=16384, 2025-05-07T20:32:01.9460889Z D=5120, 2025-05-07T20:32:01.9461070Z scale_ub=None, 2025-05-07T20:32:01.9461294Z contiguous=False, 2025-05-07T20:32:01.9461583Z compiled=True, 2025-05-07T20:32:01.9461815Z ) 2025-05-07T20:32:01.9462204Z self = 2025-05-07T20:32:01.9462721Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.9463058Z 2025-05-07T20:32:01.9463135Z @given( 2025-05-07T20:32:01.9463409Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9463828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9464168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9464535Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9464931Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9465246Z ) 2025-05-07T20:32:01.9465643Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9466161Z def test_silu_mul_quant( 2025-05-07T20:32:01.9466416Z self, 2025-05-07T20:32:01.9466616Z T: int, 2025-05-07T20:32:01.9466818Z D: int, 2025-05-07T20:32:01.9467038Z scale_ub: Optional[float], 2025-05-07T20:32:01.9467331Z contiguous: bool, 2025-05-07T20:32:01.9467584Z compiled: bool, 2025-05-07T20:32:01.9467815Z ) -> None: 2025-05-07T20:32:01.9468033Z torch.manual_seed(2025) 2025-05-07T20:32:01.9468288Z 2025-05-07T20:32:01.9468584Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9468967Z 2025-05-07T20:32:01.9469164Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9474894Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9475223Z x = x_sign * x_clamp 2025-05-07T20:32:01.9475462Z x0 = x[:, :D] 2025-05-07T20:32:01.9475674Z x1 = x[:, D:] 2025-05-07T20:32:01.9475876Z 2025-05-07T20:32:01.9476064Z if contiguous: 2025-05-07T20:32:01.9476292Z x0 = x0.contiguous() 2025-05-07T20:32:01.9476549Z x1 = x1.contiguous() 2025-05-07T20:32:01.9476787Z 2025-05-07T20:32:01.9476971Z if scale_ub is not None: 2025-05-07T20:32:01.9477236Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9477568Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9477880Z ) 2025-05-07T20:32:01.9478069Z else: 2025-05-07T20:32:01.9478277Z scale_ub_tensor = None 2025-05-07T20:32:01.9478531Z 2025-05-07T20:32:01.9478879Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9479199Z op = silu_mul_quant 2025-05-07T20:32:01.9479444Z if compiled: 2025-05-07T20:32:01.9479691Z op = torch.compile(op) 2025-05-07T20:32:01.9479983Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9480249Z 2025-05-07T20:32:01.9480431Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9480597Z 2025-05-07T20:32:01.9480699Z moe/activation_test.py:117: 2025-05-07T20:32:01.9480990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9481317Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9481589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9482289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9482850Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9483897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9484643Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9485183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9485862Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9486524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9487058Z kernel = self.compile( 2025-05-07T20:32:01.9487600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9488247Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9488641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9488870Z 2025-05-07T20:32:01.9489082Z self = 2025-05-07T20:32:01.9490201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9491579Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfb3ad40>} 2025-05-07T20:32:01.9492917Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9493939Z context = 2025-05-07T20:32:01.9494222Z 2025-05-07T20:32:01.9494391Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9494911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9495367Z module_map=module_map) 2025-05-07T20:32:01.9495730Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9496075Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9496324Z E ^ 2025-05-07T20:32:01.9496789Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9497238Z 2025-05-07T20:32:01.9497655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9498166Z 2025-05-07T20:32:01.9498271Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9498672Z self=, 2025-05-07T20:32:01.9499070Z T=2048, 2025-05-07T20:32:01.9499249Z D=5120, 2025-05-07T20:32:01.9499545Z scale_ub=None, 2025-05-07T20:32:01.9499756Z contiguous=False, 2025-05-07T20:32:01.9499973Z compiled=True, 2025-05-07T20:32:01.9500163Z ) 2025-05-07T20:32:01.9500479Z self = 2025-05-07T20:32:01.9500966Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.9501234Z 2025-05-07T20:32:01.9501313Z @given( 2025-05-07T20:32:01.9501531Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9501835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9502134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9502451Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9502772Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9503053Z ) 2025-05-07T20:32:01.9503390Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9503877Z def test_silu_mul_quant( 2025-05-07T20:32:01.9504111Z self, 2025-05-07T20:32:01.9504293Z T: int, 2025-05-07T20:32:01.9504480Z D: int, 2025-05-07T20:32:01.9504694Z scale_ub: Optional[float], 2025-05-07T20:32:01.9504961Z contiguous: bool, 2025-05-07T20:32:01.9505190Z compiled: bool, 2025-05-07T20:32:01.9505405Z ) -> None: 2025-05-07T20:32:01.9505608Z torch.manual_seed(2025) 2025-05-07T20:32:01.9505847Z 2025-05-07T20:32:01.9506111Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9506443Z 2025-05-07T20:32:01.9506629Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9506914Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9507208Z x = x_sign * x_clamp 2025-05-07T20:32:01.9507440Z x0 = x[:, :D] 2025-05-07T20:32:01.9507650Z x1 = x[:, D:] 2025-05-07T20:32:01.9507848Z 2025-05-07T20:32:01.9508025Z if contiguous: 2025-05-07T20:32:01.9508257Z x0 = x0.contiguous() 2025-05-07T20:32:01.9508555Z x1 = x1.contiguous() 2025-05-07T20:32:01.9508787Z 2025-05-07T20:32:01.9508971Z if scale_ub is not None: 2025-05-07T20:32:01.9509238Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9509562Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9509863Z ) 2025-05-07T20:32:01.9510048Z else: 2025-05-07T20:32:01.9510245Z scale_ub_tensor = None 2025-05-07T20:32:01.9510489Z 2025-05-07T20:32:01.9510712Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9511015Z op = silu_mul_quant 2025-05-07T20:32:01.9511257Z if compiled: 2025-05-07T20:32:01.9511495Z op = torch.compile(op) 2025-05-07T20:32:01.9511778Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9512049Z 2025-05-07T20:32:01.9512236Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9512399Z 2025-05-07T20:32:01.9512500Z moe/activation_test.py:117: 2025-05-07T20:32:01.9512783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9513108Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9513379Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9513954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9514528Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9515182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9515856Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9516381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9517055Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9517855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9518386Z kernel = self.compile( 2025-05-07T20:32:01.9518923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9519577Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9519964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9520194Z 2025-05-07T20:32:01.9520399Z self = 2025-05-07T20:32:01.9521465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9522827Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfc6c7c0>} 2025-05-07T20:32:01.9524365Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9525395Z context = 2025-05-07T20:32:01.9525683Z 2025-05-07T20:32:01.9525846Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9526359Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9526824Z module_map=module_map) 2025-05-07T20:32:01.9527179Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9527526Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9527779Z E ^ 2025-05-07T20:32:01.9528237Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9528755Z 2025-05-07T20:32:01.9529175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9529695Z 2025-05-07T20:32:01.9529795Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9530201Z self=, 2025-05-07T20:32:01.9530595Z T=2048, 2025-05-07T20:32:01.9530785Z D=5120, 2025-05-07T20:32:01.9530971Z scale_ub=1200.0, 2025-05-07T20:32:01.9531186Z contiguous=False, 2025-05-07T20:32:01.9531402Z compiled=True, 2025-05-07T20:32:01.9531601Z ) 2025-05-07T20:32:01.9531913Z self = 2025-05-07T20:32:01.9532410Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.9532690Z 2025-05-07T20:32:01.9532765Z @given( 2025-05-07T20:32:01.9532993Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9533312Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9533613Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9533944Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9534260Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9534539Z ) 2025-05-07T20:32:01.9534882Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9535315Z def test_silu_mul_quant( 2025-05-07T20:32:01.9535548Z self, 2025-05-07T20:32:01.9535734Z T: int, 2025-05-07T20:32:01.9535926Z D: int, 2025-05-07T20:32:01.9536142Z scale_ub: Optional[float], 2025-05-07T20:32:01.9536413Z contiguous: bool, 2025-05-07T20:32:01.9536646Z compiled: bool, 2025-05-07T20:32:01.9536864Z ) -> None: 2025-05-07T20:32:01.9537075Z torch.manual_seed(2025) 2025-05-07T20:32:01.9537323Z 2025-05-07T20:32:01.9537707Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9538044Z 2025-05-07T20:32:01.9538232Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9538774Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9539089Z x = x_sign * x_clamp 2025-05-07T20:32:01.9539324Z x0 = x[:, :D] 2025-05-07T20:32:01.9539534Z x1 = x[:, D:] 2025-05-07T20:32:01.9539737Z 2025-05-07T20:32:01.9539920Z if contiguous: 2025-05-07T20:32:01.9540147Z x0 = x0.contiguous() 2025-05-07T20:32:01.9540401Z x1 = x1.contiguous() 2025-05-07T20:32:01.9540636Z 2025-05-07T20:32:01.9540821Z if scale_ub is not None: 2025-05-07T20:32:01.9541086Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9541415Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9541719Z ) 2025-05-07T20:32:01.9541906Z else: 2025-05-07T20:32:01.9542212Z scale_ub_tensor = None 2025-05-07T20:32:01.9542460Z 2025-05-07T20:32:01.9542686Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9542995Z op = silu_mul_quant 2025-05-07T20:32:01.9543238Z if compiled: 2025-05-07T20:32:01.9543478Z op = torch.compile(op) 2025-05-07T20:32:01.9543771Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9544041Z 2025-05-07T20:32:01.9544226Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9544391Z 2025-05-07T20:32:01.9544490Z moe/activation_test.py:117: 2025-05-07T20:32:01.9544788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9545108Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9545388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9545948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9546505Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9547242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9547926Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9548458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9549138Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9549805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9550340Z kernel = self.compile( 2025-05-07T20:32:01.9550878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9551534Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9551930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9552166Z 2025-05-07T20:32:01.9552381Z self = 2025-05-07T20:32:01.9553450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9554813Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfc6d300>} 2025-05-07T20:32:01.9556152Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9557179Z context = 2025-05-07T20:32:01.9557468Z 2025-05-07T20:32:01.9557757Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9558279Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9558742Z module_map=module_map) 2025-05-07T20:32:01.9559108Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9559453Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9559706Z E ^ 2025-05-07T20:32:01.9560168Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9560615Z 2025-05-07T20:32:01.9561038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9561551Z 2025-05-07T20:32:01.9561654Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9562070Z self=, 2025-05-07T20:32:01.9562524Z T=4096, 2025-05-07T20:32:01.9562709Z D=5120, 2025-05-07T20:32:01.9562899Z scale_ub=1200.0, 2025-05-07T20:32:01.9563119Z contiguous=True, 2025-05-07T20:32:01.9563438Z compiled=True, 2025-05-07T20:32:01.9563635Z ) 2025-05-07T20:32:01.9563952Z self = 2025-05-07T20:32:01.9564443Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.9564712Z 2025-05-07T20:32:01.9564788Z @given( 2025-05-07T20:32:01.9565018Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9565326Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9565622Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9565948Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9566270Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9566549Z ) 2025-05-07T20:32:01.9566898Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9567390Z def test_silu_mul_quant( 2025-05-07T20:32:01.9567630Z self, 2025-05-07T20:32:01.9567818Z T: int, 2025-05-07T20:32:01.9568012Z D: int, 2025-05-07T20:32:01.9568228Z scale_ub: Optional[float], 2025-05-07T20:32:01.9568488Z contiguous: bool, 2025-05-07T20:32:01.9568718Z compiled: bool, 2025-05-07T20:32:01.9568940Z ) -> None: 2025-05-07T20:32:01.9569143Z torch.manual_seed(2025) 2025-05-07T20:32:01.9569376Z 2025-05-07T20:32:01.9569647Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9569978Z 2025-05-07T20:32:01.9570171Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9570460Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9570760Z x = x_sign * x_clamp 2025-05-07T20:32:01.9570995Z x0 = x[:, :D] 2025-05-07T20:32:01.9571206Z x1 = x[:, D:] 2025-05-07T20:32:01.9571409Z 2025-05-07T20:32:01.9571601Z if contiguous: 2025-05-07T20:32:01.9571828Z x0 = x0.contiguous() 2025-05-07T20:32:01.9572076Z x1 = x1.contiguous() 2025-05-07T20:32:01.9572313Z 2025-05-07T20:32:01.9572502Z if scale_ub is not None: 2025-05-07T20:32:01.9572768Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9573098Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9573399Z ) 2025-05-07T20:32:01.9573589Z else: 2025-05-07T20:32:01.9573798Z scale_ub_tensor = None 2025-05-07T20:32:01.9574101Z 2025-05-07T20:32:01.9574327Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9574634Z op = silu_mul_quant 2025-05-07T20:32:01.9574883Z if compiled: 2025-05-07T20:32:01.9575124Z op = torch.compile(op) 2025-05-07T20:32:01.9575413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9575690Z 2025-05-07T20:32:01.9575970Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9576141Z 2025-05-07T20:32:01.9576236Z moe/activation_test.py:117: 2025-05-07T20:32:01.9576531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9576862Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9577145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9577697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9578254Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9578911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9579594Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9580125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9580814Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9581525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9582051Z kernel = self.compile( 2025-05-07T20:32:01.9582588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9583238Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9583626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9583856Z 2025-05-07T20:32:01.9584065Z self = 2025-05-07T20:32:01.9585142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9586510Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfc6dbc0>} 2025-05-07T20:32:01.9587906Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9588926Z context = 2025-05-07T20:32:01.9589221Z 2025-05-07T20:32:01.9589389Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9589911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9590380Z module_map=module_map) 2025-05-07T20:32:01.9590738Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9591087Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9591347Z E ^ 2025-05-07T20:32:01.9591810Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9592263Z 2025-05-07T20:32:01.9592677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9593193Z 2025-05-07T20:32:01.9593294Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9593703Z self=, 2025-05-07T20:32:01.9594092Z T=128, 2025-05-07T20:32:01.9594280Z D=5120, 2025-05-07T20:32:01.9594467Z scale_ub=1200.0, 2025-05-07T20:32:01.9594684Z contiguous=False, 2025-05-07T20:32:01.9594904Z compiled=True, 2025-05-07T20:32:01.9595107Z ) 2025-05-07T20:32:01.9595419Z self = 2025-05-07T20:32:01.9595909Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.9596270Z 2025-05-07T20:32:01.9596352Z @given( 2025-05-07T20:32:01.9596583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9596888Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9597190Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9597512Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9597832Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9598117Z ) 2025-05-07T20:32:01.9598462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9598890Z def test_silu_mul_quant( 2025-05-07T20:32:01.9599127Z self, 2025-05-07T20:32:01.9599314Z T: int, 2025-05-07T20:32:01.9599503Z D: int, 2025-05-07T20:32:01.9599714Z scale_ub: Optional[float], 2025-05-07T20:32:01.9599979Z contiguous: bool, 2025-05-07T20:32:01.9600211Z compiled: bool, 2025-05-07T20:32:01.9600427Z ) -> None: 2025-05-07T20:32:01.9600695Z torch.manual_seed(2025) 2025-05-07T20:32:01.9600933Z 2025-05-07T20:32:01.9601196Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9601531Z 2025-05-07T20:32:01.9601716Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9601999Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9602304Z x = x_sign * x_clamp 2025-05-07T20:32:01.9602539Z x0 = x[:, :D] 2025-05-07T20:32:01.9602746Z x1 = x[:, D:] 2025-05-07T20:32:01.9602950Z 2025-05-07T20:32:01.9603132Z if contiguous: 2025-05-07T20:32:01.9603465Z x0 = x0.contiguous() 2025-05-07T20:32:01.9603719Z x1 = x1.contiguous() 2025-05-07T20:32:01.9603953Z 2025-05-07T20:32:01.9604140Z if scale_ub is not None: 2025-05-07T20:32:01.9604413Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9604744Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9605047Z ) 2025-05-07T20:32:01.9605292Z else: 2025-05-07T20:32:01.9605498Z scale_ub_tensor = None 2025-05-07T20:32:01.9605742Z 2025-05-07T20:32:01.9605966Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9606275Z op = silu_mul_quant 2025-05-07T20:32:01.9606519Z if compiled: 2025-05-07T20:32:01.9606755Z op = torch.compile(op) 2025-05-07T20:32:01.9607051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9607319Z 2025-05-07T20:32:01.9607503Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9607669Z 2025-05-07T20:32:01.9607763Z moe/activation_test.py:117: 2025-05-07T20:32:01.9608052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9608375Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9608652Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9609212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9609775Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9610430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9611119Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9611660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9612337Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9613004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9613536Z kernel = self.compile( 2025-05-07T20:32:01.9614078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9614731Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9615241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9615473Z 2025-05-07T20:32:01.9615685Z self = 2025-05-07T20:32:01.9616759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9618116Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bfc6eb60>} 2025-05-07T20:32:01.9621913Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9623180Z context = 2025-05-07T20:32:01.9623529Z 2025-05-07T20:32:01.9623697Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9624213Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9624675Z module_map=module_map) 2025-05-07T20:32:01.9625036Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9625383Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9630207Z E ^ 2025-05-07T20:32:01.9630702Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9631160Z 2025-05-07T20:32:01.9631584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9632097Z 2025-05-07T20:32:01.9632242Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9632654Z self=, 2025-05-07T20:32:01.9633129Z T=16384, 2025-05-07T20:32:01.9633321Z D=7168, 2025-05-07T20:32:01.9633507Z scale_ub=1200.0, 2025-05-07T20:32:01.9633732Z contiguous=True, 2025-05-07T20:32:01.9633952Z compiled=True, 2025-05-07T20:32:01.9634151Z ) 2025-05-07T20:32:01.9634469Z self = 2025-05-07T20:32:01.9634969Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.9635245Z 2025-05-07T20:32:01.9635326Z @given( 2025-05-07T20:32:01.9635545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9635857Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9636158Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9636477Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9636805Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9637099Z ) 2025-05-07T20:32:01.9637444Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9637884Z def test_silu_mul_quant( 2025-05-07T20:32:01.9638121Z self, 2025-05-07T20:32:01.9638311Z T: int, 2025-05-07T20:32:01.9638879Z D: int, 2025-05-07T20:32:01.9639160Z scale_ub: Optional[float], 2025-05-07T20:32:01.9639427Z contiguous: bool, 2025-05-07T20:32:01.9639660Z compiled: bool, 2025-05-07T20:32:01.9639877Z ) -> None: 2025-05-07T20:32:01.9640085Z torch.manual_seed(2025) 2025-05-07T20:32:01.9640314Z 2025-05-07T20:32:01.9640587Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9640923Z 2025-05-07T20:32:01.9641106Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9641391Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9641698Z x = x_sign * x_clamp 2025-05-07T20:32:01.9641925Z x0 = x[:, :D] 2025-05-07T20:32:01.9642239Z x1 = x[:, D:] 2025-05-07T20:32:01.9642449Z 2025-05-07T20:32:01.9642626Z if contiguous: 2025-05-07T20:32:01.9642856Z x0 = x0.contiguous() 2025-05-07T20:32:01.9643112Z x1 = x1.contiguous() 2025-05-07T20:32:01.9643409Z 2025-05-07T20:32:01.9643597Z if scale_ub is not None: 2025-05-07T20:32:01.9643861Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9644188Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9644486Z ) 2025-05-07T20:32:01.9644677Z else: 2025-05-07T20:32:01.9644882Z scale_ub_tensor = None 2025-05-07T20:32:01.9645128Z 2025-05-07T20:32:01.9645359Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9645666Z op = silu_mul_quant 2025-05-07T20:32:01.9645906Z if compiled: 2025-05-07T20:32:01.9646264Z op = torch.compile(op) 2025-05-07T20:32:01.9646560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9646883Z 2025-05-07T20:32:01.9647072Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9647232Z 2025-05-07T20:32:01.9647334Z moe/activation_test.py:117: 2025-05-07T20:32:01.9647620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9647945Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9648227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9648786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9649342Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9649998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9650689Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9651220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9651902Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9652633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9653164Z kernel = self.compile( 2025-05-07T20:32:01.9653698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9654353Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9654747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9654972Z 2025-05-07T20:32:01.9655186Z self = 2025-05-07T20:32:01.9656268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9657646Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf954720>} 2025-05-07T20:32:01.9658994Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9660022Z context = 2025-05-07T20:32:01.9660308Z 2025-05-07T20:32:01.9660471Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9660995Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9661456Z module_map=module_map) 2025-05-07T20:32:01.9661816Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9662211Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9662469Z E ^ 2025-05-07T20:32:01.9662923Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9663378Z 2025-05-07T20:32:01.9663793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9664310Z 2025-05-07T20:32:01.9664411Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9664817Z self=, 2025-05-07T20:32:01.9665208Z T=16384, 2025-05-07T20:32:01.9665395Z D=5120, 2025-05-07T20:32:01.9665584Z scale_ub=1200.0, 2025-05-07T20:32:01.9665796Z contiguous=True, 2025-05-07T20:32:01.9666016Z compiled=False, 2025-05-07T20:32:01.9666216Z ) 2025-05-07T20:32:01.9666591Z self = 2025-05-07T20:32:01.9666778Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.9666828Z 2025-05-07T20:32:01.9666905Z @given( 2025-05-07T20:32:01.9667024Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9667121Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9667232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9667349Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9667458Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9667534Z ) 2025-05-07T20:32:01.9667781Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9667873Z def test_silu_mul_quant( 2025-05-07T20:32:01.9667957Z self, 2025-05-07T20:32:01.9668038Z T: int, 2025-05-07T20:32:01.9668115Z D: int, 2025-05-07T20:32:01.9668213Z scale_ub: Optional[float], 2025-05-07T20:32:01.9668303Z contiguous: bool, 2025-05-07T20:32:01.9668388Z compiled: bool, 2025-05-07T20:32:01.9668515Z ) -> None: 2025-05-07T20:32:01.9668609Z torch.manual_seed(2025) 2025-05-07T20:32:01.9668682Z 2025-05-07T20:32:01.9668855Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9668927Z 2025-05-07T20:32:01.9669018Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9669145Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9669232Z x = x_sign * x_clamp 2025-05-07T20:32:01.9669310Z x0 = x[:, :D] 2025-05-07T20:32:01.9669387Z x1 = x[:, D:] 2025-05-07T20:32:01.9669460Z 2025-05-07T20:32:01.9669549Z if contiguous: 2025-05-07T20:32:01.9669636Z x0 = x0.contiguous() 2025-05-07T20:32:01.9669725Z x1 = x1.contiguous() 2025-05-07T20:32:01.9669800Z 2025-05-07T20:32:01.9669887Z if scale_ub is not None: 2025-05-07T20:32:01.9669990Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9670130Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9670210Z ) 2025-05-07T20:32:01.9670284Z else: 2025-05-07T20:32:01.9670379Z scale_ub_tensor = None 2025-05-07T20:32:01.9670450Z 2025-05-07T20:32:01.9670577Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9670666Z op = silu_mul_quant 2025-05-07T20:32:01.9670747Z if compiled: 2025-05-07T20:32:01.9670847Z op = torch.compile(op) 2025-05-07T20:32:01.9670952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9671026Z 2025-05-07T20:32:01.9671119Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9671124Z 2025-05-07T20:32:01.9671220Z moe/activation_test.py:117: 2025-05-07T20:32:01.9671346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9671448Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9671545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9672094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9672198Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9672556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9672782Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9673120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9673214Z kernel = self.compile( 2025-05-07T20:32:01.9673603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9673773Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9673946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9673989Z 2025-05-07T20:32:01.9674227Z self = 2025-05-07T20:32:01.9675017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9675523Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf955d00>} 2025-05-07T20:32:01.9676271Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9676464Z context = 2025-05-07T20:32:01.9676472Z 2025-05-07T20:32:01.9676638Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9676941Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9677049Z module_map=module_map) 2025-05-07T20:32:01.9677210Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9677307Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9677383Z E ^ 2025-05-07T20:32:01.9677736Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9677741Z 2025-05-07T20:32:01.9678158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9678162Z 2025-05-07T20:32:01.9678263Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9678489Z self=, 2025-05-07T20:32:01.9678569Z T=1, 2025-05-07T20:32:01.9678645Z D=7168, 2025-05-07T20:32:01.9678736Z scale_ub=1200.0, 2025-05-07T20:32:01.9678822Z contiguous=False, 2025-05-07T20:32:01.9678904Z compiled=False, 2025-05-07T20:32:01.9678980Z ) 2025-05-07T20:32:01.9679194Z self = 2025-05-07T20:32:01.9679360Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.9679364Z 2025-05-07T20:32:01.9679441Z @given( 2025-05-07T20:32:01.9679557Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9679659Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9679770Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9679883Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9679998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9680072Z ) 2025-05-07T20:32:01.9680317Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9680459Z def test_silu_mul_quant( 2025-05-07T20:32:01.9680535Z self, 2025-05-07T20:32:01.9680612Z T: int, 2025-05-07T20:32:01.9680693Z D: int, 2025-05-07T20:32:01.9680789Z scale_ub: Optional[float], 2025-05-07T20:32:01.9680877Z contiguous: bool, 2025-05-07T20:32:01.9680962Z compiled: bool, 2025-05-07T20:32:01.9681040Z ) -> None: 2025-05-07T20:32:01.9681136Z torch.manual_seed(2025) 2025-05-07T20:32:01.9681208Z 2025-05-07T20:32:01.9681375Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9681452Z 2025-05-07T20:32:01.9681541Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9681665Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9681760Z x = x_sign * x_clamp 2025-05-07T20:32:01.9681837Z x0 = x[:, :D] 2025-05-07T20:32:01.9681916Z x1 = x[:, D:] 2025-05-07T20:32:01.9682066Z 2025-05-07T20:32:01.9682150Z if contiguous: 2025-05-07T20:32:01.9682283Z x0 = x0.contiguous() 2025-05-07T20:32:01.9682374Z x1 = x1.contiguous() 2025-05-07T20:32:01.9682446Z 2025-05-07T20:32:01.9682534Z if scale_ub is not None: 2025-05-07T20:32:01.9682642Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9682774Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9682854Z ) 2025-05-07T20:32:01.9682928Z else: 2025-05-07T20:32:01.9683020Z scale_ub_tensor = None 2025-05-07T20:32:01.9683093Z 2025-05-07T20:32:01.9683290Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9683377Z op = silu_mul_quant 2025-05-07T20:32:01.9683462Z if compiled: 2025-05-07T20:32:01.9683561Z op = torch.compile(op) 2025-05-07T20:32:01.9683662Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9683736Z 2025-05-07T20:32:01.9683828Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9683838Z 2025-05-07T20:32:01.9683979Z moe/activation_test.py:117: 2025-05-07T20:32:01.9684107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9684206Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9684310Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9684814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9684905Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9685270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9685491Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9685837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9685931Z kernel = self.compile( 2025-05-07T20:32:01.9686321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9686503Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9686625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9686630Z 2025-05-07T20:32:01.9686837Z self = 2025-05-07T20:32:01.9687612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9688114Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf9553a0>} 2025-05-07T20:32:01.9688914Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9689107Z context = 2025-05-07T20:32:01.9689112Z 2025-05-07T20:32:01.9689277Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9689538Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9689642Z module_map=module_map) 2025-05-07T20:32:01.9689806Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9689900Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9689975Z E ^ 2025-05-07T20:32:01.9690334Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9690339Z 2025-05-07T20:32:01.9690804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9690849Z 2025-05-07T20:32:01.9690958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9691179Z self=, 2025-05-07T20:32:01.9691252Z T=4096, 2025-05-07T20:32:01.9691334Z D=7168, 2025-05-07T20:32:01.9691417Z scale_ub=1200.0, 2025-05-07T20:32:01.9691502Z contiguous=False, 2025-05-07T20:32:01.9691590Z compiled=True, 2025-05-07T20:32:01.9691662Z ) 2025-05-07T20:32:01.9691878Z self = 2025-05-07T20:32:01.9692052Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.9692057Z 2025-05-07T20:32:01.9692133Z @given( 2025-05-07T20:32:01.9692256Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9692356Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9692471Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9692634Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9692743Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9692819Z ) 2025-05-07T20:32:01.9693060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9693153Z def test_silu_mul_quant( 2025-05-07T20:32:01.9693234Z self, 2025-05-07T20:32:01.9693308Z T: int, 2025-05-07T20:32:01.9693385Z D: int, 2025-05-07T20:32:01.9693483Z scale_ub: Optional[float], 2025-05-07T20:32:01.9693570Z contiguous: bool, 2025-05-07T20:32:01.9693654Z compiled: bool, 2025-05-07T20:32:01.9693739Z ) -> None: 2025-05-07T20:32:01.9693831Z torch.manual_seed(2025) 2025-05-07T20:32:01.9693899Z 2025-05-07T20:32:01.9694069Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9694146Z 2025-05-07T20:32:01.9694238Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9694372Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9694460Z x = x_sign * x_clamp 2025-05-07T20:32:01.9694540Z x0 = x[:, :D] 2025-05-07T20:32:01.9694620Z x1 = x[:, D:] 2025-05-07T20:32:01.9694694Z 2025-05-07T20:32:01.9694776Z if contiguous: 2025-05-07T20:32:01.9694871Z x0 = x0.contiguous() 2025-05-07T20:32:01.9694957Z x1 = x1.contiguous() 2025-05-07T20:32:01.9695028Z 2025-05-07T20:32:01.9695124Z if scale_ub is not None: 2025-05-07T20:32:01.9695227Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9695364Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9695437Z ) 2025-05-07T20:32:01.9695513Z else: 2025-05-07T20:32:01.9695607Z scale_ub_tensor = None 2025-05-07T20:32:01.9695681Z 2025-05-07T20:32:01.9695811Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9695947Z op = silu_mul_quant 2025-05-07T20:32:01.9696033Z if compiled: 2025-05-07T20:32:01.9696130Z op = torch.compile(op) 2025-05-07T20:32:01.9696236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9696310Z 2025-05-07T20:32:01.9696400Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9696412Z 2025-05-07T20:32:01.9696505Z moe/activation_test.py:117: 2025-05-07T20:32:01.9696630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9696731Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9696828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9697197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9697289Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9697834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9697974Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9698337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9698563Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9698913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9699007Z kernel = self.compile( 2025-05-07T20:32:01.9699392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9699567Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9699691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9699695Z 2025-05-07T20:32:01.9699907Z self = 2025-05-07T20:32:01.9700685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9701230Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3003bc2c0>} 2025-05-07T20:32:01.9701985Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9702172Z context = 2025-05-07T20:32:01.9702176Z 2025-05-07T20:32:01.9702341Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9702607Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9702715Z module_map=module_map) 2025-05-07T20:32:01.9702878Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9702974Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9703053Z E ^ 2025-05-07T20:32:01.9703408Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9703413Z 2025-05-07T20:32:01.9703837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9703842Z 2025-05-07T20:32:01.9703960Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9704203Z self=, 2025-05-07T20:32:01.9704281Z T=128, 2025-05-07T20:32:01.9704358Z D=7168, 2025-05-07T20:32:01.9704441Z scale_ub=1200.0, 2025-05-07T20:32:01.9704526Z contiguous=False, 2025-05-07T20:32:01.9704652Z compiled=True, 2025-05-07T20:32:01.9704729Z ) 2025-05-07T20:32:01.9704951Z self = 2025-05-07T20:32:01.9705121Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.9705126Z 2025-05-07T20:32:01.9705203Z @given( 2025-05-07T20:32:01.9705326Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9705423Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9705536Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9705653Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9705765Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9705842Z ) 2025-05-07T20:32:01.9706086Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9706177Z def test_silu_mul_quant( 2025-05-07T20:32:01.9706301Z self, 2025-05-07T20:32:01.9706420Z T: int, 2025-05-07T20:32:01.9706498Z D: int, 2025-05-07T20:32:01.9706598Z scale_ub: Optional[float], 2025-05-07T20:32:01.9706686Z contiguous: bool, 2025-05-07T20:32:01.9706768Z compiled: bool, 2025-05-07T20:32:01.9706850Z ) -> None: 2025-05-07T20:32:01.9706944Z torch.manual_seed(2025) 2025-05-07T20:32:01.9707015Z 2025-05-07T20:32:01.9707186Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9707259Z 2025-05-07T20:32:01.9707349Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9707469Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9707555Z x = x_sign * x_clamp 2025-05-07T20:32:01.9707636Z x0 = x[:, :D] 2025-05-07T20:32:01.9707714Z x1 = x[:, D:] 2025-05-07T20:32:01.9707786Z 2025-05-07T20:32:01.9707869Z if contiguous: 2025-05-07T20:32:01.9707961Z x0 = x0.contiguous() 2025-05-07T20:32:01.9708046Z x1 = x1.contiguous() 2025-05-07T20:32:01.9708170Z 2025-05-07T20:32:01.9708258Z if scale_ub is not None: 2025-05-07T20:32:01.9708361Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9708493Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9708569Z ) 2025-05-07T20:32:01.9708644Z else: 2025-05-07T20:32:01.9708737Z scale_ub_tensor = None 2025-05-07T20:32:01.9708807Z 2025-05-07T20:32:01.9708942Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9709030Z op = silu_mul_quant 2025-05-07T20:32:01.9709112Z if compiled: 2025-05-07T20:32:01.9709211Z op = torch.compile(op) 2025-05-07T20:32:01.9709312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9709384Z 2025-05-07T20:32:01.9709476Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9709480Z 2025-05-07T20:32:01.9709576Z moe/activation_test.py:117: 2025-05-07T20:32:01.9709709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9709811Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9709908Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9710281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9710370Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9710865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9710965Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9711322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9711549Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9711891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9712052Z kernel = self.compile( 2025-05-07T20:32:01.9712446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9712619Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9712743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9712747Z 2025-05-07T20:32:01.9712957Z self = 2025-05-07T20:32:01.9713729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9714281Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3003bce00>} 2025-05-07T20:32:01.9715074Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9715268Z context = 2025-05-07T20:32:01.9715273Z 2025-05-07T20:32:01.9715434Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9715696Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9715803Z module_map=module_map) 2025-05-07T20:32:01.9715961Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9716055Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9716135Z E ^ 2025-05-07T20:32:01.9716490Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9716496Z 2025-05-07T20:32:01.9716960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9716964Z 2025-05-07T20:32:01.9717066Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9717290Z self=, 2025-05-07T20:32:01.9717367Z T=2048, 2025-05-07T20:32:01.9717443Z D=7168, 2025-05-07T20:32:01.9717529Z scale_ub=None, 2025-05-07T20:32:01.9717610Z contiguous=True, 2025-05-07T20:32:01.9717693Z compiled=True, 2025-05-07T20:32:01.9717770Z ) 2025-05-07T20:32:01.9717985Z self = 2025-05-07T20:32:01.9718153Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.9718157Z 2025-05-07T20:32:01.9718233Z @given( 2025-05-07T20:32:01.9718355Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9718451Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9718572Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9718686Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9718801Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9718874Z ) 2025-05-07T20:32:01.9719115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9719213Z def test_silu_mul_quant( 2025-05-07T20:32:01.9719288Z self, 2025-05-07T20:32:01.9719364Z T: int, 2025-05-07T20:32:01.9719445Z D: int, 2025-05-07T20:32:01.9719538Z scale_ub: Optional[float], 2025-05-07T20:32:01.9719621Z contiguous: bool, 2025-05-07T20:32:01.9719710Z compiled: bool, 2025-05-07T20:32:01.9719786Z ) -> None: 2025-05-07T20:32:01.9719876Z torch.manual_seed(2025) 2025-05-07T20:32:01.9719954Z 2025-05-07T20:32:01.9720122Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9720249Z 2025-05-07T20:32:01.9720346Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9720470Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9720559Z x = x_sign * x_clamp 2025-05-07T20:32:01.9720636Z x0 = x[:, :D] 2025-05-07T20:32:01.9720715Z x1 = x[:, D:] 2025-05-07T20:32:01.9720792Z 2025-05-07T20:32:01.9720874Z if contiguous: 2025-05-07T20:32:01.9720961Z x0 = x0.contiguous() 2025-05-07T20:32:01.9721051Z x1 = x1.contiguous() 2025-05-07T20:32:01.9721120Z 2025-05-07T20:32:01.9721206Z if scale_ub is not None: 2025-05-07T20:32:01.9721311Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9721442Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9721515Z ) 2025-05-07T20:32:01.9721591Z else: 2025-05-07T20:32:01.9721682Z scale_ub_tensor = None 2025-05-07T20:32:01.9721806Z 2025-05-07T20:32:01.9721938Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9722116Z op = silu_mul_quant 2025-05-07T20:32:01.9722204Z if compiled: 2025-05-07T20:32:01.9722302Z op = torch.compile(op) 2025-05-07T20:32:01.9722402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9722480Z 2025-05-07T20:32:01.9722567Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9722572Z 2025-05-07T20:32:01.9722667Z moe/activation_test.py:117: 2025-05-07T20:32:01.9722796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9722894Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9722991Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9723495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9723584Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9724091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9724231Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9724591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9724817Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9725159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9725253Z kernel = self.compile( 2025-05-07T20:32:01.9725637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9725809Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9725941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9725948Z 2025-05-07T20:32:01.9726157Z self = 2025-05-07T20:32:01.9726940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9727442Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd3003bda80>} 2025-05-07T20:32:01.9728196Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9728391Z context = 2025-05-07T20:32:01.9728395Z 2025-05-07T20:32:01.9728560Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9728876Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9728984Z module_map=module_map) 2025-05-07T20:32:01.9729140Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9729237Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9729312Z E ^ 2025-05-07T20:32:01.9729669Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9729679Z 2025-05-07T20:32:01.9730094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9730098Z 2025-05-07T20:32:01.9730199Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9730421Z self=, 2025-05-07T20:32:01.9730498Z T=16384, 2025-05-07T20:32:01.9730619Z D=5120, 2025-05-07T20:32:01.9730702Z scale_ub=None, 2025-05-07T20:32:01.9730831Z contiguous=False, 2025-05-07T20:32:01.9730913Z compiled=False, 2025-05-07T20:32:01.9730990Z ) 2025-05-07T20:32:01.9731204Z self = 2025-05-07T20:32:01.9731384Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.9731388Z 2025-05-07T20:32:01.9731465Z @given( 2025-05-07T20:32:01.9731582Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9731683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9731793Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9731907Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9732022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9732093Z ) 2025-05-07T20:32:01.9732340Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9732432Z def test_silu_mul_quant( 2025-05-07T20:32:01.9732515Z self, 2025-05-07T20:32:01.9732633Z T: int, 2025-05-07T20:32:01.9732710Z D: int, 2025-05-07T20:32:01.9732809Z scale_ub: Optional[float], 2025-05-07T20:32:01.9732902Z contiguous: bool, 2025-05-07T20:32:01.9732987Z compiled: bool, 2025-05-07T20:32:01.9733066Z ) -> None: 2025-05-07T20:32:01.9733165Z torch.manual_seed(2025) 2025-05-07T20:32:01.9733242Z 2025-05-07T20:32:01.9733409Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9733483Z 2025-05-07T20:32:01.9733573Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9733694Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9735522Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9735534Z 2025-05-07T20:32:01.9735650Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:01.9735655Z 2025-05-07T20:32:01.9735757Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9735977Z self=, 2025-05-07T20:32:01.9736058Z T=4096, 2025-05-07T20:32:01.9736134Z D=7168, 2025-05-07T20:32:01.9736215Z scale_ub=1200.0, 2025-05-07T20:32:01.9736303Z contiguous=True, 2025-05-07T20:32:01.9736382Z compiled=True, 2025-05-07T20:32:01.9736454Z ) 2025-05-07T20:32:01.9736676Z self = 2025-05-07T20:32:01.9736886Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.9736896Z 2025-05-07T20:32:01.9736974Z @given( 2025-05-07T20:32:01.9737088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9737184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9737298Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9737410Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9737518Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9737595Z ) 2025-05-07T20:32:01.9737839Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9737929Z def test_silu_mul_quant( 2025-05-07T20:32:01.9738007Z self, 2025-05-07T20:32:01.9738083Z T: int, 2025-05-07T20:32:01.9738159Z D: int, 2025-05-07T20:32:01.9738253Z scale_ub: Optional[float], 2025-05-07T20:32:01.9738672Z contiguous: bool, 2025-05-07T20:32:01.9738813Z compiled: bool, 2025-05-07T20:32:01.9738995Z ) -> None: 2025-05-07T20:32:01.9739087Z torch.manual_seed(2025) 2025-05-07T20:32:01.9739161Z 2025-05-07T20:32:01.9739331Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9739402Z 2025-05-07T20:32:01.9739491Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9739612Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9741400Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9741511Z 2025-05-07T20:32:01.9741629Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:01.9741636Z 2025-05-07T20:32:01.9741738Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9741961Z self=, 2025-05-07T20:32:01.9742036Z T=16384, 2025-05-07T20:32:01.9742115Z D=7168, 2025-05-07T20:32:01.9742192Z scale_ub=None, 2025-05-07T20:32:01.9742274Z contiguous=False, 2025-05-07T20:32:01.9742363Z compiled=False, 2025-05-07T20:32:01.9742435Z ) 2025-05-07T20:32:01.9742649Z self = 2025-05-07T20:32:01.9742829Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.9742833Z 2025-05-07T20:32:01.9742907Z @given( 2025-05-07T20:32:01.9743023Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9743123Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9743237Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9743357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9743465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9743536Z ) 2025-05-07T20:32:01.9743783Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9743872Z def test_silu_mul_quant( 2025-05-07T20:32:01.9743945Z self, 2025-05-07T20:32:01.9744026Z T: int, 2025-05-07T20:32:01.9744099Z D: int, 2025-05-07T20:32:01.9744195Z scale_ub: Optional[float], 2025-05-07T20:32:01.9744310Z contiguous: bool, 2025-05-07T20:32:01.9744396Z compiled: bool, 2025-05-07T20:32:01.9744491Z ) -> None: 2025-05-07T20:32:01.9744586Z torch.manual_seed(2025) 2025-05-07T20:32:01.9744655Z 2025-05-07T20:32:01.9744824Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9746666Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9746681Z 2025-05-07T20:32:01.9746800Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.9746805Z 2025-05-07T20:32:01.9746901Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9747122Z self=, 2025-05-07T20:32:01.9747200Z T=2048, 2025-05-07T20:32:01.9747274Z D=7168, 2025-05-07T20:32:01.9747416Z scale_ub=1200.0, 2025-05-07T20:32:01.9747545Z contiguous=True, 2025-05-07T20:32:01.9747627Z compiled=True, 2025-05-07T20:32:01.9747699Z ) 2025-05-07T20:32:01.9747915Z self = 2025-05-07T20:32:01.9748081Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.9748085Z 2025-05-07T20:32:01.9748167Z @given( 2025-05-07T20:32:01.9748282Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9748377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9748489Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9748600Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9748708Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9748785Z ) 2025-05-07T20:32:01.9749026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9749122Z def test_silu_mul_quant( 2025-05-07T20:32:01.9749195Z self, 2025-05-07T20:32:01.9749320Z T: int, 2025-05-07T20:32:01.9749399Z D: int, 2025-05-07T20:32:01.9749492Z scale_ub: Optional[float], 2025-05-07T20:32:01.9749577Z contiguous: bool, 2025-05-07T20:32:01.9749665Z compiled: bool, 2025-05-07T20:32:01.9749740Z ) -> None: 2025-05-07T20:32:01.9749828Z torch.manual_seed(2025) 2025-05-07T20:32:01.9749904Z 2025-05-07T20:32:01.9750071Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9750140Z 2025-05-07T20:32:01.9750232Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9750351Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9752125Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9752137Z 2025-05-07T20:32:01.9752250Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:01.9752255Z 2025-05-07T20:32:01.9752357Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9752573Z self=, 2025-05-07T20:32:01.9752646Z T=2048, 2025-05-07T20:32:01.9752727Z D=7168, 2025-05-07T20:32:01.9752804Z scale_ub=None, 2025-05-07T20:32:01.9752884Z contiguous=True, 2025-05-07T20:32:01.9752970Z compiled=False, 2025-05-07T20:32:01.9753040Z ) 2025-05-07T20:32:01.9753253Z self = 2025-05-07T20:32:01.9753471Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.9753481Z 2025-05-07T20:32:01.9753557Z @given( 2025-05-07T20:32:01.9753674Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9753769Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9753878Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9753994Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9754105Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9754175Z ) 2025-05-07T20:32:01.9754420Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9754508Z def test_silu_mul_quant( 2025-05-07T20:32:01.9754578Z self, 2025-05-07T20:32:01.9754655Z T: int, 2025-05-07T20:32:01.9754729Z D: int, 2025-05-07T20:32:01.9754828Z scale_ub: Optional[float], 2025-05-07T20:32:01.9754913Z contiguous: bool, 2025-05-07T20:32:01.9755044Z compiled: bool, 2025-05-07T20:32:01.9755161Z ) -> None: 2025-05-07T20:32:01.9755256Z torch.manual_seed(2025) 2025-05-07T20:32:01.9755327Z 2025-05-07T20:32:01.9755494Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9755565Z 2025-05-07T20:32:01.9759546Z > x_sign = torch.sign(x) 2025-05-07T20:32:01.9761342Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9761349Z 2025-05-07T20:32:01.9761480Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:01.9761491Z 2025-05-07T20:32:01.9761666Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9761891Z self=, 2025-05-07T20:32:01.9761970Z T=1, 2025-05-07T20:32:01.9762048Z D=7168, 2025-05-07T20:32:01.9762131Z scale_ub=1200.0, 2025-05-07T20:32:01.9762212Z contiguous=True, 2025-05-07T20:32:01.9762295Z compiled=False, 2025-05-07T20:32:01.9762368Z ) 2025-05-07T20:32:01.9762584Z self = 2025-05-07T20:32:01.9762749Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.9762759Z 2025-05-07T20:32:01.9762836Z @given( 2025-05-07T20:32:01.9762952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9763057Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9763173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9763402Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9763525Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9763599Z ) 2025-05-07T20:32:01.9763843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9763947Z def test_silu_mul_quant( 2025-05-07T20:32:01.9764042Z self, 2025-05-07T20:32:01.9764122Z T: int, 2025-05-07T20:32:01.9764222Z D: int, 2025-05-07T20:32:01.9764319Z scale_ub: Optional[float], 2025-05-07T20:32:01.9764410Z contiguous: bool, 2025-05-07T20:32:01.9764492Z compiled: bool, 2025-05-07T20:32:01.9764568Z ) -> None: 2025-05-07T20:32:01.9764666Z torch.manual_seed(2025) 2025-05-07T20:32:01.9764737Z 2025-05-07T20:32:01.9764904Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9764982Z 2025-05-07T20:32:01.9765071Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9765194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9765338Z x = x_sign * x_clamp 2025-05-07T20:32:01.9765419Z x0 = x[:, :D] 2025-05-07T20:32:01.9765498Z x1 = x[:, D:] 2025-05-07T20:32:01.9765574Z 2025-05-07T20:32:01.9765657Z if contiguous: 2025-05-07T20:32:01.9765753Z x0 = x0.contiguous() 2025-05-07T20:32:01.9765845Z x1 = x1.contiguous() 2025-05-07T20:32:01.9765915Z 2025-05-07T20:32:01.9766009Z if scale_ub is not None: 2025-05-07T20:32:01.9766111Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9766243Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9766319Z ) 2025-05-07T20:32:01.9766395Z else: 2025-05-07T20:32:01.9766485Z scale_ub_tensor = None 2025-05-07T20:32:01.9766561Z 2025-05-07T20:32:01.9766689Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9766829Z op = silu_mul_quant 2025-05-07T20:32:01.9766917Z if compiled: 2025-05-07T20:32:01.9767059Z op = torch.compile(op) 2025-05-07T20:32:01.9767165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9767242Z 2025-05-07T20:32:01.9767332Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9767337Z 2025-05-07T20:32:01.9767434Z moe/activation_test.py:117: 2025-05-07T20:32:01.9767561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9767662Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9767764Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9768264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9768360Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9768728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9768956Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9769345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9769437Z kernel = self.compile( 2025-05-07T20:32:01.9769824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9770002Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9770127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9770131Z 2025-05-07T20:32:01.9770340Z self = 2025-05-07T20:32:01.9771115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9771617Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf661440>} 2025-05-07T20:32:01.9772374Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9772562Z context = 2025-05-07T20:32:01.9772567Z 2025-05-07T20:32:01.9772735Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9772997Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9773101Z module_map=module_map) 2025-05-07T20:32:01.9773266Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9773365Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9773446Z E ^ 2025-05-07T20:32:01.9773845Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9773850Z 2025-05-07T20:32:01.9774263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9774268Z 2025-05-07T20:32:01.9774373Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9774591Z self=, 2025-05-07T20:32:01.9774669Z T=128, 2025-05-07T20:32:01.9774742Z D=5120, 2025-05-07T20:32:01.9774820Z scale_ub=None, 2025-05-07T20:32:01.9774908Z contiguous=True, 2025-05-07T20:32:01.9774992Z compiled=False, 2025-05-07T20:32:01.9775064Z ) 2025-05-07T20:32:01.9775282Z self = 2025-05-07T20:32:01.9775498Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.9775565Z 2025-05-07T20:32:01.9775646Z @given( 2025-05-07T20:32:01.9775767Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9775864Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9775976Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9776096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9776209Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9776288Z ) 2025-05-07T20:32:01.9776528Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9776621Z def test_silu_mul_quant( 2025-05-07T20:32:01.9776701Z self, 2025-05-07T20:32:01.9776774Z T: int, 2025-05-07T20:32:01.9776852Z D: int, 2025-05-07T20:32:01.9776955Z scale_ub: Optional[float], 2025-05-07T20:32:01.9777038Z contiguous: bool, 2025-05-07T20:32:01.9777120Z compiled: bool, 2025-05-07T20:32:01.9777204Z ) -> None: 2025-05-07T20:32:01.9777302Z torch.manual_seed(2025) 2025-05-07T20:32:01.9777416Z 2025-05-07T20:32:01.9777586Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9777658Z 2025-05-07T20:32:01.9777749Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9777870Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9777956Z x = x_sign * x_clamp 2025-05-07T20:32:01.9778037Z x0 = x[:, :D] 2025-05-07T20:32:01.9778114Z x1 = x[:, D:] 2025-05-07T20:32:01.9778186Z 2025-05-07T20:32:01.9778268Z if contiguous: 2025-05-07T20:32:01.9778361Z x0 = x0.contiguous() 2025-05-07T20:32:01.9778448Z x1 = x1.contiguous() 2025-05-07T20:32:01.9778525Z 2025-05-07T20:32:01.9778612Z if scale_ub is not None: 2025-05-07T20:32:01.9778713Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9778851Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9778927Z ) 2025-05-07T20:32:01.9779015Z else: 2025-05-07T20:32:01.9779106Z scale_ub_tensor = None 2025-05-07T20:32:01.9779174Z 2025-05-07T20:32:01.9779303Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9779390Z op = silu_mul_quant 2025-05-07T20:32:01.9779474Z if compiled: 2025-05-07T20:32:01.9779573Z op = torch.compile(op) 2025-05-07T20:32:01.9779676Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9779746Z 2025-05-07T20:32:01.9779841Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9779845Z 2025-05-07T20:32:01.9779937Z moe/activation_test.py:117: 2025-05-07T20:32:01.9780061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9780160Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9780259Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9780809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9780910Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9781266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9781491Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9781830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9781927Z kernel = self.compile( 2025-05-07T20:32:01.9782312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9782482Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9782610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9782614Z 2025-05-07T20:32:01.9782866Z self = 2025-05-07T20:32:01.9783680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9784177Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf662660>} 2025-05-07T20:32:01.9784931Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9785123Z context = 2025-05-07T20:32:01.9785128Z 2025-05-07T20:32:01.9785292Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9785556Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9785706Z module_map=module_map) 2025-05-07T20:32:01.9785865Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9785960Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9786038Z E ^ 2025-05-07T20:32:01.9786390Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9786395Z 2025-05-07T20:32:01.9786811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9786816Z 2025-05-07T20:32:01.9786916Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9787137Z self=, 2025-05-07T20:32:01.9787212Z T=128, 2025-05-07T20:32:01.9787286Z D=7168, 2025-05-07T20:32:01.9787365Z scale_ub=None, 2025-05-07T20:32:01.9787460Z contiguous=True, 2025-05-07T20:32:01.9787543Z compiled=False, 2025-05-07T20:32:01.9787613Z ) 2025-05-07T20:32:01.9787830Z self = 2025-05-07T20:32:01.9787994Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.9787999Z 2025-05-07T20:32:01.9788075Z @given( 2025-05-07T20:32:01.9788190Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9788285Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9788400Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9788512Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9788620Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9788699Z ) 2025-05-07T20:32:01.9788939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9789034Z def test_silu_mul_quant( 2025-05-07T20:32:01.9789110Z self, 2025-05-07T20:32:01.9789232Z T: int, 2025-05-07T20:32:01.9789312Z D: int, 2025-05-07T20:32:01.9789405Z scale_ub: Optional[float], 2025-05-07T20:32:01.9789492Z contiguous: bool, 2025-05-07T20:32:01.9789575Z compiled: bool, 2025-05-07T20:32:01.9789648Z ) -> None: 2025-05-07T20:32:01.9789738Z torch.manual_seed(2025) 2025-05-07T20:32:01.9789818Z 2025-05-07T20:32:01.9789983Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9790057Z 2025-05-07T20:32:01.9790147Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9790268Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9790359Z x = x_sign * x_clamp 2025-05-07T20:32:01.9790438Z x0 = x[:, :D] 2025-05-07T20:32:01.9790515Z x1 = x[:, D:] 2025-05-07T20:32:01.9790588Z 2025-05-07T20:32:01.9790668Z if contiguous: 2025-05-07T20:32:01.9790814Z x0 = x0.contiguous() 2025-05-07T20:32:01.9790944Z x1 = x1.contiguous() 2025-05-07T20:32:01.9791014Z 2025-05-07T20:32:01.9791102Z if scale_ub is not None: 2025-05-07T20:32:01.9791207Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9791336Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9791410Z ) 2025-05-07T20:32:01.9791488Z else: 2025-05-07T20:32:01.9791579Z scale_ub_tensor = None 2025-05-07T20:32:01.9791647Z 2025-05-07T20:32:01.9791777Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9791864Z op = silu_mul_quant 2025-05-07T20:32:01.9791950Z if compiled: 2025-05-07T20:32:01.9792044Z op = torch.compile(op) 2025-05-07T20:32:01.9792146Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9792221Z 2025-05-07T20:32:01.9792309Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9792313Z 2025-05-07T20:32:01.9792412Z moe/activation_test.py:117: 2025-05-07T20:32:01.9792547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9792685Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9792784Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9793281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9793373Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9793732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9793949Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9794320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9794433Z kernel = self.compile( 2025-05-07T20:32:01.9794820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9795001Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9795128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9795132Z 2025-05-07T20:32:01.9795334Z self = 2025-05-07T20:32:01.9796101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9796600Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf6636a0>} 2025-05-07T20:32:01.9797389Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9797582Z context = 2025-05-07T20:32:01.9797586Z 2025-05-07T20:32:01.9797748Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9798013Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9798117Z module_map=module_map) 2025-05-07T20:32:01.9798278Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9798371Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9798446Z E ^ 2025-05-07T20:32:01.9798799Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9798804Z 2025-05-07T20:32:01.9799255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9799299Z 2025-05-07T20:32:01.9799406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9799626Z self=, 2025-05-07T20:32:01.9799703Z T=2048, 2025-05-07T20:32:01.9799780Z D=7168, 2025-05-07T20:32:01.9799859Z scale_ub=1200.0, 2025-05-07T20:32:01.9799938Z contiguous=True, 2025-05-07T20:32:01.9800021Z compiled=False, 2025-05-07T20:32:01.9800091Z ) 2025-05-07T20:32:01.9800308Z self = 2025-05-07T20:32:01.9800479Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.9800484Z 2025-05-07T20:32:01.9800558Z @given( 2025-05-07T20:32:01.9800675Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9800769Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9800884Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9801003Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9801163Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9801234Z ) 2025-05-07T20:32:01.9801481Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9801573Z def test_silu_mul_quant( 2025-05-07T20:32:01.9801648Z self, 2025-05-07T20:32:01.9801725Z T: int, 2025-05-07T20:32:01.9801799Z D: int, 2025-05-07T20:32:01.9801893Z scale_ub: Optional[float], 2025-05-07T20:32:01.9801981Z contiguous: bool, 2025-05-07T20:32:01.9802062Z compiled: bool, 2025-05-07T20:32:01.9802139Z ) -> None: 2025-05-07T20:32:01.9802231Z torch.manual_seed(2025) 2025-05-07T20:32:01.9802301Z 2025-05-07T20:32:01.9802471Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9804334Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9804345Z 2025-05-07T20:32:01.9804467Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.9804472Z 2025-05-07T20:32:01.9804572Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9804788Z self=, 2025-05-07T20:32:01.9804868Z T=1, 2025-05-07T20:32:01.9804944Z D=5120, 2025-05-07T20:32:01.9805025Z scale_ub=1200.0, 2025-05-07T20:32:01.9805111Z contiguous=True, 2025-05-07T20:32:01.9805191Z compiled=False, 2025-05-07T20:32:01.9805264Z ) 2025-05-07T20:32:01.9805523Z self = 2025-05-07T20:32:01.9805688Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.9805693Z 2025-05-07T20:32:01.9805771Z @given( 2025-05-07T20:32:01.9805885Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9805978Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9806095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9806210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9806320Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9806395Z ) 2025-05-07T20:32:01.9806633Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9806727Z def test_silu_mul_quant( 2025-05-07T20:32:01.9806800Z self, 2025-05-07T20:32:01.9806873Z T: int, 2025-05-07T20:32:01.9807012Z D: int, 2025-05-07T20:32:01.9807146Z scale_ub: Optional[float], 2025-05-07T20:32:01.9807234Z contiguous: bool, 2025-05-07T20:32:01.9807321Z compiled: bool, 2025-05-07T20:32:01.9807398Z ) -> None: 2025-05-07T20:32:01.9807487Z torch.manual_seed(2025) 2025-05-07T20:32:01.9807558Z 2025-05-07T20:32:01.9807723Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9807795Z 2025-05-07T20:32:01.9807885Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9808006Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9808093Z x = x_sign * x_clamp 2025-05-07T20:32:01.9808168Z x0 = x[:, :D] 2025-05-07T20:32:01.9808242Z x1 = x[:, D:] 2025-05-07T20:32:01.9808315Z 2025-05-07T20:32:01.9808394Z if contiguous: 2025-05-07T20:32:01.9808484Z x0 = x0.contiguous() 2025-05-07T20:32:01.9808577Z x1 = x1.contiguous() 2025-05-07T20:32:01.9808651Z 2025-05-07T20:32:01.9808737Z if scale_ub is not None: 2025-05-07T20:32:01.9808889Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9809021Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9809094Z ) 2025-05-07T20:32:01.9809173Z else: 2025-05-07T20:32:01.9809264Z scale_ub_tensor = None 2025-05-07T20:32:01.9809331Z 2025-05-07T20:32:01.9809463Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9809551Z op = silu_mul_quant 2025-05-07T20:32:01.9809631Z if compiled: 2025-05-07T20:32:01.9809726Z op = torch.compile(op) 2025-05-07T20:32:01.9809827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9809898Z 2025-05-07T20:32:01.9809987Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9809991Z 2025-05-07T20:32:01.9810082Z moe/activation_test.py:117: 2025-05-07T20:32:01.9810215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9810316Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9810417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9810916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9811010Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9811374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9811593Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9811931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9812026Z kernel = self.compile( 2025-05-07T20:32:01.9812408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9812586Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9812757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9812765Z 2025-05-07T20:32:01.9812969Z self = 2025-05-07T20:32:01.9813739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9814236Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf454b80>} 2025-05-07T20:32:01.9815021Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9815213Z context = 2025-05-07T20:32:01.9815259Z 2025-05-07T20:32:01.9815421Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9815685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9815789Z module_map=module_map) 2025-05-07T20:32:01.9815949Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9816044Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9816115Z E ^ 2025-05-07T20:32:01.9816468Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9816473Z 2025-05-07T20:32:01.9816885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9816889Z 2025-05-07T20:32:01.9816999Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9817222Z self=, 2025-05-07T20:32:01.9817350Z T=2048, 2025-05-07T20:32:01.9817429Z D=5120, 2025-05-07T20:32:01.9817510Z scale_ub=None, 2025-05-07T20:32:01.9817590Z contiguous=True, 2025-05-07T20:32:01.9817676Z compiled=False, 2025-05-07T20:32:01.9817743Z ) 2025-05-07T20:32:01.9817957Z self = 2025-05-07T20:32:01.9818127Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.9818132Z 2025-05-07T20:32:01.9818207Z @given( 2025-05-07T20:32:01.9818328Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9818421Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9818536Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9818652Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9818764Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9818839Z ) 2025-05-07T20:32:01.9819090Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9819186Z def test_silu_mul_quant( 2025-05-07T20:32:01.9819259Z self, 2025-05-07T20:32:01.9819336Z T: int, 2025-05-07T20:32:01.9819410Z D: int, 2025-05-07T20:32:01.9819503Z scale_ub: Optional[float], 2025-05-07T20:32:01.9819592Z contiguous: bool, 2025-05-07T20:32:01.9819674Z compiled: bool, 2025-05-07T20:32:01.9819751Z ) -> None: 2025-05-07T20:32:01.9819846Z torch.manual_seed(2025) 2025-05-07T20:32:01.9819915Z 2025-05-07T20:32:01.9820081Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9820156Z 2025-05-07T20:32:01.9820244Z > x_sign = torch.sign(x) 2025-05-07T20:32:01.9822069Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9822080Z 2025-05-07T20:32:01.9822194Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:01.9822199Z 2025-05-07T20:32:01.9822304Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9822524Z self=, 2025-05-07T20:32:01.9822601Z T=16384, 2025-05-07T20:32:01.9822679Z D=5120, 2025-05-07T20:32:01.9822759Z scale_ub=None, 2025-05-07T20:32:01.9822841Z contiguous=True, 2025-05-07T20:32:01.9822926Z compiled=False, 2025-05-07T20:32:01.9823037Z ) 2025-05-07T20:32:01.9823256Z self = 2025-05-07T20:32:01.9823473Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.9823479Z 2025-05-07T20:32:01.9823554Z @given( 2025-05-07T20:32:01.9823672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9823767Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9823882Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9823997Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9824128Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9824200Z ) 2025-05-07T20:32:01.9824469Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9824560Z def test_silu_mul_quant( 2025-05-07T20:32:01.9824634Z self, 2025-05-07T20:32:01.9824712Z T: int, 2025-05-07T20:32:01.9824790Z D: int, 2025-05-07T20:32:01.9824886Z scale_ub: Optional[float], 2025-05-07T20:32:01.9825024Z contiguous: bool, 2025-05-07T20:32:01.9825108Z compiled: bool, 2025-05-07T20:32:01.9825186Z ) -> None: 2025-05-07T20:32:01.9825278Z torch.manual_seed(2025) 2025-05-07T20:32:01.9825348Z 2025-05-07T20:32:01.9825525Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9827295Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9827303Z 2025-05-07T20:32:01.9827419Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.9827427Z 2025-05-07T20:32:01.9827526Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9827742Z self=, 2025-05-07T20:32:01.9827819Z T=4096, 2025-05-07T20:32:01.9827896Z D=5120, 2025-05-07T20:32:01.9827974Z scale_ub=None, 2025-05-07T20:32:01.9828055Z contiguous=True, 2025-05-07T20:32:01.9828137Z compiled=False, 2025-05-07T20:32:01.9828213Z ) 2025-05-07T20:32:01.9828427Z self = 2025-05-07T20:32:01.9828594Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.9828599Z 2025-05-07T20:32:01.9828676Z @given( 2025-05-07T20:32:01.9828789Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9828883Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9829000Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9829157Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9829275Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9829346Z ) 2025-05-07T20:32:01.9829586Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9829678Z def test_silu_mul_quant( 2025-05-07T20:32:01.9829751Z self, 2025-05-07T20:32:01.9829825Z T: int, 2025-05-07T20:32:01.9829901Z D: int, 2025-05-07T20:32:01.9829993Z scale_ub: Optional[float], 2025-05-07T20:32:01.9830079Z contiguous: bool, 2025-05-07T20:32:01.9830168Z compiled: bool, 2025-05-07T20:32:01.9830242Z ) -> None: 2025-05-07T20:32:01.9830331Z torch.manual_seed(2025) 2025-05-07T20:32:01.9830399Z 2025-05-07T20:32:01.9830564Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9832372Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9832416Z 2025-05-07T20:32:01.9832530Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.9832535Z 2025-05-07T20:32:01.9832634Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9832850Z self=, 2025-05-07T20:32:01.9832922Z T=2048, 2025-05-07T20:32:01.9832999Z D=5120, 2025-05-07T20:32:01.9833078Z scale_ub=None, 2025-05-07T20:32:01.9833162Z contiguous=False, 2025-05-07T20:32:01.9833244Z compiled=False, 2025-05-07T20:32:01.9833358Z ) 2025-05-07T20:32:01.9833570Z self = 2025-05-07T20:32:01.9833739Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.9833745Z 2025-05-07T20:32:01.9833819Z @given( 2025-05-07T20:32:01.9833936Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9834029Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9834138Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9834256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9834367Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9834440Z ) 2025-05-07T20:32:01.9834685Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9834774Z def test_silu_mul_quant( 2025-05-07T20:32:01.9834847Z self, 2025-05-07T20:32:01.9834925Z T: int, 2025-05-07T20:32:01.9835004Z D: int, 2025-05-07T20:32:01.9835098Z scale_ub: Optional[float], 2025-05-07T20:32:01.9835186Z contiguous: bool, 2025-05-07T20:32:01.9835267Z compiled: bool, 2025-05-07T20:32:01.9835345Z ) -> None: 2025-05-07T20:32:01.9835436Z torch.manual_seed(2025) 2025-05-07T20:32:01.9835504Z 2025-05-07T20:32:01.9835671Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9837433Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9837509Z 2025-05-07T20:32:01.9837627Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.9837632Z 2025-05-07T20:32:01.9837731Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9837948Z self=, 2025-05-07T20:32:01.9838026Z T=4096, 2025-05-07T20:32:01.9838101Z D=7168, 2025-05-07T20:32:01.9838181Z scale_ub=None, 2025-05-07T20:32:01.9838266Z contiguous=True, 2025-05-07T20:32:01.9838348Z compiled=True, 2025-05-07T20:32:01.9838803Z ) 2025-05-07T20:32:01.9839110Z self = 2025-05-07T20:32:01.9839321Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.9839327Z 2025-05-07T20:32:01.9839409Z @given( 2025-05-07T20:32:01.9839523Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9839717Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9839893Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9840007Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9840120Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9840197Z ) 2025-05-07T20:32:01.9840437Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9840530Z def test_silu_mul_quant( 2025-05-07T20:32:01.9840606Z self, 2025-05-07T20:32:01.9840680Z T: int, 2025-05-07T20:32:01.9840759Z D: int, 2025-05-07T20:32:01.9840852Z scale_ub: Optional[float], 2025-05-07T20:32:01.9840939Z contiguous: bool, 2025-05-07T20:32:01.9841025Z compiled: bool, 2025-05-07T20:32:01.9841102Z ) -> None: 2025-05-07T20:32:01.9841197Z torch.manual_seed(2025) 2025-05-07T20:32:01.9841271Z 2025-05-07T20:32:01.9841439Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9843321Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9843400Z 2025-05-07T20:32:01.9843516Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.9843521Z 2025-05-07T20:32:01.9843624Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9843843Z self=, 2025-05-07T20:32:01.9843916Z T=2048, 2025-05-07T20:32:01.9843992Z D=5120, 2025-05-07T20:32:01.9844076Z scale_ub=1200.0, 2025-05-07T20:32:01.9844165Z contiguous=False, 2025-05-07T20:32:01.9844254Z compiled=False, 2025-05-07T20:32:01.9844326Z ) 2025-05-07T20:32:01.9844539Z self = 2025-05-07T20:32:01.9844715Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.9844720Z 2025-05-07T20:32:01.9844793Z @given( 2025-05-07T20:32:01.9844910Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9845002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9845117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9845234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9845344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9845419Z ) 2025-05-07T20:32:01.9845663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9845755Z def test_silu_mul_quant( 2025-05-07T20:32:01.9845829Z self, 2025-05-07T20:32:01.9845981Z T: int, 2025-05-07T20:32:01.9846058Z D: int, 2025-05-07T20:32:01.9846154Z scale_ub: Optional[float], 2025-05-07T20:32:01.9846247Z contiguous: bool, 2025-05-07T20:32:01.9846332Z compiled: bool, 2025-05-07T20:32:01.9846412Z ) -> None: 2025-05-07T20:32:01.9846504Z torch.manual_seed(2025) 2025-05-07T20:32:01.9846574Z 2025-05-07T20:32:01.9846743Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9848555Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9848603Z 2025-05-07T20:32:01.9848724Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.9848728Z 2025-05-07T20:32:01.9848828Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9849045Z self=, 2025-05-07T20:32:01.9849121Z T=4096, 2025-05-07T20:32:01.9849194Z D=7168, 2025-05-07T20:32:01.9849271Z scale_ub=1200.0, 2025-05-07T20:32:01.9849355Z contiguous=True, 2025-05-07T20:32:01.9849435Z compiled=False, 2025-05-07T20:32:01.9849509Z ) 2025-05-07T20:32:01.9849721Z self = 2025-05-07T20:32:01.9849892Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.9849897Z 2025-05-07T20:32:01.9849976Z @given( 2025-05-07T20:32:01.9850094Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9850243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9850358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9850468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9850579Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9850650Z ) 2025-05-07T20:32:01.9850889Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9850983Z def test_silu_mul_quant( 2025-05-07T20:32:01.9851058Z self, 2025-05-07T20:32:01.9851132Z T: int, 2025-05-07T20:32:01.9851210Z D: int, 2025-05-07T20:32:01.9851305Z scale_ub: Optional[float], 2025-05-07T20:32:01.9851391Z contiguous: bool, 2025-05-07T20:32:01.9851475Z compiled: bool, 2025-05-07T20:32:01.9851551Z ) -> None: 2025-05-07T20:32:01.9851643Z torch.manual_seed(2025) 2025-05-07T20:32:01.9851723Z 2025-05-07T20:32:01.9851892Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9853671Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9853677Z 2025-05-07T20:32:01.9853791Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.9853796Z 2025-05-07T20:32:01.9853903Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9854120Z self=, 2025-05-07T20:32:01.9854198Z T=16384, 2025-05-07T20:32:01.9854278Z D=7168, 2025-05-07T20:32:01.9854399Z scale_ub=None, 2025-05-07T20:32:01.9854482Z contiguous=False, 2025-05-07T20:32:01.9854565Z compiled=True, 2025-05-07T20:32:01.9854641Z ) 2025-05-07T20:32:01.9854854Z self = 2025-05-07T20:32:01.9855026Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.9855032Z 2025-05-07T20:32:01.9855105Z @given( 2025-05-07T20:32:01.9855224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9855317Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9855426Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9855542Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9855652Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9855725Z ) 2025-05-07T20:32:01.9856094Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9856225Z def test_silu_mul_quant( 2025-05-07T20:32:01.9856300Z self, 2025-05-07T20:32:01.9856379Z T: int, 2025-05-07T20:32:01.9856451Z D: int, 2025-05-07T20:32:01.9856547Z scale_ub: Optional[float], 2025-05-07T20:32:01.9856633Z contiguous: bool, 2025-05-07T20:32:01.9856715Z compiled: bool, 2025-05-07T20:32:01.9856793Z ) -> None: 2025-05-07T20:32:01.9856884Z torch.manual_seed(2025) 2025-05-07T20:32:01.9856956Z 2025-05-07T20:32:01.9857125Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9858899Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9858951Z 2025-05-07T20:32:01.9859075Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.9859080Z 2025-05-07T20:32:01.9859180Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9859398Z self=, 2025-05-07T20:32:01.9859472Z T=4096, 2025-05-07T20:32:01.9859550Z D=7168, 2025-05-07T20:32:01.9859630Z scale_ub=None, 2025-05-07T20:32:01.9859717Z contiguous=True, 2025-05-07T20:32:01.9859798Z compiled=False, 2025-05-07T20:32:01.9859876Z ) 2025-05-07T20:32:01.9860091Z self = 2025-05-07T20:32:01.9860258Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.9860265Z 2025-05-07T20:32:01.9860342Z @given( 2025-05-07T20:32:01.9860458Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9860555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9860670Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9860782Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9860895Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9860968Z ) 2025-05-07T20:32:01.9861209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9861301Z def test_silu_mul_quant( 2025-05-07T20:32:01.9861375Z self, 2025-05-07T20:32:01.9861452Z T: int, 2025-05-07T20:32:01.9861529Z D: int, 2025-05-07T20:32:01.9861623Z scale_ub: Optional[float], 2025-05-07T20:32:01.9861708Z contiguous: bool, 2025-05-07T20:32:01.9861792Z compiled: bool, 2025-05-07T20:32:01.9861867Z ) -> None: 2025-05-07T20:32:01.9861961Z torch.manual_seed(2025) 2025-05-07T20:32:01.9862035Z 2025-05-07T20:32:01.9862250Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9864026Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9864032Z 2025-05-07T20:32:01.9864146Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.9864150Z 2025-05-07T20:32:01.9864250Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9864504Z self=, 2025-05-07T20:32:01.9864624Z T=16384, 2025-05-07T20:32:01.9864698Z D=7168, 2025-05-07T20:32:01.9864775Z scale_ub=None, 2025-05-07T20:32:01.9864856Z contiguous=True, 2025-05-07T20:32:01.9864936Z compiled=False, 2025-05-07T20:32:01.9865003Z ) 2025-05-07T20:32:01.9865215Z self = 2025-05-07T20:32:01.9865390Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.9865395Z 2025-05-07T20:32:01.9865467Z @given( 2025-05-07T20:32:01.9865583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9865677Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9865787Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9865908Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9866022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9866094Z ) 2025-05-07T20:32:01.9866340Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9866504Z def test_silu_mul_quant( 2025-05-07T20:32:01.9866576Z self, 2025-05-07T20:32:01.9866655Z T: int, 2025-05-07T20:32:01.9866730Z D: int, 2025-05-07T20:32:01.9866829Z scale_ub: Optional[float], 2025-05-07T20:32:01.9866916Z contiguous: bool, 2025-05-07T20:32:01.9866996Z compiled: bool, 2025-05-07T20:32:01.9867076Z ) -> None: 2025-05-07T20:32:01.9867167Z torch.manual_seed(2025) 2025-05-07T20:32:01.9867238Z 2025-05-07T20:32:01.9867407Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9869174Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9869183Z 2025-05-07T20:32:01.9869301Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.9869305Z 2025-05-07T20:32:01.9869406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9869626Z self=, 2025-05-07T20:32:01.9869706Z T=16384, 2025-05-07T20:32:01.9869782Z D=7168, 2025-05-07T20:32:01.9869865Z scale_ub=1200.0, 2025-05-07T20:32:01.9869948Z contiguous=True, 2025-05-07T20:32:01.9870030Z compiled=False, 2025-05-07T20:32:01.9870103Z ) 2025-05-07T20:32:01.9870315Z self = 2025-05-07T20:32:01.9870489Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.9870541Z 2025-05-07T20:32:01.9870618Z @given( 2025-05-07T20:32:01.9870730Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9870824Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9870937Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9871048Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9871160Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9871233Z ) 2025-05-07T20:32:01.9871472Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9871565Z def test_silu_mul_quant( 2025-05-07T20:32:01.9871640Z self, 2025-05-07T20:32:01.9871715Z T: int, 2025-05-07T20:32:01.9871794Z D: int, 2025-05-07T20:32:01.9871888Z scale_ub: Optional[float], 2025-05-07T20:32:01.9871974Z contiguous: bool, 2025-05-07T20:32:01.9872101Z compiled: bool, 2025-05-07T20:32:01.9872176Z ) -> None: 2025-05-07T20:32:01.9872321Z torch.manual_seed(2025) 2025-05-07T20:32:01.9872390Z 2025-05-07T20:32:01.9872555Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9874334Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9874340Z 2025-05-07T20:32:01.9874453Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.9874458Z 2025-05-07T20:32:01.9874563Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9874788Z self=, 2025-05-07T20:32:01.9874904Z T=128, 2025-05-07T20:32:01.9874980Z D=5120, 2025-05-07T20:32:01.9875058Z scale_ub=1200.0, 2025-05-07T20:32:01.9875140Z contiguous=False, 2025-05-07T20:32:01.9875224Z compiled=False, 2025-05-07T20:32:01.9875293Z ) 2025-05-07T20:32:01.9875505Z self = 2025-05-07T20:32:01.9875677Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.9875682Z 2025-05-07T20:32:01.9875758Z @given( 2025-05-07T20:32:01.9875875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9875968Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9876078Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9876192Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9876302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9876380Z ) 2025-05-07T20:32:01.9876623Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9876715Z def test_silu_mul_quant( 2025-05-07T20:32:01.9876789Z self, 2025-05-07T20:32:01.9876869Z T: int, 2025-05-07T20:32:01.9876945Z D: int, 2025-05-07T20:32:01.9877042Z scale_ub: Optional[float], 2025-05-07T20:32:01.9877130Z contiguous: bool, 2025-05-07T20:32:01.9877213Z compiled: bool, 2025-05-07T20:32:01.9877291Z ) -> None: 2025-05-07T20:32:01.9877382Z torch.manual_seed(2025) 2025-05-07T20:32:01.9877455Z 2025-05-07T20:32:01.9877624Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9877693Z 2025-05-07T20:32:01.9877782Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9877907Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9877994Z x = x_sign * x_clamp 2025-05-07T20:32:01.9878117Z x0 = x[:, :D] 2025-05-07T20:32:01.9878204Z x1 = x[:, D:] 2025-05-07T20:32:01.9878273Z 2025-05-07T20:32:01.9878354Z if contiguous: 2025-05-07T20:32:01.9878446Z x0 = x0.contiguous() 2025-05-07T20:32:01.9878532Z x1 = x1.contiguous() 2025-05-07T20:32:01.9878608Z 2025-05-07T20:32:01.9878696Z if scale_ub is not None: 2025-05-07T20:32:01.9878799Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9878935Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9879008Z ) 2025-05-07T20:32:01.9879082Z else: 2025-05-07T20:32:01.9879180Z scale_ub_tensor = None 2025-05-07T20:32:01.9879250Z 2025-05-07T20:32:01.9879378Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9884188Z op = silu_mul_quant 2025-05-07T20:32:01.9884301Z if compiled: 2025-05-07T20:32:01.9884470Z op = torch.compile(op) 2025-05-07T20:32:01.9884582Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9884700Z 2025-05-07T20:32:01.9884793Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9884798Z 2025-05-07T20:32:01.9884896Z moe/activation_test.py:117: 2025-05-07T20:32:01.9885029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9885130Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9885235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9885736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9885834Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9886197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9886419Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9886763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9886913Z kernel = self.compile( 2025-05-07T20:32:01.9887298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9887475Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9887601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9887606Z 2025-05-07T20:32:01.9887811Z self = 2025-05-07T20:32:01.9888589Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9889095Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf52b7e0>} 2025-05-07T20:32:01.9889853Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9890044Z context = 2025-05-07T20:32:01.9890049Z 2025-05-07T20:32:01.9890216Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9890490Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9890597Z module_map=module_map) 2025-05-07T20:32:01.9890759Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9890858Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9890935Z E ^ 2025-05-07T20:32:01.9891340Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9891351Z 2025-05-07T20:32:01.9891764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9891769Z 2025-05-07T20:32:01.9891872Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9892090Z self=, 2025-05-07T20:32:01.9892170Z T=2048, 2025-05-07T20:32:01.9892254Z D=7168, 2025-05-07T20:32:01.9892336Z scale_ub=None, 2025-05-07T20:32:01.9892423Z contiguous=False, 2025-05-07T20:32:01.9892514Z compiled=False, 2025-05-07T20:32:01.9892586Z ) 2025-05-07T20:32:01.9892801Z self = 2025-05-07T20:32:01.9892979Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.9892984Z 2025-05-07T20:32:01.9893101Z @given( 2025-05-07T20:32:01.9893223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9893361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9893474Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9893596Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9893710Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9893793Z ) 2025-05-07T20:32:01.9894080Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9894175Z def test_silu_mul_quant( 2025-05-07T20:32:01.9894248Z self, 2025-05-07T20:32:01.9894327Z T: int, 2025-05-07T20:32:01.9894405Z D: int, 2025-05-07T20:32:01.9894506Z scale_ub: Optional[float], 2025-05-07T20:32:01.9894596Z contiguous: bool, 2025-05-07T20:32:01.9894681Z compiled: bool, 2025-05-07T20:32:01.9894766Z ) -> None: 2025-05-07T20:32:01.9894863Z torch.manual_seed(2025) 2025-05-07T20:32:01.9894937Z 2025-05-07T20:32:01.9895113Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9896937Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9896943Z 2025-05-07T20:32:01.9897065Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.9897069Z 2025-05-07T20:32:01.9897172Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9897391Z self=, 2025-05-07T20:32:01.9897478Z T=128, 2025-05-07T20:32:01.9897561Z D=7168, 2025-05-07T20:32:01.9897644Z scale_ub=1200.0, 2025-05-07T20:32:01.9897728Z contiguous=True, 2025-05-07T20:32:01.9897808Z compiled=True, 2025-05-07T20:32:01.9897885Z ) 2025-05-07T20:32:01.9898098Z self = 2025-05-07T20:32:01.9898262Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.9898267Z 2025-05-07T20:32:01.9898344Z @given( 2025-05-07T20:32:01.9898460Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9898556Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9898670Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9898785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9898901Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9898975Z ) 2025-05-07T20:32:01.9899260Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9899362Z def test_silu_mul_quant( 2025-05-07T20:32:01.9899441Z self, 2025-05-07T20:32:01.9899515Z T: int, 2025-05-07T20:32:01.9899593Z D: int, 2025-05-07T20:32:01.9899685Z scale_ub: Optional[float], 2025-05-07T20:32:01.9899770Z contiguous: bool, 2025-05-07T20:32:01.9899859Z compiled: bool, 2025-05-07T20:32:01.9899936Z ) -> None: 2025-05-07T20:32:01.9900028Z torch.manual_seed(2025) 2025-05-07T20:32:01.9900103Z 2025-05-07T20:32:01.9900268Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9900347Z 2025-05-07T20:32:01.9900435Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9900557Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9900648Z x = x_sign * x_clamp 2025-05-07T20:32:01.9900729Z x0 = x[:, :D] 2025-05-07T20:32:01.9900875Z x1 = x[:, D:] 2025-05-07T20:32:01.9900954Z 2025-05-07T20:32:01.9901076Z if contiguous: 2025-05-07T20:32:01.9901170Z x0 = x0.contiguous() 2025-05-07T20:32:01.9901268Z x1 = x1.contiguous() 2025-05-07T20:32:01.9901338Z 2025-05-07T20:32:01.9901428Z if scale_ub is not None: 2025-05-07T20:32:01.9901537Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9901668Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9901742Z ) 2025-05-07T20:32:01.9901821Z else: 2025-05-07T20:32:01.9901914Z scale_ub_tensor = None 2025-05-07T20:32:01.9901991Z 2025-05-07T20:32:01.9902118Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9902206Z op = silu_mul_quant 2025-05-07T20:32:01.9902292Z if compiled: 2025-05-07T20:32:01.9902390Z op = torch.compile(op) 2025-05-07T20:32:01.9902494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9902574Z 2025-05-07T20:32:01.9902663Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9902714Z 2025-05-07T20:32:01.9902814Z moe/activation_test.py:117: 2025-05-07T20:32:01.9902946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9903043Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9903144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9903512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.9903606Z return fn(*args, **kwargs) 2025-05-07T20:32:01.9904104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9904197Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9904553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9904786Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9905133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9905231Z kernel = self.compile( 2025-05-07T20:32:01.9905613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9905785Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9905912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9905917Z 2025-05-07T20:32:01.9906121Z self = 2025-05-07T20:32:01.9906895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9907435Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bf8d6a20>} 2025-05-07T20:32:01.9908190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9908380Z context = 2025-05-07T20:32:01.9908385Z 2025-05-07T20:32:01.9908545Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9908808Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9908911Z module_map=module_map) 2025-05-07T20:32:01.9909068Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9909170Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9909291Z E ^ 2025-05-07T20:32:01.9909684Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9909697Z 2025-05-07T20:32:01.9910111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9910116Z 2025-05-07T20:32:01.9910216Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9910440Z self=, 2025-05-07T20:32:01.9910516Z T=128, 2025-05-07T20:32:01.9910592Z D=7168, 2025-05-07T20:32:01.9910680Z scale_ub=1200.0, 2025-05-07T20:32:01.9910764Z contiguous=True, 2025-05-07T20:32:01.9910848Z compiled=False, 2025-05-07T20:32:01.9910926Z ) 2025-05-07T20:32:01.9911141Z self = 2025-05-07T20:32:01.9911311Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.9911315Z 2025-05-07T20:32:01.9911403Z @given( 2025-05-07T20:32:01.9911562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9911656Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9911773Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9911885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9911998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9912071Z ) 2025-05-07T20:32:01.9912309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9912404Z def test_silu_mul_quant( 2025-05-07T20:32:01.9912477Z self, 2025-05-07T20:32:01.9912553Z T: int, 2025-05-07T20:32:01.9912631Z D: int, 2025-05-07T20:32:01.9912726Z scale_ub: Optional[float], 2025-05-07T20:32:01.9912813Z contiguous: bool, 2025-05-07T20:32:01.9912898Z compiled: bool, 2025-05-07T20:32:01.9912979Z ) -> None: 2025-05-07T20:32:01.9913070Z torch.manual_seed(2025) 2025-05-07T20:32:01.9913147Z 2025-05-07T20:32:01.9913312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9913387Z 2025-05-07T20:32:01.9913475Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9913597Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9915425Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9915432Z 2025-05-07T20:32:01.9915548Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:01.9915598Z 2025-05-07T20:32:01.9915702Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9915919Z self=, 2025-05-07T20:32:01.9915992Z T=128, 2025-05-07T20:32:01.9916075Z D=5120, 2025-05-07T20:32:01.9916154Z scale_ub=1200.0, 2025-05-07T20:32:01.9916235Z contiguous=True, 2025-05-07T20:32:01.9916317Z compiled=True, 2025-05-07T20:32:01.9916387Z ) 2025-05-07T20:32:01.9916599Z self = 2025-05-07T20:32:01.9916767Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.9916771Z 2025-05-07T20:32:01.9916845Z @given( 2025-05-07T20:32:01.9916962Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9917056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9917207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9917327Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9917477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9917549Z ) 2025-05-07T20:32:01.9917796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9917885Z def test_silu_mul_quant( 2025-05-07T20:32:01.9917962Z self, 2025-05-07T20:32:01.9918033Z T: int, 2025-05-07T20:32:01.9918107Z D: int, 2025-05-07T20:32:01.9918204Z scale_ub: Optional[float], 2025-05-07T20:32:01.9918289Z contiguous: bool, 2025-05-07T20:32:01.9918371Z compiled: bool, 2025-05-07T20:32:01.9918450Z ) -> None: 2025-05-07T20:32:01.9918541Z torch.manual_seed(2025) 2025-05-07T20:32:01.9918612Z 2025-05-07T20:32:01.9918781Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9918851Z 2025-05-07T20:32:01.9918939Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9919067Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9920877Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9920882Z 2025-05-07T20:32:01.9921001Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:01.9921006Z 2025-05-07T20:32:01.9921106Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9921329Z self=, 2025-05-07T20:32:01.9921405Z T=128, 2025-05-07T20:32:01.9921479Z D=7168, 2025-05-07T20:32:01.9921568Z scale_ub=None, 2025-05-07T20:32:01.9921650Z contiguous=True, 2025-05-07T20:32:01.9921729Z compiled=True, 2025-05-07T20:32:01.9921804Z ) 2025-05-07T20:32:01.9922014Z self = 2025-05-07T20:32:01.9922174Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.9922181Z 2025-05-07T20:32:01.9922253Z @given( 2025-05-07T20:32:01.9922365Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9922464Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9922574Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9922688Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9922802Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9922873Z ) 2025-05-07T20:32:01.9923111Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9923356Z def test_silu_mul_quant( 2025-05-07T20:32:01.9923436Z self, 2025-05-07T20:32:01.9923511Z T: int, 2025-05-07T20:32:01.9923587Z D: int, 2025-05-07T20:32:01.9923683Z scale_ub: Optional[float], 2025-05-07T20:32:01.9923792Z contiguous: bool, 2025-05-07T20:32:01.9923881Z compiled: bool, 2025-05-07T20:32:01.9923977Z ) -> None: 2025-05-07T20:32:01.9924078Z torch.manual_seed(2025) 2025-05-07T20:32:01.9924148Z 2025-05-07T20:32:01.9924311Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9926131Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.9926178Z 2025-05-07T20:32:01.9926294Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.9926427Z =============================== warnings summary =============================== 2025-05-07T20:32:01.9926734Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:01.9927031Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:01.9927329Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:01.9928201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:01.9928479Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:01.9928484Z 2025-05-07T20:32:01.9928697Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:01.9928866Z ================= 1 failed, 1 deselected, 3 warnings in 16.63s ================= 2025-05-07T20:32:03.7209799Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:03.7891672Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:32:03.7891916Z 2025-05-07T20:32:05.7911788Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:07.9408013Z ============================= test session starts ============================== 2025-05-07T20:32:07.9408693Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:07.9409216Z cachedir: .pytest_cache 2025-05-07T20:32:07.9409792Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:07.9410516Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:07.9410926Z plugins: hypothesis-6.131.14 2025-05-07T20:32:09.5490024Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:09.6999759Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:09.7000339Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:09.7000620Z 2025-05-07T20:32:12.1285891Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.1286967Z self=, 2025-05-07T20:32:12.1287442Z T=1, 2025-05-07T20:32:12.1287810Z D=5120, 2025-05-07T20:32:12.1288195Z scale_ub=None, 2025-05-07T20:32:12.1288627Z contiguous=True, 2025-05-07T20:32:12.1289063Z compiled=True, 2025-05-07T20:32:12.1289472Z ) 2025-05-07T20:32:12.1290113Z self = 2025-05-07T20:32:12.1291077Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.1291615Z 2025-05-07T20:32:12.1291772Z @given( 2025-05-07T20:32:12.1292233Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.1292856Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.1293462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.1294114Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.1294912Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.1295602Z ) 2025-05-07T20:32:12.1296303Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.1297185Z def test_silu_mul_quant( 2025-05-07T20:32:12.1297573Z self, 2025-05-07T20:32:12.1297802Z T: int, 2025-05-07T20:32:12.1298023Z D: int, 2025-05-07T20:32:12.1298240Z scale_ub: Optional[float], 2025-05-07T20:32:12.1298514Z contiguous: bool, 2025-05-07T20:32:12.1298755Z compiled: bool, 2025-05-07T20:32:12.1298976Z ) -> None: 2025-05-07T20:32:12.1299199Z torch.manual_seed(2025) 2025-05-07T20:32:12.1299443Z 2025-05-07T20:32:12.1299714Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.1300061Z 2025-05-07T20:32:12.1300262Z x_sign = torch.sign(x) 2025-05-07T20:32:12.1300550Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.1300868Z x = x_sign * x_clamp 2025-05-07T20:32:12.1301112Z x0 = x[:, :D] 2025-05-07T20:32:12.1301438Z x1 = x[:, D:] 2025-05-07T20:32:12.1301646Z 2025-05-07T20:32:12.1301840Z if contiguous: 2025-05-07T20:32:12.1302081Z x0 = x0.contiguous() 2025-05-07T20:32:12.1302336Z x1 = x1.contiguous() 2025-05-07T20:32:12.1302582Z 2025-05-07T20:32:12.1302782Z if scale_ub is not None: 2025-05-07T20:32:12.1303055Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.1303399Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.1303723Z ) 2025-05-07T20:32:12.1303916Z else: 2025-05-07T20:32:12.1304135Z scale_ub_tensor = None 2025-05-07T20:32:12.1304395Z 2025-05-07T20:32:12.1304633Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.1304954Z op = silu_mul_quant 2025-05-07T20:32:12.1305210Z if compiled: 2025-05-07T20:32:12.1305461Z op = torch.compile(op) 2025-05-07T20:32:12.1305768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1306055Z 2025-05-07T20:32:12.1306254Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.1306536Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.1306832Z 2025-05-07T20:32:12.1307076Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.1307408Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.1307705Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.1308025Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.1308382Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.1308702Z 2025-05-07T20:32:12.1308907Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.1309105Z 2025-05-07T20:32:12.1309206Z moe/activation_test.py:126: 2025-05-07T20:32:12.1309510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1309848Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.1310237Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.1311032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.1311798Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.1312348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.1313032Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.1313731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.1314464Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.1315280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:12.1316077Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.1316820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.1317471Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.1318077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.1318595Z fn() 2025-05-07T20:32:12.1319110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.1319697Z self.fn.run( 2025-05-07T20:32:12.1320162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.1320699Z kernel = self.compile( 2025-05-07T20:32:12.1321254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.1321967Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.1322356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1322596Z 2025-05-07T20:32:12.1322803Z self = 2025-05-07T20:32:12.1324042Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.1325436Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cda813a0>} 2025-05-07T20:32:12.1326778Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.1327873Z context = 2025-05-07T20:32:12.1328170Z 2025-05-07T20:32:12.1328338Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.1328868Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.1329334Z module_map=module_map) 2025-05-07T20:32:12.1329704Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.1330070Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.1330345Z E ^ 2025-05-07T20:32:12.1330808Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.1331269Z 2025-05-07T20:32:12.1331693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.1332262Z 2025-05-07T20:32:12.1332387Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.1332868Z self=, 2025-05-07T20:32:12.1333334Z T=2048, 2025-05-07T20:32:12.1333540Z D=5120, 2025-05-07T20:32:12.1333752Z scale_ub=1200.0, 2025-05-07T20:32:12.1333990Z contiguous=True, 2025-05-07T20:32:12.1334232Z compiled=False, 2025-05-07T20:32:12.1334458Z ) 2025-05-07T20:32:13.0821479Z self = 2025-05-07T20:32:13.0822613Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:13.0823172Z 2025-05-07T20:32:13.0823354Z @given( 2025-05-07T20:32:13.0823814Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.0824442Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.0825463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.0826130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.0826955Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.0827524Z ) 2025-05-07T20:32:13.0828042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.0828484Z def test_silu_mul_quant( 2025-05-07T20:32:13.0828733Z self, 2025-05-07T20:32:13.0828938Z T: int, 2025-05-07T20:32:13.0829139Z D: int, 2025-05-07T20:32:13.0829361Z scale_ub: Optional[float], 2025-05-07T20:32:13.0829638Z contiguous: bool, 2025-05-07T20:32:13.0829880Z compiled: bool, 2025-05-07T20:32:13.0830123Z ) -> None: 2025-05-07T20:32:13.0830347Z torch.manual_seed(2025) 2025-05-07T20:32:13.0830587Z 2025-05-07T20:32:13.0830873Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.0831228Z 2025-05-07T20:32:13.0831426Z x_sign = torch.sign(x) 2025-05-07T20:32:13.0831727Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.0832166Z x = x_sign * x_clamp 2025-05-07T20:32:13.0832414Z x0 = x[:, :D] 2025-05-07T20:32:13.0832627Z x1 = x[:, D:] 2025-05-07T20:32:13.0832842Z 2025-05-07T20:32:13.0833036Z if contiguous: 2025-05-07T20:32:13.0833271Z x0 = x0.contiguous() 2025-05-07T20:32:13.0833538Z x1 = x1.contiguous() 2025-05-07T20:32:13.0833785Z 2025-05-07T20:32:13.0833975Z if scale_ub is not None: 2025-05-07T20:32:13.0834256Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.0834596Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.0834900Z ) 2025-05-07T20:32:13.0835096Z else: 2025-05-07T20:32:13.0835313Z scale_ub_tensor = None 2025-05-07T20:32:13.0835560Z 2025-05-07T20:32:13.0835799Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.0836120Z op = silu_mul_quant 2025-05-07T20:32:13.0836368Z if compiled: 2025-05-07T20:32:13.0836627Z op = torch.compile(op) 2025-05-07T20:32:13.0836927Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0837201Z 2025-05-07T20:32:13.0837400Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.0837574Z 2025-05-07T20:32:13.0837682Z moe/activation_test.py:117: 2025-05-07T20:32:13.0837988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0838324Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.0838775Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0839476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.0840168Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.0840712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.0841506Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.0842192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.0842729Z kernel = self.compile( 2025-05-07T20:32:13.0843380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.0844049Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.0844445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0844687Z 2025-05-07T20:32:13.0844898Z self = 2025-05-07T20:32:13.0846049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.0847488Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cd7382c0>} 2025-05-07T20:32:13.0848842Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.0849866Z context = 2025-05-07T20:32:13.0850164Z 2025-05-07T20:32:13.0850335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.0850874Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.0851351Z module_map=module_map) 2025-05-07T20:32:13.0851718Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.0852083Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.0852354Z E ^ 2025-05-07T20:32:13.0852915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.0853376Z 2025-05-07T20:32:13.0861476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.0862142Z 2025-05-07T20:32:13.0862258Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.0862679Z self=, 2025-05-07T20:32:13.0863085Z T=2048, 2025-05-07T20:32:13.0863280Z D=5120, 2025-05-07T20:32:13.0863481Z scale_ub=1200.0, 2025-05-07T20:32:13.0863703Z contiguous=True, 2025-05-07T20:32:13.0863932Z compiled=True, 2025-05-07T20:32:13.0864148Z ) 2025-05-07T20:32:13.0864467Z self = 2025-05-07T20:32:13.0864976Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:13.0865268Z 2025-05-07T20:32:13.0865347Z @given( 2025-05-07T20:32:13.0865584Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.0865898Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.0866210Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.0866546Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.0866869Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.0867165Z ) 2025-05-07T20:32:13.0867518Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.0867955Z def test_silu_mul_quant( 2025-05-07T20:32:13.0868203Z self, 2025-05-07T20:32:13.0868401Z T: int, 2025-05-07T20:32:13.0868590Z D: int, 2025-05-07T20:32:13.0868812Z scale_ub: Optional[float], 2025-05-07T20:32:13.0869088Z contiguous: bool, 2025-05-07T20:32:13.0869335Z compiled: bool, 2025-05-07T20:32:13.0869551Z ) -> None: 2025-05-07T20:32:13.0869855Z torch.manual_seed(2025) 2025-05-07T20:32:13.0870104Z 2025-05-07T20:32:13.0870377Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.0870723Z 2025-05-07T20:32:13.0870923Z x_sign = torch.sign(x) 2025-05-07T20:32:13.0871212Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.0871531Z x = x_sign * x_clamp 2025-05-07T20:32:13.0871772Z x0 = x[:, :D] 2025-05-07T20:32:13.0871981Z x1 = x[:, D:] 2025-05-07T20:32:13.0872192Z 2025-05-07T20:32:13.0872380Z if contiguous: 2025-05-07T20:32:13.0872608Z x0 = x0.contiguous() 2025-05-07T20:32:13.0872872Z x1 = x1.contiguous() 2025-05-07T20:32:13.0873121Z 2025-05-07T20:32:13.0873310Z if scale_ub is not None: 2025-05-07T20:32:13.0873586Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.0873969Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.0874329Z ) 2025-05-07T20:32:13.0874524Z else: 2025-05-07T20:32:13.0874736Z scale_ub_tensor = None 2025-05-07T20:32:13.0874991Z 2025-05-07T20:32:13.0875218Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.0875537Z op = silu_mul_quant 2025-05-07T20:32:13.0875793Z if compiled: 2025-05-07T20:32:13.0876039Z op = torch.compile(op) 2025-05-07T20:32:13.0876345Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0876629Z 2025-05-07T20:32:13.0876817Z y_fp8, y_scale = fn() 2025-05-07T20:32:13.0877108Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:13.0877409Z 2025-05-07T20:32:13.0877641Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.0877987Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:13.0878285Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:13.0878596Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:13.0879014Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:13.0879332Z 2025-05-07T20:32:13.0879537Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:13.0879731Z 2025-05-07T20:32:13.0879831Z moe/activation_test.py:126: 2025-05-07T20:32:13.0880134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0880474Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:13.0880799Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:13.0881599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:13.0882362Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:13.0882918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.0883723Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.0884426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:13.0885163Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:13.0885930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:13.0886680Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:13.0887416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:13.0888057Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:13.0888708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:13.0889232Z fn() 2025-05-07T20:32:13.0889795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:13.0890377Z self.fn.run( 2025-05-07T20:32:13.0890851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.0891383Z kernel = self.compile( 2025-05-07T20:32:13.0891922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.0892579Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.0892976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0893202Z 2025-05-07T20:32:13.0893415Z self = 2025-05-07T20:32:13.0894529Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.0895935Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cd739440>} 2025-05-07T20:32:13.0897269Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.0898292Z context = 2025-05-07T20:32:13.0898577Z 2025-05-07T20:32:13.0898750Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.0899268Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.0899738Z module_map=module_map) 2025-05-07T20:32:13.0900108Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.0900504Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:13.0900773Z E ^ 2025-05-07T20:32:13.0901241Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.0901691Z 2025-05-07T20:32:13.0902121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.0902631Z 2025-05-07T20:32:13.0902735Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.0903147Z self=, 2025-05-07T20:32:13.0903550Z T=16384, 2025-05-07T20:32:13.0903742Z D=7168, 2025-05-07T20:32:13.0903938Z scale_ub=1200.0, 2025-05-07T20:32:13.0904162Z contiguous=False, 2025-05-07T20:32:13.0904379Z compiled=False, 2025-05-07T20:32:13.0904586Z ) 2025-05-07T20:32:13.8995606Z self = 2025-05-07T20:32:13.8996179Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.8996468Z 2025-05-07T20:32:13.8996557Z @given( 2025-05-07T20:32:13.8996790Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.8997107Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.8997418Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.8997752Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.8998119Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.8998406Z ) 2025-05-07T20:32:13.8998759Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.8999193Z def test_silu_mul_quant( 2025-05-07T20:32:13.8999445Z self, 2025-05-07T20:32:13.8999643Z T: int, 2025-05-07T20:32:13.8999839Z D: int, 2025-05-07T20:32:13.9000068Z scale_ub: Optional[float], 2025-05-07T20:32:13.9000539Z contiguous: bool, 2025-05-07T20:32:13.9000775Z compiled: bool, 2025-05-07T20:32:13.9000996Z ) -> None: 2025-05-07T20:32:13.9001213Z torch.manual_seed(2025) 2025-05-07T20:32:13.9001450Z 2025-05-07T20:32:13.9001727Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.9002073Z 2025-05-07T20:32:13.9002269Z x_sign = torch.sign(x) 2025-05-07T20:32:13.9002558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.9002866Z x = x_sign * x_clamp 2025-05-07T20:32:13.9003108Z x0 = x[:, :D] 2025-05-07T20:32:13.9003443Z x1 = x[:, D:] 2025-05-07T20:32:13.9003657Z 2025-05-07T20:32:13.9003839Z if contiguous: 2025-05-07T20:32:13.9004064Z x0 = x0.contiguous() 2025-05-07T20:32:13.9004315Z x1 = x1.contiguous() 2025-05-07T20:32:13.9004547Z 2025-05-07T20:32:13.9004802Z if scale_ub is not None: 2025-05-07T20:32:13.9005073Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.9005467Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.9005771Z ) 2025-05-07T20:32:13.9005964Z else: 2025-05-07T20:32:13.9006176Z scale_ub_tensor = None 2025-05-07T20:32:13.9006427Z 2025-05-07T20:32:13.9006664Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.9006980Z op = silu_mul_quant 2025-05-07T20:32:13.9007224Z if compiled: 2025-05-07T20:32:13.9007476Z op = torch.compile(op) 2025-05-07T20:32:13.9007775Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.9008051Z 2025-05-07T20:32:13.9008240Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.9008412Z 2025-05-07T20:32:13.9008515Z moe/activation_test.py:117: 2025-05-07T20:32:13.9008811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.9009140Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.9009426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.9010191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.9010885Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.9011418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.9012107Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.9012774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.9013306Z kernel = self.compile( 2025-05-07T20:32:13.9013856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.9014521Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.9014926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.9015159Z 2025-05-07T20:32:13.9015367Z self = 2025-05-07T20:32:13.9016444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.9017806Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cc82e660>} 2025-05-07T20:32:13.9019140Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.9020161Z context = 2025-05-07T20:32:13.9020460Z 2025-05-07T20:32:13.9020673Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.9021204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.9021676Z module_map=module_map) 2025-05-07T20:32:13.9022044Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.9022403Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.9022671Z E ^ 2025-05-07T20:32:13.9023138Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.9023596Z 2025-05-07T20:32:13.9024016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.9024535Z 2025-05-07T20:32:13.9024641Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.9025101Z self=, 2025-05-07T20:32:13.9025588Z T=1, 2025-05-07T20:32:13.9025769Z D=7168, 2025-05-07T20:32:13.9025971Z scale_ub=None, 2025-05-07T20:32:13.9026189Z contiguous=True, 2025-05-07T20:32:13.9026412Z compiled=True, 2025-05-07T20:32:13.9026624Z ) 2025-05-07T20:32:13.9026945Z self = 2025-05-07T20:32:13.9027425Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:13.9027696Z 2025-05-07T20:32:13.9027773Z @given( 2025-05-07T20:32:13.9028009Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.9028319Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.9028625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.9028959Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.9029289Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.9029576Z ) 2025-05-07T20:32:13.9029932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.9030429Z def test_silu_mul_quant( 2025-05-07T20:32:13.9030666Z self, 2025-05-07T20:32:13.9030864Z T: int, 2025-05-07T20:32:13.9031064Z D: int, 2025-05-07T20:32:13.9031274Z scale_ub: Optional[float], 2025-05-07T20:32:13.9031548Z contiguous: bool, 2025-05-07T20:32:13.9031788Z compiled: bool, 2025-05-07T20:32:13.9032007Z ) -> None: 2025-05-07T20:32:13.9032227Z torch.manual_seed(2025) 2025-05-07T20:32:13.9032468Z 2025-05-07T20:32:13.9032735Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.9033079Z 2025-05-07T20:32:13.9033274Z x_sign = torch.sign(x) 2025-05-07T20:32:13.9033561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.9033872Z x = x_sign * x_clamp 2025-05-07T20:32:13.9034115Z x0 = x[:, :D] 2025-05-07T20:32:13.9034332Z x1 = x[:, D:] 2025-05-07T20:32:13.9034550Z 2025-05-07T20:32:13.9034749Z if contiguous: 2025-05-07T20:32:13.9034989Z x0 = x0.contiguous() 2025-05-07T20:32:13.9035242Z x1 = x1.contiguous() 2025-05-07T20:32:13.9035487Z 2025-05-07T20:32:13.9035678Z if scale_ub is not None: 2025-05-07T20:32:13.9035948Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.9036289Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.9036604Z ) 2025-05-07T20:32:13.9036793Z else: 2025-05-07T20:32:13.9037008Z scale_ub_tensor = None 2025-05-07T20:32:13.9037261Z 2025-05-07T20:32:13.9037494Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.9037813Z op = silu_mul_quant 2025-05-07T20:32:13.9038067Z if compiled: 2025-05-07T20:32:13.9038333Z op = torch.compile(op) 2025-05-07T20:32:13.9038810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.9039090Z 2025-05-07T20:32:13.9039371Z y_fp8, y_scale = fn() 2025-05-07T20:32:13.9039657Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:13.9039952Z 2025-05-07T20:32:13.9040195Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.9040526Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:13.9040820Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:13.9041135Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:13.9041499Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:13.9041807Z 2025-05-07T20:32:13.9042018Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:13.9042212Z 2025-05-07T20:32:13.9042321Z moe/activation_test.py:126: 2025-05-07T20:32:13.9042612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.9042948Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:13.9043473Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:13.9044311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:13.9045070Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:13.9045623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.9046311Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.9046997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:13.9047724Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:13.9048483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:13.9049242Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:13.9050045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:13.9050690Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:13.9051292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:13.9051810Z fn() 2025-05-07T20:32:13.9052322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:13.9052908Z self.fn.run( 2025-05-07T20:32:13.9053382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.9054083Z kernel = self.compile( 2025-05-07T20:32:13.9054635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.9055298Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.9055695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.9055930Z 2025-05-07T20:32:13.9056139Z self = 2025-05-07T20:32:13.9057560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.9059144Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cc5ea5c0>} 2025-05-07T20:32:13.9060485Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.9061568Z context = 2025-05-07T20:32:13.9061865Z 2025-05-07T20:32:13.9062032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.9062557Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.9063029Z module_map=module_map) 2025-05-07T20:32:13.9063392Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.9063751Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:13.9064025Z E ^ 2025-05-07T20:32:13.9064488Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.9064944Z 2025-05-07T20:32:13.9065365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.9065931Z 2025-05-07T20:32:13.9066039Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.9066508Z self=, 2025-05-07T20:32:13.9066909Z T=4096, 2025-05-07T20:32:13.9067102Z D=5120, 2025-05-07T20:32:13.9067300Z scale_ub=None, 2025-05-07T20:32:13.9067511Z contiguous=False, 2025-05-07T20:32:13.9067740Z compiled=False, 2025-05-07T20:32:13.9067955Z ) 2025-05-07T20:32:14.8310405Z self = 2025-05-07T20:32:14.8310958Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.8311254Z 2025-05-07T20:32:14.8311340Z @given( 2025-05-07T20:32:14.8311583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.8311898Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.8312212Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.8312554Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.8312882Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.8313319Z ) 2025-05-07T20:32:14.8313675Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.8314121Z def test_silu_mul_quant( 2025-05-07T20:32:14.8314372Z self, 2025-05-07T20:32:14.8314578Z T: int, 2025-05-07T20:32:14.8314770Z D: int, 2025-05-07T20:32:14.8314997Z scale_ub: Optional[float], 2025-05-07T20:32:14.8315277Z contiguous: bool, 2025-05-07T20:32:14.8315514Z compiled: bool, 2025-05-07T20:32:14.8315748Z ) -> None: 2025-05-07T20:32:14.8315965Z torch.manual_seed(2025) 2025-05-07T20:32:14.8316213Z 2025-05-07T20:32:14.8316480Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.8316827Z 2025-05-07T20:32:14.8317020Z x_sign = torch.sign(x) 2025-05-07T20:32:14.8317309Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.8317626Z x = x_sign * x_clamp 2025-05-07T20:32:14.8317872Z x0 = x[:, :D] 2025-05-07T20:32:14.8318104Z x1 = x[:, D:] 2025-05-07T20:32:14.8318349Z 2025-05-07T20:32:14.8318537Z if contiguous: 2025-05-07T20:32:14.8318767Z x0 = x0.contiguous() 2025-05-07T20:32:14.8319031Z x1 = x1.contiguous() 2025-05-07T20:32:14.8319277Z 2025-05-07T20:32:14.8319465Z if scale_ub is not None: 2025-05-07T20:32:14.8319740Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.8320080Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.8320388Z ) 2025-05-07T20:32:14.8320585Z else: 2025-05-07T20:32:14.8320800Z scale_ub_tensor = None 2025-05-07T20:32:14.8321055Z 2025-05-07T20:32:14.8321285Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.8321603Z op = silu_mul_quant 2025-05-07T20:32:14.8321859Z if compiled: 2025-05-07T20:32:14.8322109Z op = torch.compile(op) 2025-05-07T20:32:14.8322487Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.8322774Z 2025-05-07T20:32:14.8322962Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.8323132Z 2025-05-07T20:32:14.8323314Z moe/activation_test.py:117: 2025-05-07T20:32:14.8323614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.8323944Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.8324234Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.8324930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.8325630Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.8326168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.8326931Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.8327620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.8328219Z kernel = self.compile( 2025-05-07T20:32:14.8328771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.8329442Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.8329846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.8330076Z 2025-05-07T20:32:14.8330287Z self = 2025-05-07T20:32:14.8331375Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.8332765Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cc068540>} 2025-05-07T20:32:14.8334157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.8335197Z context = 2025-05-07T20:32:14.8335485Z 2025-05-07T20:32:14.8335657Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.8336187Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.8336663Z module_map=module_map) 2025-05-07T20:32:14.8337029Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.8337388Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.8337649Z E ^ 2025-05-07T20:32:14.8338125Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.8338827Z 2025-05-07T20:32:14.8339251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.8339774Z 2025-05-07T20:32:14.8339879Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.8340298Z self=, 2025-05-07T20:32:14.8340706Z T=4096, 2025-05-07T20:32:14.8340892Z D=7168, 2025-05-07T20:32:14.8341093Z scale_ub=None, 2025-05-07T20:32:14.8341307Z contiguous=False, 2025-05-07T20:32:14.8348249Z compiled=False, 2025-05-07T20:32:14.8348541Z ) 2025-05-07T20:32:14.8348871Z self = 2025-05-07T20:32:14.8349387Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.8349661Z 2025-05-07T20:32:14.8349748Z @given( 2025-05-07T20:32:14.8350110Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.8350440Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.8350740Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.8351079Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.8351409Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.8351717Z ) 2025-05-07T20:32:14.8352062Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.8352510Z def test_silu_mul_quant( 2025-05-07T20:32:14.8352754Z self, 2025-05-07T20:32:14.8352952Z T: int, 2025-05-07T20:32:14.8353147Z D: int, 2025-05-07T20:32:14.8353370Z scale_ub: Optional[float], 2025-05-07T20:32:14.8353643Z contiguous: bool, 2025-05-07T20:32:14.8353875Z compiled: bool, 2025-05-07T20:32:14.8354103Z ) -> None: 2025-05-07T20:32:14.8354388Z torch.manual_seed(2025) 2025-05-07T20:32:14.8354629Z 2025-05-07T20:32:14.8354964Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.8355314Z 2025-05-07T20:32:14.8355505Z x_sign = torch.sign(x) 2025-05-07T20:32:14.8355799Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.8356109Z x = x_sign * x_clamp 2025-05-07T20:32:14.8356344Z x0 = x[:, :D] 2025-05-07T20:32:14.8356563Z x1 = x[:, D:] 2025-05-07T20:32:14.8356777Z 2025-05-07T20:32:14.8356964Z if contiguous: 2025-05-07T20:32:14.8357202Z x0 = x0.contiguous() 2025-05-07T20:32:14.8357467Z x1 = x1.contiguous() 2025-05-07T20:32:14.8357702Z 2025-05-07T20:32:14.8357901Z if scale_ub is not None: 2025-05-07T20:32:14.8358184Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.8358525Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.8358832Z ) 2025-05-07T20:32:14.8359035Z else: 2025-05-07T20:32:14.8359255Z scale_ub_tensor = None 2025-05-07T20:32:14.8359579Z 2025-05-07T20:32:14.8359816Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.8360137Z op = silu_mul_quant 2025-05-07T20:32:14.8360379Z if compiled: 2025-05-07T20:32:14.8360631Z op = torch.compile(op) 2025-05-07T20:32:14.8360934Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.8361204Z 2025-05-07T20:32:14.8361396Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.8361559Z 2025-05-07T20:32:14.8361667Z moe/activation_test.py:117: 2025-05-07T20:32:14.8361957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.8362297Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.8362586Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.8363394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.8364091Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.8364641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.8365331Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.8366003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.8366536Z kernel = self.compile( 2025-05-07T20:32:14.8367090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.8367753Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.8368151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.8368396Z 2025-05-07T20:32:14.8368643Z self = 2025-05-07T20:32:14.8369782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.8371158Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cc5cc720>} 2025-05-07T20:32:14.8372499Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.8373514Z context = 2025-05-07T20:32:14.8373807Z 2025-05-07T20:32:14.8373975Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.8374537Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.8375051Z module_map=module_map) 2025-05-07T20:32:14.8375413Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.8375770Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.8376039Z E ^ 2025-05-07T20:32:14.8376501Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.8376959Z 2025-05-07T20:32:14.8377381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.8377907Z 2025-05-07T20:32:14.8378012Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.8378435Z self=, 2025-05-07T20:32:14.8378835Z T=128, 2025-05-07T20:32:14.8379023Z D=7168, 2025-05-07T20:32:14.8379209Z scale_ub=None, 2025-05-07T20:32:14.8379426Z contiguous=False, 2025-05-07T20:32:14.8379658Z compiled=True, 2025-05-07T20:32:14.8379905Z ) 2025-05-07T20:32:14.8806430Z self = 2025-05-07T20:32:14.8806978Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.8807271Z 2025-05-07T20:32:14.8807358Z @given( 2025-05-07T20:32:14.8807595Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.8807908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.8808216Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.8808546Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.8808871Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.8809162Z ) 2025-05-07T20:32:14.8809514Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.8809951Z def test_silu_mul_quant( 2025-05-07T20:32:14.8810201Z self, 2025-05-07T20:32:14.8810405Z T: int, 2025-05-07T20:32:14.8810600Z D: int, 2025-05-07T20:32:14.8810825Z scale_ub: Optional[float], 2025-05-07T20:32:14.8811100Z contiguous: bool, 2025-05-07T20:32:14.8811335Z compiled: bool, 2025-05-07T20:32:14.8811564Z ) -> None: 2025-05-07T20:32:14.8811787Z torch.manual_seed(2025) 2025-05-07T20:32:14.8812025Z 2025-05-07T20:32:14.8812296Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.8812641Z 2025-05-07T20:32:14.8812841Z x_sign = torch.sign(x) 2025-05-07T20:32:14.8813132Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.8813441Z x = x_sign * x_clamp 2025-05-07T20:32:14.8813687Z x0 = x[:, :D] 2025-05-07T20:32:14.8813900Z x1 = x[:, D:] 2025-05-07T20:32:14.8814110Z 2025-05-07T20:32:14.8814302Z if contiguous: 2025-05-07T20:32:14.8814527Z x0 = x0.contiguous() 2025-05-07T20:32:14.8814788Z x1 = x1.contiguous() 2025-05-07T20:32:14.8815038Z 2025-05-07T20:32:14.8815325Z if scale_ub is not None: 2025-05-07T20:32:14.8815610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.8815957Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.8816265Z ) 2025-05-07T20:32:14.8816464Z else: 2025-05-07T20:32:14.8816679Z scale_ub_tensor = None 2025-05-07T20:32:14.8816930Z 2025-05-07T20:32:14.8817169Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.8817490Z op = silu_mul_quant 2025-05-07T20:32:14.8817741Z if compiled: 2025-05-07T20:32:14.8817989Z op = torch.compile(op) 2025-05-07T20:32:14.8818288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.8818571Z 2025-05-07T20:32:14.8818762Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.8819048Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.8819413Z 2025-05-07T20:32:14.8819652Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.8820045Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.8820339Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.8820649Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.8821012Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.8821325Z 2025-05-07T20:32:14.8821531Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.8821725Z 2025-05-07T20:32:14.8821826Z moe/activation_test.py:126: 2025-05-07T20:32:14.8822123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.8822463Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.8822786Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.8823585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.8824346Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.8824969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.8825651Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.8826344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.8827071Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.8827822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.8828575Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.8829362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.8830010Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.8830613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.8831141Z fn() 2025-05-07T20:32:14.8831656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.8832240Z self.fn.run( 2025-05-07T20:32:14.8832707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.8833244Z kernel = self.compile( 2025-05-07T20:32:14.8833793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.8834447Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.8834847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.8835087Z 2025-05-07T20:32:14.8835343Z self = 2025-05-07T20:32:14.8836435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.8837800Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cc0bb060>} 2025-05-07T20:32:14.8839370Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.8840402Z context = 2025-05-07T20:32:14.8840690Z 2025-05-07T20:32:14.8840935Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.8841543Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.8842008Z module_map=module_map) 2025-05-07T20:32:14.8842380Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.8842750Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.8843015Z E ^ 2025-05-07T20:32:14.8843579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.8844033Z 2025-05-07T20:32:14.8844460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.8844986Z 2025-05-07T20:32:14.8845100Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.8845507Z self=, 2025-05-07T20:32:14.8845915Z T=128, 2025-05-07T20:32:14.8846108Z D=7168, 2025-05-07T20:32:14.8846305Z scale_ub=None, 2025-05-07T20:32:14.8846596Z contiguous=False, 2025-05-07T20:32:14.8846828Z compiled=False, 2025-05-07T20:32:14.8847028Z ) 2025-05-07T20:32:15.2007450Z self = 2025-05-07T20:32:15.2007991Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:15.2008286Z 2025-05-07T20:32:15.2008375Z @given( 2025-05-07T20:32:15.2008657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.2008977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.2009282Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.2009605Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.2009932Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.2010215Z ) 2025-05-07T20:32:15.2010563Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.2011004Z def test_silu_mul_quant( 2025-05-07T20:32:15.2011255Z self, 2025-05-07T20:32:15.2011447Z T: int, 2025-05-07T20:32:15.2011636Z D: int, 2025-05-07T20:32:15.2011851Z scale_ub: Optional[float], 2025-05-07T20:32:15.2012120Z contiguous: bool, 2025-05-07T20:32:15.2012357Z compiled: bool, 2025-05-07T20:32:15.2012581Z ) -> None: 2025-05-07T20:32:15.2012794Z torch.manual_seed(2025) 2025-05-07T20:32:15.2013032Z 2025-05-07T20:32:15.2013303Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.2013646Z 2025-05-07T20:32:15.2013832Z x_sign = torch.sign(x) 2025-05-07T20:32:15.2014119Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.2014428Z x = x_sign * x_clamp 2025-05-07T20:32:15.2014663Z x0 = x[:, :D] 2025-05-07T20:32:15.2014880Z x1 = x[:, D:] 2025-05-07T20:32:15.2015084Z 2025-05-07T20:32:15.2015265Z if contiguous: 2025-05-07T20:32:15.2015617Z x0 = x0.contiguous() 2025-05-07T20:32:15.2015882Z x1 = x1.contiguous() 2025-05-07T20:32:15.2016118Z 2025-05-07T20:32:15.2016308Z if scale_ub is not None: 2025-05-07T20:32:15.2016588Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.2016924Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.2017231Z ) 2025-05-07T20:32:15.2017426Z else: 2025-05-07T20:32:15.2017635Z scale_ub_tensor = None 2025-05-07T20:32:15.2017884Z 2025-05-07T20:32:15.2018119Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.2018463Z op = silu_mul_quant 2025-05-07T20:32:15.2018731Z if compiled: 2025-05-07T20:32:15.2018981Z op = torch.compile(op) 2025-05-07T20:32:15.2019277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.2019548Z 2025-05-07T20:32:15.2019832Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.2019997Z 2025-05-07T20:32:15.2020164Z moe/activation_test.py:117: 2025-05-07T20:32:15.2020456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2020795Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.2021085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.2021781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.2022468Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.2023009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.2023701Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.2024369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.2024900Z kernel = self.compile( 2025-05-07T20:32:15.2025456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.2026197Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.2026585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2026816Z 2025-05-07T20:32:15.2027028Z self = 2025-05-07T20:32:15.2028104Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.2029513Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a3a88cc0>} 2025-05-07T20:32:15.2030855Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.2031877Z context = 2025-05-07T20:32:15.2032168Z 2025-05-07T20:32:15.2032336Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.2032858Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.2033326Z module_map=module_map) 2025-05-07T20:32:15.2033683Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.2034037Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.2034296Z E ^ 2025-05-07T20:32:15.2034755Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.2035210Z 2025-05-07T20:32:15.2035680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.2036219Z 2025-05-07T20:32:15.2036319Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.2036732Z self=, 2025-05-07T20:32:15.2037124Z T=4096, 2025-05-07T20:32:15.2037309Z D=5120, 2025-05-07T20:32:15.2037499Z scale_ub=1200.0, 2025-05-07T20:32:15.2037718Z contiguous=True, 2025-05-07T20:32:15.2037938Z compiled=False, 2025-05-07T20:32:15.2038141Z ) 2025-05-07T20:32:15.2038715Z self = 2025-05-07T20:32:15.2039238Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.2039517Z 2025-05-07T20:32:15.2039598Z @given( 2025-05-07T20:32:15.2039832Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.2040139Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.2040514Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.2040901Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.2041234Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.2041517Z ) 2025-05-07T20:32:15.2041871Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.2042311Z def test_silu_mul_quant( 2025-05-07T20:32:15.2042544Z self, 2025-05-07T20:32:15.2042739Z T: int, 2025-05-07T20:32:15.2042936Z D: int, 2025-05-07T20:32:15.2043146Z scale_ub: Optional[float], 2025-05-07T20:32:15.2043471Z contiguous: bool, 2025-05-07T20:32:15.2043714Z compiled: bool, 2025-05-07T20:32:15.2043935Z ) -> None: 2025-05-07T20:32:15.2044150Z torch.manual_seed(2025) 2025-05-07T20:32:15.2044396Z 2025-05-07T20:32:15.2044666Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.2045005Z 2025-05-07T20:32:15.2045199Z x_sign = torch.sign(x) 2025-05-07T20:32:15.2045487Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.2045868Z x = x_sign * x_clamp 2025-05-07T20:32:15.2046103Z x0 = x[:, :D] 2025-05-07T20:32:15.2046322Z x1 = x[:, D:] 2025-05-07T20:32:15.2046522Z 2025-05-07T20:32:15.2046702Z if contiguous: 2025-05-07T20:32:15.2046933Z x0 = x0.contiguous() 2025-05-07T20:32:15.2047188Z x1 = x1.contiguous() 2025-05-07T20:32:15.2047427Z 2025-05-07T20:32:15.2047615Z if scale_ub is not None: 2025-05-07T20:32:15.2047880Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.2048211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.2048523Z ) 2025-05-07T20:32:15.2048735Z else: 2025-05-07T20:32:15.2048968Z scale_ub_tensor = None 2025-05-07T20:32:15.2049218Z 2025-05-07T20:32:15.2049441Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.2049755Z op = silu_mul_quant 2025-05-07T20:32:15.2050010Z if compiled: 2025-05-07T20:32:15.2050250Z op = torch.compile(op) 2025-05-07T20:32:15.2050545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.2050817Z 2025-05-07T20:32:15.2051006Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.2051166Z 2025-05-07T20:32:15.2051264Z moe/activation_test.py:117: 2025-05-07T20:32:15.2051553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2051884Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.2052156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.2052847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.2053540Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.2054082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.2054833Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.2055506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.2056043Z kernel = self.compile( 2025-05-07T20:32:15.2056576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.2057237Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.2057636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2057861Z 2025-05-07T20:32:15.2058071Z self = 2025-05-07T20:32:15.2059234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.2060640Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a3a89f80>} 2025-05-07T20:32:15.2061980Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.2063003Z context = 2025-05-07T20:32:15.2063286Z 2025-05-07T20:32:15.2063458Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.2063971Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.2064437Z module_map=module_map) 2025-05-07T20:32:15.2064803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.2065149Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.2065455Z E ^ 2025-05-07T20:32:15.2065922Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.2066371Z 2025-05-07T20:32:15.2066790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.2067303Z 2025-05-07T20:32:15.2067406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.2067816Z self=, 2025-05-07T20:32:15.2068216Z T=1, 2025-05-07T20:32:15.2068392Z D=5120, 2025-05-07T20:32:15.2068582Z scale_ub=None, 2025-05-07T20:32:15.2068794Z contiguous=True, 2025-05-07T20:32:15.2069006Z compiled=True, 2025-05-07T20:32:15.2069205Z ) 2025-05-07T20:32:15.6548453Z self = 2025-05-07T20:32:15.6549040Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:15.6549316Z 2025-05-07T20:32:15.6549399Z @given( 2025-05-07T20:32:15.6549641Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.6549961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.6550277Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.6550609Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.6550944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.6551237Z ) 2025-05-07T20:32:15.6551588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.6552035Z def test_silu_mul_quant( 2025-05-07T20:32:15.6552283Z self, 2025-05-07T20:32:15.6552475Z T: int, 2025-05-07T20:32:15.6552677Z D: int, 2025-05-07T20:32:15.6552905Z scale_ub: Optional[float], 2025-05-07T20:32:15.6553177Z contiguous: bool, 2025-05-07T20:32:15.6553427Z compiled: bool, 2025-05-07T20:32:15.6553809Z ) -> None: 2025-05-07T20:32:15.6554029Z torch.manual_seed(2025) 2025-05-07T20:32:15.6554273Z 2025-05-07T20:32:15.6554550Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.6554899Z 2025-05-07T20:32:15.6561593Z x_sign = torch.sign(x) 2025-05-07T20:32:15.6561914Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.6562231Z x = x_sign * x_clamp 2025-05-07T20:32:15.6562484Z x0 = x[:, :D] 2025-05-07T20:32:15.6562710Z x1 = x[:, D:] 2025-05-07T20:32:15.6562918Z 2025-05-07T20:32:15.6563117Z if contiguous: 2025-05-07T20:32:15.6563478Z x0 = x0.contiguous() 2025-05-07T20:32:15.6563737Z x1 = x1.contiguous() 2025-05-07T20:32:15.6563985Z 2025-05-07T20:32:15.6564184Z if scale_ub is not None: 2025-05-07T20:32:15.6564455Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.6564914Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.6565299Z ) 2025-05-07T20:32:15.6565495Z else: 2025-05-07T20:32:15.6565704Z scale_ub_tensor = None 2025-05-07T20:32:15.6565965Z 2025-05-07T20:32:15.6566207Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.6566520Z op = silu_mul_quant 2025-05-07T20:32:15.6566782Z if compiled: 2025-05-07T20:32:15.6567035Z op = torch.compile(op) 2025-05-07T20:32:15.6567327Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.6567610Z 2025-05-07T20:32:15.6567807Z y_fp8, y_scale = fn() 2025-05-07T20:32:15.6568090Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:15.6568385Z 2025-05-07T20:32:15.6568641Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.6569021Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:15.6569326Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:15.6569645Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:15.6570087Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:15.6570398Z 2025-05-07T20:32:15.6570605Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:15.6570800Z 2025-05-07T20:32:15.6570915Z moe/activation_test.py:126: 2025-05-07T20:32:15.6571208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.6571549Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:15.6571879Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:15.6572670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:15.6573434Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:15.6573988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.6574683Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.6575380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:15.6576107Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:15.6576867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:15.6577623Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:15.6578375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:15.6579045Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:15.6579657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:15.6580180Z fn() 2025-05-07T20:32:15.6580738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:15.6581338Z self.fn.run( 2025-05-07T20:32:15.6581810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.6582337Z kernel = self.compile( 2025-05-07T20:32:15.6582886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.6583548Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.6583951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.6584179Z 2025-05-07T20:32:15.6584387Z self = 2025-05-07T20:32:15.6585519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.6586938Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a3a8afc0>} 2025-05-07T20:32:15.6588285Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.6589357Z context = 2025-05-07T20:32:15.6589654Z 2025-05-07T20:32:15.6589822Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.6590355Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.6590832Z module_map=module_map) 2025-05-07T20:32:15.6591199Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.6591605Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:15.6591879Z E ^ 2025-05-07T20:32:15.6592342Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.6592800Z 2025-05-07T20:32:15.6593223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.6593746Z 2025-05-07T20:32:15.6593850Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.6594270Z self=, 2025-05-07T20:32:15.6594669Z T=2048, 2025-05-07T20:32:15.6594873Z D=5120, 2025-05-07T20:32:15.6595071Z scale_ub=None, 2025-05-07T20:32:15.6595281Z contiguous=True, 2025-05-07T20:32:15.6595507Z compiled=True, 2025-05-07T20:32:15.6595705Z ) 2025-05-07T20:32:16.0931123Z self = 2025-05-07T20:32:16.0931662Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.0931933Z 2025-05-07T20:32:16.0932021Z @given( 2025-05-07T20:32:16.0932262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.0932602Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.0932916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.0933253Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.0933579Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.0933877Z ) 2025-05-07T20:32:16.0934237Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.0934679Z def test_silu_mul_quant( 2025-05-07T20:32:16.0934928Z self, 2025-05-07T20:32:16.0935128Z T: int, 2025-05-07T20:32:16.0935324Z D: int, 2025-05-07T20:32:16.0935549Z scale_ub: Optional[float], 2025-05-07T20:32:16.0935945Z contiguous: bool, 2025-05-07T20:32:16.0936193Z compiled: bool, 2025-05-07T20:32:16.0936418Z ) -> None: 2025-05-07T20:32:16.0936638Z torch.manual_seed(2025) 2025-05-07T20:32:16.0936883Z 2025-05-07T20:32:16.0937154Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.0937503Z 2025-05-07T20:32:16.0937704Z x_sign = torch.sign(x) 2025-05-07T20:32:16.0937993Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.0938305Z x = x_sign * x_clamp 2025-05-07T20:32:16.0938760Z x0 = x[:, :D] 2025-05-07T20:32:16.0938975Z x1 = x[:, D:] 2025-05-07T20:32:16.0939190Z 2025-05-07T20:32:16.0939384Z if contiguous: 2025-05-07T20:32:16.0939618Z x0 = x0.contiguous() 2025-05-07T20:32:16.0939885Z x1 = x1.contiguous() 2025-05-07T20:32:16.0940132Z 2025-05-07T20:32:16.0940394Z if scale_ub is not None: 2025-05-07T20:32:16.0940682Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.0941086Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.0941403Z ) 2025-05-07T20:32:16.0941597Z else: 2025-05-07T20:32:16.0941810Z scale_ub_tensor = None 2025-05-07T20:32:16.0942071Z 2025-05-07T20:32:16.0943731Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.0944051Z op = silu_mul_quant 2025-05-07T20:32:16.0944304Z if compiled: 2025-05-07T20:32:16.0944550Z op = torch.compile(op) 2025-05-07T20:32:16.0944852Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.0945137Z 2025-05-07T20:32:16.0945328Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.0945621Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.0945920Z 2025-05-07T20:32:16.0946157Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.0946503Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.0946804Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.0947199Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.0947559Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.0947874Z 2025-05-07T20:32:16.0948083Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.0948276Z 2025-05-07T20:32:16.0948378Z moe/activation_test.py:126: 2025-05-07T20:32:16.0948681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.0949020Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.0949344Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.0950137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.0950897Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.0951454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.0952143Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.0952840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.0953570Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.0954329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:16.0955077Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.0955813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.0956455Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.0957128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.0957672Z fn() 2025-05-07T20:32:16.0958188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.0958780Z self.fn.run( 2025-05-07T20:32:16.0959300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.0959839Z kernel = self.compile( 2025-05-07T20:32:16.0960389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.0961044Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.0961443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.0961679Z 2025-05-07T20:32:16.0961935Z self = 2025-05-07T20:32:16.0963019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.0964543Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a37bf420>} 2025-05-07T20:32:16.0965875Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.0966907Z context = 2025-05-07T20:32:16.0967198Z 2025-05-07T20:32:16.0967368Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.0967898Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.0968511Z module_map=module_map) 2025-05-07T20:32:16.0968880Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.0969240Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.0969509Z E ^ 2025-05-07T20:32:16.0969977Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.0970437Z 2025-05-07T20:32:16.0970859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.0971380Z 2025-05-07T20:32:16.0971493Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.0971905Z self=, 2025-05-07T20:32:16.0972314Z T=128, 2025-05-07T20:32:16.0972508Z D=5120, 2025-05-07T20:32:16.0972700Z scale_ub=None, 2025-05-07T20:32:16.0972924Z contiguous=True, 2025-05-07T20:32:16.0973159Z compiled=True, 2025-05-07T20:32:16.0973367Z ) 2025-05-07T20:32:16.7668129Z self = 2025-05-07T20:32:16.7669083Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.7669370Z 2025-05-07T20:32:16.7669468Z @given( 2025-05-07T20:32:16.7669717Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.7670047Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.7670365Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.7670710Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.7671053Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.7671350Z ) 2025-05-07T20:32:16.7671703Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.7672160Z def test_silu_mul_quant( 2025-05-07T20:32:16.7672426Z self, 2025-05-07T20:32:16.7672628Z T: int, 2025-05-07T20:32:16.7672968Z D: int, 2025-05-07T20:32:16.7673210Z scale_ub: Optional[float], 2025-05-07T20:32:16.7673494Z contiguous: bool, 2025-05-07T20:32:16.7673746Z compiled: bool, 2025-05-07T20:32:16.7673978Z ) -> None: 2025-05-07T20:32:16.7674199Z torch.manual_seed(2025) 2025-05-07T20:32:16.7674448Z 2025-05-07T20:32:16.7674728Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.7675068Z 2025-05-07T20:32:16.7675264Z x_sign = torch.sign(x) 2025-05-07T20:32:16.7675564Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.7675881Z x = x_sign * x_clamp 2025-05-07T20:32:16.7676115Z x0 = x[:, :D] 2025-05-07T20:32:16.7676336Z x1 = x[:, D:] 2025-05-07T20:32:16.7676551Z 2025-05-07T20:32:16.7676738Z if contiguous: 2025-05-07T20:32:16.7676973Z x0 = x0.contiguous() 2025-05-07T20:32:16.7677301Z x1 = x1.contiguous() 2025-05-07T20:32:16.7677592Z 2025-05-07T20:32:16.7677802Z if scale_ub is not None: 2025-05-07T20:32:16.7678078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.7678413Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.7678725Z ) 2025-05-07T20:32:16.7678950Z else: 2025-05-07T20:32:16.7679185Z scale_ub_tensor = None 2025-05-07T20:32:16.7679445Z 2025-05-07T20:32:16.7679684Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.7679992Z op = silu_mul_quant 2025-05-07T20:32:16.7680244Z if compiled: 2025-05-07T20:32:16.7680497Z op = torch.compile(op) 2025-05-07T20:32:16.7680792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.7681079Z 2025-05-07T20:32:16.7681284Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.7681577Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.7681872Z 2025-05-07T20:32:16.7682119Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.7682533Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.7682829Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.7683151Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.7683595Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.7683907Z 2025-05-07T20:32:16.7684119Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.7684313Z 2025-05-07T20:32:16.7684423Z moe/activation_test.py:126: 2025-05-07T20:32:16.7684730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.7685071Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.7685402Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.7686204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.7686965Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.7687529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.7688226Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.7688977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.7689702Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.7690462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:16.7691223Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.7691968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.7692654Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.7693275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.7693807Z fn() 2025-05-07T20:32:16.7694318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.7694911Z self.fn.run( 2025-05-07T20:32:16.7695388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.7695924Z kernel = self.compile( 2025-05-07T20:32:16.7696468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.7697130Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.7697573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.7697804Z 2025-05-07T20:32:16.7698058Z self = 2025-05-07T20:32:16.7699197Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.7700566Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a3045e40>} 2025-05-07T20:32:16.7701911Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.7702946Z context = 2025-05-07T20:32:16.7703232Z 2025-05-07T20:32:16.7703406Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.7703942Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.7704455Z module_map=module_map) 2025-05-07T20:32:16.7704824Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.7705183Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.7705458Z E ^ 2025-05-07T20:32:16.7705929Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.7706386Z 2025-05-07T20:32:16.7706806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.7707331Z 2025-05-07T20:32:16.7707439Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.7707855Z self=, 2025-05-07T20:32:16.7708266Z T=4096, 2025-05-07T20:32:16.7708455Z D=5120, 2025-05-07T20:32:16.7708656Z scale_ub=None, 2025-05-07T20:32:16.7708880Z contiguous=True, 2025-05-07T20:32:16.7709103Z compiled=True, 2025-05-07T20:32:16.7709317Z ) 2025-05-07T20:32:17.2825802Z self = 2025-05-07T20:32:17.2826858Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.2827406Z 2025-05-07T20:32:17.2827559Z @given( 2025-05-07T20:32:17.2828017Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.2828638Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.2829247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.2829633Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.2829956Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.2830242Z ) 2025-05-07T20:32:17.2830598Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.2831149Z def test_silu_mul_quant( 2025-05-07T20:32:17.2831411Z self, 2025-05-07T20:32:17.2831614Z T: int, 2025-05-07T20:32:17.2831809Z D: int, 2025-05-07T20:32:17.2832032Z scale_ub: Optional[float], 2025-05-07T20:32:17.2832312Z contiguous: bool, 2025-05-07T20:32:17.2832551Z compiled: bool, 2025-05-07T20:32:17.2832779Z ) -> None: 2025-05-07T20:32:17.2833000Z torch.manual_seed(2025) 2025-05-07T20:32:17.2833245Z 2025-05-07T20:32:17.2833516Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.2833868Z 2025-05-07T20:32:17.2834067Z x_sign = torch.sign(x) 2025-05-07T20:32:17.2834353Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.2834666Z x = x_sign * x_clamp 2025-05-07T20:32:17.2834910Z x0 = x[:, :D] 2025-05-07T20:32:17.2835122Z x1 = x[:, D:] 2025-05-07T20:32:17.2835338Z 2025-05-07T20:32:17.2835594Z if contiguous: 2025-05-07T20:32:17.2835828Z x0 = x0.contiguous() 2025-05-07T20:32:17.2836156Z x1 = x1.contiguous() 2025-05-07T20:32:17.2836401Z 2025-05-07T20:32:17.2836589Z if scale_ub is not None: 2025-05-07T20:32:17.2836861Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.2837201Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.2837508Z ) 2025-05-07T20:32:17.2837703Z else: 2025-05-07T20:32:17.2837916Z scale_ub_tensor = None 2025-05-07T20:32:17.2838170Z 2025-05-07T20:32:17.2838566Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.2838885Z op = silu_mul_quant 2025-05-07T20:32:17.2839143Z if compiled: 2025-05-07T20:32:17.2839388Z op = torch.compile(op) 2025-05-07T20:32:17.2839687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.2839966Z 2025-05-07T20:32:17.2840160Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.2840451Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.2840828Z 2025-05-07T20:32:17.2841065Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.2841403Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.2841705Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.2842019Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.2842383Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.2842698Z 2025-05-07T20:32:17.2842906Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.2843103Z 2025-05-07T20:32:17.2843282Z moe/activation_test.py:126: 2025-05-07T20:32:17.2843581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.2843915Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.2844238Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.2845036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.2845801Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.2846358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.2847040Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.2847738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.2848474Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.2849233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.2849983Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.2850818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.2851470Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.2852067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.2852589Z fn() 2025-05-07T20:32:17.2853100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.2853682Z self.fn.run( 2025-05-07T20:32:17.2854150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.2854683Z kernel = self.compile( 2025-05-07T20:32:17.2855228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.2855973Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.2856378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.2856672Z 2025-05-07T20:32:17.2856882Z self = 2025-05-07T20:32:17.2857957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.2859329Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a3352840>} 2025-05-07T20:32:17.2860664Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.2861694Z context = 2025-05-07T20:32:17.2861984Z 2025-05-07T20:32:17.2862206Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.2862727Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.2863190Z module_map=module_map) 2025-05-07T20:32:17.2863559Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.2863921Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.2864184Z E ^ 2025-05-07T20:32:17.2864651Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.2865104Z 2025-05-07T20:32:17.2871815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.2872538Z 2025-05-07T20:32:17.2872712Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.2873177Z self=, 2025-05-07T20:32:17.2873588Z T=16384, 2025-05-07T20:32:17.2873789Z D=5120, 2025-05-07T20:32:17.2873986Z scale_ub=None, 2025-05-07T20:32:17.2874198Z contiguous=True, 2025-05-07T20:32:17.2874420Z compiled=True, 2025-05-07T20:32:17.2874627Z ) 2025-05-07T20:32:17.3126653Z W0507 20:32:17.311000 87828 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:17.3127900Z W0507 20:32:17.311000 87828 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:17.3129239Z W0507 20:32:17.311000 87828 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:17.3130348Z W0507 20:32:17.311000 87828 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:17.3131461Z W0507 20:32:17.311000 87828 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:17.3823238Z self = 2025-05-07T20:32:17.3824295Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.3824847Z 2025-05-07T20:32:17.3825006Z @given( 2025-05-07T20:32:17.3825468Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.3826088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.3826685Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.3827339Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.3827997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.3828728Z ) 2025-05-07T20:32:17.3829384Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.3829944Z def test_silu_mul_quant( 2025-05-07T20:32:17.3830188Z self, 2025-05-07T20:32:17.3830391Z T: int, 2025-05-07T20:32:17.3830597Z D: int, 2025-05-07T20:32:17.3830825Z scale_ub: Optional[float], 2025-05-07T20:32:17.3831101Z contiguous: bool, 2025-05-07T20:32:17.3831350Z compiled: bool, 2025-05-07T20:32:17.3831585Z ) -> None: 2025-05-07T20:32:17.3831800Z torch.manual_seed(2025) 2025-05-07T20:32:17.3832045Z 2025-05-07T20:32:17.3832329Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.3832675Z 2025-05-07T20:32:17.3832877Z x_sign = torch.sign(x) 2025-05-07T20:32:17.3833177Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.3833485Z x = x_sign * x_clamp 2025-05-07T20:32:17.3833733Z x0 = x[:, :D] 2025-05-07T20:32:17.3833951Z x1 = x[:, D:] 2025-05-07T20:32:17.3834160Z 2025-05-07T20:32:17.3834422Z if contiguous: 2025-05-07T20:32:17.3834665Z x0 = x0.contiguous() 2025-05-07T20:32:17.3834923Z x1 = x1.contiguous() 2025-05-07T20:32:17.3835171Z 2025-05-07T20:32:17.3835371Z if scale_ub is not None: 2025-05-07T20:32:17.3835642Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.3835982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.3836298Z ) 2025-05-07T20:32:17.3836497Z else: 2025-05-07T20:32:17.3836708Z scale_ub_tensor = None 2025-05-07T20:32:17.3836964Z 2025-05-07T20:32:17.3837208Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.3837518Z op = silu_mul_quant 2025-05-07T20:32:17.3837771Z if compiled: 2025-05-07T20:32:17.3838025Z op = torch.compile(op) 2025-05-07T20:32:17.3838322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.3838755Z 2025-05-07T20:32:17.3838956Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.3839242Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.3839538Z 2025-05-07T20:32:17.3839785Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.3840118Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.3840419Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.3840739Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.3841100Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.3841407Z 2025-05-07T20:32:17.3841615Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.3841810Z 2025-05-07T20:32:17.3841919Z moe/activation_test.py:126: 2025-05-07T20:32:17.3842218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.3842559Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.3842967Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.3843919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.3844685Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.3845236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.3845921Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.3846613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.3847344Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.3848100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.3848916Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.3849710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.3850351Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.3850955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.3851470Z fn() 2025-05-07T20:32:17.3851979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.3852562Z self.fn.run( 2025-05-07T20:32:17.3853028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.3853560Z kernel = self.compile( 2025-05-07T20:32:17.3854108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.3854772Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.3855237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.3855471Z 2025-05-07T20:32:17.3855678Z self = 2025-05-07T20:32:17.3856761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.3858130Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2b19d00>} 2025-05-07T20:32:17.3859470Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.3860508Z context = 2025-05-07T20:32:17.3860801Z 2025-05-07T20:32:17.3860971Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.3861498Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.3861959Z module_map=module_map) 2025-05-07T20:32:17.3862328Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.3862688Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.3862950Z E ^ 2025-05-07T20:32:17.3863416Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.3863872Z 2025-05-07T20:32:17.3864292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.3864811Z 2025-05-07T20:32:17.3864969Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.3865384Z self=, 2025-05-07T20:32:17.3865787Z T=1, 2025-05-07T20:32:17.3865974Z D=5120, 2025-05-07T20:32:17.3866164Z scale_ub=1200.0, 2025-05-07T20:32:17.3866386Z contiguous=True, 2025-05-07T20:32:17.3866608Z compiled=True, 2025-05-07T20:32:17.3866814Z ) 2025-05-07T20:32:17.6696519Z self = 2025-05-07T20:32:17.6697035Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.6697327Z 2025-05-07T20:32:17.6697410Z @given( 2025-05-07T20:32:17.6697649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.6697961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.6698272Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.6698715Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.6699051Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.6699395Z ) 2025-05-07T20:32:17.6699947Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.6700796Z def test_silu_mul_quant( 2025-05-07T20:32:17.6701224Z self, 2025-05-07T20:32:17.6701580Z T: int, 2025-05-07T20:32:17.6701936Z D: int, 2025-05-07T20:32:17.6702325Z scale_ub: Optional[float], 2025-05-07T20:32:17.6702822Z contiguous: bool, 2025-05-07T20:32:17.6703260Z compiled: bool, 2025-05-07T20:32:17.6703702Z ) -> None: 2025-05-07T20:32:17.6704086Z torch.manual_seed(2025) 2025-05-07T20:32:17.6704532Z 2025-05-07T20:32:17.6705036Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.6705650Z 2025-05-07T20:32:17.6705999Z x_sign = torch.sign(x) 2025-05-07T20:32:17.6706539Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.6707094Z x = x_sign * x_clamp 2025-05-07T20:32:17.6707678Z x0 = x[:, :D] 2025-05-07T20:32:17.6708075Z x1 = x[:, D:] 2025-05-07T20:32:17.6708457Z 2025-05-07T20:32:17.6708796Z if contiguous: 2025-05-07T20:32:17.6709228Z x0 = x0.contiguous() 2025-05-07T20:32:17.6709541Z x1 = x1.contiguous() 2025-05-07T20:32:17.6709786Z 2025-05-07T20:32:17.6709989Z if scale_ub is not None: 2025-05-07T20:32:17.6710269Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.6710601Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.6710919Z ) 2025-05-07T20:32:17.6711120Z else: 2025-05-07T20:32:17.6711331Z scale_ub_tensor = None 2025-05-07T20:32:17.6711586Z 2025-05-07T20:32:17.6711823Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.6712134Z op = silu_mul_quant 2025-05-07T20:32:17.6712390Z if compiled: 2025-05-07T20:32:17.6712649Z op = torch.compile(op) 2025-05-07T20:32:17.6712958Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.6713241Z 2025-05-07T20:32:17.6713439Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.6713604Z 2025-05-07T20:32:17.6713716Z moe/activation_test.py:117: 2025-05-07T20:32:17.6714012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.6714351Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.6714639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.6715201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.6715772Z return fn(*args, **kwargs) 2025-05-07T20:32:17.6716444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.6717135Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.6717748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.6718441Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.6719113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.6719644Z kernel = self.compile( 2025-05-07T20:32:17.6720191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.6720858Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.6721261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.6721494Z 2025-05-07T20:32:17.6721706Z self = 2025-05-07T20:32:17.6722834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.6724414Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2d68220>} 2025-05-07T20:32:17.6725755Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.6726782Z context = 2025-05-07T20:32:17.6727069Z 2025-05-07T20:32:17.6727239Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.6727762Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.6728242Z module_map=module_map) 2025-05-07T20:32:17.6728618Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.6729038Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.6729334Z E ^ 2025-05-07T20:32:17.6729801Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.6730252Z 2025-05-07T20:32:17.6730670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.6731182Z 2025-05-07T20:32:17.6731287Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.6731705Z self=, 2025-05-07T20:32:17.6732110Z T=1, 2025-05-07T20:32:17.6732290Z D=5120, 2025-05-07T20:32:17.6732488Z scale_ub=None, 2025-05-07T20:32:17.6732707Z contiguous=False, 2025-05-07T20:32:17.6732927Z compiled=True, 2025-05-07T20:32:17.6733130Z ) 2025-05-07T20:32:17.7208294Z self = 2025-05-07T20:32:17.7208843Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.7209108Z 2025-05-07T20:32:17.7209193Z @given( 2025-05-07T20:32:17.7209649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.7210283Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.7210887Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.7211531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.7213616Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.7214176Z ) 2025-05-07T20:32:17.7214862Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.7215738Z def test_silu_mul_quant( 2025-05-07T20:32:17.7216210Z self, 2025-05-07T20:32:17.7216580Z T: int, 2025-05-07T20:32:17.7216970Z D: int, 2025-05-07T20:32:17.7217404Z scale_ub: Optional[float], 2025-05-07T20:32:17.7218086Z contiguous: bool, 2025-05-07T20:32:17.7218577Z compiled: bool, 2025-05-07T20:32:17.7219024Z ) -> None: 2025-05-07T20:32:17.7219243Z torch.manual_seed(2025) 2025-05-07T20:32:17.7219478Z 2025-05-07T20:32:17.7219756Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.7220095Z 2025-05-07T20:32:17.7220286Z x_sign = torch.sign(x) 2025-05-07T20:32:17.7220577Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.7220882Z x = x_sign * x_clamp 2025-05-07T20:32:17.7221114Z x0 = x[:, :D] 2025-05-07T20:32:17.7221331Z x1 = x[:, D:] 2025-05-07T20:32:17.7221543Z 2025-05-07T20:32:17.7221723Z if contiguous: 2025-05-07T20:32:17.7221954Z x0 = x0.contiguous() 2025-05-07T20:32:17.7222213Z x1 = x1.contiguous() 2025-05-07T20:32:17.7222452Z 2025-05-07T20:32:17.7222647Z if scale_ub is not None: 2025-05-07T20:32:17.7222989Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.7223373Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.7223688Z ) 2025-05-07T20:32:17.7223889Z else: 2025-05-07T20:32:17.7224096Z scale_ub_tensor = None 2025-05-07T20:32:17.7224348Z 2025-05-07T20:32:17.7224584Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.7224900Z op = silu_mul_quant 2025-05-07T20:32:17.7225145Z if compiled: 2025-05-07T20:32:17.7225396Z op = torch.compile(op) 2025-05-07T20:32:17.7225695Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.7225967Z 2025-05-07T20:32:17.7226158Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.7226446Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.7226738Z 2025-05-07T20:32:17.7226976Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.7227312Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.7227608Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.7227992Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.7228350Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.7228662Z 2025-05-07T20:32:17.7228860Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.7229057Z 2025-05-07T20:32:17.7229157Z moe/activation_test.py:126: 2025-05-07T20:32:17.7229452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.7229781Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.7230108Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.7230899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.7231653Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.7232195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.7232896Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.7233587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.7234307Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.7235063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.7235812Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.7236538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.7237173Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.7237819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.7238348Z fn() 2025-05-07T20:32:17.7239020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.7239595Z self.fn.run( 2025-05-07T20:32:17.7240064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.7240608Z kernel = self.compile( 2025-05-07T20:32:17.7241146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.7241806Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.7242209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.7242437Z 2025-05-07T20:32:17.7242648Z self = 2025-05-07T20:32:17.7243900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.7245314Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2d8e200>} 2025-05-07T20:32:17.7246657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.7247682Z context = 2025-05-07T20:32:17.7247967Z 2025-05-07T20:32:17.7248137Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.7248666Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.7249138Z module_map=module_map) 2025-05-07T20:32:17.7249580Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.7249948Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.7250214Z E ^ 2025-05-07T20:32:17.7250675Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.7251124Z 2025-05-07T20:32:17.7251547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.7252059Z 2025-05-07T20:32:17.7252165Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.7252578Z self=, 2025-05-07T20:32:17.7252979Z T=1, 2025-05-07T20:32:17.7253159Z D=5120, 2025-05-07T20:32:17.7253353Z scale_ub=None, 2025-05-07T20:32:17.7253570Z contiguous=True, 2025-05-07T20:32:17.7253788Z compiled=False, 2025-05-07T20:32:17.7253998Z ) 2025-05-07T20:32:17.8410038Z self = 2025-05-07T20:32:17.8410561Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.8410830Z 2025-05-07T20:32:17.8410922Z @given( 2025-05-07T20:32:17.8411151Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.8411467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.8411771Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.8412096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.8412425Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.8412707Z ) 2025-05-07T20:32:17.8413050Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.8413492Z def test_silu_mul_quant( 2025-05-07T20:32:17.8413736Z self, 2025-05-07T20:32:17.8413933Z T: int, 2025-05-07T20:32:17.8414126Z D: int, 2025-05-07T20:32:17.8414458Z scale_ub: Optional[float], 2025-05-07T20:32:17.8414737Z contiguous: bool, 2025-05-07T20:32:17.8414966Z compiled: bool, 2025-05-07T20:32:17.8415190Z ) -> None: 2025-05-07T20:32:17.8415406Z torch.manual_seed(2025) 2025-05-07T20:32:17.8415644Z 2025-05-07T20:32:17.8415917Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.8416255Z 2025-05-07T20:32:17.8416442Z x_sign = torch.sign(x) 2025-05-07T20:32:17.8416732Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.8417042Z x = x_sign * x_clamp 2025-05-07T20:32:17.8417276Z x0 = x[:, :D] 2025-05-07T20:32:17.8417492Z x1 = x[:, D:] 2025-05-07T20:32:17.8417702Z 2025-05-07T20:32:17.8417884Z if contiguous: 2025-05-07T20:32:17.8418117Z x0 = x0.contiguous() 2025-05-07T20:32:17.8418439Z x1 = x1.contiguous() 2025-05-07T20:32:17.8418676Z 2025-05-07T20:32:17.8418923Z if scale_ub is not None: 2025-05-07T20:32:17.8419204Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.8419538Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.8419840Z ) 2025-05-07T20:32:17.8420034Z else: 2025-05-07T20:32:17.8420248Z scale_ub_tensor = None 2025-05-07T20:32:17.8420492Z 2025-05-07T20:32:17.8420723Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.8421037Z op = silu_mul_quant 2025-05-07T20:32:17.8421285Z if compiled: 2025-05-07T20:32:17.8421528Z op = torch.compile(op) 2025-05-07T20:32:17.8421830Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8422100Z 2025-05-07T20:32:17.8422288Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.8422448Z 2025-05-07T20:32:17.8422551Z moe/activation_test.py:117: 2025-05-07T20:32:17.8422849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8423248Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.8423535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8424223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.8424905Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.8425441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.8426124Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.8426791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.8427320Z kernel = self.compile( 2025-05-07T20:32:17.8427861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.8428522Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.8428918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8429151Z 2025-05-07T20:32:17.8429360Z self = 2025-05-07T20:32:17.8430433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.8431800Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2d8f560>} 2025-05-07T20:32:17.8433143Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.8434214Z context = 2025-05-07T20:32:17.8434510Z 2025-05-07T20:32:17.8434677Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.8435196Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.8435660Z module_map=module_map) 2025-05-07T20:32:17.8436022Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.8436376Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.8436633Z E ^ 2025-05-07T20:32:17.8437092Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.8437545Z 2025-05-07T20:32:17.8437962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.8438657Z 2025-05-07T20:32:17.8438862Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.8445653Z self=, 2025-05-07T20:32:17.8446081Z T=128, 2025-05-07T20:32:17.8446265Z D=5120, 2025-05-07T20:32:17.8446461Z scale_ub=None, 2025-05-07T20:32:17.8446679Z contiguous=False, 2025-05-07T20:32:17.8446901Z compiled=True, 2025-05-07T20:32:17.8447103Z ) 2025-05-07T20:32:17.8447434Z self = 2025-05-07T20:32:17.8447924Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.8448196Z 2025-05-07T20:32:17.8448274Z @given( 2025-05-07T20:32:17.8448501Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.8448815Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.8449114Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.8449439Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.8449774Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.8450173Z ) 2025-05-07T20:32:17.8450529Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.8450968Z def test_silu_mul_quant( 2025-05-07T20:32:17.8451211Z self, 2025-05-07T20:32:17.8451401Z T: int, 2025-05-07T20:32:17.8451594Z D: int, 2025-05-07T20:32:17.8451811Z scale_ub: Optional[float], 2025-05-07T20:32:17.8452072Z contiguous: bool, 2025-05-07T20:32:17.8452309Z compiled: bool, 2025-05-07T20:32:17.8452529Z ) -> None: 2025-05-07T20:32:17.8452740Z torch.manual_seed(2025) 2025-05-07T20:32:17.8452976Z 2025-05-07T20:32:17.8453246Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.8453581Z 2025-05-07T20:32:17.8453773Z x_sign = torch.sign(x) 2025-05-07T20:32:17.8454062Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.8454367Z x = x_sign * x_clamp 2025-05-07T20:32:17.8454610Z x0 = x[:, :D] 2025-05-07T20:32:17.8454826Z x1 = x[:, D:] 2025-05-07T20:32:17.8455045Z 2025-05-07T20:32:17.8455226Z if contiguous: 2025-05-07T20:32:17.8455457Z x0 = x0.contiguous() 2025-05-07T20:32:17.8455709Z x1 = x1.contiguous() 2025-05-07T20:32:17.8455940Z 2025-05-07T20:32:17.8456133Z if scale_ub is not None: 2025-05-07T20:32:17.8456403Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.8456729Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.8457035Z ) 2025-05-07T20:32:17.8457216Z else: 2025-05-07T20:32:17.8457417Z scale_ub_tensor = None 2025-05-07T20:32:17.8457661Z 2025-05-07T20:32:17.8457888Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.8458191Z op = silu_mul_quant 2025-05-07T20:32:17.8458438Z if compiled: 2025-05-07T20:32:17.8458686Z op = torch.compile(op) 2025-05-07T20:32:17.8459057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8459328Z 2025-05-07T20:32:17.8459518Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.8459682Z 2025-05-07T20:32:17.8459783Z moe/activation_test.py:117: 2025-05-07T20:32:17.8460066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8460399Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.8460677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8461231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.8461790Z return fn(*args, **kwargs) 2025-05-07T20:32:17.8462449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.8463136Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.8463715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.8464445Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.8465109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.8465635Z kernel = self.compile( 2025-05-07T20:32:17.8466182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.8466843Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.8467240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8467465Z 2025-05-07T20:32:17.8467672Z self = 2025-05-07T20:32:17.8468753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.8470163Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2d696c0>} 2025-05-07T20:32:17.8471508Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.8472532Z context = 2025-05-07T20:32:17.8472815Z 2025-05-07T20:32:17.8472981Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.8473501Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.8473972Z module_map=module_map) 2025-05-07T20:32:17.8474331Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.8474695Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.8474953Z E ^ 2025-05-07T20:32:17.8475410Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.8475864Z 2025-05-07T20:32:17.8476280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.8476790Z 2025-05-07T20:32:17.8476897Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.8477309Z self=, 2025-05-07T20:32:17.8477699Z T=128, 2025-05-07T20:32:17.8477877Z D=7168, 2025-05-07T20:32:17.8478065Z scale_ub=1200.0, 2025-05-07T20:32:17.8478282Z contiguous=False, 2025-05-07T20:32:17.8478499Z compiled=False, 2025-05-07T20:32:17.8478699Z ) 2025-05-07T20:32:17.9344933Z self = 2025-05-07T20:32:17.9345571Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.9345863Z 2025-05-07T20:32:17.9345940Z @given( 2025-05-07T20:32:17.9346169Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9346472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9346774Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9347095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9347416Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9347696Z ) 2025-05-07T20:32:17.9348046Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9348480Z def test_silu_mul_quant( 2025-05-07T20:32:17.9348715Z self, 2025-05-07T20:32:17.9348910Z T: int, 2025-05-07T20:32:17.9349104Z D: int, 2025-05-07T20:32:17.9349384Z scale_ub: Optional[float], 2025-05-07T20:32:17.9349652Z contiguous: bool, 2025-05-07T20:32:17.9349951Z compiled: bool, 2025-05-07T20:32:17.9350168Z ) -> None: 2025-05-07T20:32:17.9350383Z torch.manual_seed(2025) 2025-05-07T20:32:17.9350620Z 2025-05-07T20:32:17.9350885Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9351225Z 2025-05-07T20:32:17.9351410Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9351694Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9351992Z x = x_sign * x_clamp 2025-05-07T20:32:17.9352231Z x0 = x[:, :D] 2025-05-07T20:32:17.9352445Z x1 = x[:, D:] 2025-05-07T20:32:17.9352642Z 2025-05-07T20:32:17.9352823Z if contiguous: 2025-05-07T20:32:17.9353055Z x0 = x0.contiguous() 2025-05-07T20:32:17.9353310Z x1 = x1.contiguous() 2025-05-07T20:32:17.9353550Z 2025-05-07T20:32:17.9353739Z if scale_ub is not None: 2025-05-07T20:32:17.9354003Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9354339Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9354717Z ) 2025-05-07T20:32:17.9354906Z else: 2025-05-07T20:32:17.9355111Z scale_ub_tensor = None 2025-05-07T20:32:17.9355359Z 2025-05-07T20:32:17.9355584Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9355887Z op = silu_mul_quant 2025-05-07T20:32:17.9356133Z if compiled: 2025-05-07T20:32:17.9356379Z op = torch.compile(op) 2025-05-07T20:32:17.9356662Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9356931Z 2025-05-07T20:32:17.9357119Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9357280Z 2025-05-07T20:32:17.9357378Z moe/activation_test.py:117: 2025-05-07T20:32:17.9357668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9357991Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9358267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9358967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9359655Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9360192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9360868Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9361533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9362068Z kernel = self.compile( 2025-05-07T20:32:17.9362604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9363350Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9363742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9364018Z 2025-05-07T20:32:17.9364231Z self = 2025-05-07T20:32:17.9365302Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9366660Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2a7cae0>} 2025-05-07T20:32:17.9368001Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9369017Z context = 2025-05-07T20:32:17.9369341Z 2025-05-07T20:32:17.9369514Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9370071Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9370532Z module_map=module_map) 2025-05-07T20:32:17.9370893Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9371239Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9371496Z E ^ 2025-05-07T20:32:17.9371958Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9372405Z 2025-05-07T20:32:17.9372849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9373362Z 2025-05-07T20:32:17.9373469Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9373873Z self=, 2025-05-07T20:32:17.9374274Z T=128, 2025-05-07T20:32:17.9374508Z D=5120, 2025-05-07T20:32:17.9374695Z scale_ub=None, 2025-05-07T20:32:17.9374903Z contiguous=False, 2025-05-07T20:32:17.9375128Z compiled=False, 2025-05-07T20:32:17.9375327Z ) 2025-05-07T20:32:17.9375641Z self = 2025-05-07T20:32:17.9376126Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.9376393Z 2025-05-07T20:32:17.9376474Z @given( 2025-05-07T20:32:17.9376691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9376999Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9377298Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9377618Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9377938Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9378218Z ) 2025-05-07T20:32:17.9378562Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9379011Z def test_silu_mul_quant( 2025-05-07T20:32:17.9379247Z self, 2025-05-07T20:32:17.9379433Z T: int, 2025-05-07T20:32:17.9379620Z D: int, 2025-05-07T20:32:17.9379831Z scale_ub: Optional[float], 2025-05-07T20:32:17.9380097Z contiguous: bool, 2025-05-07T20:32:17.9380326Z compiled: bool, 2025-05-07T20:32:17.9380558Z ) -> None: 2025-05-07T20:32:17.9380758Z torch.manual_seed(2025) 2025-05-07T20:32:17.9380997Z 2025-05-07T20:32:17.9381271Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9381601Z 2025-05-07T20:32:17.9381792Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9382078Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9382371Z x = x_sign * x_clamp 2025-05-07T20:32:17.9382607Z x0 = x[:, :D] 2025-05-07T20:32:17.9382824Z x1 = x[:, D:] 2025-05-07T20:32:17.9383030Z 2025-05-07T20:32:17.9383262Z if contiguous: 2025-05-07T20:32:17.9383495Z x0 = x0.contiguous() 2025-05-07T20:32:17.9383752Z x1 = x1.contiguous() 2025-05-07T20:32:17.9383986Z 2025-05-07T20:32:17.9384174Z if scale_ub is not None: 2025-05-07T20:32:17.9384445Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9384772Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9385076Z ) 2025-05-07T20:32:17.9385266Z else: 2025-05-07T20:32:17.9385467Z scale_ub_tensor = None 2025-05-07T20:32:17.9385713Z 2025-05-07T20:32:17.9385943Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9386242Z op = silu_mul_quant 2025-05-07T20:32:17.9386486Z if compiled: 2025-05-07T20:32:17.9386727Z op = torch.compile(op) 2025-05-07T20:32:17.9387011Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9387430Z 2025-05-07T20:32:17.9387622Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9387829Z 2025-05-07T20:32:17.9387931Z moe/activation_test.py:117: 2025-05-07T20:32:17.9388216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9388541Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9388816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9389548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9390238Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9390773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9391454Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9392113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9392642Z kernel = self.compile( 2025-05-07T20:32:17.9393229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9393884Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9394272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9394501Z 2025-05-07T20:32:17.9394704Z self = 2025-05-07T20:32:17.9395780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9397139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a212c720>} 2025-05-07T20:32:17.9398480Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9399511Z context = 2025-05-07T20:32:17.9399796Z 2025-05-07T20:32:17.9399970Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9400487Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9400946Z module_map=module_map) 2025-05-07T20:32:17.9401307Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9401661Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9401912Z E ^ 2025-05-07T20:32:17.9402369Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9402822Z 2025-05-07T20:32:17.9403416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9403939Z 2025-05-07T20:32:17.9404043Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9404448Z self=, 2025-05-07T20:32:17.9404846Z T=128, 2025-05-07T20:32:17.9405028Z D=5120, 2025-05-07T20:32:17.9405212Z scale_ub=1200.0, 2025-05-07T20:32:17.9405437Z contiguous=True, 2025-05-07T20:32:17.9405657Z compiled=False, 2025-05-07T20:32:17.9405852Z ) 2025-05-07T20:32:18.2485289Z self = 2025-05-07T20:32:18.2485812Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:18.2486113Z 2025-05-07T20:32:18.2486205Z @given( 2025-05-07T20:32:18.2486436Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.2486880Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.2487257Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.2487595Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.2487919Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.2488205Z ) 2025-05-07T20:32:18.2488555Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.2488991Z def test_silu_mul_quant( 2025-05-07T20:32:18.2489226Z self, 2025-05-07T20:32:18.2489425Z T: int, 2025-05-07T20:32:18.2489619Z D: int, 2025-05-07T20:32:18.2489838Z scale_ub: Optional[float], 2025-05-07T20:32:18.2490109Z contiguous: bool, 2025-05-07T20:32:18.2490340Z compiled: bool, 2025-05-07T20:32:18.2490565Z ) -> None: 2025-05-07T20:32:18.2490778Z torch.manual_seed(2025) 2025-05-07T20:32:18.2491017Z 2025-05-07T20:32:18.2491284Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.2491625Z 2025-05-07T20:32:18.2491825Z x_sign = torch.sign(x) 2025-05-07T20:32:18.2492186Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.2492501Z x = x_sign * x_clamp 2025-05-07T20:32:18.2492740Z x0 = x[:, :D] 2025-05-07T20:32:18.2492951Z x1 = x[:, D:] 2025-05-07T20:32:18.2493157Z 2025-05-07T20:32:18.2493344Z if contiguous: 2025-05-07T20:32:18.2493572Z x0 = x0.contiguous() 2025-05-07T20:32:18.2493829Z x1 = x1.contiguous() 2025-05-07T20:32:18.2494069Z 2025-05-07T20:32:18.2494256Z if scale_ub is not None: 2025-05-07T20:32:18.2494527Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.2494863Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.2495169Z ) 2025-05-07T20:32:18.2495357Z else: 2025-05-07T20:32:18.2495564Z scale_ub_tensor = None 2025-05-07T20:32:18.2495809Z 2025-05-07T20:32:18.2496043Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.2496365Z op = silu_mul_quant 2025-05-07T20:32:18.2496616Z if compiled: 2025-05-07T20:32:18.2496863Z op = torch.compile(op) 2025-05-07T20:32:18.2497158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.2497427Z 2025-05-07T20:32:18.2497614Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.2497782Z 2025-05-07T20:32:18.2497881Z moe/activation_test.py:117: 2025-05-07T20:32:18.2498174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.2498496Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.2498776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.2499467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.2500153Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.2500757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.2501457Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.2502120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.2502649Z kernel = self.compile( 2025-05-07T20:32:18.2503191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.2503851Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.2504242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.2504468Z 2025-05-07T20:32:18.2504675Z self = 2025-05-07T20:32:18.2505794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.2507188Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a212d8a0>} 2025-05-07T20:32:18.2508522Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.2509550Z context = 2025-05-07T20:32:18.2509838Z 2025-05-07T20:32:18.2510004Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.2510522Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.2510990Z module_map=module_map) 2025-05-07T20:32:18.2511353Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.2511753Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.2512011Z E ^ 2025-05-07T20:32:18.2512473Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.2512923Z 2025-05-07T20:32:18.2513342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.2513864Z 2025-05-07T20:32:18.2513966Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.2514378Z self=, 2025-05-07T20:32:18.2514775Z T=1, 2025-05-07T20:32:18.2514957Z D=7168, 2025-05-07T20:32:18.2515145Z scale_ub=1200.0, 2025-05-07T20:32:18.2515362Z contiguous=True, 2025-05-07T20:32:18.2515574Z compiled=True, 2025-05-07T20:32:18.2515775Z ) 2025-05-07T20:32:18.2516092Z self = 2025-05-07T20:32:18.2516580Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:18.2516844Z 2025-05-07T20:32:18.2516922Z @given( 2025-05-07T20:32:18.2517151Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.2517452Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.2517777Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.2518102Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.2518425Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.2518701Z ) 2025-05-07T20:32:18.2519049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.2519538Z def test_silu_mul_quant( 2025-05-07T20:32:18.2519777Z self, 2025-05-07T20:32:18.2519967Z T: int, 2025-05-07T20:32:18.2520165Z D: int, 2025-05-07T20:32:18.2520380Z scale_ub: Optional[float], 2025-05-07T20:32:18.2520647Z contiguous: bool, 2025-05-07T20:32:18.2520936Z compiled: bool, 2025-05-07T20:32:18.2521161Z ) -> None: 2025-05-07T20:32:18.2521368Z torch.manual_seed(2025) 2025-05-07T20:32:18.2521611Z 2025-05-07T20:32:18.2521882Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.2522217Z 2025-05-07T20:32:18.2522410Z x_sign = torch.sign(x) 2025-05-07T20:32:18.2522699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.2523001Z x = x_sign * x_clamp 2025-05-07T20:32:18.2523356Z x0 = x[:, :D] 2025-05-07T20:32:18.2523569Z x1 = x[:, D:] 2025-05-07T20:32:18.2523772Z 2025-05-07T20:32:18.2523953Z if contiguous: 2025-05-07T20:32:18.2524182Z x0 = x0.contiguous() 2025-05-07T20:32:18.2524430Z x1 = x1.contiguous() 2025-05-07T20:32:18.2524672Z 2025-05-07T20:32:18.2530824Z if scale_ub is not None: 2025-05-07T20:32:18.2531198Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.2531574Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.2531890Z ) 2025-05-07T20:32:18.2532083Z else: 2025-05-07T20:32:18.2532290Z scale_ub_tensor = None 2025-05-07T20:32:18.2532544Z 2025-05-07T20:32:18.2532780Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.2533094Z op = silu_mul_quant 2025-05-07T20:32:18.2533342Z if compiled: 2025-05-07T20:32:18.2533585Z op = torch.compile(op) 2025-05-07T20:32:18.2533885Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.2534152Z 2025-05-07T20:32:18.2534344Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.2534508Z 2025-05-07T20:32:18.2534613Z moe/activation_test.py:117: 2025-05-07T20:32:18.2534902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.2535238Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.2535520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.2536131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:18.2536695Z return fn(*args, **kwargs) 2025-05-07T20:32:18.2537353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.2538041Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.2538819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.2539505Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.2540171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.2540709Z kernel = self.compile( 2025-05-07T20:32:18.2541252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.2541922Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.2542319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.2542545Z 2025-05-07T20:32:18.2542752Z self = 2025-05-07T20:32:18.2543831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.2545196Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a212ee80>} 2025-05-07T20:32:18.2546621Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.2547653Z context = 2025-05-07T20:32:18.2547940Z 2025-05-07T20:32:18.2548106Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.2548626Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.2549092Z module_map=module_map) 2025-05-07T20:32:18.2549506Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.2549859Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.2550120Z E ^ 2025-05-07T20:32:18.2550586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.2551039Z 2025-05-07T20:32:18.2551520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.2552090Z 2025-05-07T20:32:18.2552199Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.2552609Z self=, 2025-05-07T20:32:18.2553003Z T=1, 2025-05-07T20:32:18.2553178Z D=7168, 2025-05-07T20:32:18.2553374Z scale_ub=1200.0, 2025-05-07T20:32:18.2553597Z contiguous=False, 2025-05-07T20:32:18.2553812Z compiled=True, 2025-05-07T20:32:18.2554017Z ) 2025-05-07T20:32:18.3570391Z self = 2025-05-07T20:32:18.3570921Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:18.3571185Z 2025-05-07T20:32:18.3571266Z @given( 2025-05-07T20:32:18.3571496Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.3571808Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.3572113Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.3572447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.3572898Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.3573187Z ) 2025-05-07T20:32:18.3573532Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.3573977Z def test_silu_mul_quant( 2025-05-07T20:32:18.3574220Z self, 2025-05-07T20:32:18.3574411Z T: int, 2025-05-07T20:32:18.3574614Z D: int, 2025-05-07T20:32:18.3574829Z scale_ub: Optional[float], 2025-05-07T20:32:18.3575096Z contiguous: bool, 2025-05-07T20:32:18.3575334Z compiled: bool, 2025-05-07T20:32:18.3575558Z ) -> None: 2025-05-07T20:32:18.3575765Z torch.manual_seed(2025) 2025-05-07T20:32:18.3576006Z 2025-05-07T20:32:18.3576286Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.3576628Z 2025-05-07T20:32:18.3576817Z x_sign = torch.sign(x) 2025-05-07T20:32:18.3577113Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.3577427Z x = x_sign * x_clamp 2025-05-07T20:32:18.3577660Z x0 = x[:, :D] 2025-05-07T20:32:18.3577876Z x1 = x[:, D:] 2025-05-07T20:32:18.3578078Z 2025-05-07T20:32:18.3578263Z if contiguous: 2025-05-07T20:32:18.3578493Z x0 = x0.contiguous() 2025-05-07T20:32:18.3578742Z x1 = x1.contiguous() 2025-05-07T20:32:18.3578982Z 2025-05-07T20:32:18.3579168Z if scale_ub is not None: 2025-05-07T20:32:18.3579433Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.3579763Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.3580069Z ) 2025-05-07T20:32:18.3580252Z else: 2025-05-07T20:32:18.3580463Z scale_ub_tensor = None 2025-05-07T20:32:18.3580714Z 2025-05-07T20:32:18.3580947Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.3581254Z op = silu_mul_quant 2025-05-07T20:32:18.3581505Z if compiled: 2025-05-07T20:32:18.3581821Z op = torch.compile(op) 2025-05-07T20:32:18.3582118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.3582388Z 2025-05-07T20:32:18.3582573Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.3582734Z 2025-05-07T20:32:18.3582831Z moe/activation_test.py:117: 2025-05-07T20:32:18.3583123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.3583453Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.3583729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.3584286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:18.3584845Z return fn(*args, **kwargs) 2025-05-07T20:32:18.3585506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.3586254Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.3586790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.3587528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.3588191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.3588719Z kernel = self.compile( 2025-05-07T20:32:18.3589285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.3589966Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.3590354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.3590583Z 2025-05-07T20:32:18.3590787Z self = 2025-05-07T20:32:18.3591854Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.3593257Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1fc8680>} 2025-05-07T20:32:18.3594589Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.3595605Z context = 2025-05-07T20:32:18.3595897Z 2025-05-07T20:32:18.3596065Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.3596580Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.3597045Z module_map=module_map) 2025-05-07T20:32:18.3597408Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.3597763Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.3598023Z E ^ 2025-05-07T20:32:18.3598477Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.3598929Z 2025-05-07T20:32:18.3599349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.3599867Z 2025-05-07T20:32:18.3599970Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.3600384Z self=, 2025-05-07T20:32:18.3600778Z T=1, 2025-05-07T20:32:18.3600960Z D=7168, 2025-05-07T20:32:18.3601147Z scale_ub=None, 2025-05-07T20:32:18.3601358Z contiguous=False, 2025-05-07T20:32:18.3601582Z compiled=True, 2025-05-07T20:32:18.3601784Z ) 2025-05-07T20:32:18.4279954Z self = 2025-05-07T20:32:18.4280519Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:18.4280788Z 2025-05-07T20:32:18.4280866Z @given( 2025-05-07T20:32:18.4281098Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.4281402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.4281707Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.4282040Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.4282362Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.4282642Z ) 2025-05-07T20:32:18.4282989Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.4283517Z def test_silu_mul_quant( 2025-05-07T20:32:18.4283752Z self, 2025-05-07T20:32:18.4283945Z T: int, 2025-05-07T20:32:18.4284206Z D: int, 2025-05-07T20:32:18.4284416Z scale_ub: Optional[float], 2025-05-07T20:32:18.4284740Z contiguous: bool, 2025-05-07T20:32:18.4284976Z compiled: bool, 2025-05-07T20:32:18.4285195Z ) -> None: 2025-05-07T20:32:18.4285405Z torch.manual_seed(2025) 2025-05-07T20:32:18.4285641Z 2025-05-07T20:32:18.4285906Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.4286245Z 2025-05-07T20:32:18.4286438Z x_sign = torch.sign(x) 2025-05-07T20:32:18.4286730Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.4287032Z x = x_sign * x_clamp 2025-05-07T20:32:18.4287268Z x0 = x[:, :D] 2025-05-07T20:32:18.4287484Z x1 = x[:, D:] 2025-05-07T20:32:18.4287683Z 2025-05-07T20:32:18.4287868Z if contiguous: 2025-05-07T20:32:18.4288099Z x0 = x0.contiguous() 2025-05-07T20:32:18.4288350Z x1 = x1.contiguous() 2025-05-07T20:32:18.4288585Z 2025-05-07T20:32:18.4288776Z if scale_ub is not None: 2025-05-07T20:32:18.4289050Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.4289491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.4289801Z ) 2025-05-07T20:32:18.4289990Z else: 2025-05-07T20:32:18.4290197Z scale_ub_tensor = None 2025-05-07T20:32:18.4290447Z 2025-05-07T20:32:18.4290674Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.4290983Z op = silu_mul_quant 2025-05-07T20:32:18.4291224Z if compiled: 2025-05-07T20:32:18.4291465Z op = torch.compile(op) 2025-05-07T20:32:18.4291752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.4292020Z 2025-05-07T20:32:18.4292213Z y_fp8, y_scale = fn() 2025-05-07T20:32:18.4292492Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:18.4292777Z 2025-05-07T20:32:18.4293016Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.4293342Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:18.4293641Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:18.4293952Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:18.4294308Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:18.4294616Z 2025-05-07T20:32:18.4294824Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:18.4295014Z 2025-05-07T20:32:18.4295120Z moe/activation_test.py:126: 2025-05-07T20:32:18.4295417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.4295753Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:18.4296070Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:18.4296861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:18.4297612Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:18.4298208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.4298896Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.4299588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:18.4300321Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:18.4301073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:18.4301815Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:18.4302548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:18.4303232Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:18.4303833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:18.4304391Z fn() 2025-05-07T20:32:18.4304901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:18.4305479Z self.fn.run( 2025-05-07T20:32:18.4305939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.4306467Z kernel = self.compile( 2025-05-07T20:32:18.4307008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.4307657Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.4308049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.4308278Z 2025-05-07T20:32:18.4308488Z self = 2025-05-07T20:32:18.4309561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.4310969Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1fc9580>} 2025-05-07T20:32:18.4312300Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.4313336Z context = 2025-05-07T20:32:18.4315017Z 2025-05-07T20:32:18.4315187Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.4315711Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.4316178Z module_map=module_map) 2025-05-07T20:32:18.4316540Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.4316895Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:18.4317154Z E ^ 2025-05-07T20:32:18.4317615Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.4318070Z 2025-05-07T20:32:18.4318487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.4319000Z 2025-05-07T20:32:18.4319107Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.4319560Z self=, 2025-05-07T20:32:18.4319954Z T=1, 2025-05-07T20:32:18.4320131Z D=5120, 2025-05-07T20:32:18.4320322Z scale_ub=1200.0, 2025-05-07T20:32:18.4320538Z contiguous=False, 2025-05-07T20:32:18.4320809Z compiled=True, 2025-05-07T20:32:18.4321013Z ) 2025-05-07T20:32:18.5529942Z self = 2025-05-07T20:32:18.5530467Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:18.5530736Z 2025-05-07T20:32:18.5530816Z @given( 2025-05-07T20:32:18.5531055Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5531373Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5531674Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5532003Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5532332Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5532609Z ) 2025-05-07T20:32:18.5532956Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5533396Z def test_silu_mul_quant( 2025-05-07T20:32:18.5533731Z self, 2025-05-07T20:32:18.5533923Z T: int, 2025-05-07T20:32:18.5534184Z D: int, 2025-05-07T20:32:18.5534403Z scale_ub: Optional[float], 2025-05-07T20:32:18.5534667Z contiguous: bool, 2025-05-07T20:32:18.5534904Z compiled: bool, 2025-05-07T20:32:18.5535134Z ) -> None: 2025-05-07T20:32:18.5535344Z torch.manual_seed(2025) 2025-05-07T20:32:18.5535584Z 2025-05-07T20:32:18.5535861Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5536198Z 2025-05-07T20:32:18.5536398Z x_sign = torch.sign(x) 2025-05-07T20:32:18.5536693Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.5536998Z x = x_sign * x_clamp 2025-05-07T20:32:18.5537232Z x0 = x[:, :D] 2025-05-07T20:32:18.5537449Z x1 = x[:, D:] 2025-05-07T20:32:18.5537652Z 2025-05-07T20:32:18.5537847Z if contiguous: 2025-05-07T20:32:18.5538085Z x0 = x0.contiguous() 2025-05-07T20:32:18.5538346Z x1 = x1.contiguous() 2025-05-07T20:32:18.5538730Z 2025-05-07T20:32:18.5539025Z if scale_ub is not None: 2025-05-07T20:32:18.5539296Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.5539629Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.5539934Z ) 2025-05-07T20:32:18.5540126Z else: 2025-05-07T20:32:18.5540336Z scale_ub_tensor = None 2025-05-07T20:32:18.5540588Z 2025-05-07T20:32:18.5540815Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.5541126Z op = silu_mul_quant 2025-05-07T20:32:18.5541376Z if compiled: 2025-05-07T20:32:18.5541618Z op = torch.compile(op) 2025-05-07T20:32:18.5541912Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.5542188Z 2025-05-07T20:32:18.5542378Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.5542547Z 2025-05-07T20:32:18.5542652Z moe/activation_test.py:117: 2025-05-07T20:32:18.5542948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.5543282Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.5543560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.5544118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:18.5544682Z return fn(*args, **kwargs) 2025-05-07T20:32:18.5545335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.5546025Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.5546560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.5547233Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.5547898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.5548504Z kernel = self.compile( 2025-05-07T20:32:18.5549053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.5549761Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.5550154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.5550380Z 2025-05-07T20:32:18.5550590Z self = 2025-05-07T20:32:18.5551665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.5553076Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1fcab60>} 2025-05-07T20:32:18.5554467Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.5555495Z context = 2025-05-07T20:32:18.5555779Z 2025-05-07T20:32:18.5555950Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.5556466Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.5556933Z module_map=module_map) 2025-05-07T20:32:18.5557296Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.5557655Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.5557910Z E ^ 2025-05-07T20:32:18.5558374Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.5558828Z 2025-05-07T20:32:18.5559324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.5559861Z 2025-05-07T20:32:18.5559967Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5560372Z self=, 2025-05-07T20:32:18.5560769Z T=1, 2025-05-07T20:32:18.5560951Z D=5120, 2025-05-07T20:32:18.5561134Z scale_ub=1200.0, 2025-05-07T20:32:18.5561356Z contiguous=False, 2025-05-07T20:32:18.5561582Z compiled=False, 2025-05-07T20:32:18.5561780Z ) 2025-05-07T20:32:18.5562097Z self = 2025-05-07T20:32:18.5562586Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:18.5562850Z 2025-05-07T20:32:18.5562928Z @given( 2025-05-07T20:32:18.5563158Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5563600Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5563911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5564233Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5564566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5564851Z ) 2025-05-07T20:32:18.5565196Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5565633Z def test_silu_mul_quant( 2025-05-07T20:32:18.5565873Z self, 2025-05-07T20:32:18.5566062Z T: int, 2025-05-07T20:32:18.5566256Z D: int, 2025-05-07T20:32:18.5566474Z scale_ub: Optional[float], 2025-05-07T20:32:18.5566739Z contiguous: bool, 2025-05-07T20:32:18.5566972Z compiled: bool, 2025-05-07T20:32:18.5567189Z ) -> None: 2025-05-07T20:32:18.5567394Z torch.manual_seed(2025) 2025-05-07T20:32:18.5567634Z 2025-05-07T20:32:18.5567908Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5568294Z 2025-05-07T20:32:18.5568486Z x_sign = torch.sign(x) 2025-05-07T20:32:18.5568778Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.5569083Z x = x_sign * x_clamp 2025-05-07T20:32:18.5569315Z x0 = x[:, :D] 2025-05-07T20:32:18.5569550Z x1 = x[:, D:] 2025-05-07T20:32:18.5569780Z 2025-05-07T20:32:18.5569960Z if contiguous: 2025-05-07T20:32:18.5570187Z x0 = x0.contiguous() 2025-05-07T20:32:18.5570444Z x1 = x1.contiguous() 2025-05-07T20:32:18.5570678Z 2025-05-07T20:32:18.5570867Z if scale_ub is not None: 2025-05-07T20:32:18.5571138Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.5571466Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.5571767Z ) 2025-05-07T20:32:18.5571963Z else: 2025-05-07T20:32:18.5572167Z scale_ub_tensor = None 2025-05-07T20:32:18.5572483Z 2025-05-07T20:32:18.5572722Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.5573106Z op = silu_mul_quant 2025-05-07T20:32:18.5573347Z if compiled: 2025-05-07T20:32:18.5573595Z op = torch.compile(op) 2025-05-07T20:32:18.5573890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.5574158Z 2025-05-07T20:32:18.5574351Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.5574514Z 2025-05-07T20:32:18.5574620Z moe/activation_test.py:117: 2025-05-07T20:32:18.5574908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.5575238Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.5575523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.5576212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.5576907Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.5583174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.5584406Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.5585075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.5585613Z kernel = self.compile( 2025-05-07T20:32:18.5586155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.5586810Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.5587206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.5587430Z 2025-05-07T20:32:18.5587638Z self = 2025-05-07T20:32:18.5588708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.5590123Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1fcb2e0>} 2025-05-07T20:32:18.5591454Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.5592479Z context = 2025-05-07T20:32:18.5592761Z 2025-05-07T20:32:18.5592928Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.5593451Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.5593918Z module_map=module_map) 2025-05-07T20:32:18.5594336Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.5594694Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.5594947Z E ^ 2025-05-07T20:32:18.5595407Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.5595853Z 2025-05-07T20:32:18.5596269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.5596785Z 2025-05-07T20:32:18.5596889Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5597293Z self=, 2025-05-07T20:32:18.5597689Z T=16384, 2025-05-07T20:32:18.5597875Z D=5120, 2025-05-07T20:32:18.5598065Z scale_ub=1200.0, 2025-05-07T20:32:18.5598286Z contiguous=False, 2025-05-07T20:32:18.5598501Z compiled=True, 2025-05-07T20:32:18.5598700Z ) 2025-05-07T20:32:18.8040880Z self = 2025-05-07T20:32:18.8041506Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:18.8041788Z 2025-05-07T20:32:18.8041872Z @given( 2025-05-07T20:32:18.8042097Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.8042412Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.8042715Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.8043037Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.8043454Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.8043742Z ) 2025-05-07T20:32:18.8044089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.8044524Z def test_silu_mul_quant( 2025-05-07T20:32:18.8044766Z self, 2025-05-07T20:32:18.8044960Z T: int, 2025-05-07T20:32:18.8045156Z D: int, 2025-05-07T20:32:18.8045379Z scale_ub: Optional[float], 2025-05-07T20:32:18.8045655Z contiguous: bool, 2025-05-07T20:32:18.8045967Z compiled: bool, 2025-05-07T20:32:18.8046196Z ) -> None: 2025-05-07T20:32:18.8046420Z torch.manual_seed(2025) 2025-05-07T20:32:18.8046661Z 2025-05-07T20:32:18.8046935Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.8047272Z 2025-05-07T20:32:18.8047466Z x_sign = torch.sign(x) 2025-05-07T20:32:18.8047746Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.8048048Z x = x_sign * x_clamp 2025-05-07T20:32:18.8048281Z x0 = x[:, :D] 2025-05-07T20:32:18.8048491Z x1 = x[:, D:] 2025-05-07T20:32:18.8048703Z 2025-05-07T20:32:18.8048895Z if contiguous: 2025-05-07T20:32:18.8049124Z x0 = x0.contiguous() 2025-05-07T20:32:18.8049381Z x1 = x1.contiguous() 2025-05-07T20:32:18.8049628Z 2025-05-07T20:32:18.8049820Z if scale_ub is not None: 2025-05-07T20:32:18.8050093Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.8050436Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.8050743Z ) 2025-05-07T20:32:18.8050936Z else: 2025-05-07T20:32:18.8051145Z scale_ub_tensor = None 2025-05-07T20:32:18.8051388Z 2025-05-07T20:32:18.8051619Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.8051931Z op = silu_mul_quant 2025-05-07T20:32:18.8052175Z if compiled: 2025-05-07T20:32:18.8052426Z op = torch.compile(op) 2025-05-07T20:32:18.8052721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.8052988Z 2025-05-07T20:32:18.8053184Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.8053345Z 2025-05-07T20:32:18.8053448Z moe/activation_test.py:117: 2025-05-07T20:32:18.8053733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.8054062Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.8054417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.8054978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:18.8055540Z return fn(*args, **kwargs) 2025-05-07T20:32:18.8056195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.8056884Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.8057413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.8058088Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.8058753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.8059275Z kernel = self.compile( 2025-05-07T20:32:18.8059864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.8060564Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.8060961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.8061187Z 2025-05-07T20:32:18.8061392Z self = 2025-05-07T20:32:18.8062462Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.8063824Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1b40fe0>} 2025-05-07T20:32:18.8065166Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.8066237Z context = 2025-05-07T20:32:18.8066521Z 2025-05-07T20:32:18.8066689Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.8067208Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.8067673Z module_map=module_map) 2025-05-07T20:32:18.8068037Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.8068390Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.8068647Z E ^ 2025-05-07T20:32:18.8069110Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.8069562Z 2025-05-07T20:32:18.8069986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.8070512Z 2025-05-07T20:32:18.8070616Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.8071027Z self=, 2025-05-07T20:32:18.8071426Z T=2048, 2025-05-07T20:32:18.8071616Z D=7168, 2025-05-07T20:32:18.8071809Z scale_ub=1200.0, 2025-05-07T20:32:18.8072021Z contiguous=False, 2025-05-07T20:32:18.8072238Z compiled=True, 2025-05-07T20:32:18.8072438Z ) 2025-05-07T20:32:18.8072763Z self = 2025-05-07T20:32:18.8073254Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:18.8073535Z 2025-05-07T20:32:18.8073611Z @given( 2025-05-07T20:32:18.8073831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.8074134Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.8074439Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.8074813Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.8075134Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.8075417Z ) 2025-05-07T20:32:18.8075761Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.8076200Z def test_silu_mul_quant( 2025-05-07T20:32:18.8076431Z self, 2025-05-07T20:32:18.8076623Z T: int, 2025-05-07T20:32:18.8076816Z D: int, 2025-05-07T20:32:18.8077025Z scale_ub: Optional[float], 2025-05-07T20:32:18.8077291Z contiguous: bool, 2025-05-07T20:32:18.8077527Z compiled: bool, 2025-05-07T20:32:18.8077743Z ) -> None: 2025-05-07T20:32:18.8077956Z torch.manual_seed(2025) 2025-05-07T20:32:18.8078196Z 2025-05-07T20:32:18.8078463Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.8078805Z 2025-05-07T20:32:18.8079047Z x_sign = torch.sign(x) 2025-05-07T20:32:18.8079362Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.8079729Z x = x_sign * x_clamp 2025-05-07T20:32:18.8079969Z x0 = x[:, :D] 2025-05-07T20:32:18.8080178Z x1 = x[:, D:] 2025-05-07T20:32:18.8080385Z 2025-05-07T20:32:18.8080569Z if contiguous: 2025-05-07T20:32:18.8080792Z x0 = x0.contiguous() 2025-05-07T20:32:18.8081045Z x1 = x1.contiguous() 2025-05-07T20:32:18.8081284Z 2025-05-07T20:32:18.8081470Z if scale_ub is not None: 2025-05-07T20:32:18.8081732Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.8082065Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.8082368Z ) 2025-05-07T20:32:18.8082553Z else: 2025-05-07T20:32:18.8082760Z scale_ub_tensor = None 2025-05-07T20:32:18.8083011Z 2025-05-07T20:32:18.8083303Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.8083619Z op = silu_mul_quant 2025-05-07T20:32:18.8083874Z if compiled: 2025-05-07T20:32:18.8084185Z op = torch.compile(op) 2025-05-07T20:32:18.8084485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.8084757Z 2025-05-07T20:32:18.8084940Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.8085109Z 2025-05-07T20:32:18.8085208Z moe/activation_test.py:117: 2025-05-07T20:32:18.8085499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.8085831Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.8086105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.8086664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:18.8087223Z return fn(*args, **kwargs) 2025-05-07T20:32:18.8087881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.8088574Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.8089118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.8089803Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.8090461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.8090998Z kernel = self.compile( 2025-05-07T20:32:18.8091542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.8092196Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.8092589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.8092822Z 2025-05-07T20:32:18.8093029Z self = 2025-05-07T20:32:18.8094153Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.8095516Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1b41b20>} 2025-05-07T20:32:18.8096852Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.8097877Z context = 2025-05-07T20:32:18.8098166Z 2025-05-07T20:32:18.8098332Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.8098923Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.8099425Z module_map=module_map) 2025-05-07T20:32:18.8099791Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.8100146Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.8100398Z E ^ 2025-05-07T20:32:18.8100858Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.8101315Z 2025-05-07T20:32:18.8101742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.8102255Z 2025-05-07T20:32:18.9000936Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.9001400Z self=, 2025-05-07T20:32:18.9001809Z T=1, 2025-05-07T20:32:18.9001995Z D=5120, 2025-05-07T20:32:18.9002182Z scale_ub=None, 2025-05-07T20:32:18.9002395Z contiguous=False, 2025-05-07T20:32:18.9002635Z compiled=False, 2025-05-07T20:32:18.9002834Z ) 2025-05-07T20:32:18.9003346Z self = 2025-05-07T20:32:18.9003838Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:18.9004097Z 2025-05-07T20:32:18.9004170Z @given( 2025-05-07T20:32:18.9004397Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.9004709Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.9005044Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.9005370Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.9005695Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.9005978Z ) 2025-05-07T20:32:18.9006322Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.9006759Z def test_silu_mul_quant( 2025-05-07T20:32:18.9006990Z self, 2025-05-07T20:32:18.9007184Z T: int, 2025-05-07T20:32:18.9007377Z D: int, 2025-05-07T20:32:18.9007590Z scale_ub: Optional[float], 2025-05-07T20:32:18.9007861Z contiguous: bool, 2025-05-07T20:32:18.9008095Z compiled: bool, 2025-05-07T20:32:18.9008316Z ) -> None: 2025-05-07T20:32:18.9008531Z torch.manual_seed(2025) 2025-05-07T20:32:18.9008765Z 2025-05-07T20:32:18.9009029Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.9009368Z 2025-05-07T20:32:18.9009559Z x_sign = torch.sign(x) 2025-05-07T20:32:18.9009842Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.9010180Z x = x_sign * x_clamp 2025-05-07T20:32:18.9010433Z x0 = x[:, :D] 2025-05-07T20:32:18.9010640Z x1 = x[:, D:] 2025-05-07T20:32:18.9010844Z 2025-05-07T20:32:18.9011026Z if contiguous: 2025-05-07T20:32:18.9011252Z x0 = x0.contiguous() 2025-05-07T20:32:18.9011500Z x1 = x1.contiguous() 2025-05-07T20:32:18.9011737Z 2025-05-07T20:32:18.9011993Z if scale_ub is not None: 2025-05-07T20:32:18.9012270Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.9012596Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.9012904Z ) 2025-05-07T20:32:18.9013088Z else: 2025-05-07T20:32:18.9013303Z scale_ub_tensor = None 2025-05-07T20:32:18.9013551Z 2025-05-07T20:32:18.9013775Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.9014084Z op = silu_mul_quant 2025-05-07T20:32:18.9014326Z if compiled: 2025-05-07T20:32:18.9014568Z op = torch.compile(op) 2025-05-07T20:32:18.9014857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.9015129Z 2025-05-07T20:32:18.9015313Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.9015484Z 2025-05-07T20:32:18.9015581Z moe/activation_test.py:117: 2025-05-07T20:32:18.9015937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.9016329Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.9016609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.9017299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.9017990Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.9018518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.9019201Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.9019918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.9020450Z kernel = self.compile( 2025-05-07T20:32:18.9020988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.9021652Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.9022093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.9022317Z 2025-05-07T20:32:18.9022525Z self = 2025-05-07T20:32:18.9023588Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.9024948Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1b42e80>} 2025-05-07T20:32:18.9026295Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.9027323Z context = 2025-05-07T20:32:18.9027613Z 2025-05-07T20:32:18.9027777Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.9028299Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.9028763Z module_map=module_map) 2025-05-07T20:32:18.9029134Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.9029483Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.9029745Z E ^ 2025-05-07T20:32:18.9030254Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.9030705Z 2025-05-07T20:32:18.9031129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.9031643Z 2025-05-07T20:32:18.9031751Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.9032282Z self=, 2025-05-07T20:32:18.9032684Z T=4096, 2025-05-07T20:32:18.9032866Z D=7168, 2025-05-07T20:32:18.9033055Z scale_ub=1200.0, 2025-05-07T20:32:18.9033272Z contiguous=False, 2025-05-07T20:32:18.9033485Z compiled=False, 2025-05-07T20:32:18.9033689Z ) 2025-05-07T20:32:18.9034008Z self = 2025-05-07T20:32:18.9034497Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:18.9034779Z 2025-05-07T20:32:18.9034856Z @given( 2025-05-07T20:32:18.9035080Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.9035391Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.9035689Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.9036017Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.9036384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.9036707Z ) 2025-05-07T20:32:18.9037062Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.9037499Z def test_silu_mul_quant( 2025-05-07T20:32:18.9037735Z self, 2025-05-07T20:32:18.9037927Z T: int, 2025-05-07T20:32:18.9038122Z D: int, 2025-05-07T20:32:18.9038330Z scale_ub: Optional[float], 2025-05-07T20:32:18.9038778Z contiguous: bool, 2025-05-07T20:32:18.9039016Z compiled: bool, 2025-05-07T20:32:18.9039238Z ) -> None: 2025-05-07T20:32:18.9039446Z torch.manual_seed(2025) 2025-05-07T20:32:18.9039682Z 2025-05-07T20:32:18.9039953Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.9040287Z 2025-05-07T20:32:18.9040478Z x_sign = torch.sign(x) 2025-05-07T20:32:18.9040766Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.9041070Z x = x_sign * x_clamp 2025-05-07T20:32:18.9041313Z x0 = x[:, :D] 2025-05-07T20:32:18.9041603Z x1 = x[:, D:] 2025-05-07T20:32:18.9041804Z 2025-05-07T20:32:18.9041987Z if contiguous: 2025-05-07T20:32:18.9042225Z x0 = x0.contiguous() 2025-05-07T20:32:18.9042477Z x1 = x1.contiguous() 2025-05-07T20:32:18.9042714Z 2025-05-07T20:32:18.9042904Z if scale_ub is not None: 2025-05-07T20:32:18.9043167Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.9043575Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.9043877Z ) 2025-05-07T20:32:18.9044065Z else: 2025-05-07T20:32:18.9044275Z scale_ub_tensor = None 2025-05-07T20:32:18.9044524Z 2025-05-07T20:32:18.9044752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.9045056Z op = silu_mul_quant 2025-05-07T20:32:18.9045300Z if compiled: 2025-05-07T20:32:18.9045544Z op = torch.compile(op) 2025-05-07T20:32:18.9045834Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.9046113Z 2025-05-07T20:32:18.9046303Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.9046465Z 2025-05-07T20:32:18.9046563Z moe/activation_test.py:117: 2025-05-07T20:32:18.9046850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.9047176Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.9047448Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.9048135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.9048827Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.9049404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.9050104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.9050847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.9051394Z kernel = self.compile( 2025-05-07T20:32:18.9051936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.9052589Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.9052993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.9053222Z 2025-05-07T20:32:18.9053437Z self = 2025-05-07T20:32:18.9054504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.9055933Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a329c040>} 2025-05-07T20:32:18.9057336Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.9058362Z context = 2025-05-07T20:32:18.9058646Z 2025-05-07T20:32:18.9058815Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.9059331Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.9059844Z module_map=module_map) 2025-05-07T20:32:18.9060205Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.9060556Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.9060807Z E ^ 2025-05-07T20:32:18.9061277Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.9061770Z 2025-05-07T20:32:18.9062191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.9062704Z 2025-05-07T20:32:18.9062803Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.9063212Z self=, 2025-05-07T20:32:18.9063609Z T=16384, 2025-05-07T20:32:18.9063796Z D=7168, 2025-05-07T20:32:18.9063981Z scale_ub=None, 2025-05-07T20:32:18.9064190Z contiguous=True, 2025-05-07T20:32:18.9064410Z compiled=True, 2025-05-07T20:32:18.9064605Z ) 2025-05-07T20:32:19.0437567Z self = 2025-05-07T20:32:19.0438086Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.0438529Z 2025-05-07T20:32:19.0438628Z @given( 2025-05-07T20:32:19.0438861Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.0439190Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.0439523Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.0440147Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.0440753Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.0441275Z ) 2025-05-07T20:32:19.0441906Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.0451828Z def test_silu_mul_quant( 2025-05-07T20:32:19.0452074Z self, 2025-05-07T20:32:19.0452264Z T: int, 2025-05-07T20:32:19.0452463Z D: int, 2025-05-07T20:32:19.0452676Z scale_ub: Optional[float], 2025-05-07T20:32:19.0452940Z contiguous: bool, 2025-05-07T20:32:19.0453181Z compiled: bool, 2025-05-07T20:32:19.0453408Z ) -> None: 2025-05-07T20:32:19.0453630Z torch.manual_seed(2025) 2025-05-07T20:32:19.0453864Z 2025-05-07T20:32:19.0454257Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.0454607Z 2025-05-07T20:32:19.0454793Z x_sign = torch.sign(x) 2025-05-07T20:32:19.0455080Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.0455388Z x = x_sign * x_clamp 2025-05-07T20:32:19.0455617Z x0 = x[:, :D] 2025-05-07T20:32:19.0455832Z x1 = x[:, D:] 2025-05-07T20:32:19.0456031Z 2025-05-07T20:32:19.0456210Z if contiguous: 2025-05-07T20:32:19.0456434Z x0 = x0.contiguous() 2025-05-07T20:32:19.0456687Z x1 = x1.contiguous() 2025-05-07T20:32:19.0456918Z 2025-05-07T20:32:19.0457108Z if scale_ub is not None: 2025-05-07T20:32:19.0457374Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.0457701Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.0458008Z ) 2025-05-07T20:32:19.0458271Z else: 2025-05-07T20:32:19.0458486Z scale_ub_tensor = None 2025-05-07T20:32:19.0458804Z 2025-05-07T20:32:19.0459042Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.0459351Z op = silu_mul_quant 2025-05-07T20:32:19.0459591Z if compiled: 2025-05-07T20:32:19.0459836Z op = torch.compile(op) 2025-05-07T20:32:19.0460129Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.0460397Z 2025-05-07T20:32:19.0460586Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.0460749Z 2025-05-07T20:32:19.0460854Z moe/activation_test.py:117: 2025-05-07T20:32:19.0461141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.0461463Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.0461745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.0462302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.0462861Z return fn(*args, **kwargs) 2025-05-07T20:32:19.0463525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.0464285Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.0464816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.0465501Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.0466167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.0466702Z kernel = self.compile( 2025-05-07T20:32:19.0467236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.0467892Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.0468291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.0468523Z 2025-05-07T20:32:19.0468735Z self = 2025-05-07T20:32:19.0469803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.0471160Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a329d260>} 2025-05-07T20:32:19.0472500Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.0473520Z context = 2025-05-07T20:32:19.0473807Z 2025-05-07T20:32:19.0474018Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.0474546Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.0475011Z module_map=module_map) 2025-05-07T20:32:19.0475372Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.0475720Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.0475974Z E ^ 2025-05-07T20:32:19.0476437Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.0476884Z 2025-05-07T20:32:19.0477303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.0477815Z 2025-05-07T20:32:19.0477917Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.0478365Z self=, 2025-05-07T20:32:19.0478767Z T=4096, 2025-05-07T20:32:19.0478991Z D=5120, 2025-05-07T20:32:19.0479186Z scale_ub=None, 2025-05-07T20:32:19.0479417Z contiguous=False, 2025-05-07T20:32:19.0479669Z compiled=True, 2025-05-07T20:32:19.0479870Z ) 2025-05-07T20:32:19.0480192Z self = 2025-05-07T20:32:19.0480681Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:19.0480955Z 2025-05-07T20:32:19.0481049Z @given( 2025-05-07T20:32:19.0481278Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.0481586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.0481883Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.0482210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.0482532Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.0482808Z ) 2025-05-07T20:32:19.0483163Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.0483788Z def test_silu_mul_quant( 2025-05-07T20:32:19.0484027Z self, 2025-05-07T20:32:19.0484211Z T: int, 2025-05-07T20:32:19.0484406Z D: int, 2025-05-07T20:32:19.0484626Z scale_ub: Optional[float], 2025-05-07T20:32:19.0484891Z contiguous: bool, 2025-05-07T20:32:19.0485123Z compiled: bool, 2025-05-07T20:32:19.0485345Z ) -> None: 2025-05-07T20:32:19.0485551Z torch.manual_seed(2025) 2025-05-07T20:32:19.0485785Z 2025-05-07T20:32:19.0486057Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.0486395Z 2025-05-07T20:32:19.0486594Z x_sign = torch.sign(x) 2025-05-07T20:32:19.0486882Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.0487179Z x = x_sign * x_clamp 2025-05-07T20:32:19.0487413Z x0 = x[:, :D] 2025-05-07T20:32:19.0487627Z x1 = x[:, D:] 2025-05-07T20:32:19.0487827Z 2025-05-07T20:32:19.0488015Z if contiguous: 2025-05-07T20:32:19.0488250Z x0 = x0.contiguous() 2025-05-07T20:32:19.0488505Z x1 = x1.contiguous() 2025-05-07T20:32:19.0488739Z 2025-05-07T20:32:19.0488927Z if scale_ub is not None: 2025-05-07T20:32:19.0489195Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.0489523Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.0489829Z ) 2025-05-07T20:32:19.0490012Z else: 2025-05-07T20:32:19.0490213Z scale_ub_tensor = None 2025-05-07T20:32:19.0490461Z 2025-05-07T20:32:19.0490695Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.0490999Z op = silu_mul_quant 2025-05-07T20:32:19.0491240Z if compiled: 2025-05-07T20:32:19.0491484Z op = torch.compile(op) 2025-05-07T20:32:19.0491770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.0492047Z 2025-05-07T20:32:19.0492234Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.0492447Z 2025-05-07T20:32:19.0492548Z moe/activation_test.py:117: 2025-05-07T20:32:19.0492831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.0493157Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.0493432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.0493984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.0494539Z return fn(*args, **kwargs) 2025-05-07T20:32:19.0495198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.0495872Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.0496404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.0497141Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.0497852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.0498379Z kernel = self.compile( 2025-05-07T20:32:19.0498920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.0499577Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.0499969Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.0500197Z 2025-05-07T20:32:19.0500401Z self = 2025-05-07T20:32:19.0501469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.0502825Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a329dda0>} 2025-05-07T20:32:19.0504205Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.0505218Z context = 2025-05-07T20:32:19.0505509Z 2025-05-07T20:32:19.0505673Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.0506193Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.0506658Z module_map=module_map) 2025-05-07T20:32:19.0507016Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.0507366Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.0507621Z E ^ 2025-05-07T20:32:19.0508082Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.0508537Z 2025-05-07T20:32:19.0508953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.0509478Z 2025-05-07T20:32:19.1657228Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1657700Z self=, 2025-05-07T20:32:19.1658105Z T=4096, 2025-05-07T20:32:19.1658289Z D=5120, 2025-05-07T20:32:19.1658480Z scale_ub=1200.0, 2025-05-07T20:32:19.1658704Z contiguous=False, 2025-05-07T20:32:19.1658922Z compiled=False, 2025-05-07T20:32:19.1659126Z ) 2025-05-07T20:32:19.1659446Z self = 2025-05-07T20:32:19.1660195Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:19.1660755Z 2025-05-07T20:32:19.1661114Z @given( 2025-05-07T20:32:19.1661572Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1662179Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1662771Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1663413Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1664051Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1664604Z ) 2025-05-07T20:32:19.1665287Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1666159Z def test_silu_mul_quant( 2025-05-07T20:32:19.1666619Z self, 2025-05-07T20:32:19.1667000Z T: int, 2025-05-07T20:32:19.1667372Z D: int, 2025-05-07T20:32:19.1667788Z scale_ub: Optional[float], 2025-05-07T20:32:19.1668315Z contiguous: bool, 2025-05-07T20:32:19.1668778Z compiled: bool, 2025-05-07T20:32:19.1669320Z ) -> None: 2025-05-07T20:32:19.1669743Z torch.manual_seed(2025) 2025-05-07T20:32:19.1670137Z 2025-05-07T20:32:19.1670419Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1670755Z 2025-05-07T20:32:19.1670944Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1671228Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1671532Z x = x_sign * x_clamp 2025-05-07T20:32:19.1671766Z x0 = x[:, :D] 2025-05-07T20:32:19.1671977Z x1 = x[:, D:] 2025-05-07T20:32:19.1672180Z 2025-05-07T20:32:19.1672368Z if contiguous: 2025-05-07T20:32:19.1672597Z x0 = x0.contiguous() 2025-05-07T20:32:19.1672855Z x1 = x1.contiguous() 2025-05-07T20:32:19.1673093Z 2025-05-07T20:32:19.1673278Z if scale_ub is not None: 2025-05-07T20:32:19.1673545Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1673873Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1674181Z ) 2025-05-07T20:32:19.1674375Z else: 2025-05-07T20:32:19.1674651Z scale_ub_tensor = None 2025-05-07T20:32:19.1674904Z 2025-05-07T20:32:19.1675130Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1675441Z op = silu_mul_quant 2025-05-07T20:32:19.1675694Z if compiled: 2025-05-07T20:32:19.1675964Z op = torch.compile(op) 2025-05-07T20:32:19.1676252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1676520Z 2025-05-07T20:32:19.1676710Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1676872Z 2025-05-07T20:32:19.1676969Z moe/activation_test.py:117: 2025-05-07T20:32:19.1677253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1677579Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1677857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1678541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1679233Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1679819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1680492Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1681153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1681687Z kernel = self.compile( 2025-05-07T20:32:19.1682236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1682886Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1683355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1683577Z 2025-05-07T20:32:19.1683789Z self = 2025-05-07T20:32:19.1684908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1686269Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a329f420>} 2025-05-07T20:32:19.1687615Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1688639Z context = 2025-05-07T20:32:19.1688923Z 2025-05-07T20:32:19.1689095Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1689659Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1690160Z module_map=module_map) 2025-05-07T20:32:19.1690521Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1690872Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1691122Z E ^ 2025-05-07T20:32:19.1691585Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1692037Z 2025-05-07T20:32:19.1692458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1692970Z 2025-05-07T20:32:19.1693074Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1693480Z self=, 2025-05-07T20:32:19.1693882Z T=4096, 2025-05-07T20:32:19.1694069Z D=5120, 2025-05-07T20:32:19.1694252Z scale_ub=1200.0, 2025-05-07T20:32:19.1694477Z contiguous=False, 2025-05-07T20:32:19.1694744Z compiled=True, 2025-05-07T20:32:19.1694942Z ) 2025-05-07T20:32:19.1695257Z self = 2025-05-07T20:32:19.1695750Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:19.1696020Z 2025-05-07T20:32:19.1696097Z @given( 2025-05-07T20:32:19.1696326Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1696640Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1696943Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1697263Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1697588Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1697875Z ) 2025-05-07T20:32:19.1698215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1698654Z def test_silu_mul_quant( 2025-05-07T20:32:19.1698892Z self, 2025-05-07T20:32:19.1699087Z T: int, 2025-05-07T20:32:19.1699284Z D: int, 2025-05-07T20:32:19.1699502Z scale_ub: Optional[float], 2025-05-07T20:32:19.1699763Z contiguous: bool, 2025-05-07T20:32:19.1700001Z compiled: bool, 2025-05-07T20:32:19.1700216Z ) -> None: 2025-05-07T20:32:19.1700420Z torch.manual_seed(2025) 2025-05-07T20:32:19.1700661Z 2025-05-07T20:32:19.1700933Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1701268Z 2025-05-07T20:32:19.1701461Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1701749Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1702052Z x = x_sign * x_clamp 2025-05-07T20:32:19.1702283Z x0 = x[:, :D] 2025-05-07T20:32:19.1702498Z x1 = x[:, D:] 2025-05-07T20:32:19.1702700Z 2025-05-07T20:32:19.1702881Z if contiguous: 2025-05-07T20:32:19.1703114Z x0 = x0.contiguous() 2025-05-07T20:32:19.1703419Z x1 = x1.contiguous() 2025-05-07T20:32:19.1703661Z 2025-05-07T20:32:19.1703844Z if scale_ub is not None: 2025-05-07T20:32:19.1704115Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1704442Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1704750Z ) 2025-05-07T20:32:19.1704938Z else: 2025-05-07T20:32:19.1705142Z scale_ub_tensor = None 2025-05-07T20:32:19.1705392Z 2025-05-07T20:32:19.1705621Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1705928Z op = silu_mul_quant 2025-05-07T20:32:19.1706173Z if compiled: 2025-05-07T20:32:19.1706412Z op = torch.compile(op) 2025-05-07T20:32:19.1706705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1706971Z 2025-05-07T20:32:19.1707162Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1707323Z 2025-05-07T20:32:19.1707473Z moe/activation_test.py:117: 2025-05-07T20:32:19.1707768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1708135Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1708418Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1708969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1709529Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1710235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1710923Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1711449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1712133Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1712798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1713374Z kernel = self.compile( 2025-05-07T20:32:19.1713919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1714579Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1714973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1715201Z 2025-05-07T20:32:19.1715406Z self = 2025-05-07T20:32:19.1716478Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1717840Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1cf0860>} 2025-05-07T20:32:19.1719185Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1720211Z context = 2025-05-07T20:32:19.1720494Z 2025-05-07T20:32:19.1720661Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1721182Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1721643Z module_map=module_map) 2025-05-07T20:32:19.1722000Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1722355Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1722610Z E ^ 2025-05-07T20:32:19.1723077Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1723680Z 2025-05-07T20:32:19.1724102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1724620Z 2025-05-07T20:32:19.2601480Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2602350Z self=, 2025-05-07T20:32:19.2603152Z T=2048, 2025-05-07T20:32:19.2605435Z D=7168, 2025-05-07T20:32:19.2605800Z scale_ub=1200.0, 2025-05-07T20:32:19.2606230Z contiguous=False, 2025-05-07T20:32:19.2606663Z compiled=False, 2025-05-07T20:32:19.2607065Z ) 2025-05-07T20:32:19.2607694Z self = 2025-05-07T20:32:19.2608675Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:19.2609220Z 2025-05-07T20:32:19.2609371Z @given( 2025-05-07T20:32:19.2609968Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2610377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2610680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2611005Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2611335Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2611615Z ) 2025-05-07T20:32:19.2611962Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2612401Z def test_silu_mul_quant( 2025-05-07T20:32:19.2612635Z self, 2025-05-07T20:32:19.2612828Z T: int, 2025-05-07T20:32:19.2613023Z D: int, 2025-05-07T20:32:19.2613242Z scale_ub: Optional[float], 2025-05-07T20:32:19.2614957Z contiguous: bool, 2025-05-07T20:32:19.2615189Z compiled: bool, 2025-05-07T20:32:19.2615409Z ) -> None: 2025-05-07T20:32:19.2615611Z torch.manual_seed(2025) 2025-05-07T20:32:19.2615843Z 2025-05-07T20:32:19.2616120Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2616528Z 2025-05-07T20:32:19.2616716Z x_sign = torch.sign(x) 2025-05-07T20:32:19.2617002Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.2617301Z x = x_sign * x_clamp 2025-05-07T20:32:19.2617536Z x0 = x[:, :D] 2025-05-07T20:32:19.2617745Z x1 = x[:, D:] 2025-05-07T20:32:19.2617944Z 2025-05-07T20:32:19.2618124Z if contiguous: 2025-05-07T20:32:19.2618354Z x0 = x0.contiguous() 2025-05-07T20:32:19.2618602Z x1 = x1.contiguous() 2025-05-07T20:32:19.2618836Z 2025-05-07T20:32:19.2619031Z if scale_ub is not None: 2025-05-07T20:32:19.2619292Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.2619627Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.2619925Z ) 2025-05-07T20:32:19.2620112Z else: 2025-05-07T20:32:19.2620316Z scale_ub_tensor = None 2025-05-07T20:32:19.2620563Z 2025-05-07T20:32:19.2620800Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.2621108Z op = silu_mul_quant 2025-05-07T20:32:19.2621354Z if compiled: 2025-05-07T20:32:19.2621595Z op = torch.compile(op) 2025-05-07T20:32:19.2621883Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.2622156Z 2025-05-07T20:32:19.2622350Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.2622510Z 2025-05-07T20:32:19.2622607Z moe/activation_test.py:117: 2025-05-07T20:32:19.2622896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.2623229Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.2623506Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.2624190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.2624909Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.2625523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.2626211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.2626868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.2627398Z kernel = self.compile( 2025-05-07T20:32:19.2627943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.2635092Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.2635528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.2635756Z 2025-05-07T20:32:19.2635967Z self = 2025-05-07T20:32:19.2637138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.2638722Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1cf16c0>} 2025-05-07T20:32:19.2640122Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.2641143Z context = 2025-05-07T20:32:19.2641427Z 2025-05-07T20:32:19.2641595Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.2642116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.2642592Z module_map=module_map) 2025-05-07T20:32:19.2643045Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.2643526Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.2643785Z E ^ 2025-05-07T20:32:19.2644246Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.2644697Z 2025-05-07T20:32:19.2645117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.2645642Z 2025-05-07T20:32:19.2645744Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2646154Z self=, 2025-05-07T20:32:19.2646542Z T=1, 2025-05-07T20:32:19.2646718Z D=7168, 2025-05-07T20:32:19.2646906Z scale_ub=None, 2025-05-07T20:32:19.2647116Z contiguous=True, 2025-05-07T20:32:19.2647329Z compiled=False, 2025-05-07T20:32:19.2647532Z ) 2025-05-07T20:32:19.2647853Z self = 2025-05-07T20:32:19.2648336Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:19.2648597Z 2025-05-07T20:32:19.2648673Z @given( 2025-05-07T20:32:19.2648903Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2649210Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2649509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2649826Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2650141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2650420Z ) 2025-05-07T20:32:19.2650764Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2651208Z def test_silu_mul_quant( 2025-05-07T20:32:19.2651439Z self, 2025-05-07T20:32:19.2651626Z T: int, 2025-05-07T20:32:19.2651819Z D: int, 2025-05-07T20:32:19.2652026Z scale_ub: Optional[float], 2025-05-07T20:32:19.2652367Z contiguous: bool, 2025-05-07T20:32:19.2652603Z compiled: bool, 2025-05-07T20:32:19.2652817Z ) -> None: 2025-05-07T20:32:19.2653032Z torch.manual_seed(2025) 2025-05-07T20:32:19.2653269Z 2025-05-07T20:32:19.2653532Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2653872Z 2025-05-07T20:32:19.2654059Z x_sign = torch.sign(x) 2025-05-07T20:32:19.2654341Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.2654642Z x = x_sign * x_clamp 2025-05-07T20:32:19.2654877Z x0 = x[:, :D] 2025-05-07T20:32:19.2655082Z x1 = x[:, D:] 2025-05-07T20:32:19.2655283Z 2025-05-07T20:32:19.2655468Z if contiguous: 2025-05-07T20:32:19.2655701Z x0 = x0.contiguous() 2025-05-07T20:32:19.2655948Z x1 = x1.contiguous() 2025-05-07T20:32:19.2656181Z 2025-05-07T20:32:19.2656448Z if scale_ub is not None: 2025-05-07T20:32:19.2656766Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.2657096Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.2657400Z ) 2025-05-07T20:32:19.2657583Z else: 2025-05-07T20:32:19.2657789Z scale_ub_tensor = None 2025-05-07T20:32:19.2658036Z 2025-05-07T20:32:19.2658257Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.2658562Z op = silu_mul_quant 2025-05-07T20:32:19.2658809Z if compiled: 2025-05-07T20:32:19.2659045Z op = torch.compile(op) 2025-05-07T20:32:19.2659341Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.2659617Z 2025-05-07T20:32:19.2659805Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.2659976Z 2025-05-07T20:32:19.2660093Z moe/activation_test.py:117: 2025-05-07T20:32:19.2660408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.2660735Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.2661067Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.2661753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.2662440Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.2662974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.2663655Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.2664311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.2664839Z kernel = self.compile( 2025-05-07T20:32:19.2665378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.2666032Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.2666423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.2666655Z 2025-05-07T20:32:19.2666859Z self = 2025-05-07T20:32:19.2667934Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.2669291Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1cf0fe0>} 2025-05-07T20:32:19.2670626Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.2671647Z context = 2025-05-07T20:32:19.2671982Z 2025-05-07T20:32:19.2672148Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.2672663Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.2673118Z module_map=module_map) 2025-05-07T20:32:19.2673476Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.2673828Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.2674077Z E ^ 2025-05-07T20:32:19.2674533Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.2674988Z 2025-05-07T20:32:19.2675405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.2675919Z 2025-05-07T20:32:19.2676024Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2676472Z self=, 2025-05-07T20:32:19.2676908Z T=16384, 2025-05-07T20:32:19.2677090Z D=7168, 2025-05-07T20:32:19.2677272Z scale_ub=1200.0, 2025-05-07T20:32:19.2677490Z contiguous=False, 2025-05-07T20:32:19.2677710Z compiled=True, 2025-05-07T20:32:19.6324326Z ) 2025-05-07T20:32:19.6325136Z self = 2025-05-07T20:32:19.6325693Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:19.6325982Z 2025-05-07T20:32:19.6326066Z @given( 2025-05-07T20:32:19.6326318Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.6326639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.6326954Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.6327287Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.6327645Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.6327941Z ) 2025-05-07T20:32:19.6328311Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.6329099Z def test_silu_mul_quant( 2025-05-07T20:32:19.6329349Z self, 2025-05-07T20:32:19.6329544Z T: int, 2025-05-07T20:32:19.6329752Z D: int, 2025-05-07T20:32:19.6329982Z scale_ub: Optional[float], 2025-05-07T20:32:19.6330253Z contiguous: bool, 2025-05-07T20:32:19.6330500Z compiled: bool, 2025-05-07T20:32:19.6330739Z ) -> None: 2025-05-07T20:32:19.6330955Z torch.manual_seed(2025) 2025-05-07T20:32:19.6331203Z 2025-05-07T20:32:19.6331487Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.6331840Z 2025-05-07T20:32:19.6332033Z x_sign = torch.sign(x) 2025-05-07T20:32:19.6332374Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.6332804Z x = x_sign * x_clamp 2025-05-07T20:32:19.6333049Z x0 = x[:, :D] 2025-05-07T20:32:19.6333280Z x1 = x[:, D:] 2025-05-07T20:32:19.6333497Z 2025-05-07T20:32:19.6333686Z if contiguous: 2025-05-07T20:32:19.6333936Z x0 = x0.contiguous() 2025-05-07T20:32:19.6334204Z x1 = x1.contiguous() 2025-05-07T20:32:19.6334446Z 2025-05-07T20:32:19.6334652Z if scale_ub is not None: 2025-05-07T20:32:19.6334930Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.6335268Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.6335589Z ) 2025-05-07T20:32:19.6335787Z else: 2025-05-07T20:32:19.6335997Z scale_ub_tensor = None 2025-05-07T20:32:19.6336259Z 2025-05-07T20:32:19.6336506Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.6336820Z op = silu_mul_quant 2025-05-07T20:32:19.6337080Z if compiled: 2025-05-07T20:32:19.6337341Z op = torch.compile(op) 2025-05-07T20:32:19.6337650Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.6338037Z 2025-05-07T20:32:19.6338241Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.6338755Z 2025-05-07T20:32:19.6338879Z moe/activation_test.py:117: 2025-05-07T20:32:19.6339178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.6339524Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.6339857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.6340472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.6341043Z return fn(*args, **kwargs) 2025-05-07T20:32:19.6341724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.6342434Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.6343099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.6344130Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.6344840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.6345392Z kernel = self.compile( 2025-05-07T20:32:19.6345954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.6346627Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.6347037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.6347270Z 2025-05-07T20:32:19.6347491Z self = 2025-05-07T20:32:19.6348582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.6350182Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1cf3b00>} 2025-05-07T20:32:19.6351549Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.6352624Z context = 2025-05-07T20:32:19.6352916Z 2025-05-07T20:32:19.6353089Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.6353628Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.6354112Z module_map=module_map) 2025-05-07T20:32:19.6354490Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.6354953Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.6355311Z E ^ 2025-05-07T20:32:19.6355792Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.6356249Z 2025-05-07T20:32:19.6356676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.6357211Z 2025-05-07T20:32:19.6357320Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.6357743Z self=, 2025-05-07T20:32:19.6358157Z T=1, 2025-05-07T20:32:19.6358343Z D=7168, 2025-05-07T20:32:19.6358548Z scale_ub=None, 2025-05-07T20:32:19.6358776Z contiguous=False, 2025-05-07T20:32:19.6358997Z compiled=False, 2025-05-07T20:32:19.6359219Z ) 2025-05-07T20:32:19.6359549Z self = 2025-05-07T20:32:19.6360123Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:19.6360408Z 2025-05-07T20:32:19.6360486Z @given( 2025-05-07T20:32:19.6360724Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.6361043Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.6361348Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.6361682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.6362016Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.6362299Z ) 2025-05-07T20:32:19.6362654Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.6363106Z def test_silu_mul_quant( 2025-05-07T20:32:19.6363515Z self, 2025-05-07T20:32:19.6363715Z T: int, 2025-05-07T20:32:19.6363914Z D: int, 2025-05-07T20:32:19.6364126Z scale_ub: Optional[float], 2025-05-07T20:32:19.6364406Z contiguous: bool, 2025-05-07T20:32:19.6364701Z compiled: bool, 2025-05-07T20:32:19.6364965Z ) -> None: 2025-05-07T20:32:19.6365195Z torch.manual_seed(2025) 2025-05-07T20:32:19.6365439Z 2025-05-07T20:32:19.6365723Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.6366069Z 2025-05-07T20:32:19.6366267Z x_sign = torch.sign(x) 2025-05-07T20:32:19.6366564Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.6366876Z x = x_sign * x_clamp 2025-05-07T20:32:19.6367130Z x0 = x[:, :D] 2025-05-07T20:32:19.6367359Z x1 = x[:, D:] 2025-05-07T20:32:19.6367567Z 2025-05-07T20:32:19.6367758Z if contiguous: 2025-05-07T20:32:19.6368001Z x0 = x0.contiguous() 2025-05-07T20:32:19.6368259Z x1 = x1.contiguous() 2025-05-07T20:32:19.6368505Z 2025-05-07T20:32:19.6368707Z if scale_ub is not None: 2025-05-07T20:32:19.6368979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.6369325Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.6369720Z ) 2025-05-07T20:32:19.6369937Z else: 2025-05-07T20:32:19.6370153Z scale_ub_tensor = None 2025-05-07T20:32:19.6370411Z 2025-05-07T20:32:19.6370650Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.6370966Z op = silu_mul_quant 2025-05-07T20:32:19.6371221Z if compiled: 2025-05-07T20:32:19.6371474Z op = torch.compile(op) 2025-05-07T20:32:19.6371772Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.6372060Z 2025-05-07T20:32:19.6372259Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.6372423Z 2025-05-07T20:32:19.6372521Z moe/activation_test.py:117: 2025-05-07T20:32:19.6372819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.6373157Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.6373437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.6374137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.6374850Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.6375395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.6376088Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.6376765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.6377308Z kernel = self.compile( 2025-05-07T20:32:19.6377867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.6378530Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.6378935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.6379167Z 2025-05-07T20:32:19.6379437Z self = 2025-05-07T20:32:19.6380573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.6381951Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a268c9a0>} 2025-05-07T20:32:19.6383306Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.6384349Z context = 2025-05-07T20:32:19.6384639Z 2025-05-07T20:32:19.6384856Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.6385426Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.6385903Z module_map=module_map) 2025-05-07T20:32:19.6386275Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.6386630Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.6386899Z E ^ 2025-05-07T20:32:19.6387370Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.6387828Z 2025-05-07T20:32:19.6388258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.6388778Z 2025-05-07T20:32:19.6388880Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.6389301Z self=, 2025-05-07T20:32:19.6389716Z T=2048, 2025-05-07T20:32:19.6389913Z D=7168, 2025-05-07T20:32:19.6390111Z scale_ub=None, 2025-05-07T20:32:19.6390380Z contiguous=False, 2025-05-07T20:32:19.6390619Z compiled=True, 2025-05-07T20:32:19.6390818Z ) 2025-05-07T20:32:19.7082781Z self = 2025-05-07T20:32:19.7083513Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:19.7083798Z 2025-05-07T20:32:19.7083880Z @given( 2025-05-07T20:32:19.7084121Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.7084433Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.7084744Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.7085081Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.7085406Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.7085705Z ) 2025-05-07T20:32:19.7086081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.7086546Z def test_silu_mul_quant( 2025-05-07T20:32:19.7086797Z self, 2025-05-07T20:32:19.7087001Z T: int, 2025-05-07T20:32:19.7087209Z D: int, 2025-05-07T20:32:19.7087424Z scale_ub: Optional[float], 2025-05-07T20:32:19.7087701Z contiguous: bool, 2025-05-07T20:32:19.7087944Z compiled: bool, 2025-05-07T20:32:19.7088166Z ) -> None: 2025-05-07T20:32:19.7088389Z torch.manual_seed(2025) 2025-05-07T20:32:19.7088635Z 2025-05-07T20:32:19.7088910Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.7089256Z 2025-05-07T20:32:19.7089459Z x_sign = torch.sign(x) 2025-05-07T20:32:19.7089747Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.7090065Z x = x_sign * x_clamp 2025-05-07T20:32:19.7090310Z x0 = x[:, :D] 2025-05-07T20:32:19.7090527Z x1 = x[:, D:] 2025-05-07T20:32:19.7090740Z 2025-05-07T20:32:19.7090945Z if contiguous: 2025-05-07T20:32:19.7091462Z x0 = x0.contiguous() 2025-05-07T20:32:19.7091747Z x1 = x1.contiguous() 2025-05-07T20:32:19.7092007Z 2025-05-07T20:32:19.7092200Z if scale_ub is not None: 2025-05-07T20:32:19.7092485Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.7092833Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.7093151Z ) 2025-05-07T20:32:19.7093345Z else: 2025-05-07T20:32:19.7093566Z scale_ub_tensor = None 2025-05-07T20:32:19.7093828Z 2025-05-07T20:32:19.7094062Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.7094385Z op = silu_mul_quant 2025-05-07T20:32:19.7094644Z if compiled: 2025-05-07T20:32:19.7094891Z op = torch.compile(op) 2025-05-07T20:32:19.7095193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.7095479Z 2025-05-07T20:32:19.7095795Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.7095972Z 2025-05-07T20:32:19.7096158Z moe/activation_test.py:117: 2025-05-07T20:32:19.7096467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.7096814Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.7097098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.7097666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.7098248Z return fn(*args, **kwargs) 2025-05-07T20:32:19.7098909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.7099610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.7100159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.7100856Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.7101526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.7102160Z kernel = self.compile( 2025-05-07T20:32:19.7102712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.7103374Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.7103776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.7104015Z 2025-05-07T20:32:19.7104223Z self = 2025-05-07T20:32:19.7105303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.7106677Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a268dd00>} 2025-05-07T20:32:19.7108026Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.7109054Z context = 2025-05-07T20:32:19.7109340Z 2025-05-07T20:32:19.7109515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.7110045Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.7110513Z module_map=module_map) 2025-05-07T20:32:19.7110884Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.7111252Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.7111513Z E ^ 2025-05-07T20:32:19.7112031Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.7112491Z 2025-05-07T20:32:19.7112921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.7113443Z 2025-05-07T20:32:19.7113556Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.7113971Z self=, 2025-05-07T20:32:19.7114380Z T=4096, 2025-05-07T20:32:19.7114579Z D=7168, 2025-05-07T20:32:19.7114772Z scale_ub=None, 2025-05-07T20:32:19.7114998Z contiguous=False, 2025-05-07T20:32:19.7115232Z compiled=True, 2025-05-07T20:32:19.7115439Z ) 2025-05-07T20:32:19.7115763Z self = 2025-05-07T20:32:19.7116264Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:19.7116539Z 2025-05-07T20:32:19.7116672Z @given( 2025-05-07T20:32:19.7116941Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.7117264Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.7117577Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.7117906Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.7118241Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.7118550Z ) 2025-05-07T20:32:19.7118896Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.7119348Z def test_silu_mul_quant( 2025-05-07T20:32:19.7119600Z self, 2025-05-07T20:32:19.7119799Z T: int, 2025-05-07T20:32:19.7120006Z D: int, 2025-05-07T20:32:19.7120226Z scale_ub: Optional[float], 2025-05-07T20:32:19.7120494Z contiguous: bool, 2025-05-07T20:32:19.7129828Z compiled: bool, 2025-05-07T20:32:19.7130122Z ) -> None: 2025-05-07T20:32:19.7130351Z torch.manual_seed(2025) 2025-05-07T20:32:19.7130610Z 2025-05-07T20:32:19.7130906Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.7131329Z 2025-05-07T20:32:19.7131540Z x_sign = torch.sign(x) 2025-05-07T20:32:19.7131849Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.7132160Z x = x_sign * x_clamp 2025-05-07T20:32:19.7132412Z x0 = x[:, :D] 2025-05-07T20:32:19.7132641Z x1 = x[:, D:] 2025-05-07T20:32:19.7132847Z 2025-05-07T20:32:19.7133049Z if contiguous: 2025-05-07T20:32:19.7133294Z x0 = x0.contiguous() 2025-05-07T20:32:19.7133555Z x1 = x1.contiguous() 2025-05-07T20:32:19.7133808Z 2025-05-07T20:32:19.7134012Z if scale_ub is not None: 2025-05-07T20:32:19.7134287Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.7134636Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.7134957Z ) 2025-05-07T20:32:19.7135168Z else: 2025-05-07T20:32:19.7135384Z scale_ub_tensor = None 2025-05-07T20:32:19.7135656Z 2025-05-07T20:32:19.7135903Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.7136222Z op = silu_mul_quant 2025-05-07T20:32:19.7136487Z if compiled: 2025-05-07T20:32:19.7136742Z op = torch.compile(op) 2025-05-07T20:32:19.7137037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.7137321Z 2025-05-07T20:32:19.7137523Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.7137687Z 2025-05-07T20:32:19.7137789Z moe/activation_test.py:117: 2025-05-07T20:32:19.7138095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.7138784Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.7139100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.7139692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.7140294Z return fn(*args, **kwargs) 2025-05-07T20:32:19.7141062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.7141907Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.7142557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.7143393Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.7144207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.7144852Z kernel = self.compile( 2025-05-07T20:32:19.7145511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.7146319Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.7146855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.7147191Z 2025-05-07T20:32:19.7147432Z self = 2025-05-07T20:32:19.7148783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.7150519Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a268e840>} 2025-05-07T20:32:19.7152209Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.7153480Z context = 2025-05-07T20:32:19.7153833Z 2025-05-07T20:32:19.7154028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.7154721Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.7155285Z module_map=module_map) 2025-05-07T20:32:19.7155695Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.7156108Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.7156404Z E ^ 2025-05-07T20:32:19.7156952Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.7157519Z 2025-05-07T20:32:19.7158031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.7158678Z 2025-05-07T20:32:19.8419436Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.8419953Z self=, 2025-05-07T20:32:19.8420468Z T=16384, 2025-05-07T20:32:19.8420748Z D=5120, 2025-05-07T20:32:19.8421030Z scale_ub=1200.0, 2025-05-07T20:32:19.8421349Z contiguous=False, 2025-05-07T20:32:19.8421640Z compiled=False, 2025-05-07T20:32:19.8421864Z ) 2025-05-07T20:32:19.8422199Z self = 2025-05-07T20:32:19.8422722Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:19.8423009Z 2025-05-07T20:32:19.8423095Z @given( 2025-05-07T20:32:19.8423342Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.8423673Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.8423988Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.8424328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.8424657Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.8424954Z ) 2025-05-07T20:32:19.8425617Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.8426069Z def test_silu_mul_quant( 2025-05-07T20:32:19.8426319Z self, 2025-05-07T20:32:19.8426523Z T: int, 2025-05-07T20:32:19.8426721Z D: int, 2025-05-07T20:32:19.8426947Z scale_ub: Optional[float], 2025-05-07T20:32:19.8427228Z contiguous: bool, 2025-05-07T20:32:19.8427467Z compiled: bool, 2025-05-07T20:32:19.8427705Z ) -> None: 2025-05-07T20:32:19.8427929Z torch.manual_seed(2025) 2025-05-07T20:32:19.8428170Z 2025-05-07T20:32:19.8428458Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.8428810Z 2025-05-07T20:32:19.8429005Z x_sign = torch.sign(x) 2025-05-07T20:32:19.8429307Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.8429631Z x = x_sign * x_clamp 2025-05-07T20:32:19.8429905Z x0 = x[:, :D] 2025-05-07T20:32:19.8430234Z x1 = x[:, D:] 2025-05-07T20:32:19.8430459Z 2025-05-07T20:32:19.8430749Z if contiguous: 2025-05-07T20:32:19.8430986Z x0 = x0.contiguous() 2025-05-07T20:32:19.8431252Z x1 = x1.contiguous() 2025-05-07T20:32:19.8431504Z 2025-05-07T20:32:19.8431700Z if scale_ub is not None: 2025-05-07T20:32:19.8431982Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.8432326Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.8432635Z ) 2025-05-07T20:32:19.8432841Z else: 2025-05-07T20:32:19.8433072Z scale_ub_tensor = None 2025-05-07T20:32:19.8433329Z 2025-05-07T20:32:19.8433577Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.8433906Z op = silu_mul_quant 2025-05-07T20:32:19.8434163Z if compiled: 2025-05-07T20:32:19.8434424Z op = torch.compile(op) 2025-05-07T20:32:19.8434731Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.8435020Z 2025-05-07T20:32:19.8435219Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.8435487Z 2025-05-07T20:32:19.8435593Z moe/activation_test.py:117: 2025-05-07T20:32:19.8435899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.8436232Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.8436530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.8437233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.8437922Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.8438672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.8439366Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.8440043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.8440575Z kernel = self.compile( 2025-05-07T20:32:19.8441132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.8441797Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.8442194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.8442423Z 2025-05-07T20:32:19.8442632Z self = 2025-05-07T20:32:19.8443812Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.8445202Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2784040>} 2025-05-07T20:32:19.8446623Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.8447655Z context = 2025-05-07T20:32:19.8447947Z 2025-05-07T20:32:19.8448115Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.8448644Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.8449117Z module_map=module_map) 2025-05-07T20:32:19.8449486Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.8449894Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.8450166Z E ^ 2025-05-07T20:32:19.8450692Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.8451155Z 2025-05-07T20:32:19.8452270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.8452795Z 2025-05-07T20:32:19.8452901Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.8453317Z self=, 2025-05-07T20:32:19.8453720Z T=16384, 2025-05-07T20:32:19.8453915Z D=5120, 2025-05-07T20:32:19.8454120Z scale_ub=1200.0, 2025-05-07T20:32:19.8454339Z contiguous=True, 2025-05-07T20:32:19.8454565Z compiled=True, 2025-05-07T20:32:19.8454786Z ) 2025-05-07T20:32:19.8455103Z self = 2025-05-07T20:32:19.8455607Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:19.8455892Z 2025-05-07T20:32:19.8455974Z @given( 2025-05-07T20:32:19.8456209Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.8456519Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.8456907Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.8457235Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.8457559Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.8457851Z ) 2025-05-07T20:32:19.8458206Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.8458644Z def test_silu_mul_quant( 2025-05-07T20:32:19.8458889Z self, 2025-05-07T20:32:19.8459089Z T: int, 2025-05-07T20:32:19.8459283Z D: int, 2025-05-07T20:32:19.8459511Z scale_ub: Optional[float], 2025-05-07T20:32:19.8459786Z contiguous: bool, 2025-05-07T20:32:19.8460030Z compiled: bool, 2025-05-07T20:32:19.8460252Z ) -> None: 2025-05-07T20:32:19.8460475Z torch.manual_seed(2025) 2025-05-07T20:32:19.8460719Z 2025-05-07T20:32:19.8460995Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.8461339Z 2025-05-07T20:32:19.8461543Z x_sign = torch.sign(x) 2025-05-07T20:32:19.8461851Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.8462172Z x = x_sign * x_clamp 2025-05-07T20:32:19.8462414Z x0 = x[:, :D] 2025-05-07T20:32:19.8462639Z x1 = x[:, D:] 2025-05-07T20:32:19.8462843Z 2025-05-07T20:32:19.8463034Z if contiguous: 2025-05-07T20:32:19.8463270Z x0 = x0.contiguous() 2025-05-07T20:32:19.8463524Z x1 = x1.contiguous() 2025-05-07T20:32:19.8463764Z 2025-05-07T20:32:19.8463963Z if scale_ub is not None: 2025-05-07T20:32:19.8464230Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.8464564Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.8464888Z ) 2025-05-07T20:32:19.8465089Z else: 2025-05-07T20:32:19.8465305Z scale_ub_tensor = None 2025-05-07T20:32:19.8465565Z 2025-05-07T20:32:19.8465841Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.8466171Z op = silu_mul_quant 2025-05-07T20:32:19.8466424Z if compiled: 2025-05-07T20:32:19.8466668Z op = torch.compile(op) 2025-05-07T20:32:19.8466968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.8467246Z 2025-05-07T20:32:19.8467437Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.8467609Z 2025-05-07T20:32:19.8467710Z moe/activation_test.py:117: 2025-05-07T20:32:19.8468007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.8468335Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.8468624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.8469186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.8469751Z return fn(*args, **kwargs) 2025-05-07T20:32:19.8470488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.8471225Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.8471766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.8472444Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.8473110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.8473647Z kernel = self.compile( 2025-05-07T20:32:19.8474191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.8474847Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.8475248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.8475474Z 2025-05-07T20:32:19.8475692Z self = 2025-05-07T20:32:19.8476813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.8478173Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2785300>} 2025-05-07T20:32:19.8479518Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.8480546Z context = 2025-05-07T20:32:19.8480834Z 2025-05-07T20:32:19.8481010Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.8481531Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.8482007Z module_map=module_map) 2025-05-07T20:32:19.8482374Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.8482733Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.8482987Z E ^ 2025-05-07T20:32:19.8483628Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.8484081Z 2025-05-07T20:32:19.8484505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.8485017Z 2025-05-07T20:32:20.1631156Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.1631692Z self=, 2025-05-07T20:32:20.1632135Z T=16384, 2025-05-07T20:32:20.1632364Z D=5120, 2025-05-07T20:32:20.1632567Z scale_ub=None, 2025-05-07T20:32:20.1633140Z contiguous=False, 2025-05-07T20:32:20.1633380Z compiled=True, 2025-05-07T20:32:20.1633594Z ) 2025-05-07T20:32:20.1633918Z self = 2025-05-07T20:32:20.1634411Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:20.1634696Z 2025-05-07T20:32:20.1634776Z @given( 2025-05-07T20:32:20.1635009Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.1635325Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.1635625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.1635953Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.1636281Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.1636559Z ) 2025-05-07T20:32:20.1636909Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.1637446Z def test_silu_mul_quant( 2025-05-07T20:32:20.1637779Z self, 2025-05-07T20:32:20.1637980Z T: int, 2025-05-07T20:32:20.1638180Z D: int, 2025-05-07T20:32:20.1638677Z scale_ub: Optional[float], 2025-05-07T20:32:20.1638983Z contiguous: bool, 2025-05-07T20:32:20.1639245Z compiled: bool, 2025-05-07T20:32:20.1639486Z ) -> None: 2025-05-07T20:32:20.1639719Z torch.manual_seed(2025) 2025-05-07T20:32:20.1640027Z 2025-05-07T20:32:20.1640329Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.1640725Z 2025-05-07T20:32:20.1640933Z x_sign = torch.sign(x) 2025-05-07T20:32:20.1641254Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.1641596Z x = x_sign * x_clamp 2025-05-07T20:32:20.1641856Z x0 = x[:, :D] 2025-05-07T20:32:20.1642091Z x1 = x[:, D:] 2025-05-07T20:32:20.1642312Z 2025-05-07T20:32:20.1642511Z if contiguous: 2025-05-07T20:32:20.1642769Z x0 = x0.contiguous() 2025-05-07T20:32:20.1643057Z x1 = x1.contiguous() 2025-05-07T20:32:20.1643565Z 2025-05-07T20:32:20.1643762Z if scale_ub is not None: 2025-05-07T20:32:20.1644031Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.1644374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.1644695Z ) 2025-05-07T20:32:20.1644888Z else: 2025-05-07T20:32:20.1645101Z scale_ub_tensor = None 2025-05-07T20:32:20.1645365Z 2025-05-07T20:32:20.1645602Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.1645925Z op = silu_mul_quant 2025-05-07T20:32:20.1646178Z if compiled: 2025-05-07T20:32:20.1646430Z op = torch.compile(op) 2025-05-07T20:32:20.1646721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.1647001Z 2025-05-07T20:32:20.1647203Z > y_fp8, y_scale = fn() 2025-05-07T20:32:20.1647367Z 2025-05-07T20:32:20.1647471Z moe/activation_test.py:117: 2025-05-07T20:32:20.1647778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.1648118Z moe/activation_test.py:115: in fn 2025-05-07T20:32:20.1648396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.1648961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:20.1649529Z return fn(*args, **kwargs) 2025-05-07T20:32:20.1650200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:20.1650882Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:20.1651422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.1652106Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.1652881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.1653432Z kernel = self.compile( 2025-05-07T20:32:20.1653979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.1654645Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.1655039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.1655272Z 2025-05-07T20:32:20.1655480Z self = 2025-05-07T20:32:20.1656561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.1658021Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2785e40>} 2025-05-07T20:32:20.1659415Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.1660490Z context = 2025-05-07T20:32:20.1660783Z 2025-05-07T20:32:20.1660950Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.1661474Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.1661939Z module_map=module_map) 2025-05-07T20:32:20.1662309Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.1662670Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.1662932Z E ^ 2025-05-07T20:32:20.1663401Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.1663914Z 2025-05-07T20:32:20.1664333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.1664850Z 2025-05-07T20:32:20.1664966Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.1665379Z self=, 2025-05-07T20:32:20.1665778Z T=2048, 2025-05-07T20:32:20.1665969Z D=5120, 2025-05-07T20:32:20.1666167Z scale_ub=None, 2025-05-07T20:32:20.1666383Z contiguous=False, 2025-05-07T20:32:20.1666616Z compiled=True, 2025-05-07T20:32:20.1666826Z ) 2025-05-07T20:32:20.2394768Z self = 2025-05-07T20:32:20.2395402Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:20.2395726Z 2025-05-07T20:32:20.2395809Z @given( 2025-05-07T20:32:20.2396077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.2396457Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.2396796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.2397173Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.2397550Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.2397866Z ) 2025-05-07T20:32:20.2398220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.2398665Z def test_silu_mul_quant( 2025-05-07T20:32:20.2398913Z self, 2025-05-07T20:32:20.2399102Z T: int, 2025-05-07T20:32:20.2399303Z D: int, 2025-05-07T20:32:20.2399526Z scale_ub: Optional[float], 2025-05-07T20:32:20.2399792Z contiguous: bool, 2025-05-07T20:32:20.2400035Z compiled: bool, 2025-05-07T20:32:20.2400267Z ) -> None: 2025-05-07T20:32:20.2400477Z torch.manual_seed(2025) 2025-05-07T20:32:20.2400716Z 2025-05-07T20:32:20.2401271Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.2401619Z 2025-05-07T20:32:20.2401812Z x_sign = torch.sign(x) 2025-05-07T20:32:20.2402104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.2402407Z x = x_sign * x_clamp 2025-05-07T20:32:20.2402645Z x0 = x[:, :D] 2025-05-07T20:32:20.2402866Z x1 = x[:, D:] 2025-05-07T20:32:20.2403069Z 2025-05-07T20:32:20.2403438Z if contiguous: 2025-05-07T20:32:20.2403678Z x0 = x0.contiguous() 2025-05-07T20:32:20.2403937Z x1 = x1.contiguous() 2025-05-07T20:32:20.2404173Z 2025-05-07T20:32:20.2404368Z if scale_ub is not None: 2025-05-07T20:32:20.2404640Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.2404970Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.2405280Z ) 2025-05-07T20:32:20.2405474Z else: 2025-05-07T20:32:20.2405769Z scale_ub_tensor = None 2025-05-07T20:32:20.2406100Z 2025-05-07T20:32:20.2406342Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.2406654Z op = silu_mul_quant 2025-05-07T20:32:20.2406908Z if compiled: 2025-05-07T20:32:20.2407160Z op = torch.compile(op) 2025-05-07T20:32:20.2407451Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.2407729Z 2025-05-07T20:32:20.2407926Z > y_fp8, y_scale = fn() 2025-05-07T20:32:20.2408109Z 2025-05-07T20:32:20.2408215Z moe/activation_test.py:117: 2025-05-07T20:32:20.2408500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.2408835Z moe/activation_test.py:115: in fn 2025-05-07T20:32:20.2409119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.2409675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:20.2410297Z return fn(*args, **kwargs) 2025-05-07T20:32:20.2410958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:20.2419887Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:20.2420465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.2421158Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.2421823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.2422367Z kernel = self.compile( 2025-05-07T20:32:20.2422919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.2423586Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.2423987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.2424227Z 2025-05-07T20:32:20.2424440Z self = 2025-05-07T20:32:20.2425525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.2426904Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2787240>} 2025-05-07T20:32:20.2428236Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.2429265Z context = 2025-05-07T20:32:20.2429557Z 2025-05-07T20:32:20.2429727Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.2430374Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.2430842Z module_map=module_map) 2025-05-07T20:32:20.2431213Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.2431575Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.2431830Z E ^ 2025-05-07T20:32:20.2432295Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.2432755Z 2025-05-07T20:32:20.2433174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.2433687Z 2025-05-07T20:32:20.2433798Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.2434205Z self=, 2025-05-07T20:32:20.2434658Z T=2048, 2025-05-07T20:32:20.2434851Z D=5120, 2025-05-07T20:32:20.2435106Z scale_ub=1200.0, 2025-05-07T20:32:20.2435329Z contiguous=False, 2025-05-07T20:32:20.2435569Z compiled=True, 2025-05-07T20:32:20.2435786Z ) 2025-05-07T20:32:20.2436104Z self = 2025-05-07T20:32:20.2436602Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:20.2436877Z 2025-05-07T20:32:20.2436967Z @given( 2025-05-07T20:32:20.2437194Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.2437510Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.2437823Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.2438144Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.2438910Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.2439255Z ) 2025-05-07T20:32:20.2439620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.2440075Z def test_silu_mul_quant( 2025-05-07T20:32:20.2440420Z self, 2025-05-07T20:32:20.2440622Z T: int, 2025-05-07T20:32:20.2440815Z D: int, 2025-05-07T20:32:20.2441039Z scale_ub: Optional[float], 2025-05-07T20:32:20.2441312Z contiguous: bool, 2025-05-07T20:32:20.2441551Z compiled: bool, 2025-05-07T20:32:20.2441781Z ) -> None: 2025-05-07T20:32:20.2441995Z torch.manual_seed(2025) 2025-05-07T20:32:20.2442234Z 2025-05-07T20:32:20.2442516Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.2442866Z 2025-05-07T20:32:20.2443058Z x_sign = torch.sign(x) 2025-05-07T20:32:20.2443472Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.2443784Z x = x_sign * x_clamp 2025-05-07T20:32:20.2444020Z x0 = x[:, :D] 2025-05-07T20:32:20.2444240Z x1 = x[:, D:] 2025-05-07T20:32:20.2444455Z 2025-05-07T20:32:20.2444643Z if contiguous: 2025-05-07T20:32:20.2444881Z x0 = x0.contiguous() 2025-05-07T20:32:20.2445156Z x1 = x1.contiguous() 2025-05-07T20:32:20.2445399Z 2025-05-07T20:32:20.2445589Z if scale_ub is not None: 2025-05-07T20:32:20.2445864Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.2446202Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.2446507Z ) 2025-05-07T20:32:20.2446705Z else: 2025-05-07T20:32:20.2446916Z scale_ub_tensor = None 2025-05-07T20:32:20.2447165Z 2025-05-07T20:32:20.2447400Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.2447716Z op = silu_mul_quant 2025-05-07T20:32:20.2447962Z if compiled: 2025-05-07T20:32:20.2448215Z op = torch.compile(op) 2025-05-07T20:32:20.2448512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.2448789Z 2025-05-07T20:32:20.2448988Z > y_fp8, y_scale = fn() 2025-05-07T20:32:20.2449154Z 2025-05-07T20:32:20.2449337Z moe/activation_test.py:117: 2025-05-07T20:32:20.2449636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.2449969Z moe/activation_test.py:115: in fn 2025-05-07T20:32:20.2450242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.2450803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:20.2451369Z return fn(*args, **kwargs) 2025-05-07T20:32:20.2452033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:20.2452720Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:20.2453257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.2454016Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.2454686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.2455284Z kernel = self.compile( 2025-05-07T20:32:20.2455827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.2456487Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.2456881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.2457114Z 2025-05-07T20:32:20.2457321Z self = 2025-05-07T20:32:20.2458403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.2459795Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1afc720>} 2025-05-07T20:32:20.2461212Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.2462244Z context = 2025-05-07T20:32:20.2462536Z 2025-05-07T20:32:20.2462703Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.2463227Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.2463689Z module_map=module_map) 2025-05-07T20:32:20.2464057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.2464413Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.2464676Z E ^ 2025-05-07T20:32:20.2465144Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.2465605Z 2025-05-07T20:32:20.2466025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.2466541Z 2025-05-07T20:32:20.3799163Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.3800055Z self=, 2025-05-07T20:32:20.3800495Z T=4096, 2025-05-07T20:32:20.3800685Z D=5120, 2025-05-07T20:32:20.3800885Z scale_ub=1200.0, 2025-05-07T20:32:20.3801116Z contiguous=True, 2025-05-07T20:32:20.3801335Z compiled=True, 2025-05-07T20:32:20.3801549Z ) 2025-05-07T20:32:20.3801882Z self = 2025-05-07T20:32:20.3802376Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:20.3802679Z 2025-05-07T20:32:20.3802760Z @given( 2025-05-07T20:32:20.3803477Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.3803808Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.3804123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.3804460Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.3804798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.3805088Z ) 2025-05-07T20:32:20.3805446Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.3805898Z def test_silu_mul_quant( 2025-05-07T20:32:20.3806136Z self, 2025-05-07T20:32:20.3806342Z T: int, 2025-05-07T20:32:20.3806546Z D: int, 2025-05-07T20:32:20.3806758Z scale_ub: Optional[float], 2025-05-07T20:32:20.3807036Z contiguous: bool, 2025-05-07T20:32:20.3807277Z compiled: bool, 2025-05-07T20:32:20.3807503Z ) -> None: 2025-05-07T20:32:20.3807836Z torch.manual_seed(2025) 2025-05-07T20:32:20.3808198Z 2025-05-07T20:32:20.3808500Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.3808893Z 2025-05-07T20:32:20.3809100Z x_sign = torch.sign(x) 2025-05-07T20:32:20.3809416Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.3809768Z x = x_sign * x_clamp 2025-05-07T20:32:20.3810034Z x0 = x[:, :D] 2025-05-07T20:32:20.3810266Z x1 = x[:, D:] 2025-05-07T20:32:20.3810484Z 2025-05-07T20:32:20.3810686Z if contiguous: 2025-05-07T20:32:20.3810939Z x0 = x0.contiguous() 2025-05-07T20:32:20.3811219Z x1 = x1.contiguous() 2025-05-07T20:32:20.3811486Z 2025-05-07T20:32:20.3811696Z if scale_ub is not None: 2025-05-07T20:32:20.3811994Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.3812371Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.3812693Z ) 2025-05-07T20:32:20.3812891Z else: 2025-05-07T20:32:20.3813118Z scale_ub_tensor = None 2025-05-07T20:32:20.3813476Z 2025-05-07T20:32:20.3813708Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.3814029Z op = silu_mul_quant 2025-05-07T20:32:20.3814287Z if compiled: 2025-05-07T20:32:20.3814532Z op = torch.compile(op) 2025-05-07T20:32:20.3814837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.3815117Z 2025-05-07T20:32:20.3815320Z > y_fp8, y_scale = fn() 2025-05-07T20:32:20.3815483Z 2025-05-07T20:32:20.3815585Z moe/activation_test.py:117: 2025-05-07T20:32:20.3815882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.3816220Z moe/activation_test.py:115: in fn 2025-05-07T20:32:20.3816502Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.3817074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:20.3817645Z return fn(*args, **kwargs) 2025-05-07T20:32:20.3818313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:20.3819006Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:20.3819548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.3820235Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.3820900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.3821438Z kernel = self.compile( 2025-05-07T20:32:20.3821988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.3822654Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.3823108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.3823357Z 2025-05-07T20:32:20.3823566Z self = 2025-05-07T20:32:20.3824645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.3826027Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1afd260>} 2025-05-07T20:32:20.3827358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.3828462Z context = 2025-05-07T20:32:20.3828756Z 2025-05-07T20:32:20.3828971Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.3829500Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.3829995Z module_map=module_map) 2025-05-07T20:32:20.3830388Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.3830768Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.3831025Z E ^ 2025-05-07T20:32:20.3831494Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.3831949Z 2025-05-07T20:32:20.3832378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.3832906Z 2025-05-07T20:32:20.3833014Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.3833432Z self=, 2025-05-07T20:32:20.3833843Z T=128, 2025-05-07T20:32:20.3834090Z D=5120, 2025-05-07T20:32:20.3834279Z scale_ub=1200.0, 2025-05-07T20:32:20.3834505Z contiguous=False, 2025-05-07T20:32:20.3834731Z compiled=True, 2025-05-07T20:32:20.3834926Z ) 2025-05-07T20:32:20.6471307Z self = 2025-05-07T20:32:20.6472068Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:20.6472348Z 2025-05-07T20:32:20.6472432Z @given( 2025-05-07T20:32:20.6472680Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.6473004Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.6473323Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.6473655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.6473994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.6474308Z ) 2025-05-07T20:32:20.6474666Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.6475141Z def test_silu_mul_quant( 2025-05-07T20:32:20.6475390Z self, 2025-05-07T20:32:20.6475585Z T: int, 2025-05-07T20:32:20.6475793Z D: int, 2025-05-07T20:32:20.6476023Z scale_ub: Optional[float], 2025-05-07T20:32:20.6476298Z contiguous: bool, 2025-05-07T20:32:20.6476545Z compiled: bool, 2025-05-07T20:32:20.6476781Z ) -> None: 2025-05-07T20:32:20.6476998Z torch.manual_seed(2025) 2025-05-07T20:32:20.6477243Z 2025-05-07T20:32:20.6477523Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.6477875Z 2025-05-07T20:32:20.6478073Z x_sign = torch.sign(x) 2025-05-07T20:32:20.6478371Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.6478687Z x = x_sign * x_clamp 2025-05-07T20:32:20.6478929Z x0 = x[:, :D] 2025-05-07T20:32:20.6479155Z x1 = x[:, D:] 2025-05-07T20:32:20.6479375Z 2025-05-07T20:32:20.6479875Z if contiguous: 2025-05-07T20:32:20.6480127Z x0 = x0.contiguous() 2025-05-07T20:32:20.6480393Z x1 = x1.contiguous() 2025-05-07T20:32:20.6480634Z 2025-05-07T20:32:20.6480837Z if scale_ub is not None: 2025-05-07T20:32:20.6481119Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.6481457Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.6481771Z ) 2025-05-07T20:32:20.6481972Z else: 2025-05-07T20:32:20.6482186Z scale_ub_tensor = None 2025-05-07T20:32:20.6482448Z 2025-05-07T20:32:20.6482688Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.6483005Z op = silu_mul_quant 2025-05-07T20:32:20.6483428Z if compiled: 2025-05-07T20:32:20.6483703Z op = torch.compile(op) 2025-05-07T20:32:20.6484100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.6484374Z 2025-05-07T20:32:20.6484659Z > y_fp8, y_scale = fn() 2025-05-07T20:32:20.6484826Z 2025-05-07T20:32:20.6484933Z moe/activation_test.py:117: 2025-05-07T20:32:20.6485226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.6485566Z moe/activation_test.py:115: in fn 2025-05-07T20:32:20.6485855Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.6486421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:20.6486992Z return fn(*args, **kwargs) 2025-05-07T20:32:20.6487660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:20.6488358Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:20.6488898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.6489596Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.6490421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.6490964Z kernel = self.compile( 2025-05-07T20:32:20.6491513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.6492184Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.6492587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.6492816Z 2025-05-07T20:32:20.6493028Z self = 2025-05-07T20:32:20.6494114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.6495499Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1afe480>} 2025-05-07T20:32:20.6496857Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.6497889Z context = 2025-05-07T20:32:20.6498177Z 2025-05-07T20:32:20.6498346Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.6498874Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.6499351Z module_map=module_map) 2025-05-07T20:32:20.6499715Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.6500079Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.6500348Z E ^ 2025-05-07T20:32:20.6500876Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.6501332Z 2025-05-07T20:32:20.6501753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.6502273Z 2025-05-07T20:32:20.6502379Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.6502795Z self=, 2025-05-07T20:32:20.6503203Z T=16384, 2025-05-07T20:32:20.6503399Z D=7168, 2025-05-07T20:32:20.6503598Z scale_ub=1200.0, 2025-05-07T20:32:20.6503832Z contiguous=True, 2025-05-07T20:32:20.6504058Z compiled=True, 2025-05-07T20:32:20.6504270Z ) 2025-05-07T20:32:20.6504597Z self = 2025-05-07T20:32:20.6505139Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:20.6505466Z 2025-05-07T20:32:20.6505549Z @given( 2025-05-07T20:32:20.6505782Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.6506093Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.6506406Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.6506736Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.6507065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.6507351Z ) 2025-05-07T20:32:20.6507702Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.6508147Z def test_silu_mul_quant( 2025-05-07T20:32:20.6508385Z self, 2025-05-07T20:32:20.6508588Z T: int, 2025-05-07T20:32:20.6508792Z D: int, 2025-05-07T20:32:20.6509012Z scale_ub: Optional[float], 2025-05-07T20:32:20.6509295Z contiguous: bool, 2025-05-07T20:32:20.6509540Z compiled: bool, 2025-05-07T20:32:20.6509765Z ) -> None: 2025-05-07T20:32:20.6510016Z torch.manual_seed(2025) 2025-05-07T20:32:20.6510345Z 2025-05-07T20:32:20.6510618Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.6510969Z 2025-05-07T20:32:20.6511168Z x_sign = torch.sign(x) 2025-05-07T20:32:20.6511461Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.6511826Z x = x_sign * x_clamp 2025-05-07T20:32:20.6512169Z x0 = x[:, :D] 2025-05-07T20:32:20.6512428Z x1 = x[:, D:] 2025-05-07T20:32:20.6512633Z 2025-05-07T20:32:20.6512827Z if contiguous: 2025-05-07T20:32:20.6513066Z x0 = x0.contiguous() 2025-05-07T20:32:20.6513326Z x1 = x1.contiguous() 2025-05-07T20:32:20.6513570Z 2025-05-07T20:32:20.6513765Z if scale_ub is not None: 2025-05-07T20:32:20.6514035Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.6514376Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.6514691Z ) 2025-05-07T20:32:20.6514893Z else: 2025-05-07T20:32:20.6515112Z scale_ub_tensor = None 2025-05-07T20:32:20.6515374Z 2025-05-07T20:32:20.6515605Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.6515924Z op = silu_mul_quant 2025-05-07T20:32:20.6516177Z if compiled: 2025-05-07T20:32:20.6516424Z op = torch.compile(op) 2025-05-07T20:32:20.6516730Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.6517012Z 2025-05-07T20:32:20.6517211Z > y_fp8, y_scale = fn() 2025-05-07T20:32:20.6517375Z 2025-05-07T20:32:20.6517475Z moe/activation_test.py:117: 2025-05-07T20:32:20.6517773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.6518110Z moe/activation_test.py:115: in fn 2025-05-07T20:32:20.6518389Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.6519014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:20.6519587Z return fn(*args, **kwargs) 2025-05-07T20:32:20.6520248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:20.6520944Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:20.6521497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.6522196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.6522863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.6523515Z kernel = self.compile( 2025-05-07T20:32:20.6524075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.6524790Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.6525232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.6525472Z 2025-05-07T20:32:20.6525688Z self = 2025-05-07T20:32:20.6526769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.6528136Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1affd80>} 2025-05-07T20:32:20.6529486Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.6530520Z context = 2025-05-07T20:32:20.6530860Z 2025-05-07T20:32:20.6531030Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.6531561Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.6532026Z module_map=module_map) 2025-05-07T20:32:20.6532398Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.6532764Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.6533026Z E ^ 2025-05-07T20:32:20.6533497Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.6533958Z 2025-05-07T20:32:20.6534378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.6534894Z 2025-05-07T20:32:20.7494894Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.7496230Z self=, 2025-05-07T20:32:20.7497108Z T=16384, 2025-05-07T20:32:20.7497502Z D=5120, 2025-05-07T20:32:20.7497890Z scale_ub=1200.0, 2025-05-07T20:32:20.7498330Z contiguous=True, 2025-05-07T20:32:20.7498786Z compiled=False, 2025-05-07T20:32:20.7499198Z ) 2025-05-07T20:32:20.7499839Z self = 2025-05-07T20:32:20.7500405Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:20.7500688Z 2025-05-07T20:32:20.7500780Z @given( 2025-05-07T20:32:20.7501013Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.7501336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.7501651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.7501990Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.7510408Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.7510760Z ) 2025-05-07T20:32:20.7511423Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.7511873Z def test_silu_mul_quant( 2025-05-07T20:32:20.7512122Z self, 2025-05-07T20:32:20.7512318Z T: int, 2025-05-07T20:32:20.7512519Z D: int, 2025-05-07T20:32:20.7512743Z scale_ub: Optional[float], 2025-05-07T20:32:20.7513017Z contiguous: bool, 2025-05-07T20:32:20.7513350Z compiled: bool, 2025-05-07T20:32:20.7513655Z ) -> None: 2025-05-07T20:32:20.7513954Z torch.manual_seed(2025) 2025-05-07T20:32:20.7514197Z 2025-05-07T20:32:20.7514490Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.7514847Z 2025-05-07T20:32:20.7515042Z x_sign = torch.sign(x) 2025-05-07T20:32:20.7515343Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.7515770Z x = x_sign * x_clamp 2025-05-07T20:32:20.7516015Z x0 = x[:, :D] 2025-05-07T20:32:20.7516318Z x1 = x[:, D:] 2025-05-07T20:32:20.7516538Z 2025-05-07T20:32:20.7516729Z if contiguous: 2025-05-07T20:32:20.7516972Z x0 = x0.contiguous() 2025-05-07T20:32:20.7517237Z x1 = x1.contiguous() 2025-05-07T20:32:20.7517475Z 2025-05-07T20:32:20.7517675Z if scale_ub is not None: 2025-05-07T20:32:20.7517957Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.7518291Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.7518604Z ) 2025-05-07T20:32:20.7518804Z else: 2025-05-07T20:32:20.7519020Z scale_ub_tensor = None 2025-05-07T20:32:20.7519268Z 2025-05-07T20:32:20.7519508Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.7519828Z op = silu_mul_quant 2025-05-07T20:32:20.7520100Z if compiled: 2025-05-07T20:32:20.7520385Z op = torch.compile(op) 2025-05-07T20:32:20.7520690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.7521059Z 2025-05-07T20:32:20.7521257Z > y_fp8, y_scale = fn() 2025-05-07T20:32:20.7521424Z 2025-05-07T20:32:20.7521537Z moe/activation_test.py:117: 2025-05-07T20:32:20.7521830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.7522171Z moe/activation_test.py:115: in fn 2025-05-07T20:32:20.7522460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.7523156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:20.7524013Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:20.7524564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.7525268Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.7525948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.7526496Z kernel = self.compile( 2025-05-07T20:32:20.7527052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.7527722Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.7528116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.7528356Z 2025-05-07T20:32:20.7528566Z self = 2025-05-07T20:32:20.7529657Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.7531098Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1774cc0>} 2025-05-07T20:32:20.7532445Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.7533484Z context = 2025-05-07T20:32:20.7533782Z 2025-05-07T20:32:20.7533951Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.7534487Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.7534953Z module_map=module_map) 2025-05-07T20:32:20.7535326Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.7535691Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.7535959Z E ^ 2025-05-07T20:32:20.7536476Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.7536975Z 2025-05-07T20:32:20.7537397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.7537917Z 2025-05-07T20:32:20.7538031Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.7538761Z self=, 2025-05-07T20:32:20.7539178Z T=1, 2025-05-07T20:32:20.7539368Z D=7168, 2025-05-07T20:32:20.7539568Z scale_ub=1200.0, 2025-05-07T20:32:20.7539789Z contiguous=False, 2025-05-07T20:32:20.7540023Z compiled=False, 2025-05-07T20:32:20.7540230Z ) 2025-05-07T20:32:20.7540554Z self = 2025-05-07T20:32:20.7541054Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:20.7541327Z 2025-05-07T20:32:20.7541410Z @given( 2025-05-07T20:32:20.7541649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.7542049Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.7542351Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.7542681Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.7543008Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.7543290Z ) 2025-05-07T20:32:20.7543639Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.7544082Z def test_silu_mul_quant( 2025-05-07T20:32:20.7544319Z self, 2025-05-07T20:32:20.7544510Z T: int, 2025-05-07T20:32:20.7544709Z D: int, 2025-05-07T20:32:20.7544929Z scale_ub: Optional[float], 2025-05-07T20:32:20.7545198Z contiguous: bool, 2025-05-07T20:32:20.7545437Z compiled: bool, 2025-05-07T20:32:20.7545662Z ) -> None: 2025-05-07T20:32:20.7545870Z torch.manual_seed(2025) 2025-05-07T20:32:20.7546118Z 2025-05-07T20:32:20.7546394Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.7546738Z 2025-05-07T20:32:20.7546939Z x_sign = torch.sign(x) 2025-05-07T20:32:20.7547236Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.7547544Z x = x_sign * x_clamp 2025-05-07T20:32:20.7547787Z x0 = x[:, :D] 2025-05-07T20:32:20.7548008Z x1 = x[:, D:] 2025-05-07T20:32:20.7548209Z 2025-05-07T20:32:20.7548398Z if contiguous: 2025-05-07T20:32:20.7548654Z x0 = x0.contiguous() 2025-05-07T20:32:20.7548917Z x1 = x1.contiguous() 2025-05-07T20:32:20.7549155Z 2025-05-07T20:32:20.7549347Z if scale_ub is not None: 2025-05-07T20:32:20.7549614Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.7549941Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.7550253Z ) 2025-05-07T20:32:20.7550444Z else: 2025-05-07T20:32:20.7550655Z scale_ub_tensor = None 2025-05-07T20:32:20.7550990Z 2025-05-07T20:32:20.7551235Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.7551544Z op = silu_mul_quant 2025-05-07T20:32:20.7551801Z if compiled: 2025-05-07T20:32:20.7552052Z op = torch.compile(op) 2025-05-07T20:32:20.7552351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.7552622Z 2025-05-07T20:32:20.7552820Z > y_fp8, y_scale = fn() 2025-05-07T20:32:20.7552984Z 2025-05-07T20:32:20.7553093Z moe/activation_test.py:117: 2025-05-07T20:32:20.7553380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.7553715Z moe/activation_test.py:115: in fn 2025-05-07T20:32:20.7554007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.7554698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:20.7555466Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:20.7556066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.7556761Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.7557427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.7557969Z kernel = self.compile( 2025-05-07T20:32:20.7558515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.7559181Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.7559572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.7559807Z 2025-05-07T20:32:20.7560038Z self = 2025-05-07T20:32:20.7561151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.7562567Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1775080>} 2025-05-07T20:32:20.7564047Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.7565081Z context = 2025-05-07T20:32:20.7565371Z 2025-05-07T20:32:20.7565539Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.7566064Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.7566531Z module_map=module_map) 2025-05-07T20:32:20.7566903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.7567257Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.7567514Z E ^ 2025-05-07T20:32:20.7567979Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.7568441Z 2025-05-07T20:32:20.7568863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.7569378Z 2025-05-07T20:32:20.8909074Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.8909591Z self=, 2025-05-07T20:32:20.8910014Z T=4096, 2025-05-07T20:32:20.8910215Z D=7168, 2025-05-07T20:32:20.8910418Z scale_ub=1200.0, 2025-05-07T20:32:20.8910650Z contiguous=False, 2025-05-07T20:32:20.8910909Z compiled=True, 2025-05-07T20:32:20.8911130Z ) 2025-05-07T20:32:20.8911767Z self = 2025-05-07T20:32:20.8912270Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:20.8912553Z 2025-05-07T20:32:20.8912633Z @given( 2025-05-07T20:32:20.8912875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.8913188Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.8913502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.8913839Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.8914165Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.8914459Z ) 2025-05-07T20:32:20.8914817Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.8915267Z def test_silu_mul_quant( 2025-05-07T20:32:20.8915508Z self, 2025-05-07T20:32:20.8915802Z T: int, 2025-05-07T20:32:20.8916008Z D: int, 2025-05-07T20:32:20.8916340Z scale_ub: Optional[float], 2025-05-07T20:32:20.8916626Z contiguous: bool, 2025-05-07T20:32:20.8916880Z compiled: bool, 2025-05-07T20:32:20.8917106Z ) -> None: 2025-05-07T20:32:20.8917334Z torch.manual_seed(2025) 2025-05-07T20:32:20.8917584Z 2025-05-07T20:32:20.8917855Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.8918208Z 2025-05-07T20:32:20.8918407Z x_sign = torch.sign(x) 2025-05-07T20:32:20.8918696Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.8919008Z x = x_sign * x_clamp 2025-05-07T20:32:20.8919254Z x0 = x[:, :D] 2025-05-07T20:32:20.8919468Z x1 = x[:, D:] 2025-05-07T20:32:20.8919683Z 2025-05-07T20:32:20.8919874Z if contiguous: 2025-05-07T20:32:20.8920126Z x0 = x0.contiguous() 2025-05-07T20:32:20.8920441Z x1 = x1.contiguous() 2025-05-07T20:32:20.8920700Z 2025-05-07T20:32:20.8920897Z if scale_ub is not None: 2025-05-07T20:32:20.8921280Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.8921623Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.8921931Z ) 2025-05-07T20:32:20.8922129Z else: 2025-05-07T20:32:20.8922343Z scale_ub_tensor = None 2025-05-07T20:32:20.8922598Z 2025-05-07T20:32:20.8922836Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.8923156Z op = silu_mul_quant 2025-05-07T20:32:20.8923572Z if compiled: 2025-05-07T20:32:20.8923817Z op = torch.compile(op) 2025-05-07T20:32:20.8924119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.8924397Z 2025-05-07T20:32:20.8924589Z > y_fp8, y_scale = fn() 2025-05-07T20:32:20.8924768Z 2025-05-07T20:32:20.8924871Z moe/activation_test.py:117: 2025-05-07T20:32:20.8925178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.8925516Z moe/activation_test.py:115: in fn 2025-05-07T20:32:20.8925810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.8926374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:20.8926936Z return fn(*args, **kwargs) 2025-05-07T20:32:20.8927595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:20.8928288Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:20.8928832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.8929525Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.8930244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.8930786Z kernel = self.compile( 2025-05-07T20:32:20.8931388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.8932053Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.8932454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.8932683Z 2025-05-07T20:32:20.8932899Z self = 2025-05-07T20:32:20.8933976Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.8935355Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1777060>} 2025-05-07T20:32:20.8936745Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.8937821Z context = 2025-05-07T20:32:20.8938109Z 2025-05-07T20:32:20.8938284Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.8939138Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.8939613Z module_map=module_map) 2025-05-07T20:32:20.8939987Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.8940353Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.8940611Z E ^ 2025-05-07T20:32:20.8941085Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.8941535Z 2025-05-07T20:32:20.8941972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.8942567Z 2025-05-07T20:32:20.8942673Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.8943097Z self=, 2025-05-07T20:32:20.8943506Z T=128, 2025-05-07T20:32:20.8943701Z D=7168, 2025-05-07T20:32:20.8943896Z scale_ub=1200.0, 2025-05-07T20:32:20.8944134Z contiguous=False, 2025-05-07T20:32:20.8944363Z compiled=True, 2025-05-07T20:32:20.8944568Z ) 2025-05-07T20:32:20.9673486Z self = 2025-05-07T20:32:20.9674175Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:20.9674451Z 2025-05-07T20:32:20.9674545Z @given( 2025-05-07T20:32:20.9674779Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.9675111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.9675435Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.9675782Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.9676115Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.9676408Z ) 2025-05-07T20:32:20.9676757Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.9677205Z def test_silu_mul_quant( 2025-05-07T20:32:20.9677452Z self, 2025-05-07T20:32:20.9677649Z T: int, 2025-05-07T20:32:20.9677851Z D: int, 2025-05-07T20:32:20.9678081Z scale_ub: Optional[float], 2025-05-07T20:32:20.9678362Z contiguous: bool, 2025-05-07T20:32:20.9678604Z compiled: bool, 2025-05-07T20:32:20.9678836Z ) -> None: 2025-05-07T20:32:20.9679057Z torch.manual_seed(2025) 2025-05-07T20:32:20.9679300Z 2025-05-07T20:32:20.9679582Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.9679932Z 2025-05-07T20:32:20.9680363Z x_sign = torch.sign(x) 2025-05-07T20:32:20.9680672Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.9680990Z x = x_sign * x_clamp 2025-05-07T20:32:20.9681227Z x0 = x[:, :D] 2025-05-07T20:32:20.9681452Z x1 = x[:, D:] 2025-05-07T20:32:20.9681671Z 2025-05-07T20:32:20.9681861Z if contiguous: 2025-05-07T20:32:20.9682103Z x0 = x0.contiguous() 2025-05-07T20:32:20.9682373Z x1 = x1.contiguous() 2025-05-07T20:32:20.9682616Z 2025-05-07T20:32:20.9682815Z if scale_ub is not None: 2025-05-07T20:32:20.9683095Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.9683601Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.9683918Z ) 2025-05-07T20:32:20.9684118Z else: 2025-05-07T20:32:20.9684331Z scale_ub_tensor = None 2025-05-07T20:32:20.9684585Z 2025-05-07T20:32:20.9684911Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.9685313Z op = silu_mul_quant 2025-05-07T20:32:20.9685566Z if compiled: 2025-05-07T20:32:20.9685818Z op = torch.compile(op) 2025-05-07T20:32:20.9686123Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.9686403Z 2025-05-07T20:32:20.9686604Z > y_fp8, y_scale = fn() 2025-05-07T20:32:20.9686772Z 2025-05-07T20:32:20.9686882Z moe/activation_test.py:117: 2025-05-07T20:32:20.9687175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.9687518Z moe/activation_test.py:115: in fn 2025-05-07T20:32:20.9687806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.9688374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:20.9688939Z return fn(*args, **kwargs) 2025-05-07T20:32:20.9689607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:20.9690390Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:20.9690929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.9691618Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.9692290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.9692830Z kernel = self.compile( 2025-05-07T20:32:20.9693373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.9694045Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.9694450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.9694679Z 2025-05-07T20:32:20.9694899Z self = 2025-05-07T20:32:20.9695980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.9697364Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f188c360>} 2025-05-07T20:32:20.9698706Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.9699736Z context = 2025-05-07T20:32:20.9700027Z 2025-05-07T20:32:20.9700197Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.9700773Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.9701259Z module_map=module_map) 2025-05-07T20:32:20.9701630Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.9701988Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.9702256Z E ^ 2025-05-07T20:32:20.9702726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.9703178Z 2025-05-07T20:32:20.9703598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.9704120Z 2025-05-07T20:32:20.9704226Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.9704647Z self=, 2025-05-07T20:32:20.9705060Z T=2048, 2025-05-07T20:32:20.9705249Z D=7168, 2025-05-07T20:32:20.9705496Z scale_ub=None, 2025-05-07T20:32:20.9705720Z contiguous=True, 2025-05-07T20:32:20.9705986Z compiled=True, 2025-05-07T20:32:20.9706200Z ) 2025-05-07T20:32:20.9706531Z self = 2025-05-07T20:32:20.9707020Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:20.9707296Z 2025-05-07T20:32:20.9707378Z @given( 2025-05-07T20:32:20.9707615Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.9707931Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.9708240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.9708569Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.9708900Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.9709188Z ) 2025-05-07T20:32:20.9709543Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.9709993Z def test_silu_mul_quant( 2025-05-07T20:32:20.9710267Z self, 2025-05-07T20:32:20.9710485Z T: int, 2025-05-07T20:32:20.9710743Z D: int, 2025-05-07T20:32:20.9710962Z scale_ub: Optional[float], 2025-05-07T20:32:20.9711240Z contiguous: bool, 2025-05-07T20:32:20.9711483Z compiled: bool, 2025-05-07T20:32:20.9711711Z ) -> None: 2025-05-07T20:32:20.9711932Z torch.manual_seed(2025) 2025-05-07T20:32:20.9712174Z 2025-05-07T20:32:20.9712446Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.9712792Z 2025-05-07T20:32:20.9712991Z x_sign = torch.sign(x) 2025-05-07T20:32:20.9713287Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.9713596Z x = x_sign * x_clamp 2025-05-07T20:32:20.9713839Z x0 = x[:, :D] 2025-05-07T20:32:20.9714061Z x1 = x[:, D:] 2025-05-07T20:32:20.9714269Z 2025-05-07T20:32:20.9714466Z if contiguous: 2025-05-07T20:32:20.9714708Z x0 = x0.contiguous() 2025-05-07T20:32:20.9714969Z x1 = x1.contiguous() 2025-05-07T20:32:20.9715222Z 2025-05-07T20:32:20.9715424Z if scale_ub is not None: 2025-05-07T20:32:20.9715699Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.9716041Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.9716356Z ) 2025-05-07T20:32:20.9716552Z else: 2025-05-07T20:32:20.9716771Z scale_ub_tensor = None 2025-05-07T20:32:20.9717034Z 2025-05-07T20:32:20.9717265Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.9717588Z op = silu_mul_quant 2025-05-07T20:32:20.9717847Z if compiled: 2025-05-07T20:32:20.9718101Z op = torch.compile(op) 2025-05-07T20:32:20.9718401Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.9718694Z 2025-05-07T20:32:20.9718893Z > y_fp8, y_scale = fn() 2025-05-07T20:32:20.9719058Z 2025-05-07T20:32:20.9719161Z moe/activation_test.py:117: 2025-05-07T20:32:20.9719513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.9719857Z moe/activation_test.py:115: in fn 2025-05-07T20:32:20.9727871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.9728465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:20.9729039Z return fn(*args, **kwargs) 2025-05-07T20:32:20.9729693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:20.9730384Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:20.9730928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.9731607Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.9732363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.9732953Z kernel = self.compile( 2025-05-07T20:32:20.9733511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.9734173Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.9734580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.9734817Z 2025-05-07T20:32:20.9735025Z self = 2025-05-07T20:32:20.9736108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.9737471Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f188cea0>} 2025-05-07T20:32:20.9739108Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.9740287Z context = 2025-05-07T20:32:20.9740581Z 2025-05-07T20:32:20.9740761Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.9741281Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.9741758Z module_map=module_map) 2025-05-07T20:32:20.9742132Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.9742494Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.9742748Z E ^ 2025-05-07T20:32:20.9743221Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.9743674Z 2025-05-07T20:32:20.9744108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.9744639Z 2025-05-07T20:32:21.0404613Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.0405948Z self=, 2025-05-07T20:32:21.0406752Z T=16384, 2025-05-07T20:32:21.0407149Z D=5120, 2025-05-07T20:32:21.0407537Z scale_ub=None, 2025-05-07T20:32:21.0407958Z contiguous=False, 2025-05-07T20:32:21.0408413Z compiled=False, 2025-05-07T20:32:21.0408817Z ) 2025-05-07T20:32:21.0409447Z self = 2025-05-07T20:32:21.0410240Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:21.0410524Z 2025-05-07T20:32:21.0410613Z @given( 2025-05-07T20:32:21.0410862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.0411392Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.0411724Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.0412061Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.0412386Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.0412674Z ) 2025-05-07T20:32:21.0413023Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.0413461Z def test_silu_mul_quant( 2025-05-07T20:32:21.0413713Z self, 2025-05-07T20:32:21.0413916Z T: int, 2025-05-07T20:32:21.0414116Z D: int, 2025-05-07T20:32:21.0414343Z scale_ub: Optional[float], 2025-05-07T20:32:21.0414623Z contiguous: bool, 2025-05-07T20:32:21.0414865Z compiled: bool, 2025-05-07T20:32:21.0415099Z ) -> None: 2025-05-07T20:32:21.0415322Z torch.manual_seed(2025) 2025-05-07T20:32:21.0415558Z 2025-05-07T20:32:21.0415924Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.0416347Z 2025-05-07T20:32:21.0416555Z x_sign = torch.sign(x) 2025-05-07T20:32:21.0416846Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.0418874Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.0420762Z 2025-05-07T20:32:21.0420885Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:21.0421112Z 2025-05-07T20:32:21.0421222Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.0421647Z self=, 2025-05-07T20:32:21.0422130Z T=4096, 2025-05-07T20:32:21.0422330Z D=7168, 2025-05-07T20:32:21.0422530Z scale_ub=1200.0, 2025-05-07T20:32:21.0422757Z contiguous=True, 2025-05-07T20:32:21.0422992Z compiled=True, 2025-05-07T20:32:21.0423202Z ) 2025-05-07T20:32:21.0423526Z self = 2025-05-07T20:32:21.0424018Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:21.0424303Z 2025-05-07T20:32:21.0424382Z @given( 2025-05-07T20:32:21.0424616Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.0424929Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.0425244Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.0425579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.0425906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.0426207Z ) 2025-05-07T20:32:21.0426565Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.0427006Z def test_silu_mul_quant( 2025-05-07T20:32:21.0427252Z self, 2025-05-07T20:32:21.0427454Z T: int, 2025-05-07T20:32:21.0427649Z D: int, 2025-05-07T20:32:21.0427872Z scale_ub: Optional[float], 2025-05-07T20:32:21.0428150Z contiguous: bool, 2025-05-07T20:32:21.0428389Z compiled: bool, 2025-05-07T20:32:21.0428616Z ) -> None: 2025-05-07T20:32:21.0428834Z torch.manual_seed(2025) 2025-05-07T20:32:21.0429078Z 2025-05-07T20:32:21.0429351Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.0429697Z 2025-05-07T20:32:21.0429891Z x_sign = torch.sign(x) 2025-05-07T20:32:21.0430185Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.0432228Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.0434097Z 2025-05-07T20:32:21.0434217Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:21.0434430Z 2025-05-07T20:32:21.0434537Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.0434941Z self=, 2025-05-07T20:32:21.0435347Z T=16384, 2025-05-07T20:32:21.0435542Z D=7168, 2025-05-07T20:32:21.0435730Z scale_ub=None, 2025-05-07T20:32:21.0435991Z contiguous=False, 2025-05-07T20:32:21.0436226Z compiled=False, 2025-05-07T20:32:21.0436473Z ) 2025-05-07T20:32:21.0436792Z self = 2025-05-07T20:32:21.0437291Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:21.0437569Z 2025-05-07T20:32:21.0437654Z @given( 2025-05-07T20:32:21.0437880Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.0438192Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.0438818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.0439156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.0439492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.0439787Z ) 2025-05-07T20:32:21.0440134Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.0440624Z def test_silu_mul_quant( 2025-05-07T20:32:21.0440864Z self, 2025-05-07T20:32:21.0441068Z T: int, 2025-05-07T20:32:21.0441269Z D: int, 2025-05-07T20:32:21.0441586Z scale_ub: Optional[float], 2025-05-07T20:32:21.0441865Z contiguous: bool, 2025-05-07T20:32:21.0442101Z compiled: bool, 2025-05-07T20:32:21.0442332Z ) -> None: 2025-05-07T20:32:21.0442552Z torch.manual_seed(2025) 2025-05-07T20:32:21.0442791Z 2025-05-07T20:32:21.0443073Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.0445216Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.0447091Z 2025-05-07T20:32:21.0447220Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:21.0447431Z 2025-05-07T20:32:21.0447543Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.0447950Z self=, 2025-05-07T20:32:21.0448356Z T=2048, 2025-05-07T20:32:21.0448554Z D=7168, 2025-05-07T20:32:21.0448760Z scale_ub=1200.0, 2025-05-07T20:32:21.0448986Z contiguous=True, 2025-05-07T20:32:21.0449215Z compiled=True, 2025-05-07T20:32:21.0449416Z ) 2025-05-07T20:32:21.0449738Z self = 2025-05-07T20:32:21.0450234Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:21.0450505Z 2025-05-07T20:32:21.0450589Z @given( 2025-05-07T20:32:21.0450814Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.0451132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.0451513Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.0451842Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.0452176Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.0452467Z ) 2025-05-07T20:32:21.0452811Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.0453257Z def test_silu_mul_quant( 2025-05-07T20:32:21.0453502Z self, 2025-05-07T20:32:21.0453698Z T: int, 2025-05-07T20:32:21.0453893Z D: int, 2025-05-07T20:32:21.0454115Z scale_ub: Optional[float], 2025-05-07T20:32:21.0454388Z contiguous: bool, 2025-05-07T20:32:21.0454623Z compiled: bool, 2025-05-07T20:32:21.0454849Z ) -> None: 2025-05-07T20:32:21.0455066Z torch.manual_seed(2025) 2025-05-07T20:32:21.0455302Z 2025-05-07T20:32:21.0455642Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.0455989Z 2025-05-07T20:32:21.0456238Z x_sign = torch.sign(x) 2025-05-07T20:32:21.0456536Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.0458522Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.0460416Z 2025-05-07T20:32:21.0460541Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:21.0460752Z 2025-05-07T20:32:21.0460862Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.0461275Z self=, 2025-05-07T20:32:21.0461760Z T=2048, 2025-05-07T20:32:21.0461956Z D=7168, 2025-05-07T20:32:21.0462146Z scale_ub=None, 2025-05-07T20:32:21.0462365Z contiguous=True, 2025-05-07T20:32:21.0462591Z compiled=False, 2025-05-07T20:32:21.0462794Z ) 2025-05-07T20:32:21.1326519Z self = 2025-05-07T20:32:21.1327247Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:21.1327525Z 2025-05-07T20:32:21.1327616Z @given( 2025-05-07T20:32:21.1327857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.1328175Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.1328497Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.1328836Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.1329168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.1329480Z ) 2025-05-07T20:32:21.1329850Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.1330307Z def test_silu_mul_quant( 2025-05-07T20:32:21.1330555Z self, 2025-05-07T20:32:21.1330758Z T: int, 2025-05-07T20:32:21.1330958Z D: int, 2025-05-07T20:32:21.1331182Z scale_ub: Optional[float], 2025-05-07T20:32:21.1331463Z contiguous: bool, 2025-05-07T20:32:21.1331705Z compiled: bool, 2025-05-07T20:32:21.1331936Z ) -> None: 2025-05-07T20:32:21.1332159Z torch.manual_seed(2025) 2025-05-07T20:32:21.1332418Z 2025-05-07T20:32:21.1332704Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.1333054Z 2025-05-07T20:32:21.1333253Z > x_sign = torch.sign(x) 2025-05-07T20:32:21.1335462Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.1337331Z 2025-05-07T20:32:21.1337452Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:21.1337676Z 2025-05-07T20:32:21.1337783Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.1338206Z self=, 2025-05-07T20:32:21.1338910Z T=1, 2025-05-07T20:32:21.1339104Z D=7168, 2025-05-07T20:32:21.1339308Z scale_ub=1200.0, 2025-05-07T20:32:21.1339534Z contiguous=True, 2025-05-07T20:32:21.1339770Z compiled=False, 2025-05-07T20:32:21.1339983Z ) 2025-05-07T20:32:21.1340434Z self = 2025-05-07T20:32:21.1341010Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:21.1341281Z 2025-05-07T20:32:21.1341374Z @given( 2025-05-07T20:32:21.1341609Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.1341930Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.1342244Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.1342581Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.1342911Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.1343204Z ) 2025-05-07T20:32:21.1343563Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.1344005Z def test_silu_mul_quant( 2025-05-07T20:32:21.1344250Z self, 2025-05-07T20:32:21.1344452Z T: int, 2025-05-07T20:32:21.1344648Z D: int, 2025-05-07T20:32:21.1344877Z scale_ub: Optional[float], 2025-05-07T20:32:21.1345154Z contiguous: bool, 2025-05-07T20:32:21.1345477Z compiled: bool, 2025-05-07T20:32:21.1345704Z ) -> None: 2025-05-07T20:32:21.1345931Z torch.manual_seed(2025) 2025-05-07T20:32:21.1346168Z 2025-05-07T20:32:21.1346449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.1346801Z 2025-05-07T20:32:21.1347001Z x_sign = torch.sign(x) 2025-05-07T20:32:21.1347293Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.1347610Z x = x_sign * x_clamp 2025-05-07T20:32:21.1347854Z x0 = x[:, :D] 2025-05-07T20:32:21.1348071Z x1 = x[:, D:] 2025-05-07T20:32:21.1348286Z 2025-05-07T20:32:21.1348479Z if contiguous: 2025-05-07T20:32:21.1348714Z x0 = x0.contiguous() 2025-05-07T20:32:21.1348989Z x1 = x1.contiguous() 2025-05-07T20:32:21.1349241Z 2025-05-07T20:32:21.1349439Z if scale_ub is not None: 2025-05-07T20:32:21.1349724Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.1350077Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.1350419Z ) 2025-05-07T20:32:21.1350645Z else: 2025-05-07T20:32:21.1350864Z scale_ub_tensor = None 2025-05-07T20:32:21.1351117Z 2025-05-07T20:32:21.1351360Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.1351687Z op = silu_mul_quant 2025-05-07T20:32:21.1351947Z if compiled: 2025-05-07T20:32:21.1352202Z op = torch.compile(op) 2025-05-07T20:32:21.1352509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.1352800Z 2025-05-07T20:32:21.1352996Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.1353174Z 2025-05-07T20:32:21.1353278Z moe/activation_test.py:117: 2025-05-07T20:32:21.1353582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.1353925Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.1354291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.1355001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.1355706Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.1356251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.1356946Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.1357619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.1358156Z kernel = self.compile( 2025-05-07T20:32:21.1358713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.1359382Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.1359835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.1360109Z 2025-05-07T20:32:21.1360323Z self = 2025-05-07T20:32:21.1361413Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.1362781Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1644680>} 2025-05-07T20:32:21.1364235Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.1365273Z context = 2025-05-07T20:32:21.1365567Z 2025-05-07T20:32:21.1365747Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.1366324Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.1366804Z module_map=module_map) 2025-05-07T20:32:21.1367171Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.1367535Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.1367799Z E ^ 2025-05-07T20:32:21.1368267Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.1368726Z 2025-05-07T20:32:21.1369150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.1369676Z 2025-05-07T20:32:21.1369782Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.1370204Z self=, 2025-05-07T20:32:21.1370614Z T=128, 2025-05-07T20:32:21.1370813Z D=5120, 2025-05-07T20:32:21.1371016Z scale_ub=None, 2025-05-07T20:32:21.1371238Z contiguous=True, 2025-05-07T20:32:21.1371469Z compiled=False, 2025-05-07T20:32:21.1371683Z ) 2025-05-07T20:32:21.3721941Z self = 2025-05-07T20:32:21.3722788Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:21.3723202Z 2025-05-07T20:32:21.3723426Z @given( 2025-05-07T20:32:21.3723739Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.3724064Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.3724393Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.3724743Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.3725081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.3725390Z ) 2025-05-07T20:32:21.3726009Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.3726487Z def test_silu_mul_quant( 2025-05-07T20:32:21.3726733Z self, 2025-05-07T20:32:21.3726937Z T: int, 2025-05-07T20:32:21.3727141Z D: int, 2025-05-07T20:32:21.3727359Z scale_ub: Optional[float], 2025-05-07T20:32:21.3727644Z contiguous: bool, 2025-05-07T20:32:21.3727893Z compiled: bool, 2025-05-07T20:32:21.3728118Z ) -> None: 2025-05-07T20:32:21.3728342Z torch.manual_seed(2025) 2025-05-07T20:32:21.3728594Z 2025-05-07T20:32:21.3728868Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.3729217Z 2025-05-07T20:32:21.3729414Z x_sign = torch.sign(x) 2025-05-07T20:32:21.3729706Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.3730021Z x = x_sign * x_clamp 2025-05-07T20:32:21.3730265Z x0 = x[:, :D] 2025-05-07T20:32:21.3730595Z x1 = x[:, D:] 2025-05-07T20:32:21.3730814Z 2025-05-07T20:32:21.3731074Z if contiguous: 2025-05-07T20:32:21.3731311Z x0 = x0.contiguous() 2025-05-07T20:32:21.3731586Z x1 = x1.contiguous() 2025-05-07T20:32:21.3731834Z 2025-05-07T20:32:21.3732033Z if scale_ub is not None: 2025-05-07T20:32:21.3732307Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.3732652Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.3732967Z ) 2025-05-07T20:32:21.3733159Z else: 2025-05-07T20:32:21.3733372Z scale_ub_tensor = None 2025-05-07T20:32:21.3733628Z 2025-05-07T20:32:21.3733864Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.3734188Z op = silu_mul_quant 2025-05-07T20:32:21.3734445Z if compiled: 2025-05-07T20:32:21.3734694Z op = torch.compile(op) 2025-05-07T20:32:21.3734998Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.3735280Z 2025-05-07T20:32:21.3735478Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.3735729Z 2025-05-07T20:32:21.3735837Z moe/activation_test.py:117: 2025-05-07T20:32:21.3736143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.3736486Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.3736770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.3737474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.3738178Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.3739038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.3739742Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.3740461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.3741023Z kernel = self.compile( 2025-05-07T20:32:21.3741579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.3742248Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.3742653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.3742884Z 2025-05-07T20:32:21.3743102Z self = 2025-05-07T20:32:21.3744185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.3745573Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f16458a0>} 2025-05-07T20:32:21.3747016Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.3748061Z context = 2025-05-07T20:32:21.3748353Z 2025-05-07T20:32:21.3748526Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.3749057Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.3749530Z module_map=module_map) 2025-05-07T20:32:21.3749900Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.3750254Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.3750521Z E ^ 2025-05-07T20:32:21.3751058Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.3751516Z 2025-05-07T20:32:21.3751996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.3752524Z 2025-05-07T20:32:21.3752631Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.3753049Z self=, 2025-05-07T20:32:21.3753457Z T=128, 2025-05-07T20:32:21.3753647Z D=7168, 2025-05-07T20:32:21.3753848Z scale_ub=None, 2025-05-07T20:32:21.3754067Z contiguous=True, 2025-05-07T20:32:21.3754289Z compiled=False, 2025-05-07T20:32:21.3754500Z ) 2025-05-07T20:32:21.3754827Z self = 2025-05-07T20:32:21.3755319Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:21.3755595Z 2025-05-07T20:32:21.3755673Z @given( 2025-05-07T20:32:21.3755905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.3756224Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.3756607Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.3756943Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.3757277Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.3757564Z ) 2025-05-07T20:32:21.3757918Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.3758367Z def test_silu_mul_quant( 2025-05-07T20:32:21.3758605Z self, 2025-05-07T20:32:21.3758807Z T: int, 2025-05-07T20:32:21.3766094Z D: int, 2025-05-07T20:32:21.3766435Z scale_ub: Optional[float], 2025-05-07T20:32:21.3766719Z contiguous: bool, 2025-05-07T20:32:21.3766961Z compiled: bool, 2025-05-07T20:32:21.3767193Z ) -> None: 2025-05-07T20:32:21.3767417Z torch.manual_seed(2025) 2025-05-07T20:32:21.3767658Z 2025-05-07T20:32:21.3767946Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.3768298Z 2025-05-07T20:32:21.3768501Z x_sign = torch.sign(x) 2025-05-07T20:32:21.3768801Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.3769121Z x = x_sign * x_clamp 2025-05-07T20:32:21.3769357Z x0 = x[:, :D] 2025-05-07T20:32:21.3769580Z x1 = x[:, D:] 2025-05-07T20:32:21.3769800Z 2025-05-07T20:32:21.3769985Z if contiguous: 2025-05-07T20:32:21.3770246Z x0 = x0.contiguous() 2025-05-07T20:32:21.3770547Z x1 = x1.contiguous() 2025-05-07T20:32:21.3770787Z 2025-05-07T20:32:21.3770988Z if scale_ub is not None: 2025-05-07T20:32:21.3771268Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.3771612Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.3771920Z ) 2025-05-07T20:32:21.3772120Z else: 2025-05-07T20:32:21.3772341Z scale_ub_tensor = None 2025-05-07T20:32:21.3772592Z 2025-05-07T20:32:21.3772916Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.3773245Z op = silu_mul_quant 2025-05-07T20:32:21.3773493Z if compiled: 2025-05-07T20:32:21.3773751Z op = torch.compile(op) 2025-05-07T20:32:21.3774054Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.3774332Z 2025-05-07T20:32:21.3774532Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.3774696Z 2025-05-07T20:32:21.3774848Z moe/activation_test.py:117: 2025-05-07T20:32:21.3775252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.3775607Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.3775896Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.3776599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.3777294Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.3777901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.3778642Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.3779318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.3779851Z kernel = self.compile( 2025-05-07T20:32:21.3780403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.3781067Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.3781462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.3781699Z 2025-05-07T20:32:21.3781910Z self = 2025-05-07T20:32:21.3783018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.3784437Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f16467a0>} 2025-05-07T20:32:21.3785796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.3786814Z context = 2025-05-07T20:32:21.3787106Z 2025-05-07T20:32:21.3787280Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.3787806Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.3788276Z module_map=module_map) 2025-05-07T20:32:21.3788642Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.3789008Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.3789279Z E ^ 2025-05-07T20:32:21.3789740Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.3790209Z 2025-05-07T20:32:21.3790632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.3791153Z 2025-05-07T20:32:21.3791255Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.3791669Z self=, 2025-05-07T20:32:21.3792064Z T=2048, 2025-05-07T20:32:21.3792262Z D=7168, 2025-05-07T20:32:21.3792462Z scale_ub=1200.0, 2025-05-07T20:32:21.3792684Z contiguous=True, 2025-05-07T20:32:21.3792915Z compiled=False, 2025-05-07T20:32:21.3793131Z ) 2025-05-07T20:32:21.4459976Z self = 2025-05-07T20:32:21.4460805Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:21.4461198Z 2025-05-07T20:32:21.4461312Z @given( 2025-05-07T20:32:21.4461588Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.4461909Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.4462219Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.4462547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.4462879Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.4463166Z ) 2025-05-07T20:32:21.4463513Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.4463958Z def test_silu_mul_quant( 2025-05-07T20:32:21.4464195Z self, 2025-05-07T20:32:21.4464395Z T: int, 2025-05-07T20:32:21.4464594Z D: int, 2025-05-07T20:32:21.4464881Z scale_ub: Optional[float], 2025-05-07T20:32:21.4465163Z contiguous: bool, 2025-05-07T20:32:21.4465465Z compiled: bool, 2025-05-07T20:32:21.4465693Z ) -> None: 2025-05-07T20:32:21.4465903Z torch.manual_seed(2025) 2025-05-07T20:32:21.4466145Z 2025-05-07T20:32:21.4466421Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.4468470Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.4470330Z 2025-05-07T20:32:21.4470467Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:21.4470748Z 2025-05-07T20:32:21.4470850Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.4471261Z self=, 2025-05-07T20:32:21.4471665Z T=1, 2025-05-07T20:32:21.4471846Z D=5120, 2025-05-07T20:32:21.4472045Z scale_ub=1200.0, 2025-05-07T20:32:21.4472268Z contiguous=True, 2025-05-07T20:32:21.4472488Z compiled=False, 2025-05-07T20:32:21.4472694Z ) 2025-05-07T20:32:21.4473015Z self = 2025-05-07T20:32:21.4473501Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:21.4473769Z 2025-05-07T20:32:21.4473846Z @given( 2025-05-07T20:32:21.4474083Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.4474397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.4474699Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.4475035Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.4475369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.4475651Z ) 2025-05-07T20:32:21.4476003Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.4476445Z def test_silu_mul_quant( 2025-05-07T20:32:21.4476695Z self, 2025-05-07T20:32:21.4476884Z T: int, 2025-05-07T20:32:21.4477085Z D: int, 2025-05-07T20:32:21.4477361Z scale_ub: Optional[float], 2025-05-07T20:32:21.4477706Z contiguous: bool, 2025-05-07T20:32:21.4477949Z compiled: bool, 2025-05-07T20:32:21.4478182Z ) -> None: 2025-05-07T20:32:21.4478394Z torch.manual_seed(2025) 2025-05-07T20:32:21.4478650Z 2025-05-07T20:32:21.4478941Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.4479279Z 2025-05-07T20:32:21.4479482Z x_sign = torch.sign(x) 2025-05-07T20:32:21.4479968Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.4480352Z x = x_sign * x_clamp 2025-05-07T20:32:21.4480596Z x0 = x[:, :D] 2025-05-07T20:32:21.4480816Z x1 = x[:, D:] 2025-05-07T20:32:21.4481026Z 2025-05-07T20:32:21.4481216Z if contiguous: 2025-05-07T20:32:21.4481457Z x0 = x0.contiguous() 2025-05-07T20:32:21.4481711Z x1 = x1.contiguous() 2025-05-07T20:32:21.4481961Z 2025-05-07T20:32:21.4482157Z if scale_ub is not None: 2025-05-07T20:32:21.4482433Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.4482764Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.4483073Z ) 2025-05-07T20:32:21.4483269Z else: 2025-05-07T20:32:21.4483644Z scale_ub_tensor = None 2025-05-07T20:32:21.4483898Z 2025-05-07T20:32:21.4484132Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.4484490Z op = silu_mul_quant 2025-05-07T20:32:21.4484748Z if compiled: 2025-05-07T20:32:21.4485051Z op = torch.compile(op) 2025-05-07T20:32:21.4485353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.4485639Z 2025-05-07T20:32:21.4485832Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.4485995Z 2025-05-07T20:32:21.4486097Z moe/activation_test.py:117: 2025-05-07T20:32:21.4486393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.4486733Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.4487014Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.4487704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.4488596Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.4489242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.4489933Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.4490670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.4491206Z kernel = self.compile( 2025-05-07T20:32:21.4491752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.4492407Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.4492809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.4493036Z 2025-05-07T20:32:21.4493251Z self = 2025-05-07T20:32:21.4494332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.4495688Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1647b00>} 2025-05-07T20:32:21.4497034Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.4498061Z context = 2025-05-07T20:32:21.4498349Z 2025-05-07T20:32:21.4498524Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.4499040Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.4499508Z module_map=module_map) 2025-05-07T20:32:21.4499877Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.4500249Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.4500547Z E ^ 2025-05-07T20:32:21.4501074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.4501529Z 2025-05-07T20:32:21.4501957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.4502473Z 2025-05-07T20:32:21.4502587Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.4502992Z self=, 2025-05-07T20:32:21.4503397Z T=2048, 2025-05-07T20:32:21.4503587Z D=5120, 2025-05-07T20:32:21.4503775Z scale_ub=None, 2025-05-07T20:32:21.4503992Z contiguous=True, 2025-05-07T20:32:21.4504219Z compiled=False, 2025-05-07T20:32:21.4504420Z ) 2025-05-07T20:32:21.4504743Z self = 2025-05-07T20:32:21.4505281Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:21.4505617Z 2025-05-07T20:32:21.4505701Z @given( 2025-05-07T20:32:21.4505934Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.4506248Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.4506556Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.4506877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.4507206Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.4507487Z ) 2025-05-07T20:32:21.4507837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.4508282Z def test_silu_mul_quant( 2025-05-07T20:32:21.4508518Z self, 2025-05-07T20:32:21.4508721Z T: int, 2025-05-07T20:32:21.4508922Z D: int, 2025-05-07T20:32:21.4509141Z scale_ub: Optional[float], 2025-05-07T20:32:21.4509407Z contiguous: bool, 2025-05-07T20:32:21.4509652Z compiled: bool, 2025-05-07T20:32:21.4509879Z ) -> None: 2025-05-07T20:32:21.4510096Z torch.manual_seed(2025) 2025-05-07T20:32:21.4510388Z 2025-05-07T20:32:21.4510673Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.4511060Z 2025-05-07T20:32:21.4511260Z > x_sign = torch.sign(x) 2025-05-07T20:32:21.4513205Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.4515054Z 2025-05-07T20:32:21.4515183Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:21.4515396Z 2025-05-07T20:32:21.4515513Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.4515928Z self=, 2025-05-07T20:32:21.4516334Z T=16384, 2025-05-07T20:32:21.4516530Z D=5120, 2025-05-07T20:32:21.4516718Z scale_ub=None, 2025-05-07T20:32:21.4516929Z contiguous=True, 2025-05-07T20:32:21.4517154Z compiled=False, 2025-05-07T20:32:21.4517353Z ) 2025-05-07T20:32:21.5218162Z self = 2025-05-07T20:32:21.5218934Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:21.5219320Z 2025-05-07T20:32:21.5219442Z @given( 2025-05-07T20:32:21.5219687Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.5220007Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.5220308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.5220649Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.5221101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.5221399Z ) 2025-05-07T20:32:21.5221746Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.5222193Z def test_silu_mul_quant( 2025-05-07T20:32:21.5222438Z self, 2025-05-07T20:32:21.5222632Z T: int, 2025-05-07T20:32:21.5222834Z D: int, 2025-05-07T20:32:21.5223056Z scale_ub: Optional[float], 2025-05-07T20:32:21.5223329Z contiguous: bool, 2025-05-07T20:32:21.5223572Z compiled: bool, 2025-05-07T20:32:21.5223804Z ) -> None: 2025-05-07T20:32:21.5224016Z torch.manual_seed(2025) 2025-05-07T20:32:21.5224259Z 2025-05-07T20:32:21.5224540Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.5226664Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.5228572Z 2025-05-07T20:32:21.5228700Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:21.5228912Z 2025-05-07T20:32:21.5229017Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.5229431Z self=, 2025-05-07T20:32:21.5229838Z T=4096, 2025-05-07T20:32:21.5230025Z D=5120, 2025-05-07T20:32:21.5230222Z scale_ub=None, 2025-05-07T20:32:21.5230447Z contiguous=True, 2025-05-07T20:32:21.5230707Z compiled=False, 2025-05-07T20:32:21.5230918Z ) 2025-05-07T20:32:21.5231239Z self = 2025-05-07T20:32:21.5231802Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:21.5232080Z 2025-05-07T20:32:21.5232162Z @given( 2025-05-07T20:32:21.5232396Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.5232706Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.5233009Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.5233340Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.5233671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.5233955Z ) 2025-05-07T20:32:21.5234307Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.5234755Z def test_silu_mul_quant( 2025-05-07T20:32:21.5234993Z self, 2025-05-07T20:32:21.5235194Z T: int, 2025-05-07T20:32:21.5235397Z D: int, 2025-05-07T20:32:21.5235612Z scale_ub: Optional[float], 2025-05-07T20:32:21.5235898Z contiguous: bool, 2025-05-07T20:32:21.5236142Z compiled: bool, 2025-05-07T20:32:21.5236371Z ) -> None: 2025-05-07T20:32:21.5236583Z torch.manual_seed(2025) 2025-05-07T20:32:21.5236832Z 2025-05-07T20:32:21.5237112Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.5239504Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.5241383Z 2025-05-07T20:32:21.5241592Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:21.5241823Z 2025-05-07T20:32:21.5241930Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.5242351Z self=, 2025-05-07T20:32:21.5242766Z T=2048, 2025-05-07T20:32:21.5242953Z D=5120, 2025-05-07T20:32:21.5243141Z scale_ub=None, 2025-05-07T20:32:21.5243482Z contiguous=False, 2025-05-07T20:32:21.5243709Z compiled=False, 2025-05-07T20:32:21.5243916Z ) 2025-05-07T20:32:21.5244240Z self = 2025-05-07T20:32:21.5244735Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:21.5245014Z 2025-05-07T20:32:21.5245092Z @given( 2025-05-07T20:32:21.5245327Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.5245642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.5246010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.5246395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.5246726Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.5247006Z ) 2025-05-07T20:32:21.5247355Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.5247796Z def test_silu_mul_quant( 2025-05-07T20:32:21.5248032Z self, 2025-05-07T20:32:21.5248228Z T: int, 2025-05-07T20:32:21.5248428Z D: int, 2025-05-07T20:32:21.5248641Z scale_ub: Optional[float], 2025-05-07T20:32:21.5248918Z contiguous: bool, 2025-05-07T20:32:21.5249158Z compiled: bool, 2025-05-07T20:32:21.5249378Z ) -> None: 2025-05-07T20:32:21.5249594Z torch.manual_seed(2025) 2025-05-07T20:32:21.5249837Z 2025-05-07T20:32:21.5250109Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.5252150Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.5254068Z 2025-05-07T20:32:21.5254185Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:21.5254401Z 2025-05-07T20:32:21.5254507Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.5254917Z self=, 2025-05-07T20:32:21.5255314Z T=4096, 2025-05-07T20:32:21.5255500Z D=7168, 2025-05-07T20:32:21.5255690Z scale_ub=None, 2025-05-07T20:32:21.5255897Z contiguous=True, 2025-05-07T20:32:21.5256124Z compiled=True, 2025-05-07T20:32:21.5256332Z ) 2025-05-07T20:32:21.5256647Z self = 2025-05-07T20:32:21.5257137Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:21.5257409Z 2025-05-07T20:32:21.5257486Z @given( 2025-05-07T20:32:21.5257713Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.5258024Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.5258328Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.5258655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.5258973Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.5259261Z ) 2025-05-07T20:32:21.5259608Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.5260044Z def test_silu_mul_quant( 2025-05-07T20:32:21.5260303Z self, 2025-05-07T20:32:21.5260539Z T: int, 2025-05-07T20:32:21.5260739Z D: int, 2025-05-07T20:32:21.5261008Z scale_ub: Optional[float], 2025-05-07T20:32:21.5261282Z contiguous: bool, 2025-05-07T20:32:21.5261518Z compiled: bool, 2025-05-07T20:32:21.5261733Z ) -> None: 2025-05-07T20:32:21.5261949Z torch.manual_seed(2025) 2025-05-07T20:32:21.5262190Z 2025-05-07T20:32:21.5262455Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.5264540Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.5266441Z 2025-05-07T20:32:21.5266558Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:21.5266769Z 2025-05-07T20:32:21.5266879Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.5267288Z self=, 2025-05-07T20:32:21.5267683Z T=2048, 2025-05-07T20:32:21.5267870Z D=5120, 2025-05-07T20:32:21.5268061Z scale_ub=1200.0, 2025-05-07T20:32:21.5268277Z contiguous=False, 2025-05-07T20:32:21.5268502Z compiled=False, 2025-05-07T20:32:21.5268708Z ) 2025-05-07T20:32:21.5269020Z self = 2025-05-07T20:32:21.5269516Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:21.5269790Z 2025-05-07T20:32:21.5269871Z @given( 2025-05-07T20:32:21.5270099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.5270418Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.5270725Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.5271099Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.5271420Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.5271707Z ) 2025-05-07T20:32:21.5272056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.5272494Z def test_silu_mul_quant( 2025-05-07T20:32:21.5272732Z self, 2025-05-07T20:32:21.5272928Z T: int, 2025-05-07T20:32:21.5273120Z D: int, 2025-05-07T20:32:21.5273341Z scale_ub: Optional[float], 2025-05-07T20:32:21.5273613Z contiguous: bool, 2025-05-07T20:32:21.5273847Z compiled: bool, 2025-05-07T20:32:21.5274073Z ) -> None: 2025-05-07T20:32:21.5274292Z torch.manual_seed(2025) 2025-05-07T20:32:21.5274529Z 2025-05-07T20:32:21.5274801Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.5276843Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.5278702Z 2025-05-07T20:32:21.5278821Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:21.5279031Z 2025-05-07T20:32:21.5279137Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.5279544Z self=, 2025-05-07T20:32:21.5279948Z T=4096, 2025-05-07T20:32:21.5280138Z D=7168, 2025-05-07T20:32:21.5280329Z scale_ub=1200.0, 2025-05-07T20:32:21.5287266Z contiguous=True, 2025-05-07T20:32:21.5287538Z compiled=False, 2025-05-07T20:32:21.5287742Z ) 2025-05-07T20:32:21.6197425Z self = 2025-05-07T20:32:21.6198162Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:21.6198550Z 2025-05-07T20:32:21.6198660Z @given( 2025-05-07T20:32:21.6198961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.6199388Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.6199801Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.6200230Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.6200565Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.6200886Z ) 2025-05-07T20:32:21.6201259Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.6201823Z def test_silu_mul_quant( 2025-05-07T20:32:21.6202074Z self, 2025-05-07T20:32:21.6202342Z T: int, 2025-05-07T20:32:21.6202544Z D: int, 2025-05-07T20:32:21.6202761Z scale_ub: Optional[float], 2025-05-07T20:32:21.6203032Z contiguous: bool, 2025-05-07T20:32:21.6203269Z compiled: bool, 2025-05-07T20:32:21.6203627Z ) -> None: 2025-05-07T20:32:21.6203837Z torch.manual_seed(2025) 2025-05-07T20:32:21.6204077Z 2025-05-07T20:32:21.6204354Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.6206403Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.6208392Z 2025-05-07T20:32:21.6208510Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:21.6208724Z 2025-05-07T20:32:21.6208830Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.6209249Z self=, 2025-05-07T20:32:21.6209653Z T=16384, 2025-05-07T20:32:21.6209845Z D=7168, 2025-05-07T20:32:21.6210041Z scale_ub=None, 2025-05-07T20:32:21.6210259Z contiguous=False, 2025-05-07T20:32:21.6210477Z compiled=True, 2025-05-07T20:32:21.6210681Z ) 2025-05-07T20:32:21.6210994Z self = 2025-05-07T20:32:21.6211484Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:21.6211765Z 2025-05-07T20:32:21.6211841Z @given( 2025-05-07T20:32:21.6212070Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.6212386Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.6212693Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.6213021Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.6213342Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.6213621Z ) 2025-05-07T20:32:21.6213964Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.6214400Z def test_silu_mul_quant( 2025-05-07T20:32:21.6214633Z self, 2025-05-07T20:32:21.6214828Z T: int, 2025-05-07T20:32:21.6215026Z D: int, 2025-05-07T20:32:21.6215243Z scale_ub: Optional[float], 2025-05-07T20:32:21.6215513Z contiguous: bool, 2025-05-07T20:32:21.6215751Z compiled: bool, 2025-05-07T20:32:21.6215967Z ) -> None: 2025-05-07T20:32:21.6216176Z torch.manual_seed(2025) 2025-05-07T20:32:21.6216415Z 2025-05-07T20:32:21.6216752Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.6218808Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.6220667Z 2025-05-07T20:32:21.6220784Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:21.6221005Z 2025-05-07T20:32:21.6221106Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.6221608Z self=, 2025-05-07T20:32:21.6222137Z T=4096, 2025-05-07T20:32:21.6222374Z D=7168, 2025-05-07T20:32:21.6222573Z scale_ub=None, 2025-05-07T20:32:21.6222785Z contiguous=True, 2025-05-07T20:32:21.6223013Z compiled=False, 2025-05-07T20:32:21.6223225Z ) 2025-05-07T20:32:21.6223537Z self = 2025-05-07T20:32:21.6224038Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:21.6224319Z 2025-05-07T20:32:21.6224403Z @given( 2025-05-07T20:32:21.6224633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.6224937Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.6225243Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.6225571Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.6225894Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.6226191Z ) 2025-05-07T20:32:21.6226554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.6226992Z def test_silu_mul_quant( 2025-05-07T20:32:21.6227290Z self, 2025-05-07T20:32:21.6227487Z T: int, 2025-05-07T20:32:21.6227679Z D: int, 2025-05-07T20:32:21.6227894Z scale_ub: Optional[float], 2025-05-07T20:32:21.6228163Z contiguous: bool, 2025-05-07T20:32:21.6228392Z compiled: bool, 2025-05-07T20:32:21.6228609Z ) -> None: 2025-05-07T20:32:21.6228820Z torch.manual_seed(2025) 2025-05-07T20:32:21.6229051Z 2025-05-07T20:32:21.6229322Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.6231428Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.6233303Z 2025-05-07T20:32:21.6233422Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:21.6233630Z 2025-05-07T20:32:21.6233735Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.6234137Z self=, 2025-05-07T20:32:21.6234550Z T=16384, 2025-05-07T20:32:21.6234747Z D=7168, 2025-05-07T20:32:21.6234936Z scale_ub=None, 2025-05-07T20:32:21.6235145Z contiguous=True, 2025-05-07T20:32:21.6235362Z compiled=False, 2025-05-07T20:32:21.6235555Z ) 2025-05-07T20:32:21.6235867Z self = 2025-05-07T20:32:21.6236360Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:21.6236637Z 2025-05-07T20:32:21.6236714Z @given( 2025-05-07T20:32:21.6236994Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.6237311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.6237614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.6237939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.6238261Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.6238921Z ) 2025-05-07T20:32:21.6239268Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.6239711Z def test_silu_mul_quant( 2025-05-07T20:32:21.6239955Z self, 2025-05-07T20:32:21.6240142Z T: int, 2025-05-07T20:32:21.6240338Z D: int, 2025-05-07T20:32:21.6240555Z scale_ub: Optional[float], 2025-05-07T20:32:21.6240817Z contiguous: bool, 2025-05-07T20:32:21.6241051Z compiled: bool, 2025-05-07T20:32:21.6241272Z ) -> None: 2025-05-07T20:32:21.6241564Z torch.manual_seed(2025) 2025-05-07T20:32:21.6241861Z 2025-05-07T20:32:21.6242150Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.6244309Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.6246186Z 2025-05-07T20:32:21.6246308Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:21.6246517Z 2025-05-07T20:32:21.6246618Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.6247026Z self=, 2025-05-07T20:32:21.6247500Z T=16384, 2025-05-07T20:32:21.6247684Z D=7168, 2025-05-07T20:32:21.6247869Z scale_ub=1200.0, 2025-05-07T20:32:21.6248086Z contiguous=True, 2025-05-07T20:32:21.6248298Z compiled=False, 2025-05-07T20:32:21.6248497Z ) 2025-05-07T20:32:21.6248808Z self = 2025-05-07T20:32:21.6249297Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:21.6249574Z 2025-05-07T20:32:21.6249645Z @given( 2025-05-07T20:32:21.6249867Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.6250174Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.6250467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.6250787Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.6251110Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.6251384Z ) 2025-05-07T20:32:21.6251735Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.6252179Z def test_silu_mul_quant( 2025-05-07T20:32:21.6252419Z self, 2025-05-07T20:32:21.6252602Z T: int, 2025-05-07T20:32:21.6252794Z D: int, 2025-05-07T20:32:21.6253008Z scale_ub: Optional[float], 2025-05-07T20:32:21.6253267Z contiguous: bool, 2025-05-07T20:32:21.6253498Z compiled: bool, 2025-05-07T20:32:21.6253717Z ) -> None: 2025-05-07T20:32:21.6253921Z torch.manual_seed(2025) 2025-05-07T20:32:21.6254155Z 2025-05-07T20:32:21.6254421Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.6256518Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.6258373Z 2025-05-07T20:32:21.6258488Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:21.6258705Z 2025-05-07T20:32:21.6258805Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.6259211Z self=, 2025-05-07T20:32:21.6259608Z T=128, 2025-05-07T20:32:21.6259785Z D=5120, 2025-05-07T20:32:21.6259968Z scale_ub=1200.0, 2025-05-07T20:32:21.6260184Z contiguous=False, 2025-05-07T20:32:21.6260400Z compiled=False, 2025-05-07T20:32:21.6260600Z ) 2025-05-07T20:32:21.7275138Z self = 2025-05-07T20:32:21.7276914Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:21.7277858Z 2025-05-07T20:32:21.7278070Z @given( 2025-05-07T20:32:21.7278629Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.7279264Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.7279870Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.7280402Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.7280734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.7281023Z ) 2025-05-07T20:32:21.7281371Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.7281817Z def test_silu_mul_quant( 2025-05-07T20:32:21.7282062Z self, 2025-05-07T20:32:21.7282254Z T: int, 2025-05-07T20:32:21.7282451Z D: int, 2025-05-07T20:32:21.7282670Z scale_ub: Optional[float], 2025-05-07T20:32:21.7282938Z contiguous: bool, 2025-05-07T20:32:21.7283181Z compiled: bool, 2025-05-07T20:32:21.7283567Z ) -> None: 2025-05-07T20:32:21.7283865Z torch.manual_seed(2025) 2025-05-07T20:32:21.7284108Z 2025-05-07T20:32:21.7284385Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.7284728Z 2025-05-07T20:32:21.7284918Z x_sign = torch.sign(x) 2025-05-07T20:32:21.7285211Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.7285522Z x = x_sign * x_clamp 2025-05-07T20:32:21.7285756Z x0 = x[:, :D] 2025-05-07T20:32:21.7285978Z x1 = x[:, D:] 2025-05-07T20:32:21.7286188Z 2025-05-07T20:32:21.7286384Z if contiguous: 2025-05-07T20:32:21.7286626Z x0 = x0.contiguous() 2025-05-07T20:32:21.7286884Z x1 = x1.contiguous() 2025-05-07T20:32:21.7287127Z 2025-05-07T20:32:21.7287321Z if scale_ub is not None: 2025-05-07T20:32:21.7287596Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.7287934Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.7288249Z ) 2025-05-07T20:32:21.7288451Z else: 2025-05-07T20:32:21.7288659Z scale_ub_tensor = None 2025-05-07T20:32:21.7288917Z 2025-05-07T20:32:21.7289153Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.7289464Z op = silu_mul_quant 2025-05-07T20:32:21.7289719Z if compiled: 2025-05-07T20:32:21.7289968Z op = torch.compile(op) 2025-05-07T20:32:21.7290261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.7290540Z 2025-05-07T20:32:21.7290730Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.7290900Z 2025-05-07T20:32:21.7291001Z moe/activation_test.py:117: 2025-05-07T20:32:21.7291295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.7291626Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.7291909Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.7292674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.7293375Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.7293912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.7294591Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.7295264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.7295805Z kernel = self.compile( 2025-05-07T20:32:21.7296353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.7297014Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.7297415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.7297687Z 2025-05-07T20:32:21.7297908Z self = 2025-05-07T20:32:21.7299031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.7300393Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f146e700>} 2025-05-07T20:32:21.7301734Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.7302763Z context = 2025-05-07T20:32:21.7303053Z 2025-05-07T20:32:21.7303233Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.7303757Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.7304268Z module_map=module_map) 2025-05-07T20:32:21.7304631Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.7304983Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.7305233Z E ^ 2025-05-07T20:32:21.7305701Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.7306155Z 2025-05-07T20:32:21.7306577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.7307091Z 2025-05-07T20:32:21.7307198Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.7307606Z self=, 2025-05-07T20:32:21.7308011Z T=2048, 2025-05-07T20:32:21.7308201Z D=7168, 2025-05-07T20:32:21.7308393Z scale_ub=None, 2025-05-07T20:32:21.7308609Z contiguous=False, 2025-05-07T20:32:21.7308829Z compiled=False, 2025-05-07T20:32:21.7309032Z ) 2025-05-07T20:32:21.7309363Z self = 2025-05-07T20:32:21.7309856Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:21.7310129Z 2025-05-07T20:32:21.7310213Z @given( 2025-05-07T20:32:21.7310438Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.7310751Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.7311058Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.7311380Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.7311707Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.7311991Z ) 2025-05-07T20:32:21.7312337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.7312779Z def test_silu_mul_quant( 2025-05-07T20:32:21.7313065Z self, 2025-05-07T20:32:21.7313261Z T: int, 2025-05-07T20:32:21.7313450Z D: int, 2025-05-07T20:32:21.7313669Z scale_ub: Optional[float], 2025-05-07T20:32:21.7313938Z contiguous: bool, 2025-05-07T20:32:21.7314171Z compiled: bool, 2025-05-07T20:32:21.7314389Z ) -> None: 2025-05-07T20:32:21.7314603Z torch.manual_seed(2025) 2025-05-07T20:32:21.7314841Z 2025-05-07T20:32:21.7315116Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.7317220Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.7319116Z 2025-05-07T20:32:21.7319239Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:21.7319451Z 2025-05-07T20:32:21.7319557Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.7319964Z self=, 2025-05-07T20:32:21.7320365Z T=128, 2025-05-07T20:32:21.7320552Z D=7168, 2025-05-07T20:32:21.7320735Z scale_ub=1200.0, 2025-05-07T20:32:21.7320958Z contiguous=True, 2025-05-07T20:32:21.7321181Z compiled=True, 2025-05-07T20:32:21.7321382Z ) 2025-05-07T20:32:21.7623793Z self = 2025-05-07T20:32:21.7624555Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:21.7624947Z 2025-05-07T20:32:21.7625082Z @given( 2025-05-07T20:32:21.7625386Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.7625975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.7626296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.7626624Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.7626947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.7627230Z ) 2025-05-07T20:32:21.7627585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.7628024Z def test_silu_mul_quant( 2025-05-07T20:32:21.7628265Z self, 2025-05-07T20:32:21.7628465Z T: int, 2025-05-07T20:32:21.7628660Z D: int, 2025-05-07T20:32:21.7628884Z scale_ub: Optional[float], 2025-05-07T20:32:21.7629158Z contiguous: bool, 2025-05-07T20:32:21.7629397Z compiled: bool, 2025-05-07T20:32:21.7629632Z ) -> None: 2025-05-07T20:32:21.7629849Z torch.manual_seed(2025) 2025-05-07T20:32:21.7630088Z 2025-05-07T20:32:21.7630367Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.7630717Z 2025-05-07T20:32:21.7630918Z x_sign = torch.sign(x) 2025-05-07T20:32:21.7631210Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.7631525Z x = x_sign * x_clamp 2025-05-07T20:32:21.7631763Z x0 = x[:, :D] 2025-05-07T20:32:21.7631972Z x1 = x[:, D:] 2025-05-07T20:32:21.7632183Z 2025-05-07T20:32:21.7632370Z if contiguous: 2025-05-07T20:32:21.7632598Z x0 = x0.contiguous() 2025-05-07T20:32:21.7632855Z x1 = x1.contiguous() 2025-05-07T20:32:21.7633094Z 2025-05-07T20:32:21.7633282Z if scale_ub is not None: 2025-05-07T20:32:21.7633569Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.7633914Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.7634216Z ) 2025-05-07T20:32:21.7634408Z else: 2025-05-07T20:32:21.7634624Z scale_ub_tensor = None 2025-05-07T20:32:21.7634953Z 2025-05-07T20:32:21.7635193Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.7635515Z op = silu_mul_quant 2025-05-07T20:32:21.7635765Z if compiled: 2025-05-07T20:32:21.7636011Z op = torch.compile(op) 2025-05-07T20:32:21.7636305Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.7636577Z 2025-05-07T20:32:21.7636763Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.7636927Z 2025-05-07T20:32:21.7637027Z moe/activation_test.py:117: 2025-05-07T20:32:21.7637323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.7637651Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.7637935Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.7638749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:21.7639401Z return fn(*args, **kwargs) 2025-05-07T20:32:21.7640129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.7640832Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.7641370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.7642052Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.7642723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.7643266Z kernel = self.compile( 2025-05-07T20:32:21.7643951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.7644616Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.7645017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.7645249Z 2025-05-07T20:32:21.7645533Z self = 2025-05-07T20:32:21.7646610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.7647972Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f146ff60>} 2025-05-07T20:32:21.7649333Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.7650502Z context = 2025-05-07T20:32:21.7650792Z 2025-05-07T20:32:21.7650964Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.7651493Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.7651956Z module_map=module_map) 2025-05-07T20:32:21.7652314Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.7652665Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.7652917Z E ^ 2025-05-07T20:32:21.7653377Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.7653825Z 2025-05-07T20:32:21.7654248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.7654763Z 2025-05-07T20:32:21.7654893Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.7655301Z self=, 2025-05-07T20:32:21.7655698Z T=128, 2025-05-07T20:32:21.7655976Z D=7168, 2025-05-07T20:32:21.7656172Z scale_ub=1200.0, 2025-05-07T20:32:21.7656392Z contiguous=True, 2025-05-07T20:32:21.7656612Z compiled=False, 2025-05-07T20:32:21.7656813Z ) 2025-05-07T20:32:21.7657122Z self = 2025-05-07T20:32:21.7657614Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:21.7657883Z 2025-05-07T20:32:21.7657965Z @given( 2025-05-07T20:32:21.7658187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.7658500Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.7658800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.7659122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.7659448Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.7659734Z ) 2025-05-07T20:32:21.7660162Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.7660687Z def test_silu_mul_quant( 2025-05-07T20:32:21.7660923Z self, 2025-05-07T20:32:21.7661113Z T: int, 2025-05-07T20:32:21.7661300Z D: int, 2025-05-07T20:32:21.7661515Z scale_ub: Optional[float], 2025-05-07T20:32:21.7661782Z contiguous: bool, 2025-05-07T20:32:21.7662015Z compiled: bool, 2025-05-07T20:32:21.7662235Z ) -> None: 2025-05-07T20:32:21.7662445Z torch.manual_seed(2025) 2025-05-07T20:32:21.7662676Z 2025-05-07T20:32:21.7662947Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.7663288Z 2025-05-07T20:32:21.7669693Z x_sign = torch.sign(x) 2025-05-07T20:32:21.7670022Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.7672035Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.7673978Z 2025-05-07T20:32:21.7674102Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:21.7674316Z 2025-05-07T20:32:21.7674420Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.7674836Z self=, 2025-05-07T20:32:21.7675243Z T=128, 2025-05-07T20:32:21.7675428Z D=5120, 2025-05-07T20:32:21.7675619Z scale_ub=1200.0, 2025-05-07T20:32:21.7675842Z contiguous=True, 2025-05-07T20:32:21.7676056Z compiled=True, 2025-05-07T20:32:21.7676260Z ) 2025-05-07T20:32:21.7676583Z self = 2025-05-07T20:32:21.7677072Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:21.7677345Z 2025-05-07T20:32:21.7677422Z @given( 2025-05-07T20:32:21.7677645Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.7677948Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.7678248Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.7678579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.7678904Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.7679181Z ) 2025-05-07T20:32:21.7679525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.7679959Z def test_silu_mul_quant( 2025-05-07T20:32:21.7680193Z self, 2025-05-07T20:32:21.7680385Z T: int, 2025-05-07T20:32:21.7680585Z D: int, 2025-05-07T20:32:21.7680799Z scale_ub: Optional[float], 2025-05-07T20:32:21.7681120Z contiguous: bool, 2025-05-07T20:32:21.7681367Z compiled: bool, 2025-05-07T20:32:21.7681585Z ) -> None: 2025-05-07T20:32:21.7681799Z torch.manual_seed(2025) 2025-05-07T20:32:21.7682044Z 2025-05-07T20:32:21.7682312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.7682653Z 2025-05-07T20:32:21.7682843Z x_sign = torch.sign(x) 2025-05-07T20:32:21.7683125Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.7685343Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.7687231Z 2025-05-07T20:32:21.7687347Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:21.7687557Z 2025-05-07T20:32:21.7687659Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.7688065Z self=, 2025-05-07T20:32:21.7688457Z T=128, 2025-05-07T20:32:21.7688640Z D=7168, 2025-05-07T20:32:21.7688822Z scale_ub=None, 2025-05-07T20:32:21.7689022Z contiguous=True, 2025-05-07T20:32:21.7689241Z compiled=True, 2025-05-07T20:32:21.7689438Z ) 2025-05-07T20:32:21.9743486Z self = 2025-05-07T20:32:21.9744896Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:21.9745632Z 2025-05-07T20:32:21.9745850Z @given( 2025-05-07T20:32:21.9746322Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.9746966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.9747804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.9748450Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.9749101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.9749667Z ) 2025-05-07T20:32:21.9750360Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.9750826Z def test_silu_mul_quant( 2025-05-07T20:32:21.9751070Z self, 2025-05-07T20:32:21.9751267Z T: int, 2025-05-07T20:32:21.9751462Z D: int, 2025-05-07T20:32:21.9751684Z scale_ub: Optional[float], 2025-05-07T20:32:21.9751964Z contiguous: bool, 2025-05-07T20:32:21.9752201Z compiled: bool, 2025-05-07T20:32:21.9752432Z ) -> None: 2025-05-07T20:32:21.9752652Z torch.manual_seed(2025) 2025-05-07T20:32:21.9752888Z 2025-05-07T20:32:21.9753169Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.9755225Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.9757075Z 2025-05-07T20:32:21.9757194Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:21.9757405Z 2025-05-07T20:32:21.9762927Z FAILED 2025-05-07T20:32:21.9763264Z 2025-05-07T20:32:21.9763662Z =================================== FAILURES =================================== 2025-05-07T20:32:21.9764449Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:21.9765087Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:21.9765928Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor 2025-05-07T20:32:21.9766685Z | yield 2025-05-07T20:32:21.9767294Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run 2025-05-07T20:32:21.9768021Z | self._callTestMethod(testMethod) 2025-05-07T20:32:21.9768802Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod 2025-05-07T20:32:21.9769561Z | if method() is not None: 2025-05-07T20:32:21.9769898Z | ^^^^^^^^ 2025-05-07T20:32:21.9770929Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:21.9772022Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.9772505Z | ^^^^^^^ 2025-05-07T20:32:21.9773280Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:21.9774156Z | raise the_error_hypothesis_found 2025-05-07T20:32:21.9774737Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:21.9775303Z +-+---------------- 1 ---------------- 2025-05-07T20:32:21.9775713Z | Traceback (most recent call last): 2025-05-07T20:32:21.9776684Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:21.9777745Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.9778246Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.9781016Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.9783820Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:21.9784483Z | self=, 2025-05-07T20:32:21.9785033Z | T=2048, 2025-05-07T20:32:21.9785357Z | D=5120, # or any other generated value 2025-05-07T20:32:21.9785844Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:21.9786350Z | contiguous=True, # or any other generated value 2025-05-07T20:32:21.9786858Z | compiled=False, # or any other generated value 2025-05-07T20:32:21.9787310Z | ) 2025-05-07T20:32:21.9787578Z | 2025-05-07T20:32:21.9788290Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:21.9789123Z +---------------- 2 ---------------- 2025-05-07T20:32:21.9789525Z | Traceback (most recent call last): 2025-05-07T20:32:21.9790528Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:21.9791607Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.9792103Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.9794898Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.9797618Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:21.9798249Z | self=, 2025-05-07T20:32:21.9798798Z | T=128, 2025-05-07T20:32:21.9799072Z | D=7168, 2025-05-07T20:32:21.9799359Z | scale_ub=None, 2025-05-07T20:32:21.9799696Z | contiguous=True, 2025-05-07T20:32:21.9800016Z | compiled=True, 2025-05-07T20:32:21.9800323Z | ) 2025-05-07T20:32:21.9800575Z | 2025-05-07T20:32:21.9801377Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:21.9802286Z +---------------- 3 ---------------- 2025-05-07T20:32:21.9802678Z | Traceback (most recent call last): 2025-05-07T20:32:21.9803783Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:21.9804865Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.9805374Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.9807545Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:21.9809561Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:21.9809994Z | self=, 2025-05-07T20:32:21.9810396Z | T=128, 2025-05-07T20:32:21.9810606Z | D=5120, 2025-05-07T20:32:21.9810898Z | scale_ub=1200.0, 2025-05-07T20:32:21.9811237Z | contiguous=True, 2025-05-07T20:32:21.9811579Z | compiled=True, 2025-05-07T20:32:21.9811889Z | ) 2025-05-07T20:32:21.9812138Z | 2025-05-07T20:32:21.9812881Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:21.9813745Z +---------------- 4 ---------------- 2025-05-07T20:32:21.9814164Z | Traceback (most recent call last): 2025-05-07T20:32:21.9815203Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:21.9816247Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:21.9816660Z | ^^^^^^^^ 2025-05-07T20:32:21.9817570Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:21.9818277Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.9818609Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.9819426Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:21.9820227Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:21.9820988Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:21.9822040Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.9822649Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.9823496Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:21.9824556Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.9825217Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.9826161Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:21.9827370Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.9828086Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.9828980Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:21.9829940Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:21.9830451Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.9831286Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:21.9832082Z | fn() 2025-05-07T20:32:21.9832891Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:21.9833752Z | self.fn.run( 2025-05-07T20:32:21.9834483Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:21.9835379Z | kernel = self.compile( 2025-05-07T20:32:21.9835752Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:21.9836604Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:21.9837586Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.9838115Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.9839311Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:21.9840453Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.9841131Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.9841656Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.9842149Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:21.9842508Z | ^ 2025-05-07T20:32:21.9843149Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.9844154Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:21.9844691Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:21.9845409Z | self=, 2025-05-07T20:32:21.9846009Z | T=1, # or any other generated value 2025-05-07T20:32:21.9846432Z | D=5120, # or any other generated value 2025-05-07T20:32:21.9846884Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:21.9847387Z | contiguous=True, # or any other generated value 2025-05-07T20:32:21.9847913Z | compiled=True, # or any other generated value 2025-05-07T20:32:21.9848513Z | ) 2025-05-07T20:32:21.9848766Z | 2025-05-07T20:32:21.9849515Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:21.9850374Z +------------------------------------ 2025-05-07T20:32:21.9850855Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:21.9851376Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.9851948Z self=, 2025-05-07T20:32:21.9852480Z T=1, 2025-05-07T20:32:21.9852734Z D=5120, 2025-05-07T20:32:21.9853001Z scale_ub=None, 2025-05-07T20:32:21.9853292Z contiguous=True, 2025-05-07T20:32:21.9853604Z compiled=True, 2025-05-07T20:32:21.9853895Z ) 2025-05-07T20:32:21.9854333Z self = 2025-05-07T20:32:21.9855099Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:21.9855545Z 2025-05-07T20:32:21.9855654Z @given( 2025-05-07T20:32:21.9855976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.9856397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.9856823Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.9857292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.9857744Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.9858142Z ) 2025-05-07T20:32:21.9858627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.9859234Z def test_silu_mul_quant( 2025-05-07T20:32:21.9859567Z self, 2025-05-07T20:32:21.9859837Z T: int, 2025-05-07T20:32:21.9860098Z D: int, 2025-05-07T20:32:21.9860396Z scale_ub: Optional[float], 2025-05-07T20:32:21.9860754Z contiguous: bool, 2025-05-07T20:32:21.9861076Z compiled: bool, 2025-05-07T20:32:21.9861393Z ) -> None: 2025-05-07T20:32:21.9861800Z torch.manual_seed(2025) 2025-05-07T20:32:21.9862136Z 2025-05-07T20:32:21.9862520Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.9863005Z 2025-05-07T20:32:21.9863280Z x_sign = torch.sign(x) 2025-05-07T20:32:21.9863678Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.9864118Z x = x_sign * x_clamp 2025-05-07T20:32:21.9864452Z x0 = x[:, :D] 2025-05-07T20:32:21.9864720Z x1 = x[:, D:] 2025-05-07T20:32:21.9865018Z 2025-05-07T20:32:21.9865278Z if contiguous: 2025-05-07T20:32:21.9865604Z x0 = x0.contiguous() 2025-05-07T20:32:21.9865964Z x1 = x1.contiguous() 2025-05-07T20:32:21.9866304Z 2025-05-07T20:32:21.9866568Z if scale_ub is not None: 2025-05-07T20:32:21.9866965Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.9867439Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.9867875Z ) 2025-05-07T20:32:21.9868147Z else: 2025-05-07T20:32:21.9868443Z scale_ub_tensor = None 2025-05-07T20:32:21.9868803Z 2025-05-07T20:32:21.9869127Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.9869557Z op = silu_mul_quant 2025-05-07T20:32:21.9869912Z if compiled: 2025-05-07T20:32:21.9870255Z op = torch.compile(op) 2025-05-07T20:32:21.9870662Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.9871049Z 2025-05-07T20:32:21.9871310Z y_fp8, y_scale = fn() 2025-05-07T20:32:21.9871709Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:21.9872120Z 2025-05-07T20:32:21.9872444Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.9872910Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:21.9873327Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:21.9873817Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:21.9874322Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.9874748Z 2025-05-07T20:32:21.9875028Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:21.9875294Z 2025-05-07T20:32:21.9875430Z moe/activation_test.py:126: 2025-05-07T20:32:21.9875842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.9876313Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:21.9876765Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.9877879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:21.9878957Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:21.9879735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.9880702Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.9881686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:21.9882690Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.9883892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:21.9884951Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.9885983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:21.9886892Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:21.9887740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:21.9888460Z fn() 2025-05-07T20:32:21.9889212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:21.9890007Z self.fn.run( 2025-05-07T20:32:21.9890685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.9891386Z kernel = self.compile( 2025-05-07T20:32:21.9892106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.9892966Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.9893522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.9893859Z 2025-05-07T20:32:21.9894145Z self = 2025-05-07T20:32:21.9895666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.9897623Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cda813a0>} 2025-05-07T20:32:21.9899409Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.9900830Z context = 2025-05-07T20:32:21.9901209Z 2025-05-07T20:32:21.9901432Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.9902128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.9902745Z module_map=module_map) 2025-05-07T20:32:21.9903312Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.9903790Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:21.9904128Z E ^ 2025-05-07T20:32:21.9904739Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.9905335Z 2025-05-07T20:32:21.9905893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.9906582Z 2025-05-07T20:32:21.9906724Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.9907288Z self=, 2025-05-07T20:32:21.9907837Z T=2048, 2025-05-07T20:32:21.9908086Z D=5120, 2025-05-07T20:32:21.9908333Z scale_ub=1200.0, 2025-05-07T20:32:21.9908628Z contiguous=True, 2025-05-07T20:32:21.9908939Z compiled=False, 2025-05-07T20:32:21.9909262Z ) 2025-05-07T20:32:21.9909714Z self = 2025-05-07T20:32:21.9910473Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:21.9910861Z 2025-05-07T20:32:21.9910974Z @given( 2025-05-07T20:32:21.9911271Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.9911697Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.9912101Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.9912535Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.9912974Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.9913363Z ) 2025-05-07T20:32:21.9913818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.9914413Z def test_silu_mul_quant( 2025-05-07T20:32:21.9914735Z self, 2025-05-07T20:32:21.9914991Z T: int, 2025-05-07T20:32:21.9915256Z D: int, 2025-05-07T20:32:21.9915545Z scale_ub: Optional[float], 2025-05-07T20:32:21.9915983Z contiguous: bool, 2025-05-07T20:32:21.9937593Z compiled: bool, 2025-05-07T20:32:21.9937917Z ) -> None: 2025-05-07T20:32:21.9938220Z torch.manual_seed(2025) 2025-05-07T20:32:21.9938823Z 2025-05-07T20:32:21.9939203Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.9939660Z 2025-05-07T20:32:21.9939936Z x_sign = torch.sign(x) 2025-05-07T20:32:21.9940338Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.9940769Z x = x_sign * x_clamp 2025-05-07T20:32:21.9941090Z x0 = x[:, :D] 2025-05-07T20:32:21.9941378Z x1 = x[:, D:] 2025-05-07T20:32:21.9941660Z 2025-05-07T20:32:21.9941911Z if contiguous: 2025-05-07T20:32:21.9942223Z x0 = x0.contiguous() 2025-05-07T20:32:21.9942563Z x1 = x1.contiguous() 2025-05-07T20:32:21.9942895Z 2025-05-07T20:32:21.9943168Z if scale_ub is not None: 2025-05-07T20:32:21.9943536Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.9944000Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.9944411Z ) 2025-05-07T20:32:21.9944663Z else: 2025-05-07T20:32:21.9944963Z scale_ub_tensor = None 2025-05-07T20:32:21.9945316Z 2025-05-07T20:32:21.9945634Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.9946083Z op = silu_mul_quant 2025-05-07T20:32:21.9946436Z if compiled: 2025-05-07T20:32:21.9946764Z op = torch.compile(op) 2025-05-07T20:32:21.9947164Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.9947545Z 2025-05-07T20:32:21.9947791Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.9948019Z 2025-05-07T20:32:21.9948157Z moe/activation_test.py:117: 2025-05-07T20:32:21.9948577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.9949050Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.9949631Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.9950599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.9951555Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.9952292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.9953258Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.9954200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.9954970Z kernel = self.compile( 2025-05-07T20:32:21.9955733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.9956765Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.9957345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.9957775Z 2025-05-07T20:32:21.9958068Z self = 2025-05-07T20:32:21.9959560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.9961518Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cd7382c0>} 2025-05-07T20:32:21.9963456Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.9964919Z context = 2025-05-07T20:32:21.9965309Z 2025-05-07T20:32:21.9965629Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.9966332Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.9966968Z module_map=module_map) 2025-05-07T20:32:21.9967493Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.9967983Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.9968347Z E ^ 2025-05-07T20:32:21.9968993Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.9969616Z 2025-05-07T20:32:21.9970191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.9970938Z 2025-05-07T20:32:21.9971078Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.9971647Z self=, 2025-05-07T20:32:21.9972188Z T=2048, 2025-05-07T20:32:21.9972435Z D=5120, 2025-05-07T20:32:21.9972693Z scale_ub=1200.0, 2025-05-07T20:32:21.9972999Z contiguous=True, 2025-05-07T20:32:21.9973293Z compiled=True, 2025-05-07T20:32:21.9973572Z ) 2025-05-07T20:32:21.9974024Z self = 2025-05-07T20:32:21.9974704Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:21.9975091Z 2025-05-07T20:32:21.9975199Z @given( 2025-05-07T20:32:21.9975522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.9975975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.9976375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.9976822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.9977176Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.9977460Z ) 2025-05-07T20:32:21.9977878Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.9978335Z def test_silu_mul_quant( 2025-05-07T20:32:21.9978575Z self, 2025-05-07T20:32:21.9978778Z T: int, 2025-05-07T20:32:21.9978987Z D: int, 2025-05-07T20:32:21.9979201Z scale_ub: Optional[float], 2025-05-07T20:32:21.9979478Z contiguous: bool, 2025-05-07T20:32:21.9979721Z compiled: bool, 2025-05-07T20:32:21.9979941Z ) -> None: 2025-05-07T20:32:21.9980162Z torch.manual_seed(2025) 2025-05-07T20:32:21.9980406Z 2025-05-07T20:32:21.9980671Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.9981017Z 2025-05-07T20:32:21.9981212Z x_sign = torch.sign(x) 2025-05-07T20:32:21.9981499Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.9981809Z x = x_sign * x_clamp 2025-05-07T20:32:21.9982099Z x0 = x[:, :D] 2025-05-07T20:32:21.9982314Z x1 = x[:, D:] 2025-05-07T20:32:21.9982576Z 2025-05-07T20:32:21.9982764Z if contiguous: 2025-05-07T20:32:21.9982991Z x0 = x0.contiguous() 2025-05-07T20:32:21.9983251Z x1 = x1.contiguous() 2025-05-07T20:32:21.9983494Z 2025-05-07T20:32:21.9983679Z if scale_ub is not None: 2025-05-07T20:32:21.9983954Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.9984291Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.9984605Z ) 2025-05-07T20:32:21.9984796Z else: 2025-05-07T20:32:21.9985006Z scale_ub_tensor = None 2025-05-07T20:32:21.9985261Z 2025-05-07T20:32:21.9985486Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.9985798Z op = silu_mul_quant 2025-05-07T20:32:21.9986047Z if compiled: 2025-05-07T20:32:21.9986288Z op = torch.compile(op) 2025-05-07T20:32:21.9986590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.9986870Z 2025-05-07T20:32:21.9987108Z y_fp8, y_scale = fn() 2025-05-07T20:32:21.9987394Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:21.9987682Z 2025-05-07T20:32:21.9987918Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.9988256Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:21.9988548Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:21.9988866Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:21.9989223Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.9989537Z 2025-05-07T20:32:21.9989737Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:21.9989931Z 2025-05-07T20:32:21.9990033Z moe/activation_test.py:126: 2025-05-07T20:32:21.9990330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.9990666Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:21.9990994Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.9991797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:21.9992552Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:21.9993097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.9993775Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.9994469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:21.9995195Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.9995952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:21.9996746Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.9997487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:21.9998128Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:21.9998731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:21.9999246Z fn() 2025-05-07T20:32:21.9999755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.0000349Z self.fn.run( 2025-05-07T20:32:22.0000852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0001387Z kernel = self.compile( 2025-05-07T20:32:22.0001975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0002666Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0003067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0003405Z 2025-05-07T20:32:22.0003612Z self = 2025-05-07T20:32:22.0004692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0006059Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cd739440>} 2025-05-07T20:32:22.0007399Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0008472Z context = 2025-05-07T20:32:22.0008766Z 2025-05-07T20:32:22.0008933Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0009456Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0009915Z module_map=module_map) 2025-05-07T20:32:22.0010282Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0010640Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.0010900Z E ^ 2025-05-07T20:32:22.0011364Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0011823Z 2025-05-07T20:32:22.0012244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0012758Z 2025-05-07T20:32:22.0012872Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0013282Z self=, 2025-05-07T20:32:22.0013686Z T=16384, 2025-05-07T20:32:22.0013878Z D=7168, 2025-05-07T20:32:22.0014063Z scale_ub=1200.0, 2025-05-07T20:32:22.0014288Z contiguous=False, 2025-05-07T20:32:22.0014513Z compiled=False, 2025-05-07T20:32:22.0014712Z ) 2025-05-07T20:32:22.0015032Z self = 2025-05-07T20:32:22.0015536Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.0015815Z 2025-05-07T20:32:22.0015901Z @given( 2025-05-07T20:32:22.0016124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0016438Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0016744Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0017071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0017448Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0017744Z ) 2025-05-07T20:32:22.0018086Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0018530Z def test_silu_mul_quant( 2025-05-07T20:32:22.0018770Z self, 2025-05-07T20:32:22.0018961Z T: int, 2025-05-07T20:32:22.0019153Z D: int, 2025-05-07T20:32:22.0019370Z scale_ub: Optional[float], 2025-05-07T20:32:22.0019637Z contiguous: bool, 2025-05-07T20:32:22.0019871Z compiled: bool, 2025-05-07T20:32:22.0020092Z ) -> None: 2025-05-07T20:32:22.0020302Z torch.manual_seed(2025) 2025-05-07T20:32:22.0020540Z 2025-05-07T20:32:22.0020812Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0021155Z 2025-05-07T20:32:22.0021342Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0021682Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0022056Z x = x_sign * x_clamp 2025-05-07T20:32:22.0022304Z x0 = x[:, :D] 2025-05-07T20:32:22.0022522Z x1 = x[:, D:] 2025-05-07T20:32:22.0022730Z 2025-05-07T20:32:22.0022910Z if contiguous: 2025-05-07T20:32:22.0023148Z x0 = x0.contiguous() 2025-05-07T20:32:22.0023403Z x1 = x1.contiguous() 2025-05-07T20:32:22.0023638Z 2025-05-07T20:32:22.0023834Z if scale_ub is not None: 2025-05-07T20:32:22.0024108Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0024439Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0024748Z ) 2025-05-07T20:32:22.0024939Z else: 2025-05-07T20:32:22.0025146Z scale_ub_tensor = None 2025-05-07T20:32:22.0025401Z 2025-05-07T20:32:22.0025633Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0025939Z op = silu_mul_quant 2025-05-07T20:32:22.0026192Z if compiled: 2025-05-07T20:32:22.0026445Z op = torch.compile(op) 2025-05-07T20:32:22.0026786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0027059Z 2025-05-07T20:32:22.0027252Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0027415Z 2025-05-07T20:32:22.0027518Z moe/activation_test.py:117: 2025-05-07T20:32:22.0027805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0028132Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0028415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0029097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0029791Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0030337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0031078Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0031747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0032295Z kernel = self.compile( 2025-05-07T20:32:22.0032841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0033500Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0033901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0034134Z 2025-05-07T20:32:22.0034341Z self = 2025-05-07T20:32:22.0035422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0036836Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cc82e660>} 2025-05-07T20:32:22.0038187Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0039541Z context = 2025-05-07T20:32:22.0039834Z 2025-05-07T20:32:22.0040002Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0040530Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0040992Z module_map=module_map) 2025-05-07T20:32:22.0041363Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0041718Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0042070Z E ^ 2025-05-07T20:32:22.0042596Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0043056Z 2025-05-07T20:32:22.0043587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0044104Z 2025-05-07T20:32:22.0044216Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0044625Z self=, 2025-05-07T20:32:22.0045031Z T=1, 2025-05-07T20:32:22.0045217Z D=7168, 2025-05-07T20:32:22.0045405Z scale_ub=None, 2025-05-07T20:32:22.0045619Z contiguous=True, 2025-05-07T20:32:22.0045841Z compiled=True, 2025-05-07T20:32:22.0046039Z ) 2025-05-07T20:32:22.0046359Z self = 2025-05-07T20:32:22.0046849Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.0047105Z 2025-05-07T20:32:22.0047195Z @given( 2025-05-07T20:32:22.0047498Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0047816Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0048126Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0048452Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0048780Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0049069Z ) 2025-05-07T20:32:22.0049411Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0049853Z def test_silu_mul_quant( 2025-05-07T20:32:22.0050097Z self, 2025-05-07T20:32:22.0050285Z T: int, 2025-05-07T20:32:22.0050506Z D: int, 2025-05-07T20:32:22.0050747Z scale_ub: Optional[float], 2025-05-07T20:32:22.0051015Z contiguous: bool, 2025-05-07T20:32:22.0051251Z compiled: bool, 2025-05-07T20:32:22.0051476Z ) -> None: 2025-05-07T20:32:22.0051692Z torch.manual_seed(2025) 2025-05-07T20:32:22.0051930Z 2025-05-07T20:32:22.0052208Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0052556Z 2025-05-07T20:32:22.0052745Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0053037Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0053348Z x = x_sign * x_clamp 2025-05-07T20:32:22.0053581Z x0 = x[:, :D] 2025-05-07T20:32:22.0053801Z x1 = x[:, D:] 2025-05-07T20:32:22.0054011Z 2025-05-07T20:32:22.0054191Z if contiguous: 2025-05-07T20:32:22.0054424Z x0 = x0.contiguous() 2025-05-07T20:32:22.0054682Z x1 = x1.contiguous() 2025-05-07T20:32:22.0054917Z 2025-05-07T20:32:22.0055115Z if scale_ub is not None: 2025-05-07T20:32:22.0055389Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0055717Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0056032Z ) 2025-05-07T20:32:22.0056228Z else: 2025-05-07T20:32:22.0056516Z scale_ub_tensor = None 2025-05-07T20:32:22.0056766Z 2025-05-07T20:32:22.0057000Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0057316Z op = silu_mul_quant 2025-05-07T20:32:22.0057562Z if compiled: 2025-05-07T20:32:22.0057810Z op = torch.compile(op) 2025-05-07T20:32:22.0058106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0058375Z 2025-05-07T20:32:22.0058572Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.0058860Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.0059148Z 2025-05-07T20:32:22.0059388Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0059724Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.0060015Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.0060386Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.0060797Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0061150Z 2025-05-07T20:32:22.0061349Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.0061551Z 2025-05-07T20:32:22.0061651Z moe/activation_test.py:126: 2025-05-07T20:32:22.0061947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0062279Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.0062612Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0063411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.0064174Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.0064716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0065408Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0066107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.0066875Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0067631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:22.0068384Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0069120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.0069763Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.0070374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.0070952Z fn() 2025-05-07T20:32:22.0071470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.0072056Z self.fn.run( 2025-05-07T20:32:22.0072535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0073070Z kernel = self.compile( 2025-05-07T20:32:22.0073610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0074271Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0074670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0074898Z 2025-05-07T20:32:22.0075112Z self = 2025-05-07T20:32:22.0076234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0077617Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cc5ea5c0>} 2025-05-07T20:32:22.0078969Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0080001Z context = 2025-05-07T20:32:22.0080288Z 2025-05-07T20:32:22.0080461Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0080980Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0081452Z module_map=module_map) 2025-05-07T20:32:22.0081862Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0082268Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.0082540Z E ^ 2025-05-07T20:32:22.0083008Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0083572Z 2025-05-07T20:32:22.0084000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0084517Z 2025-05-07T20:32:22.0084622Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0085036Z self=, 2025-05-07T20:32:22.0085441Z T=4096, 2025-05-07T20:32:22.0085627Z D=5120, 2025-05-07T20:32:22.0085821Z scale_ub=None, 2025-05-07T20:32:22.0086035Z contiguous=False, 2025-05-07T20:32:22.0086255Z compiled=False, 2025-05-07T20:32:22.0086465Z ) 2025-05-07T20:32:22.0086789Z self = 2025-05-07T20:32:22.0087289Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.0087611Z 2025-05-07T20:32:22.0087688Z @given( 2025-05-07T20:32:22.0087918Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0088236Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0088540Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0088872Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0089201Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0089480Z ) 2025-05-07T20:32:22.0089829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0090269Z def test_silu_mul_quant( 2025-05-07T20:32:22.0090505Z self, 2025-05-07T20:32:22.0090723Z T: int, 2025-05-07T20:32:22.0090946Z D: int, 2025-05-07T20:32:22.0091162Z scale_ub: Optional[float], 2025-05-07T20:32:22.0091430Z contiguous: bool, 2025-05-07T20:32:22.0091671Z compiled: bool, 2025-05-07T20:32:22.0091904Z ) -> None: 2025-05-07T20:32:22.0092112Z torch.manual_seed(2025) 2025-05-07T20:32:22.0092354Z 2025-05-07T20:32:22.0092629Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0092966Z 2025-05-07T20:32:22.0093161Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0093452Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0093753Z x = x_sign * x_clamp 2025-05-07T20:32:22.0093990Z x0 = x[:, :D] 2025-05-07T20:32:22.0094211Z x1 = x[:, D:] 2025-05-07T20:32:22.0094416Z 2025-05-07T20:32:22.0094602Z if contiguous: 2025-05-07T20:32:22.0094838Z x0 = x0.contiguous() 2025-05-07T20:32:22.0095089Z x1 = x1.contiguous() 2025-05-07T20:32:22.0095329Z 2025-05-07T20:32:22.0095518Z if scale_ub is not None: 2025-05-07T20:32:22.0095788Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0096168Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0103997Z ) 2025-05-07T20:32:22.0104230Z else: 2025-05-07T20:32:22.0104450Z scale_ub_tensor = None 2025-05-07T20:32:22.0104709Z 2025-05-07T20:32:22.0104954Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0105277Z op = silu_mul_quant 2025-05-07T20:32:22.0105522Z if compiled: 2025-05-07T20:32:22.0105777Z op = torch.compile(op) 2025-05-07T20:32:22.0106080Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0106357Z 2025-05-07T20:32:22.0106550Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0106723Z 2025-05-07T20:32:22.0106824Z moe/activation_test.py:117: 2025-05-07T20:32:22.0107124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0107458Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0107861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0108566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0109300Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0109845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0110533Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0111214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0111742Z kernel = self.compile( 2025-05-07T20:32:22.0112288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0112949Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0113351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0113583Z 2025-05-07T20:32:22.0113838Z self = 2025-05-07T20:32:22.0114915Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0116278Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cc068540>} 2025-05-07T20:32:22.0117615Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0118627Z context = 2025-05-07T20:32:22.0118919Z 2025-05-07T20:32:22.0119092Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0119623Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0120088Z module_map=module_map) 2025-05-07T20:32:22.0120448Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0120810Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0121073Z E ^ 2025-05-07T20:32:22.0121535Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0121991Z 2025-05-07T20:32:22.0122409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0122925Z 2025-05-07T20:32:22.0123027Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0123558Z self=, 2025-05-07T20:32:22.0123956Z T=4096, 2025-05-07T20:32:22.0124195Z D=7168, 2025-05-07T20:32:22.0124393Z scale_ub=None, 2025-05-07T20:32:22.0124599Z contiguous=False, 2025-05-07T20:32:22.0124831Z compiled=False, 2025-05-07T20:32:22.0125037Z ) 2025-05-07T20:32:22.0125347Z self = 2025-05-07T20:32:22.0125840Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.0126118Z 2025-05-07T20:32:22.0126192Z @given( 2025-05-07T20:32:22.0126423Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0126729Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0127037Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0127363Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0127682Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0127965Z ) 2025-05-07T20:32:22.0128360Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0128835Z def test_silu_mul_quant( 2025-05-07T20:32:22.0129077Z self, 2025-05-07T20:32:22.0129271Z T: int, 2025-05-07T20:32:22.0129461Z D: int, 2025-05-07T20:32:22.0129679Z scale_ub: Optional[float], 2025-05-07T20:32:22.0129955Z contiguous: bool, 2025-05-07T20:32:22.0130198Z compiled: bool, 2025-05-07T20:32:22.0130413Z ) -> None: 2025-05-07T20:32:22.0130627Z torch.manual_seed(2025) 2025-05-07T20:32:22.0130864Z 2025-05-07T20:32:22.0131131Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0131470Z 2025-05-07T20:32:22.0131661Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0131942Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0132247Z x = x_sign * x_clamp 2025-05-07T20:32:22.0132486Z x0 = x[:, :D] 2025-05-07T20:32:22.0132690Z x1 = x[:, D:] 2025-05-07T20:32:22.0132901Z 2025-05-07T20:32:22.0133092Z if contiguous: 2025-05-07T20:32:22.0133371Z x0 = x0.contiguous() 2025-05-07T20:32:22.0133628Z x1 = x1.contiguous() 2025-05-07T20:32:22.0133859Z 2025-05-07T20:32:22.0134050Z if scale_ub is not None: 2025-05-07T20:32:22.0134313Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0134648Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0134953Z ) 2025-05-07T20:32:22.0135146Z else: 2025-05-07T20:32:22.0135346Z scale_ub_tensor = None 2025-05-07T20:32:22.0135589Z 2025-05-07T20:32:22.0135816Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0136120Z op = silu_mul_quant 2025-05-07T20:32:22.0136364Z if compiled: 2025-05-07T20:32:22.0136610Z op = torch.compile(op) 2025-05-07T20:32:22.0136897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0137163Z 2025-05-07T20:32:22.0137355Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0137520Z 2025-05-07T20:32:22.0137623Z moe/activation_test.py:117: 2025-05-07T20:32:22.0137916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0138243Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0138793Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0139489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0140174Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0140712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0141389Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0142053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0142585Z kernel = self.compile( 2025-05-07T20:32:22.0143224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0143405Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0143534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0143539Z 2025-05-07T20:32:22.0143746Z self = 2025-05-07T20:32:22.0144518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0145019Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cc5cc720>} 2025-05-07T20:32:22.0145826Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0146066Z context = 2025-05-07T20:32:22.0146075Z 2025-05-07T20:32:22.0146239Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0146503Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0146615Z module_map=module_map) 2025-05-07T20:32:22.0146774Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0146870Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0146952Z E ^ 2025-05-07T20:32:22.0147307Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0147312Z 2025-05-07T20:32:22.0147736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0147891Z 2025-05-07T20:32:22.0147995Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0148216Z self=, 2025-05-07T20:32:22.0148299Z T=128, 2025-05-07T20:32:22.0148375Z D=7168, 2025-05-07T20:32:22.0148458Z scale_ub=None, 2025-05-07T20:32:22.0148548Z contiguous=False, 2025-05-07T20:32:22.0148632Z compiled=True, 2025-05-07T20:32:22.0148704Z ) 2025-05-07T20:32:22.0148924Z self = 2025-05-07T20:32:22.0149091Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.0149096Z 2025-05-07T20:32:22.0149179Z @given( 2025-05-07T20:32:22.0149297Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0149397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0149519Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0149643Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0149754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0149834Z ) 2025-05-07T20:32:22.0150079Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0150177Z def test_silu_mul_quant( 2025-05-07T20:32:22.0150254Z self, 2025-05-07T20:32:22.0150330Z T: int, 2025-05-07T20:32:22.0150412Z D: int, 2025-05-07T20:32:22.0150509Z scale_ub: Optional[float], 2025-05-07T20:32:22.0150596Z contiguous: bool, 2025-05-07T20:32:22.0150686Z compiled: bool, 2025-05-07T20:32:22.0150766Z ) -> None: 2025-05-07T20:32:22.0150862Z torch.manual_seed(2025) 2025-05-07T20:32:22.0150938Z 2025-05-07T20:32:22.0151131Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0151216Z 2025-05-07T20:32:22.0151323Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0151502Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0151589Z x = x_sign * x_clamp 2025-05-07T20:32:22.0151675Z x0 = x[:, :D] 2025-05-07T20:32:22.0151754Z x1 = x[:, D:] 2025-05-07T20:32:22.0151830Z 2025-05-07T20:32:22.0151911Z if contiguous: 2025-05-07T20:32:22.0152002Z x0 = x0.contiguous() 2025-05-07T20:32:22.0152095Z x1 = x1.contiguous() 2025-05-07T20:32:22.0152165Z 2025-05-07T20:32:22.0152253Z if scale_ub is not None: 2025-05-07T20:32:22.0152360Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0152495Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0152570Z ) 2025-05-07T20:32:22.0152652Z else: 2025-05-07T20:32:22.0152745Z scale_ub_tensor = None 2025-05-07T20:32:22.0152816Z 2025-05-07T20:32:22.0152993Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0153124Z op = silu_mul_quant 2025-05-07T20:32:22.0153213Z if compiled: 2025-05-07T20:32:22.0153312Z op = torch.compile(op) 2025-05-07T20:32:22.0153415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0153493Z 2025-05-07T20:32:22.0153583Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.0153704Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.0153780Z 2025-05-07T20:32:22.0153915Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0154014Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.0154115Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.0154235Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.0154374Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0154454Z 2025-05-07T20:32:22.0154553Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.0154563Z 2025-05-07T20:32:22.0154668Z moe/activation_test.py:126: 2025-05-07T20:32:22.0154842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0154947Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.0155084Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0155646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.0155749Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.0156119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0156341Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0156718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.0156977Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0157383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:22.0157642Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0158019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.0158188Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.0158534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.0158610Z fn() 2025-05-07T20:32:22.0159018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.0159102Z self.fn.run( 2025-05-07T20:32:22.0159511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0159615Z kernel = self.compile( 2025-05-07T20:32:22.0159997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0160174Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0160300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0160305Z 2025-05-07T20:32:22.0160508Z self = 2025-05-07T20:32:22.0161285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0161819Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18cc0bb060>} 2025-05-07T20:32:22.0162611Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0162800Z context = 2025-05-07T20:32:22.0162805Z 2025-05-07T20:32:22.0162972Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0163236Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0163448Z module_map=module_map) 2025-05-07T20:32:22.0163616Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0163716Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.0163793Z E ^ 2025-05-07T20:32:22.0164155Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0164208Z 2025-05-07T20:32:22.0164625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0164630Z 2025-05-07T20:32:22.0164741Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0164964Z self=, 2025-05-07T20:32:22.0165041Z T=128, 2025-05-07T20:32:22.0165120Z D=7168, 2025-05-07T20:32:22.0165202Z scale_ub=None, 2025-05-07T20:32:22.0165289Z contiguous=False, 2025-05-07T20:32:22.0165375Z compiled=False, 2025-05-07T20:32:22.0165446Z ) 2025-05-07T20:32:22.0165663Z self = 2025-05-07T20:32:22.0165839Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.0165843Z 2025-05-07T20:32:22.0165920Z @given( 2025-05-07T20:32:22.0166044Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0166149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0166263Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0166385Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0166496Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0166570Z ) 2025-05-07T20:32:22.0166818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0166911Z def test_silu_mul_quant( 2025-05-07T20:32:22.0166992Z self, 2025-05-07T20:32:22.0167070Z T: int, 2025-05-07T20:32:22.0167737Z D: int, 2025-05-07T20:32:22.0167840Z scale_ub: Optional[float], 2025-05-07T20:32:22.0167928Z contiguous: bool, 2025-05-07T20:32:22.0168010Z compiled: bool, 2025-05-07T20:32:22.0168092Z ) -> None: 2025-05-07T20:32:22.0168186Z torch.manual_seed(2025) 2025-05-07T20:32:22.0168259Z 2025-05-07T20:32:22.0168482Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0168560Z 2025-05-07T20:32:22.0168650Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0168779Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0168866Z x = x_sign * x_clamp 2025-05-07T20:32:22.0168949Z x0 = x[:, :D] 2025-05-07T20:32:22.0169027Z x1 = x[:, D:] 2025-05-07T20:32:22.0169097Z 2025-05-07T20:32:22.0169186Z if contiguous: 2025-05-07T20:32:22.0169275Z x0 = x0.contiguous() 2025-05-07T20:32:22.0169362Z x1 = x1.contiguous() 2025-05-07T20:32:22.0169438Z 2025-05-07T20:32:22.0169528Z if scale_ub is not None: 2025-05-07T20:32:22.0169631Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0169768Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0169841Z ) 2025-05-07T20:32:22.0169921Z else: 2025-05-07T20:32:22.0170056Z scale_ub_tensor = None 2025-05-07T20:32:22.0170164Z 2025-05-07T20:32:22.0170300Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0170394Z op = silu_mul_quant 2025-05-07T20:32:22.0170479Z if compiled: 2025-05-07T20:32:22.0170580Z op = torch.compile(op) 2025-05-07T20:32:22.0170690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0170762Z 2025-05-07T20:32:22.0170850Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0170854Z 2025-05-07T20:32:22.0170954Z moe/activation_test.py:117: 2025-05-07T20:32:22.0171080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0171180Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0171277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0171777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0171887Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0172257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0172522Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0172871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0172965Z kernel = self.compile( 2025-05-07T20:32:22.0173358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0173532Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0173655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0173659Z 2025-05-07T20:32:22.0173868Z self = 2025-05-07T20:32:22.0174641Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0175150Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a3a88cc0>} 2025-05-07T20:32:22.0175900Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0176089Z context = 2025-05-07T20:32:22.0176099Z 2025-05-07T20:32:22.0176263Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0176527Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0176676Z module_map=module_map) 2025-05-07T20:32:22.0176846Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0176942Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0177026Z E ^ 2025-05-07T20:32:22.0177380Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0177385Z 2025-05-07T20:32:22.0177803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0177808Z 2025-05-07T20:32:22.0177911Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0178132Z self=, 2025-05-07T20:32:22.0178212Z T=4096, 2025-05-07T20:32:22.0178291Z D=5120, 2025-05-07T20:32:22.0178374Z scale_ub=1200.0, 2025-05-07T20:32:22.0178462Z contiguous=True, 2025-05-07T20:32:22.0178587Z compiled=False, 2025-05-07T20:32:22.0178663Z ) 2025-05-07T20:32:22.0178926Z self = 2025-05-07T20:32:22.0179102Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:22.0179107Z 2025-05-07T20:32:22.0179185Z @given( 2025-05-07T20:32:22.0179306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0179402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0179521Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0179635Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0179745Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0179825Z ) 2025-05-07T20:32:22.0180068Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0180166Z def test_silu_mul_quant( 2025-05-07T20:32:22.0180242Z self, 2025-05-07T20:32:22.0180319Z T: int, 2025-05-07T20:32:22.0180403Z D: int, 2025-05-07T20:32:22.0180505Z scale_ub: Optional[float], 2025-05-07T20:32:22.0180634Z contiguous: bool, 2025-05-07T20:32:22.0180722Z compiled: bool, 2025-05-07T20:32:22.0180797Z ) -> None: 2025-05-07T20:32:22.0180890Z torch.manual_seed(2025) 2025-05-07T20:32:22.0180967Z 2025-05-07T20:32:22.0181135Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0181208Z 2025-05-07T20:32:22.0181301Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0181424Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0181508Z x = x_sign * x_clamp 2025-05-07T20:32:22.0181592Z x0 = x[:, :D] 2025-05-07T20:32:22.0181673Z x1 = x[:, D:] 2025-05-07T20:32:22.0181749Z 2025-05-07T20:32:22.0181830Z if contiguous: 2025-05-07T20:32:22.0181917Z x0 = x0.contiguous() 2025-05-07T20:32:22.0182008Z x1 = x1.contiguous() 2025-05-07T20:32:22.0182084Z 2025-05-07T20:32:22.0182173Z if scale_ub is not None: 2025-05-07T20:32:22.0182286Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0182420Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0182494Z ) 2025-05-07T20:32:22.0182574Z else: 2025-05-07T20:32:22.0182667Z scale_ub_tensor = None 2025-05-07T20:32:22.0182739Z 2025-05-07T20:32:22.0182873Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0182962Z op = silu_mul_quant 2025-05-07T20:32:22.0183050Z if compiled: 2025-05-07T20:32:22.0183150Z op = torch.compile(op) 2025-05-07T20:32:22.0183251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0183326Z 2025-05-07T20:32:22.0183414Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0183419Z 2025-05-07T20:32:22.0183517Z moe/activation_test.py:117: 2025-05-07T20:32:22.0183652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0183798Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0183905Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0184408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0184506Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0184873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0185094Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0185435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0185533Z kernel = self.compile( 2025-05-07T20:32:22.0185919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0186133Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0186304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0186311Z 2025-05-07T20:32:22.0186514Z self = 2025-05-07T20:32:22.0187290Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0187787Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a3a89f80>} 2025-05-07T20:32:22.0188543Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0188735Z context = 2025-05-07T20:32:22.0188803Z 2025-05-07T20:32:22.0188971Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0189242Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0189348Z module_map=module_map) 2025-05-07T20:32:22.0189516Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0189611Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0189689Z E ^ 2025-05-07T20:32:22.0190048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0190053Z 2025-05-07T20:32:22.0190467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0190472Z 2025-05-07T20:32:22.0190580Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0190806Z self=, 2025-05-07T20:32:22.0190888Z T=1, 2025-05-07T20:32:22.0190972Z D=5120, 2025-05-07T20:32:22.0191051Z scale_ub=None, 2025-05-07T20:32:22.0191155Z contiguous=True, 2025-05-07T20:32:22.0191248Z compiled=True, 2025-05-07T20:32:22.0191336Z ) 2025-05-07T20:32:22.0191558Z self = 2025-05-07T20:32:22.0191722Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.0191727Z 2025-05-07T20:32:22.0191804Z @given( 2025-05-07T20:32:22.0191921Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0192023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0192134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0192255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0192367Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0192443Z ) 2025-05-07T20:32:22.0192737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0192830Z def test_silu_mul_quant( 2025-05-07T20:32:22.0192906Z self, 2025-05-07T20:32:22.0192986Z T: int, 2025-05-07T20:32:22.0193059Z D: int, 2025-05-07T20:32:22.0193153Z scale_ub: Optional[float], 2025-05-07T20:32:22.0193243Z contiguous: bool, 2025-05-07T20:32:22.0193327Z compiled: bool, 2025-05-07T20:32:22.0193403Z ) -> None: 2025-05-07T20:32:22.0193502Z torch.manual_seed(2025) 2025-05-07T20:32:22.0193574Z 2025-05-07T20:32:22.0193748Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0193821Z 2025-05-07T20:32:22.0193910Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0194038Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0194126Z x = x_sign * x_clamp 2025-05-07T20:32:22.0194249Z x0 = x[:, :D] 2025-05-07T20:32:22.0194372Z x1 = x[:, D:] 2025-05-07T20:32:22.0194445Z 2025-05-07T20:32:22.0194529Z if contiguous: 2025-05-07T20:32:22.0194625Z x0 = x0.contiguous() 2025-05-07T20:32:22.0194714Z x1 = x1.contiguous() 2025-05-07T20:32:22.0194787Z 2025-05-07T20:32:22.0194881Z if scale_ub is not None: 2025-05-07T20:32:22.0194984Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0195124Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0195200Z ) 2025-05-07T20:32:22.0195276Z else: 2025-05-07T20:32:22.0195373Z scale_ub_tensor = None 2025-05-07T20:32:22.0195445Z 2025-05-07T20:32:22.0195575Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0195670Z op = silu_mul_quant 2025-05-07T20:32:22.0195754Z if compiled: 2025-05-07T20:32:22.0195851Z op = torch.compile(op) 2025-05-07T20:32:22.0195962Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0196080Z 2025-05-07T20:32:22.0196171Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.0196296Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.0196369Z 2025-05-07T20:32:22.0196507Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0196608Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.0196705Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.0196831Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.0196969Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0197042Z 2025-05-07T20:32:22.0197147Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.0197151Z 2025-05-07T20:32:22.0197250Z moe/activation_test.py:126: 2025-05-07T20:32:22.0197375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0197487Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.0197624Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0198202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.0198301Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.0198663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0198887Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0199256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.0199515Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0199916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:22.0200219Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0200614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.0200778Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.0201122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.0201203Z fn() 2025-05-07T20:32:22.0201607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.0201693Z self.fn.run( 2025-05-07T20:32:22.0202033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0202124Z kernel = self.compile( 2025-05-07T20:32:22.0202552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0202770Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0202897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0202907Z 2025-05-07T20:32:22.0203113Z self = 2025-05-07T20:32:22.0203962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0204466Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a3a8afc0>} 2025-05-07T20:32:22.0205224Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0205467Z context = 2025-05-07T20:32:22.0205472Z 2025-05-07T20:32:22.0205636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0205900Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0206008Z module_map=module_map) 2025-05-07T20:32:22.0206170Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0206274Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.0206350Z E ^ 2025-05-07T20:32:22.0206706Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0206710Z 2025-05-07T20:32:22.0207135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0207142Z 2025-05-07T20:32:22.0207251Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0207479Z self=, 2025-05-07T20:32:22.0207554Z T=2048, 2025-05-07T20:32:22.0207628Z D=5120, 2025-05-07T20:32:22.0207716Z scale_ub=None, 2025-05-07T20:32:22.0207801Z contiguous=True, 2025-05-07T20:32:22.0207884Z compiled=True, 2025-05-07T20:32:22.0207958Z ) 2025-05-07T20:32:22.0208177Z self = 2025-05-07T20:32:22.0208345Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.0208350Z 2025-05-07T20:32:22.0208429Z @given( 2025-05-07T20:32:22.0208547Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0208644Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0208764Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0208883Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0209049Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0209126Z ) 2025-05-07T20:32:22.0209368Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0209466Z def test_silu_mul_quant( 2025-05-07T20:32:22.0209543Z self, 2025-05-07T20:32:22.0209619Z T: int, 2025-05-07T20:32:22.0209701Z D: int, 2025-05-07T20:32:22.0209797Z scale_ub: Optional[float], 2025-05-07T20:32:22.0209885Z contiguous: bool, 2025-05-07T20:32:22.0209973Z compiled: bool, 2025-05-07T20:32:22.0210051Z ) -> None: 2025-05-07T20:32:22.0210143Z torch.manual_seed(2025) 2025-05-07T20:32:22.0210220Z 2025-05-07T20:32:22.0210388Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0210465Z 2025-05-07T20:32:22.0210555Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0210719Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0210851Z x = x_sign * x_clamp 2025-05-07T20:32:22.0210933Z x0 = x[:, :D] 2025-05-07T20:32:22.0211010Z x1 = x[:, D:] 2025-05-07T20:32:22.0211085Z 2025-05-07T20:32:22.0211168Z if contiguous: 2025-05-07T20:32:22.0211259Z x0 = x0.contiguous() 2025-05-07T20:32:22.0211348Z x1 = x1.contiguous() 2025-05-07T20:32:22.0211418Z 2025-05-07T20:32:22.0211503Z if scale_ub is not None: 2025-05-07T20:32:22.0211609Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0211742Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0211821Z ) 2025-05-07T20:32:22.0211897Z else: 2025-05-07T20:32:22.0211997Z scale_ub_tensor = None 2025-05-07T20:32:22.0212073Z 2025-05-07T20:32:22.0212202Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0212289Z op = silu_mul_quant 2025-05-07T20:32:22.0212379Z if compiled: 2025-05-07T20:32:22.0212484Z op = torch.compile(op) 2025-05-07T20:32:22.0212632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0212707Z 2025-05-07T20:32:22.0212796Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.0212916Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.0212990Z 2025-05-07T20:32:22.0213124Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0213230Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.0213325Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.0213443Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.0213586Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0213659Z 2025-05-07T20:32:22.0213756Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.0213761Z 2025-05-07T20:32:22.0213866Z moe/activation_test.py:126: 2025-05-07T20:32:22.0213996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0214105Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.0214243Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0214803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.0214910Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.0215272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0215494Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0215866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.0216120Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0216570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:22.0216831Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0217206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.0217375Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.0217718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.0217801Z fn() 2025-05-07T20:32:22.0218206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.0218288Z self.fn.run( 2025-05-07T20:32:22.0218630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0218762Z kernel = self.compile( 2025-05-07T20:32:22.0219213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0219392Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0219518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0219523Z 2025-05-07T20:32:22.0219729Z self = 2025-05-07T20:32:22.0220517Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0221051Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a37bf420>} 2025-05-07T20:32:22.0221810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0222043Z context = 2025-05-07T20:32:22.0222048Z 2025-05-07T20:32:22.0222212Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0222480Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0222583Z module_map=module_map) 2025-05-07T20:32:22.0222749Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0222849Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.0222929Z E ^ 2025-05-07T20:32:22.0223283Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0223289Z 2025-05-07T20:32:22.0223707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0223718Z 2025-05-07T20:32:22.0223825Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0224049Z self=, 2025-05-07T20:32:22.0224133Z T=128, 2025-05-07T20:32:22.0224209Z D=5120, 2025-05-07T20:32:22.0224292Z scale_ub=None, 2025-05-07T20:32:22.0224380Z contiguous=True, 2025-05-07T20:32:22.0224462Z compiled=True, 2025-05-07T20:32:22.0224534Z ) 2025-05-07T20:32:22.0224758Z self = 2025-05-07T20:32:22.0224924Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.0224929Z 2025-05-07T20:32:22.0225006Z @given( 2025-05-07T20:32:22.0225129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0225226Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0225383Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0225508Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0225619Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0225695Z ) 2025-05-07T20:32:22.0225937Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0226029Z def test_silu_mul_quant( 2025-05-07T20:32:22.0226105Z self, 2025-05-07T20:32:22.0226180Z T: int, 2025-05-07T20:32:22.0226259Z D: int, 2025-05-07T20:32:22.0226358Z scale_ub: Optional[float], 2025-05-07T20:32:22.0226446Z contiguous: bool, 2025-05-07T20:32:22.0226528Z compiled: bool, 2025-05-07T20:32:22.0226606Z ) -> None: 2025-05-07T20:32:22.0226699Z torch.manual_seed(2025) 2025-05-07T20:32:22.0226771Z 2025-05-07T20:32:22.0226943Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0227063Z 2025-05-07T20:32:22.0227167Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0227330Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0227416Z x = x_sign * x_clamp 2025-05-07T20:32:22.0227502Z x0 = x[:, :D] 2025-05-07T20:32:22.0227578Z x1 = x[:, D:] 2025-05-07T20:32:22.0227652Z 2025-05-07T20:32:22.0227737Z if contiguous: 2025-05-07T20:32:22.0227826Z x0 = x0.contiguous() 2025-05-07T20:32:22.0227914Z x1 = x1.contiguous() 2025-05-07T20:32:22.0227991Z 2025-05-07T20:32:22.0228077Z if scale_ub is not None: 2025-05-07T20:32:22.0228181Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0228319Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0228395Z ) 2025-05-07T20:32:22.0228475Z else: 2025-05-07T20:32:22.0228566Z scale_ub_tensor = None 2025-05-07T20:32:22.0228637Z 2025-05-07T20:32:22.0228772Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0228863Z op = silu_mul_quant 2025-05-07T20:32:22.0228989Z if compiled: 2025-05-07T20:32:22.0229094Z op = torch.compile(op) 2025-05-07T20:32:22.0229196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0229266Z 2025-05-07T20:32:22.0229358Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.0229478Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.0229549Z 2025-05-07T20:32:22.0229687Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0229784Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.0229887Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.0230007Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.0230147Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0245871Z 2025-05-07T20:32:22.0246020Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.0246027Z 2025-05-07T20:32:22.0246138Z moe/activation_test.py:126: 2025-05-07T20:32:22.0246285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0246391Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.0246532Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0247108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.0247214Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.0247575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0247806Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0248180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.0248558Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0248968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:22.0249224Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0249610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.0249779Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.0250128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.0250205Z fn() 2025-05-07T20:32:22.0250609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.0250697Z self.fn.run( 2025-05-07T20:32:22.0251099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0251248Z kernel = self.compile( 2025-05-07T20:32:22.0251641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0251816Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0251952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0251957Z 2025-05-07T20:32:22.0252164Z self = 2025-05-07T20:32:22.0252940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0253452Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a3045e40>} 2025-05-07T20:32:22.0254266Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0254460Z context = 2025-05-07T20:32:22.0254465Z 2025-05-07T20:32:22.0254630Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0254903Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0255011Z module_map=module_map) 2025-05-07T20:32:22.0255173Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0255279Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.0255357Z E ^ 2025-05-07T20:32:22.0255718Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0255728Z 2025-05-07T20:32:22.0256149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0256154Z 2025-05-07T20:32:22.0256257Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0256486Z self=, 2025-05-07T20:32:22.0256562Z T=4096, 2025-05-07T20:32:22.0256638Z D=5120, 2025-05-07T20:32:22.0256726Z scale_ub=None, 2025-05-07T20:32:22.0256812Z contiguous=True, 2025-05-07T20:32:22.0256895Z compiled=True, 2025-05-07T20:32:22.0256975Z ) 2025-05-07T20:32:22.0257195Z self = 2025-05-07T20:32:22.0257367Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.0257378Z 2025-05-07T20:32:22.0257456Z @given( 2025-05-07T20:32:22.0257577Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0257726Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0257844Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0257960Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0258078Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0258153Z ) 2025-05-07T20:32:22.0258399Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0258495Z def test_silu_mul_quant( 2025-05-07T20:32:22.0258573Z self, 2025-05-07T20:32:22.0258652Z T: int, 2025-05-07T20:32:22.0258733Z D: int, 2025-05-07T20:32:22.0258832Z scale_ub: Optional[float], 2025-05-07T20:32:22.0258932Z contiguous: bool, 2025-05-07T20:32:22.0259017Z compiled: bool, 2025-05-07T20:32:22.0259095Z ) -> None: 2025-05-07T20:32:22.0259197Z torch.manual_seed(2025) 2025-05-07T20:32:22.0259311Z 2025-05-07T20:32:22.0259489Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0259610Z 2025-05-07T20:32:22.0259701Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0259825Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0259919Z x = x_sign * x_clamp 2025-05-07T20:32:22.0259997Z x0 = x[:, :D] 2025-05-07T20:32:22.0260074Z x1 = x[:, D:] 2025-05-07T20:32:22.0260149Z 2025-05-07T20:32:22.0260230Z if contiguous: 2025-05-07T20:32:22.0260323Z x0 = x0.contiguous() 2025-05-07T20:32:22.0260413Z x1 = x1.contiguous() 2025-05-07T20:32:22.0260484Z 2025-05-07T20:32:22.0260578Z if scale_ub is not None: 2025-05-07T20:32:22.0260682Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0260816Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0260899Z ) 2025-05-07T20:32:22.0260975Z else: 2025-05-07T20:32:22.0261070Z scale_ub_tensor = None 2025-05-07T20:32:22.0261146Z 2025-05-07T20:32:22.0261405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0261494Z op = silu_mul_quant 2025-05-07T20:32:22.0261585Z if compiled: 2025-05-07T20:32:22.0261683Z op = torch.compile(op) 2025-05-07T20:32:22.0261788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0261861Z 2025-05-07T20:32:22.0261951Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.0262076Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.0262147Z 2025-05-07T20:32:22.0262281Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0262386Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.0262488Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.0262610Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.0262758Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0262827Z 2025-05-07T20:32:22.0262930Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.0262944Z 2025-05-07T20:32:22.0263042Z moe/activation_test.py:126: 2025-05-07T20:32:22.0263170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0263281Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.0263416Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0263976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.0264081Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.0264440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0264668Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0265081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.0265345Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0265745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:22.0265997Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0266369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.0266541Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.0266884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.0266971Z fn() 2025-05-07T20:32:22.0267413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.0267531Z self.fn.run( 2025-05-07T20:32:22.0267882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0267979Z kernel = self.compile( 2025-05-07T20:32:22.0268362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0268543Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0268667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0268672Z 2025-05-07T20:32:22.0268883Z self = 2025-05-07T20:32:22.0269654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0270161Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a3352840>} 2025-05-07T20:32:22.0270960Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0271147Z context = 2025-05-07T20:32:22.0271151Z 2025-05-07T20:32:22.0271321Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0271585Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0271696Z module_map=module_map) 2025-05-07T20:32:22.0271856Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0271959Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.0272043Z E ^ 2025-05-07T20:32:22.0272403Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0272411Z 2025-05-07T20:32:22.0272823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0272833Z 2025-05-07T20:32:22.0272934Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0273155Z self=, 2025-05-07T20:32:22.0273240Z T=16384, 2025-05-07T20:32:22.0273319Z D=5120, 2025-05-07T20:32:22.0273397Z scale_ub=None, 2025-05-07T20:32:22.0273488Z contiguous=True, 2025-05-07T20:32:22.0273571Z compiled=True, 2025-05-07T20:32:22.0273643Z ) 2025-05-07T20:32:22.0273865Z self = 2025-05-07T20:32:22.0274041Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.0274045Z 2025-05-07T20:32:22.0274198Z @given( 2025-05-07T20:32:22.0274323Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0274420Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0274537Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0274648Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0274763Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0274838Z ) 2025-05-07T20:32:22.0275090Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0275184Z def test_silu_mul_quant( 2025-05-07T20:32:22.0275257Z self, 2025-05-07T20:32:22.0275340Z T: int, 2025-05-07T20:32:22.0275415Z D: int, 2025-05-07T20:32:22.0275513Z scale_ub: Optional[float], 2025-05-07T20:32:22.0275599Z contiguous: bool, 2025-05-07T20:32:22.0275681Z compiled: bool, 2025-05-07T20:32:22.0275804Z ) -> None: 2025-05-07T20:32:22.0275902Z torch.manual_seed(2025) 2025-05-07T20:32:22.0276014Z 2025-05-07T20:32:22.0276188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0276260Z 2025-05-07T20:32:22.0276349Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0276474Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0276562Z x = x_sign * x_clamp 2025-05-07T20:32:22.0276640Z x0 = x[:, :D] 2025-05-07T20:32:22.0276718Z x1 = x[:, D:] 2025-05-07T20:32:22.0276788Z 2025-05-07T20:32:22.0276873Z if contiguous: 2025-05-07T20:32:22.0276960Z x0 = x0.contiguous() 2025-05-07T20:32:22.0277050Z x1 = x1.contiguous() 2025-05-07T20:32:22.0277128Z 2025-05-07T20:32:22.0277213Z if scale_ub is not None: 2025-05-07T20:32:22.0277318Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0277462Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0277537Z ) 2025-05-07T20:32:22.0277617Z else: 2025-05-07T20:32:22.0277760Z scale_ub_tensor = None 2025-05-07T20:32:22.0277831Z 2025-05-07T20:32:22.0277960Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0278056Z op = silu_mul_quant 2025-05-07T20:32:22.0278138Z if compiled: 2025-05-07T20:32:22.0278234Z op = torch.compile(op) 2025-05-07T20:32:22.0278339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0278410Z 2025-05-07T20:32:22.0278499Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.0278619Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.0278690Z 2025-05-07T20:32:22.0278829Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0278930Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.0279027Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.0279155Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.0279295Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0279371Z 2025-05-07T20:32:22.0279473Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.0279478Z 2025-05-07T20:32:22.0279575Z moe/activation_test.py:126: 2025-05-07T20:32:22.0279704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0279807Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.0279937Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0280498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.0280597Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.0280956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0281183Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0281598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.0281853Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0282251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:22.0282501Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0282884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.0283046Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.0283493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.0283570Z fn() 2025-05-07T20:32:22.0284012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.0284138Z self.fn.run( 2025-05-07T20:32:22.0284475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0284566Z kernel = self.compile( 2025-05-07T20:32:22.0284951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0285124Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0285249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0285254Z 2025-05-07T20:32:22.0285458Z self = 2025-05-07T20:32:22.0286227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0286769Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2b19d00>} 2025-05-07T20:32:22.0287516Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0287717Z context = 2025-05-07T20:32:22.0287722Z 2025-05-07T20:32:22.0287886Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0288153Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0288259Z module_map=module_map) 2025-05-07T20:32:22.0288423Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0288530Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.0288604Z E ^ 2025-05-07T20:32:22.0288958Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0288963Z 2025-05-07T20:32:22.0289383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0289388Z 2025-05-07T20:32:22.0289491Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0289714Z self=, 2025-05-07T20:32:22.0289790Z T=1, 2025-05-07T20:32:22.0289861Z D=5120, 2025-05-07T20:32:22.0289943Z scale_ub=1200.0, 2025-05-07T20:32:22.0290025Z contiguous=True, 2025-05-07T20:32:22.0290107Z compiled=True, 2025-05-07T20:32:22.0290182Z ) 2025-05-07T20:32:22.0290405Z self = 2025-05-07T20:32:22.0290611Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:22.0290623Z 2025-05-07T20:32:22.0290700Z @given( 2025-05-07T20:32:22.0290820Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0290915Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0291030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0291144Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0291258Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0291330Z ) 2025-05-07T20:32:22.0291572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0291669Z def test_silu_mul_quant( 2025-05-07T20:32:22.0291743Z self, 2025-05-07T20:32:22.0291817Z T: int, 2025-05-07T20:32:22.0291897Z D: int, 2025-05-07T20:32:22.0291991Z scale_ub: Optional[float], 2025-05-07T20:32:22.0292125Z contiguous: bool, 2025-05-07T20:32:22.0292251Z compiled: bool, 2025-05-07T20:32:22.0292330Z ) -> None: 2025-05-07T20:32:22.0292428Z torch.manual_seed(2025) 2025-05-07T20:32:22.0292493Z 2025-05-07T20:32:22.0292662Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0292736Z 2025-05-07T20:32:22.0292826Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0292949Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0293045Z x = x_sign * x_clamp 2025-05-07T20:32:22.0293123Z x0 = x[:, :D] 2025-05-07T20:32:22.0293200Z x1 = x[:, D:] 2025-05-07T20:32:22.0293273Z 2025-05-07T20:32:22.0293355Z if contiguous: 2025-05-07T20:32:22.0293446Z x0 = x0.contiguous() 2025-05-07T20:32:22.0293534Z x1 = x1.contiguous() 2025-05-07T20:32:22.0293605Z 2025-05-07T20:32:22.0293699Z if scale_ub is not None: 2025-05-07T20:32:22.0293803Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0293938Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0294062Z ) 2025-05-07T20:32:22.0294136Z else: 2025-05-07T20:32:22.0294229Z scale_ub_tensor = None 2025-05-07T20:32:22.0294306Z 2025-05-07T20:32:22.0294434Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0294522Z op = silu_mul_quant 2025-05-07T20:32:22.0294607Z if compiled: 2025-05-07T20:32:22.0294706Z op = torch.compile(op) 2025-05-07T20:32:22.0294810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0294886Z 2025-05-07T20:32:22.0294974Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0294978Z 2025-05-07T20:32:22.0295078Z moe/activation_test.py:117: 2025-05-07T20:32:22.0295203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0295300Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0295405Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0295784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0295879Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0296377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0296473Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0296833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0297053Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0297392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0297486Z kernel = self.compile( 2025-05-07T20:32:22.0297869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0298088Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0298218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0298222Z 2025-05-07T20:32:22.0298426Z self = 2025-05-07T20:32:22.0299197Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0299690Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2d68220>} 2025-05-07T20:32:22.0300510Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0300762Z context = 2025-05-07T20:32:22.0300767Z 2025-05-07T20:32:22.0300929Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0301195Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0301299Z module_map=module_map) 2025-05-07T20:32:22.0301461Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0301557Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0301633Z E ^ 2025-05-07T20:32:22.0301996Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0302001Z 2025-05-07T20:32:22.0302413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0302418Z 2025-05-07T20:32:22.0302530Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0302813Z self=, 2025-05-07T20:32:22.0302888Z T=1, 2025-05-07T20:32:22.0302969Z D=5120, 2025-05-07T20:32:22.0303048Z scale_ub=None, 2025-05-07T20:32:22.0303131Z contiguous=False, 2025-05-07T20:32:22.0303218Z compiled=True, 2025-05-07T20:32:22.0303289Z ) 2025-05-07T20:32:22.0303502Z self = 2025-05-07T20:32:22.0303667Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.0303671Z 2025-05-07T20:32:22.0303748Z @given( 2025-05-07T20:32:22.0303870Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0303969Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0304079Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0304204Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0304319Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0304393Z ) 2025-05-07T20:32:22.0304641Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0304732Z def test_silu_mul_quant( 2025-05-07T20:32:22.0304807Z self, 2025-05-07T20:32:22.0304886Z T: int, 2025-05-07T20:32:22.0304960Z D: int, 2025-05-07T20:32:22.0305057Z scale_ub: Optional[float], 2025-05-07T20:32:22.0305147Z contiguous: bool, 2025-05-07T20:32:22.0305228Z compiled: bool, 2025-05-07T20:32:22.0305305Z ) -> None: 2025-05-07T20:32:22.0305393Z torch.manual_seed(2025) 2025-05-07T20:32:22.0305462Z 2025-05-07T20:32:22.0305628Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0305698Z 2025-05-07T20:32:22.0305785Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0305915Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0306045Z x = x_sign * x_clamp 2025-05-07T20:32:22.0306125Z x0 = x[:, :D] 2025-05-07T20:32:22.0306204Z x1 = x[:, D:] 2025-05-07T20:32:22.0306274Z 2025-05-07T20:32:22.0306357Z if contiguous: 2025-05-07T20:32:22.0306447Z x0 = x0.contiguous() 2025-05-07T20:32:22.0306534Z x1 = x1.contiguous() 2025-05-07T20:32:22.0306606Z 2025-05-07T20:32:22.0306696Z if scale_ub is not None: 2025-05-07T20:32:22.0306798Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0306937Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0307008Z ) 2025-05-07T20:32:22.0307081Z else: 2025-05-07T20:32:22.0307178Z scale_ub_tensor = None 2025-05-07T20:32:22.0307247Z 2025-05-07T20:32:22.0307375Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0307469Z op = silu_mul_quant 2025-05-07T20:32:22.0307595Z if compiled: 2025-05-07T20:32:22.0307700Z op = torch.compile(op) 2025-05-07T20:32:22.0307849Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0307920Z 2025-05-07T20:32:22.0308010Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.0308136Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.0308202Z 2025-05-07T20:32:22.0308336Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0308442Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.0308539Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.0308658Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.0308806Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0308875Z 2025-05-07T20:32:22.0308976Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.0308980Z 2025-05-07T20:32:22.0309077Z moe/activation_test.py:126: 2025-05-07T20:32:22.0309206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0309361Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.0309491Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0310045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.0310148Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.0310510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0310741Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0311145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.0311414Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0311821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:22.0312074Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0312454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.0312616Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.0312957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.0313036Z fn() 2025-05-07T20:32:22.0313435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.0313516Z self.fn.run( 2025-05-07T20:32:22.0313860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0313953Z kernel = self.compile( 2025-05-07T20:32:22.0314380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0314559Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0314683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0314688Z 2025-05-07T20:32:22.0314895Z self = 2025-05-07T20:32:22.0315664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0316162Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2d8e200>} 2025-05-07T20:32:22.0316953Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0317181Z context = 2025-05-07T20:32:22.0317189Z 2025-05-07T20:32:22.0317354Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0317614Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0317725Z module_map=module_map) 2025-05-07T20:32:22.0317884Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0317980Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.0318062Z E ^ 2025-05-07T20:32:22.0318416Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0318421Z 2025-05-07T20:32:22.0318839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0318885Z 2025-05-07T20:32:22.0318987Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0319206Z self=, 2025-05-07T20:32:22.0319286Z T=1, 2025-05-07T20:32:22.0319361Z D=5120, 2025-05-07T20:32:22.0319440Z scale_ub=None, 2025-05-07T20:32:22.0319526Z contiguous=True, 2025-05-07T20:32:22.0319610Z compiled=False, 2025-05-07T20:32:22.0319683Z ) 2025-05-07T20:32:22.0319905Z self = 2025-05-07T20:32:22.0320065Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:22.0320070Z 2025-05-07T20:32:22.0320152Z @given( 2025-05-07T20:32:22.0320267Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0320362Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0320484Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0320605Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0320716Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0320789Z ) 2025-05-07T20:32:22.0321030Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0321120Z def test_silu_mul_quant( 2025-05-07T20:32:22.0321198Z self, 2025-05-07T20:32:22.0321271Z T: int, 2025-05-07T20:32:22.0321349Z D: int, 2025-05-07T20:32:22.0321443Z scale_ub: Optional[float], 2025-05-07T20:32:22.0321531Z contiguous: bool, 2025-05-07T20:32:22.0321622Z compiled: bool, 2025-05-07T20:32:22.0321696Z ) -> None: 2025-05-07T20:32:22.0321789Z torch.manual_seed(2025) 2025-05-07T20:32:22.0321861Z 2025-05-07T20:32:22.0322030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0322101Z 2025-05-07T20:32:22.0322195Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0322359Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0322449Z x = x_sign * x_clamp 2025-05-07T20:32:22.0322532Z x0 = x[:, :D] 2025-05-07T20:32:22.0322608Z x1 = x[:, D:] 2025-05-07T20:32:22.0322682Z 2025-05-07T20:32:22.0322763Z if contiguous: 2025-05-07T20:32:22.0322848Z x0 = x0.contiguous() 2025-05-07T20:32:22.0322942Z x1 = x1.contiguous() 2025-05-07T20:32:22.0323012Z 2025-05-07T20:32:22.0323099Z if scale_ub is not None: 2025-05-07T20:32:22.0323204Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0323452Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0323528Z ) 2025-05-07T20:32:22.0323610Z else: 2025-05-07T20:32:22.0323706Z scale_ub_tensor = None 2025-05-07T20:32:22.0323776Z 2025-05-07T20:32:22.0323954Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0324041Z op = silu_mul_quant 2025-05-07T20:32:22.0324164Z if compiled: 2025-05-07T20:32:22.0324268Z op = torch.compile(op) 2025-05-07T20:32:22.0324369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0324443Z 2025-05-07T20:32:22.0324531Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0324535Z 2025-05-07T20:32:22.0324632Z moe/activation_test.py:117: 2025-05-07T20:32:22.0324761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0324858Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0324955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0325456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0325550Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0325917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0326143Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0326526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0326625Z kernel = self.compile( 2025-05-07T20:32:22.0327010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0327180Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0327315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0327320Z 2025-05-07T20:32:22.0327522Z self = 2025-05-07T20:32:22.0328301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0328800Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2d8f560>} 2025-05-07T20:32:22.0329553Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0329740Z context = 2025-05-07T20:32:22.0329745Z 2025-05-07T20:32:22.0329908Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0330174Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0330278Z module_map=module_map) 2025-05-07T20:32:22.0330445Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0330583Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0330665Z E ^ 2025-05-07T20:32:22.0331023Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0331028Z 2025-05-07T20:32:22.0331439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0331444Z 2025-05-07T20:32:22.0331542Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0331767Z self=, 2025-05-07T20:32:22.0331842Z T=128, 2025-05-07T20:32:22.0331922Z D=5120, 2025-05-07T20:32:22.0331998Z scale_ub=None, 2025-05-07T20:32:22.0332082Z contiguous=False, 2025-05-07T20:32:22.0332169Z compiled=True, 2025-05-07T20:32:22.0332238Z ) 2025-05-07T20:32:22.0332495Z self = 2025-05-07T20:32:22.0332670Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.0332738Z 2025-05-07T20:32:22.0332815Z @given( 2025-05-07T20:32:22.0332931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0333027Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0333138Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0333256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0333366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0333438Z ) 2025-05-07T20:32:22.0333685Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0333778Z def test_silu_mul_quant( 2025-05-07T20:32:22.0333853Z self, 2025-05-07T20:32:22.0333931Z T: int, 2025-05-07T20:32:22.0334005Z D: int, 2025-05-07T20:32:22.0334099Z scale_ub: Optional[float], 2025-05-07T20:32:22.0334194Z contiguous: bool, 2025-05-07T20:32:22.0334276Z compiled: bool, 2025-05-07T20:32:22.0334400Z ) -> None: 2025-05-07T20:32:22.0334498Z torch.manual_seed(2025) 2025-05-07T20:32:22.0334570Z 2025-05-07T20:32:22.0334743Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0334817Z 2025-05-07T20:32:22.0334906Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0335028Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0335116Z x = x_sign * x_clamp 2025-05-07T20:32:22.0335195Z x0 = x[:, :D] 2025-05-07T20:32:22.0335279Z x1 = x[:, D:] 2025-05-07T20:32:22.0335348Z 2025-05-07T20:32:22.0335427Z if contiguous: 2025-05-07T20:32:22.0335519Z x0 = x0.contiguous() 2025-05-07T20:32:22.0335607Z x1 = x1.contiguous() 2025-05-07T20:32:22.0335677Z 2025-05-07T20:32:22.0335770Z if scale_ub is not None: 2025-05-07T20:32:22.0335871Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0336011Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0336089Z ) 2025-05-07T20:32:22.0336164Z else: 2025-05-07T20:32:22.0336262Z scale_ub_tensor = None 2025-05-07T20:32:22.0336330Z 2025-05-07T20:32:22.0336456Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0336546Z op = silu_mul_quant 2025-05-07T20:32:22.0336625Z if compiled: 2025-05-07T20:32:22.0336721Z op = torch.compile(op) 2025-05-07T20:32:22.0336826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0336894Z 2025-05-07T20:32:22.0336983Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0336987Z 2025-05-07T20:32:22.0337085Z moe/activation_test.py:117: 2025-05-07T20:32:22.0337211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0337311Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0337410Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0337825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0337927Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0338600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0338736Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0339146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0339369Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0339713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0339807Z kernel = self.compile( 2025-05-07T20:32:22.0340272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0340456Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0340665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0340671Z 2025-05-07T20:32:22.0340899Z self = 2025-05-07T20:32:22.0341669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0342165Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2d696c0>} 2025-05-07T20:32:22.0342919Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0343108Z context = 2025-05-07T20:32:22.0343171Z 2025-05-07T20:32:22.0343344Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0343604Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0343709Z module_map=module_map) 2025-05-07T20:32:22.0343877Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0343974Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0344053Z E ^ 2025-05-07T20:32:22.0344409Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0344414Z 2025-05-07T20:32:22.0344826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0344833Z 2025-05-07T20:32:22.0344950Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0345176Z self=, 2025-05-07T20:32:22.0345255Z T=128, 2025-05-07T20:32:22.0345329Z D=7168, 2025-05-07T20:32:22.0345413Z scale_ub=1200.0, 2025-05-07T20:32:22.0345503Z contiguous=False, 2025-05-07T20:32:22.0345588Z compiled=False, 2025-05-07T20:32:22.0345665Z ) 2025-05-07T20:32:22.0345887Z self = 2025-05-07T20:32:22.0346057Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.0346061Z 2025-05-07T20:32:22.0346135Z @given( 2025-05-07T20:32:22.0346258Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0346353Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0346465Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0346586Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0346770Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0346860Z ) 2025-05-07T20:32:22.0347103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0347194Z def test_silu_mul_quant( 2025-05-07T20:32:22.0347270Z self, 2025-05-07T20:32:22.0347347Z T: int, 2025-05-07T20:32:22.0347421Z D: int, 2025-05-07T20:32:22.0347523Z scale_ub: Optional[float], 2025-05-07T20:32:22.0347612Z contiguous: bool, 2025-05-07T20:32:22.0347699Z compiled: bool, 2025-05-07T20:32:22.0347780Z ) -> None: 2025-05-07T20:32:22.0347872Z torch.manual_seed(2025) 2025-05-07T20:32:22.0347946Z 2025-05-07T20:32:22.0348117Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0348190Z 2025-05-07T20:32:22.0348287Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0348451Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0348537Z x = x_sign * x_clamp 2025-05-07T20:32:22.0348670Z x0 = x[:, :D] 2025-05-07T20:32:22.0348750Z x1 = x[:, D:] 2025-05-07T20:32:22.0348821Z 2025-05-07T20:32:22.0348908Z if contiguous: 2025-05-07T20:32:22.0348997Z x0 = x0.contiguous() 2025-05-07T20:32:22.0349084Z x1 = x1.contiguous() 2025-05-07T20:32:22.0349160Z 2025-05-07T20:32:22.0349245Z if scale_ub is not None: 2025-05-07T20:32:22.0349348Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0349484Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0349559Z ) 2025-05-07T20:32:22.0349640Z else: 2025-05-07T20:32:22.0349731Z scale_ub_tensor = None 2025-05-07T20:32:22.0349802Z 2025-05-07T20:32:22.0349934Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0350021Z op = silu_mul_quant 2025-05-07T20:32:22.0350102Z if compiled: 2025-05-07T20:32:22.0350205Z op = torch.compile(op) 2025-05-07T20:32:22.0350356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0350429Z 2025-05-07T20:32:22.0350523Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0350528Z 2025-05-07T20:32:22.0350636Z moe/activation_test.py:117: 2025-05-07T20:32:22.0350779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0350900Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0350996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0351498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0351592Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0351949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0352175Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0352524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0352621Z kernel = self.compile( 2025-05-07T20:32:22.0353003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0353175Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0353296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0353301Z 2025-05-07T20:32:22.0353503Z self = 2025-05-07T20:32:22.0354272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0354809Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2a7cae0>} 2025-05-07T20:32:22.0355561Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0355750Z context = 2025-05-07T20:32:22.0355755Z 2025-05-07T20:32:22.0355918Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0356184Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0356289Z module_map=module_map) 2025-05-07T20:32:22.0356449Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0356548Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0356665Z E ^ 2025-05-07T20:32:22.0357021Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0357073Z 2025-05-07T20:32:22.0357485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0357489Z 2025-05-07T20:32:22.0357591Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0357812Z self=, 2025-05-07T20:32:22.0357888Z T=128, 2025-05-07T20:32:22.0357962Z D=5120, 2025-05-07T20:32:22.0358052Z scale_ub=None, 2025-05-07T20:32:22.0358137Z contiguous=False, 2025-05-07T20:32:22.0358222Z compiled=False, 2025-05-07T20:32:22.0358302Z ) 2025-05-07T20:32:22.0358516Z self = 2025-05-07T20:32:22.0358689Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.0358696Z 2025-05-07T20:32:22.0358771Z @given( 2025-05-07T20:32:22.0358894Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0359042Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0359154Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0359267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0359386Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0359457Z ) 2025-05-07T20:32:22.0359697Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0359789Z def test_silu_mul_quant( 2025-05-07T20:32:22.0359863Z self, 2025-05-07T20:32:22.0359944Z T: int, 2025-05-07T20:32:22.0360021Z D: int, 2025-05-07T20:32:22.0360116Z scale_ub: Optional[float], 2025-05-07T20:32:22.0360207Z contiguous: bool, 2025-05-07T20:32:22.0360288Z compiled: bool, 2025-05-07T20:32:22.0360362Z ) -> None: 2025-05-07T20:32:22.0360461Z torch.manual_seed(2025) 2025-05-07T20:32:22.0360538Z 2025-05-07T20:32:22.0360712Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0360789Z 2025-05-07T20:32:22.0360880Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0361007Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0361097Z x = x_sign * x_clamp 2025-05-07T20:32:22.0361171Z x0 = x[:, :D] 2025-05-07T20:32:22.0361253Z x1 = x[:, D:] 2025-05-07T20:32:22.0361326Z 2025-05-07T20:32:22.0361407Z if contiguous: 2025-05-07T20:32:22.0361506Z x0 = x0.contiguous() 2025-05-07T20:32:22.0361595Z x1 = x1.contiguous() 2025-05-07T20:32:22.0361665Z 2025-05-07T20:32:22.0361758Z if scale_ub is not None: 2025-05-07T20:32:22.0361861Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0361991Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0362073Z ) 2025-05-07T20:32:22.0362147Z else: 2025-05-07T20:32:22.0362283Z scale_ub_tensor = None 2025-05-07T20:32:22.0362365Z 2025-05-07T20:32:22.0362490Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0362579Z op = silu_mul_quant 2025-05-07T20:32:22.0362664Z if compiled: 2025-05-07T20:32:22.0362762Z op = torch.compile(op) 2025-05-07T20:32:22.0362867Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0362936Z 2025-05-07T20:32:22.0363027Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0363031Z 2025-05-07T20:32:22.0363132Z moe/activation_test.py:117: 2025-05-07T20:32:22.0363260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0363483Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0363584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0364151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0364291Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0364652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0364874Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0365221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0365313Z kernel = self.compile( 2025-05-07T20:32:22.0365698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0365880Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0366005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0366009Z 2025-05-07T20:32:22.0366216Z self = 2025-05-07T20:32:22.0366990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0367528Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a212c720>} 2025-05-07T20:32:22.0368283Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0368471Z context = 2025-05-07T20:32:22.0368476Z 2025-05-07T20:32:22.0368644Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0368904Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0369018Z module_map=module_map) 2025-05-07T20:32:22.0369178Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0369272Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0369353Z E ^ 2025-05-07T20:32:22.0369704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0369708Z 2025-05-07T20:32:22.0370120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0370125Z 2025-05-07T20:32:22.0370232Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0370450Z self=, 2025-05-07T20:32:22.0370527Z T=128, 2025-05-07T20:32:22.0370600Z D=5120, 2025-05-07T20:32:22.0370683Z scale_ub=1200.0, 2025-05-07T20:32:22.0375286Z contiguous=True, 2025-05-07T20:32:22.0375465Z compiled=False, 2025-05-07T20:32:22.0375555Z ) 2025-05-07T20:32:22.0375781Z self = 2025-05-07T20:32:22.0375957Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:22.0375962Z 2025-05-07T20:32:22.0376041Z @given( 2025-05-07T20:32:22.0376163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0376271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0376381Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0376498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0376616Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0376690Z ) 2025-05-07T20:32:22.0376935Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0377039Z def test_silu_mul_quant( 2025-05-07T20:32:22.0377159Z self, 2025-05-07T20:32:22.0377246Z T: int, 2025-05-07T20:32:22.0377366Z D: int, 2025-05-07T20:32:22.0377461Z scale_ub: Optional[float], 2025-05-07T20:32:22.0377553Z contiguous: bool, 2025-05-07T20:32:22.0377638Z compiled: bool, 2025-05-07T20:32:22.0377715Z ) -> None: 2025-05-07T20:32:22.0377814Z torch.manual_seed(2025) 2025-05-07T20:32:22.0377887Z 2025-05-07T20:32:22.0378057Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0378132Z 2025-05-07T20:32:22.0378224Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0378351Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0378444Z x = x_sign * x_clamp 2025-05-07T20:32:22.0378521Z x0 = x[:, :D] 2025-05-07T20:32:22.0378603Z x1 = x[:, D:] 2025-05-07T20:32:22.0378672Z 2025-05-07T20:32:22.0378755Z if contiguous: 2025-05-07T20:32:22.0378852Z x0 = x0.contiguous() 2025-05-07T20:32:22.0378941Z x1 = x1.contiguous() 2025-05-07T20:32:22.0379018Z 2025-05-07T20:32:22.0379162Z if scale_ub is not None: 2025-05-07T20:32:22.0379265Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0379398Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0379479Z ) 2025-05-07T20:32:22.0379554Z else: 2025-05-07T20:32:22.0379645Z scale_ub_tensor = None 2025-05-07T20:32:22.0379721Z 2025-05-07T20:32:22.0379849Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0379936Z op = silu_mul_quant 2025-05-07T20:32:22.0380023Z if compiled: 2025-05-07T20:32:22.0380123Z op = torch.compile(op) 2025-05-07T20:32:22.0380231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0380305Z 2025-05-07T20:32:22.0380396Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0380402Z 2025-05-07T20:32:22.0380523Z moe/activation_test.py:117: 2025-05-07T20:32:22.0380680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0380787Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0380888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0381391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0381493Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0381855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0382077Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0382423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0382517Z kernel = self.compile( 2025-05-07T20:32:22.0382904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0383132Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0383261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0383266Z 2025-05-07T20:32:22.0383477Z self = 2025-05-07T20:32:22.0384249Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0384748Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a212d8a0>} 2025-05-07T20:32:22.0385544Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0385776Z context = 2025-05-07T20:32:22.0385781Z 2025-05-07T20:32:22.0385952Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0386215Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0386323Z module_map=module_map) 2025-05-07T20:32:22.0386485Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0386581Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0386662Z E ^ 2025-05-07T20:32:22.0387014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0387019Z 2025-05-07T20:32:22.0387434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0387439Z 2025-05-07T20:32:22.0387548Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0387810Z self=, 2025-05-07T20:32:22.0387892Z T=1, 2025-05-07T20:32:22.0387969Z D=7168, 2025-05-07T20:32:22.0388053Z scale_ub=1200.0, 2025-05-07T20:32:22.0388139Z contiguous=True, 2025-05-07T20:32:22.0388224Z compiled=True, 2025-05-07T20:32:22.0388296Z ) 2025-05-07T20:32:22.0388517Z self = 2025-05-07T20:32:22.0388679Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:22.0388683Z 2025-05-07T20:32:22.0388758Z @given( 2025-05-07T20:32:22.0388879Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0388977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0389095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0389213Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0389328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0389411Z ) 2025-05-07T20:32:22.0389655Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0389745Z def test_silu_mul_quant( 2025-05-07T20:32:22.0389825Z self, 2025-05-07T20:32:22.0389902Z T: int, 2025-05-07T20:32:22.0389978Z D: int, 2025-05-07T20:32:22.0390081Z scale_ub: Optional[float], 2025-05-07T20:32:22.0390169Z contiguous: bool, 2025-05-07T20:32:22.0390254Z compiled: bool, 2025-05-07T20:32:22.0390332Z ) -> None: 2025-05-07T20:32:22.0390428Z torch.manual_seed(2025) 2025-05-07T20:32:22.0390506Z 2025-05-07T20:32:22.0390675Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0390748Z 2025-05-07T20:32:22.0390843Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0390967Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0391053Z x = x_sign * x_clamp 2025-05-07T20:32:22.0391187Z x0 = x[:, :D] 2025-05-07T20:32:22.0391268Z x1 = x[:, D:] 2025-05-07T20:32:22.0391340Z 2025-05-07T20:32:22.0391426Z if contiguous: 2025-05-07T20:32:22.0391515Z x0 = x0.contiguous() 2025-05-07T20:32:22.0391603Z x1 = x1.contiguous() 2025-05-07T20:32:22.0391679Z 2025-05-07T20:32:22.0391768Z if scale_ub is not None: 2025-05-07T20:32:22.0391879Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0392011Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0392087Z ) 2025-05-07T20:32:22.0392169Z else: 2025-05-07T20:32:22.0392264Z scale_ub_tensor = None 2025-05-07T20:32:22.0392341Z 2025-05-07T20:32:22.0392473Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0392565Z op = silu_mul_quant 2025-05-07T20:32:22.0392692Z if compiled: 2025-05-07T20:32:22.0392793Z op = torch.compile(op) 2025-05-07T20:32:22.0393012Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0393090Z 2025-05-07T20:32:22.0393185Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0393189Z 2025-05-07T20:32:22.0393284Z moe/activation_test.py:117: 2025-05-07T20:32:22.0393414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0393511Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0393610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0393983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0394076Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0394570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0394671Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0395033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0395306Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0395645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0395739Z kernel = self.compile( 2025-05-07T20:32:22.0396128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0396302Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0396425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0396437Z 2025-05-07T20:32:22.0396642Z self = 2025-05-07T20:32:22.0397417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0397927Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a212ee80>} 2025-05-07T20:32:22.0398677Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0398869Z context = 2025-05-07T20:32:22.0398874Z 2025-05-07T20:32:22.0399038Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0399301Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0399412Z module_map=module_map) 2025-05-07T20:32:22.0399636Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0399743Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0399827Z E ^ 2025-05-07T20:32:22.0400180Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0400184Z 2025-05-07T20:32:22.0400598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0400602Z 2025-05-07T20:32:22.0400709Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0400930Z self=, 2025-05-07T20:32:22.0401011Z T=1, 2025-05-07T20:32:22.0401085Z D=7168, 2025-05-07T20:32:22.0401167Z scale_ub=1200.0, 2025-05-07T20:32:22.0401257Z contiguous=False, 2025-05-07T20:32:22.0401340Z compiled=True, 2025-05-07T20:32:22.0401414Z ) 2025-05-07T20:32:22.0401685Z self = 2025-05-07T20:32:22.0401889Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.0401894Z 2025-05-07T20:32:22.0401969Z @given( 2025-05-07T20:32:22.0402093Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0402189Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0402308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0402425Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0402536Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0402615Z ) 2025-05-07T20:32:22.0402856Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0402947Z def test_silu_mul_quant( 2025-05-07T20:32:22.0403028Z self, 2025-05-07T20:32:22.0403103Z T: int, 2025-05-07T20:32:22.0403178Z D: int, 2025-05-07T20:32:22.0403286Z scale_ub: Optional[float], 2025-05-07T20:32:22.0403519Z contiguous: bool, 2025-05-07T20:32:22.0403651Z compiled: bool, 2025-05-07T20:32:22.0403736Z ) -> None: 2025-05-07T20:32:22.0403829Z torch.manual_seed(2025) 2025-05-07T20:32:22.0403905Z 2025-05-07T20:32:22.0404074Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0404145Z 2025-05-07T20:32:22.0404237Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0404358Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0404444Z x = x_sign * x_clamp 2025-05-07T20:32:22.0404527Z x0 = x[:, :D] 2025-05-07T20:32:22.0404604Z x1 = x[:, D:] 2025-05-07T20:32:22.0404674Z 2025-05-07T20:32:22.0404759Z if contiguous: 2025-05-07T20:32:22.0404849Z x0 = x0.contiguous() 2025-05-07T20:32:22.0404936Z x1 = x1.contiguous() 2025-05-07T20:32:22.0405009Z 2025-05-07T20:32:22.0405094Z if scale_ub is not None: 2025-05-07T20:32:22.0405202Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0405342Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0405416Z ) 2025-05-07T20:32:22.0405492Z else: 2025-05-07T20:32:22.0405587Z scale_ub_tensor = None 2025-05-07T20:32:22.0405659Z 2025-05-07T20:32:22.0405791Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0405880Z op = silu_mul_quant 2025-05-07T20:32:22.0405965Z if compiled: 2025-05-07T20:32:22.0406066Z op = torch.compile(op) 2025-05-07T20:32:22.0406169Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0406242Z 2025-05-07T20:32:22.0406333Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0406337Z 2025-05-07T20:32:22.0406433Z moe/activation_test.py:117: 2025-05-07T20:32:22.0406567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0406666Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0406807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0407187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0407279Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0407774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0407877Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0408237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0408464Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0408803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0408895Z kernel = self.compile( 2025-05-07T20:32:22.0409325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0409537Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0409661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0409669Z 2025-05-07T20:32:22.0409874Z self = 2025-05-07T20:32:22.0410680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0411202Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1fc8680>} 2025-05-07T20:32:22.0411957Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0412189Z context = 2025-05-07T20:32:22.0412194Z 2025-05-07T20:32:22.0412354Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0412616Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0412720Z module_map=module_map) 2025-05-07T20:32:22.0412878Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0412978Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0413056Z E ^ 2025-05-07T20:32:22.0413409Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0413413Z 2025-05-07T20:32:22.0413832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0413842Z 2025-05-07T20:32:22.0413946Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0414167Z self=, 2025-05-07T20:32:22.0414246Z T=1, 2025-05-07T20:32:22.0414320Z D=7168, 2025-05-07T20:32:22.0414405Z scale_ub=None, 2025-05-07T20:32:22.0414489Z contiguous=False, 2025-05-07T20:32:22.0414573Z compiled=True, 2025-05-07T20:32:22.0414648Z ) 2025-05-07T20:32:22.0414863Z self = 2025-05-07T20:32:22.0415025Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.0415030Z 2025-05-07T20:32:22.0415106Z @given( 2025-05-07T20:32:22.0415221Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0415315Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0415433Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0415590Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0415712Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0415783Z ) 2025-05-07T20:32:22.0416026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0416119Z def test_silu_mul_quant( 2025-05-07T20:32:22.0416195Z self, 2025-05-07T20:32:22.0416270Z T: int, 2025-05-07T20:32:22.0416351Z D: int, 2025-05-07T20:32:22.0416446Z scale_ub: Optional[float], 2025-05-07T20:32:22.0416535Z contiguous: bool, 2025-05-07T20:32:22.0416623Z compiled: bool, 2025-05-07T20:32:22.0416699Z ) -> None: 2025-05-07T20:32:22.0416789Z torch.manual_seed(2025) 2025-05-07T20:32:22.0416862Z 2025-05-07T20:32:22.0417027Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0417103Z 2025-05-07T20:32:22.0417192Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0417355Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0417485Z x = x_sign * x_clamp 2025-05-07T20:32:22.0417564Z x0 = x[:, :D] 2025-05-07T20:32:22.0417641Z x1 = x[:, D:] 2025-05-07T20:32:22.0417714Z 2025-05-07T20:32:22.0417792Z if contiguous: 2025-05-07T20:32:22.0417880Z x0 = x0.contiguous() 2025-05-07T20:32:22.0417971Z x1 = x1.contiguous() 2025-05-07T20:32:22.0418041Z 2025-05-07T20:32:22.0418128Z if scale_ub is not None: 2025-05-07T20:32:22.0418233Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0418365Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0418444Z ) 2025-05-07T20:32:22.0418519Z else: 2025-05-07T20:32:22.0418610Z scale_ub_tensor = None 2025-05-07T20:32:22.0418684Z 2025-05-07T20:32:22.0418809Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0418899Z op = silu_mul_quant 2025-05-07T20:32:22.0418990Z if compiled: 2025-05-07T20:32:22.0419136Z op = torch.compile(op) 2025-05-07T20:32:22.0419236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0419308Z 2025-05-07T20:32:22.0419398Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.0419514Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.0419589Z 2025-05-07T20:32:22.0419721Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0419819Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.0419914Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.0420032Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.0420171Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0420243Z 2025-05-07T20:32:22.0420342Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.0420347Z 2025-05-07T20:32:22.0420444Z moe/activation_test.py:126: 2025-05-07T20:32:22.0420573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0420681Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.0420813Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.0421374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.0421473Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.0421832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0422052Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0422422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.0422680Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0423128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:22.0423386Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.0423762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.0423929Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.0424271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.0424348Z fn() 2025-05-07T20:32:22.0424752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.0424834Z self.fn.run( 2025-05-07T20:32:22.0425215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0425309Z kernel = self.compile( 2025-05-07T20:32:22.0425734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0425914Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0426037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0426042Z 2025-05-07T20:32:22.0426248Z self = 2025-05-07T20:32:22.0427018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0427510Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1fc9580>} 2025-05-07T20:32:22.0428262Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0428515Z context = 2025-05-07T20:32:22.0428520Z 2025-05-07T20:32:22.0428683Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0428946Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0429047Z module_map=module_map) 2025-05-07T20:32:22.0429212Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0429311Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.0429390Z E ^ 2025-05-07T20:32:22.0429745Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0429752Z 2025-05-07T20:32:22.0430166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0430175Z 2025-05-07T20:32:22.0430283Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0430502Z self=, 2025-05-07T20:32:22.0430577Z T=1, 2025-05-07T20:32:22.0430658Z D=5120, 2025-05-07T20:32:22.0430738Z scale_ub=1200.0, 2025-05-07T20:32:22.0430822Z contiguous=False, 2025-05-07T20:32:22.0430901Z compiled=True, 2025-05-07T20:32:22.0430971Z ) 2025-05-07T20:32:22.0431187Z self = 2025-05-07T20:32:22.0431349Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.0431353Z 2025-05-07T20:32:22.0431428Z @given( 2025-05-07T20:32:22.0431549Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0431648Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0431806Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0431928Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0432038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0432117Z ) 2025-05-07T20:32:22.0432357Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0432448Z def test_silu_mul_quant( 2025-05-07T20:32:22.0432528Z self, 2025-05-07T20:32:22.0432603Z T: int, 2025-05-07T20:32:22.0432678Z D: int, 2025-05-07T20:32:22.0432780Z scale_ub: Optional[float], 2025-05-07T20:32:22.0432866Z contiguous: bool, 2025-05-07T20:32:22.0432947Z compiled: bool, 2025-05-07T20:32:22.0433028Z ) -> None: 2025-05-07T20:32:22.0433124Z torch.manual_seed(2025) 2025-05-07T20:32:22.0433196Z 2025-05-07T20:32:22.0433408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0433482Z 2025-05-07T20:32:22.0433616Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0433738Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0433823Z x = x_sign * x_clamp 2025-05-07T20:32:22.0433906Z x0 = x[:, :D] 2025-05-07T20:32:22.0433988Z x1 = x[:, D:] 2025-05-07T20:32:22.0434053Z 2025-05-07T20:32:22.0434131Z if contiguous: 2025-05-07T20:32:22.0434229Z x0 = x0.contiguous() 2025-05-07T20:32:22.0434316Z x1 = x1.contiguous() 2025-05-07T20:32:22.0434391Z 2025-05-07T20:32:22.0434479Z if scale_ub is not None: 2025-05-07T20:32:22.0434581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0434714Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0434786Z ) 2025-05-07T20:32:22.0434859Z else: 2025-05-07T20:32:22.0434950Z scale_ub_tensor = None 2025-05-07T20:32:22.0435019Z 2025-05-07T20:32:22.0435147Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0435282Z op = silu_mul_quant 2025-05-07T20:32:22.0435365Z if compiled: 2025-05-07T20:32:22.0435462Z op = torch.compile(op) 2025-05-07T20:32:22.0435570Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0435635Z 2025-05-07T20:32:22.0435727Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0435732Z 2025-05-07T20:32:22.0435827Z moe/activation_test.py:117: 2025-05-07T20:32:22.0435953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0436053Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0436148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0436514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0436608Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0437106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0437211Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0437567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0437786Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0438127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0438219Z kernel = self.compile( 2025-05-07T20:32:22.0438884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0439070Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0439194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0439199Z 2025-05-07T20:32:22.0439411Z self = 2025-05-07T20:32:22.0440272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0440775Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1fcab60>} 2025-05-07T20:32:22.0441575Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0441764Z context = 2025-05-07T20:32:22.0441769Z 2025-05-07T20:32:22.0441934Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0442251Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0442412Z module_map=module_map) 2025-05-07T20:32:22.0442573Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0442667Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0442745Z E ^ 2025-05-07T20:32:22.0443099Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0443104Z 2025-05-07T20:32:22.0443637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0443642Z 2025-05-07T20:32:22.0443743Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0443965Z self=, 2025-05-07T20:32:22.0444044Z T=1, 2025-05-07T20:32:22.0444118Z D=5120, 2025-05-07T20:32:22.0444207Z scale_ub=1200.0, 2025-05-07T20:32:22.0444300Z contiguous=False, 2025-05-07T20:32:22.0444468Z compiled=False, 2025-05-07T20:32:22.0444541Z ) 2025-05-07T20:32:22.0444760Z self = 2025-05-07T20:32:22.0444924Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.0444929Z 2025-05-07T20:32:22.0445005Z @given( 2025-05-07T20:32:22.0445125Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0445219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0445333Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0445444Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0445553Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0445628Z ) 2025-05-07T20:32:22.0445869Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0445961Z def test_silu_mul_quant( 2025-05-07T20:32:22.0446036Z self, 2025-05-07T20:32:22.0446114Z T: int, 2025-05-07T20:32:22.0446192Z D: int, 2025-05-07T20:32:22.0446291Z scale_ub: Optional[float], 2025-05-07T20:32:22.0446377Z contiguous: bool, 2025-05-07T20:32:22.0446461Z compiled: bool, 2025-05-07T20:32:22.0446539Z ) -> None: 2025-05-07T20:32:22.0446631Z torch.manual_seed(2025) 2025-05-07T20:32:22.0446707Z 2025-05-07T20:32:22.0446872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0446942Z 2025-05-07T20:32:22.0447034Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0447153Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0447239Z x = x_sign * x_clamp 2025-05-07T20:32:22.0447319Z x0 = x[:, :D] 2025-05-07T20:32:22.0447394Z x1 = x[:, D:] 2025-05-07T20:32:22.0447463Z 2025-05-07T20:32:22.0447548Z if contiguous: 2025-05-07T20:32:22.0447641Z x0 = x0.contiguous() 2025-05-07T20:32:22.0447770Z x1 = x1.contiguous() 2025-05-07T20:32:22.0447850Z 2025-05-07T20:32:22.0447937Z if scale_ub is not None: 2025-05-07T20:32:22.0448043Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0448174Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0448250Z ) 2025-05-07T20:32:22.0448328Z else: 2025-05-07T20:32:22.0448420Z scale_ub_tensor = None 2025-05-07T20:32:22.0448491Z 2025-05-07T20:32:22.0448618Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0448706Z op = silu_mul_quant 2025-05-07T20:32:22.0448786Z if compiled: 2025-05-07T20:32:22.0448883Z op = torch.compile(op) 2025-05-07T20:32:22.0448984Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0449053Z 2025-05-07T20:32:22.0449141Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0449146Z 2025-05-07T20:32:22.0449283Z moe/activation_test.py:117: 2025-05-07T20:32:22.0449417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0449556Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0449653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0450153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0450249Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0450607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0450831Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0451219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0451313Z kernel = self.compile( 2025-05-07T20:32:22.0451699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0451916Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0452043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0452047Z 2025-05-07T20:32:22.0452248Z self = 2025-05-07T20:32:22.0453015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0453509Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1fcb2e0>} 2025-05-07T20:32:22.0454261Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0454455Z context = 2025-05-07T20:32:22.0454459Z 2025-05-07T20:32:22.0454623Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0454887Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0454989Z module_map=module_map) 2025-05-07T20:32:22.0455147Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0455247Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0455323Z E ^ 2025-05-07T20:32:22.0455674Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0455684Z 2025-05-07T20:32:22.0456100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0456107Z 2025-05-07T20:32:22.0456255Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0456481Z self=, 2025-05-07T20:32:22.0456556Z T=16384, 2025-05-07T20:32:22.0456632Z D=5120, 2025-05-07T20:32:22.0456721Z scale_ub=1200.0, 2025-05-07T20:32:22.0456803Z contiguous=False, 2025-05-07T20:32:22.0456883Z compiled=True, 2025-05-07T20:32:22.0456957Z ) 2025-05-07T20:32:22.0457172Z self = 2025-05-07T20:32:22.0457349Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.0457353Z 2025-05-07T20:32:22.0457427Z @given( 2025-05-07T20:32:22.0457547Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0457646Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0457756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0457914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0458101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0458175Z ) 2025-05-07T20:32:22.0458421Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0458510Z def test_silu_mul_quant( 2025-05-07T20:32:22.0458584Z self, 2025-05-07T20:32:22.0458665Z T: int, 2025-05-07T20:32:22.0458739Z D: int, 2025-05-07T20:32:22.0458834Z scale_ub: Optional[float], 2025-05-07T20:32:22.0458923Z contiguous: bool, 2025-05-07T20:32:22.0459008Z compiled: bool, 2025-05-07T20:32:22.0459083Z ) -> None: 2025-05-07T20:32:22.0459182Z torch.manual_seed(2025) 2025-05-07T20:32:22.0459252Z 2025-05-07T20:32:22.0459420Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0459493Z 2025-05-07T20:32:22.0459582Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0459707Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0459804Z x = x_sign * x_clamp 2025-05-07T20:32:22.0459929Z x0 = x[:, :D] 2025-05-07T20:32:22.0460013Z x1 = x[:, D:] 2025-05-07T20:32:22.0460086Z 2025-05-07T20:32:22.0460168Z if contiguous: 2025-05-07T20:32:22.0460261Z x0 = x0.contiguous() 2025-05-07T20:32:22.0460346Z x1 = x1.contiguous() 2025-05-07T20:32:22.0460416Z 2025-05-07T20:32:22.0460507Z if scale_ub is not None: 2025-05-07T20:32:22.0460608Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0460739Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0460817Z ) 2025-05-07T20:32:22.0460888Z else: 2025-05-07T20:32:22.0460980Z scale_ub_tensor = None 2025-05-07T20:32:22.0461052Z 2025-05-07T20:32:22.0461177Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0461267Z op = silu_mul_quant 2025-05-07T20:32:22.0461354Z if compiled: 2025-05-07T20:32:22.0461456Z op = torch.compile(op) 2025-05-07T20:32:22.0461565Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0461636Z 2025-05-07T20:32:22.0461723Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0461728Z 2025-05-07T20:32:22.0461828Z moe/activation_test.py:117: 2025-05-07T20:32:22.0461954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0462052Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0462151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0462519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0462616Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0463112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0463210Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0463615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0463842Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0465510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0465604Z kernel = self.compile( 2025-05-07T20:32:22.0465985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0466163Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0466286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0466291Z 2025-05-07T20:32:22.0466499Z self = 2025-05-07T20:32:22.0467316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0467851Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1b40fe0>} 2025-05-07T20:32:22.0468605Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0468793Z context = 2025-05-07T20:32:22.0468798Z 2025-05-07T20:32:22.0468966Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0469232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0469338Z module_map=module_map) 2025-05-07T20:32:22.0469546Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0469643Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0469717Z E ^ 2025-05-07T20:32:22.0470076Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0470080Z 2025-05-07T20:32:22.0470495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0470499Z 2025-05-07T20:32:22.0470602Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0470819Z self=, 2025-05-07T20:32:22.0470890Z T=2048, 2025-05-07T20:32:22.0470968Z D=7168, 2025-05-07T20:32:22.0471046Z scale_ub=1200.0, 2025-05-07T20:32:22.0471131Z contiguous=False, 2025-05-07T20:32:22.0471219Z compiled=True, 2025-05-07T20:32:22.0471293Z ) 2025-05-07T20:32:22.0471510Z self = 2025-05-07T20:32:22.0471690Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.0471695Z 2025-05-07T20:32:22.0471768Z @given( 2025-05-07T20:32:22.0471887Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0471980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0472091Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0472210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0472320Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0472392Z ) 2025-05-07T20:32:22.0472636Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0472729Z def test_silu_mul_quant( 2025-05-07T20:32:22.0472806Z self, 2025-05-07T20:32:22.0472881Z T: int, 2025-05-07T20:32:22.0472958Z D: int, 2025-05-07T20:32:22.0473101Z scale_ub: Optional[float], 2025-05-07T20:32:22.0473195Z contiguous: bool, 2025-05-07T20:32:22.0473276Z compiled: bool, 2025-05-07T20:32:22.0473357Z ) -> None: 2025-05-07T20:32:22.0473448Z torch.manual_seed(2025) 2025-05-07T20:32:22.0473520Z 2025-05-07T20:32:22.0473688Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0473757Z 2025-05-07T20:32:22.0473850Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0473975Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0474063Z x = x_sign * x_clamp 2025-05-07T20:32:22.0474140Z x0 = x[:, :D] 2025-05-07T20:32:22.0474221Z x1 = x[:, D:] 2025-05-07T20:32:22.0474291Z 2025-05-07T20:32:22.0474376Z if contiguous: 2025-05-07T20:32:22.0474465Z x0 = x0.contiguous() 2025-05-07T20:32:22.0474548Z x1 = x1.contiguous() 2025-05-07T20:32:22.0474623Z 2025-05-07T20:32:22.0476136Z if scale_ub is not None: 2025-05-07T20:32:22.0476280Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0476418Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0476489Z ) 2025-05-07T20:32:22.0476563Z else: 2025-05-07T20:32:22.0476656Z scale_ub_tensor = None 2025-05-07T20:32:22.0476725Z 2025-05-07T20:32:22.0476850Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0476941Z op = silu_mul_quant 2025-05-07T20:32:22.0477021Z if compiled: 2025-05-07T20:32:22.0477122Z op = torch.compile(op) 2025-05-07T20:32:22.0477223Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0477292Z 2025-05-07T20:32:22.0477383Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0477388Z 2025-05-07T20:32:22.0477484Z moe/activation_test.py:117: 2025-05-07T20:32:22.0477612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0477712Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0477857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0478224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0478317Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0478811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0478906Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0479264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0479484Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0479826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0479917Z kernel = self.compile( 2025-05-07T20:32:22.0480310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0480486Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0480612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0480616Z 2025-05-07T20:32:22.0480823Z self = 2025-05-07T20:32:22.0481642Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0482139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1b41b20>} 2025-05-07T20:32:22.0482930Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0483125Z context = 2025-05-07T20:32:22.0483129Z 2025-05-07T20:32:22.0483382Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0483645Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0483752Z module_map=module_map) 2025-05-07T20:32:22.0483909Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0484004Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0484081Z E ^ 2025-05-07T20:32:22.0484433Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0484438Z 2025-05-07T20:32:22.0484899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0484943Z 2025-05-07T20:32:22.0485044Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0485264Z self=, 2025-05-07T20:32:22.0485345Z T=1, 2025-05-07T20:32:22.0485421Z D=5120, 2025-05-07T20:32:22.0485501Z scale_ub=None, 2025-05-07T20:32:22.0485589Z contiguous=False, 2025-05-07T20:32:22.0485672Z compiled=False, 2025-05-07T20:32:22.0485743Z ) 2025-05-07T20:32:22.0485962Z self = 2025-05-07T20:32:22.0486124Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.0486128Z 2025-05-07T20:32:22.0486209Z @given( 2025-05-07T20:32:22.0486325Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0486420Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0486533Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0486650Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0486803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0486880Z ) 2025-05-07T20:32:22.0487121Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0487213Z def test_silu_mul_quant( 2025-05-07T20:32:22.0487290Z self, 2025-05-07T20:32:22.0487364Z T: int, 2025-05-07T20:32:22.0487441Z D: int, 2025-05-07T20:32:22.0487536Z scale_ub: Optional[float], 2025-05-07T20:32:22.0487623Z contiguous: bool, 2025-05-07T20:32:22.0487709Z compiled: bool, 2025-05-07T20:32:22.0487786Z ) -> None: 2025-05-07T20:32:22.0487878Z torch.manual_seed(2025) 2025-05-07T20:32:22.0487952Z 2025-05-07T20:32:22.0488120Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0488190Z 2025-05-07T20:32:22.0488281Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0488407Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0488496Z x = x_sign * x_clamp 2025-05-07T20:32:22.0488576Z x0 = x[:, :D] 2025-05-07T20:32:22.0488655Z x1 = x[:, D:] 2025-05-07T20:32:22.0488725Z 2025-05-07T20:32:22.0488807Z if contiguous: 2025-05-07T20:32:22.0488895Z x0 = x0.contiguous() 2025-05-07T20:32:22.0488982Z x1 = x1.contiguous() 2025-05-07T20:32:22.0489049Z 2025-05-07T20:32:22.0489135Z if scale_ub is not None: 2025-05-07T20:32:22.0489242Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0489371Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0489443Z ) 2025-05-07T20:32:22.0489522Z else: 2025-05-07T20:32:22.0489613Z scale_ub_tensor = None 2025-05-07T20:32:22.0489684Z 2025-05-07T20:32:22.0489817Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0489904Z op = silu_mul_quant 2025-05-07T20:32:22.0490037Z if compiled: 2025-05-07T20:32:22.0490142Z op = torch.compile(op) 2025-05-07T20:32:22.0490245Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0490317Z 2025-05-07T20:32:22.0490407Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0490411Z 2025-05-07T20:32:22.0490506Z moe/activation_test.py:117: 2025-05-07T20:32:22.0490638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0490735Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0490831Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0491331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0491424Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0491848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0492109Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0492455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0492550Z kernel = self.compile( 2025-05-07T20:32:22.0492931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0493102Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0493228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0493233Z 2025-05-07T20:32:22.0493435Z self = 2025-05-07T20:32:22.0494210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0494751Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1b42e80>} 2025-05-07T20:32:22.0495504Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0495689Z context = 2025-05-07T20:32:22.0495693Z 2025-05-07T20:32:22.0495856Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0496125Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0496227Z module_map=module_map) 2025-05-07T20:32:22.0496387Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0496487Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0496567Z E ^ 2025-05-07T20:32:22.0496924Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0496929Z 2025-05-07T20:32:22.0497338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0497343Z 2025-05-07T20:32:22.0497443Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0505862Z self=, 2025-05-07T20:32:22.0505977Z T=4096, 2025-05-07T20:32:22.0506064Z D=7168, 2025-05-07T20:32:22.0506185Z scale_ub=1200.0, 2025-05-07T20:32:22.0506281Z contiguous=False, 2025-05-07T20:32:22.0506374Z compiled=False, 2025-05-07T20:32:22.0506459Z ) 2025-05-07T20:32:22.0506727Z self = 2025-05-07T20:32:22.0507001Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.0507013Z 2025-05-07T20:32:22.0507101Z @given( 2025-05-07T20:32:22.0507227Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0507333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0507458Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0507579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0507699Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0507787Z ) 2025-05-07T20:32:22.0508044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0508143Z def test_silu_mul_quant( 2025-05-07T20:32:22.0508229Z self, 2025-05-07T20:32:22.0508311Z T: int, 2025-05-07T20:32:22.0508401Z D: int, 2025-05-07T20:32:22.0508506Z scale_ub: Optional[float], 2025-05-07T20:32:22.0508647Z contiguous: bool, 2025-05-07T20:32:22.0508747Z compiled: bool, 2025-05-07T20:32:22.0508877Z ) -> None: 2025-05-07T20:32:22.0508980Z torch.manual_seed(2025) 2025-05-07T20:32:22.0509067Z 2025-05-07T20:32:22.0509241Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0509350Z 2025-05-07T20:32:22.0509458Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0509613Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0509706Z x = x_sign * x_clamp 2025-05-07T20:32:22.0509794Z x0 = x[:, :D] 2025-05-07T20:32:22.0509879Z x1 = x[:, D:] 2025-05-07T20:32:22.0509959Z 2025-05-07T20:32:22.0510056Z if contiguous: 2025-05-07T20:32:22.0510153Z x0 = x0.contiguous() 2025-05-07T20:32:22.0510252Z x1 = x1.contiguous() 2025-05-07T20:32:22.0510331Z 2025-05-07T20:32:22.0510424Z if scale_ub is not None: 2025-05-07T20:32:22.0510562Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0510723Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0510858Z ) 2025-05-07T20:32:22.0510946Z else: 2025-05-07T20:32:22.0511071Z scale_ub_tensor = None 2025-05-07T20:32:22.0511147Z 2025-05-07T20:32:22.0511285Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0511379Z op = silu_mul_quant 2025-05-07T20:32:22.0511467Z if compiled: 2025-05-07T20:32:22.0511577Z op = torch.compile(op) 2025-05-07T20:32:22.0511684Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0511764Z 2025-05-07T20:32:22.0511857Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0511861Z 2025-05-07T20:32:22.0511961Z moe/activation_test.py:117: 2025-05-07T20:32:22.0512098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0512203Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0512311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0512822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0512928Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0513300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0513531Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0513878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0513980Z kernel = self.compile( 2025-05-07T20:32:22.0514370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0514549Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0514683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0514688Z 2025-05-07T20:32:22.0514946Z self = 2025-05-07T20:32:22.0515736Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0516236Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a329c040>} 2025-05-07T20:32:22.0516993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0517186Z context = 2025-05-07T20:32:22.0517191Z 2025-05-07T20:32:22.0517401Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0517719Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0517831Z module_map=module_map) 2025-05-07T20:32:22.0517998Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0518100Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0518182Z E ^ 2025-05-07T20:32:22.0518543Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0518548Z 2025-05-07T20:32:22.0518968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0518972Z 2025-05-07T20:32:22.0519079Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0519313Z self=, 2025-05-07T20:32:22.0519398Z T=16384, 2025-05-07T20:32:22.0519483Z D=7168, 2025-05-07T20:32:22.0519618Z scale_ub=None, 2025-05-07T20:32:22.0519706Z contiguous=True, 2025-05-07T20:32:22.0519794Z compiled=True, 2025-05-07T20:32:22.0519868Z ) 2025-05-07T20:32:22.0520090Z self = 2025-05-07T20:32:22.0520273Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.0520278Z 2025-05-07T20:32:22.0520357Z @given( 2025-05-07T20:32:22.0520481Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0520589Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0520708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0520837Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0520954Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0521034Z ) 2025-05-07T20:32:22.0521290Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0521390Z def test_silu_mul_quant( 2025-05-07T20:32:22.0521474Z self, 2025-05-07T20:32:22.0521561Z T: int, 2025-05-07T20:32:22.0521642Z D: int, 2025-05-07T20:32:22.0521743Z scale_ub: Optional[float], 2025-05-07T20:32:22.0542593Z contiguous: bool, 2025-05-07T20:32:22.0542710Z compiled: bool, 2025-05-07T20:32:22.0542786Z ) -> None: 2025-05-07T20:32:22.0542883Z torch.manual_seed(2025) 2025-05-07T20:32:22.0542952Z 2025-05-07T20:32:22.0543131Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0543200Z 2025-05-07T20:32:22.0543289Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0543412Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0543497Z x = x_sign * x_clamp 2025-05-07T20:32:22.0543573Z x0 = x[:, :D] 2025-05-07T20:32:22.0543651Z x1 = x[:, D:] 2025-05-07T20:32:22.0543720Z 2025-05-07T20:32:22.0543810Z if contiguous: 2025-05-07T20:32:22.0544092Z x0 = x0.contiguous() 2025-05-07T20:32:22.0544183Z x1 = x1.contiguous() 2025-05-07T20:32:22.0544251Z 2025-05-07T20:32:22.0544344Z if scale_ub is not None: 2025-05-07T20:32:22.0544447Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0544584Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0544656Z ) 2025-05-07T20:32:22.0544728Z else: 2025-05-07T20:32:22.0544823Z scale_ub_tensor = None 2025-05-07T20:32:22.0544891Z 2025-05-07T20:32:22.0545020Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0545108Z op = silu_mul_quant 2025-05-07T20:32:22.0545229Z if compiled: 2025-05-07T20:32:22.0545325Z op = torch.compile(op) 2025-05-07T20:32:22.0545430Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0545501Z 2025-05-07T20:32:22.0545661Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0545667Z 2025-05-07T20:32:22.0545835Z moe/activation_test.py:117: 2025-05-07T20:32:22.0545963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0546066Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0546166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0546533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0546630Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0547121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0547212Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0547574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0547793Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0548137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0548313Z kernel = self.compile( 2025-05-07T20:32:22.0548696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0548869Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0548991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0548996Z 2025-05-07T20:32:22.0549206Z self = 2025-05-07T20:32:22.0549975Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0550473Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a329d260>} 2025-05-07T20:32:22.0551229Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0551419Z context = 2025-05-07T20:32:22.0551424Z 2025-05-07T20:32:22.0551589Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0551849Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0551951Z module_map=module_map) 2025-05-07T20:32:22.0552114Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0552208Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0552285Z E ^ 2025-05-07T20:32:22.0552683Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0552691Z 2025-05-07T20:32:22.0553105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0553109Z 2025-05-07T20:32:22.0553212Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0553430Z self=, 2025-05-07T20:32:22.0553500Z T=4096, 2025-05-07T20:32:22.0553577Z D=5120, 2025-05-07T20:32:22.0553654Z scale_ub=None, 2025-05-07T20:32:22.0553747Z contiguous=False, 2025-05-07T20:32:22.0553824Z compiled=True, 2025-05-07T20:32:22.0553895Z ) 2025-05-07T20:32:22.0554112Z self = 2025-05-07T20:32:22.0554281Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.0554286Z 2025-05-07T20:32:22.0554481Z @given( 2025-05-07T20:32:22.0554673Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0554771Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0554881Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0554996Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0555104Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0555181Z ) 2025-05-07T20:32:22.0555423Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0555515Z def test_silu_mul_quant( 2025-05-07T20:32:22.0555591Z self, 2025-05-07T20:32:22.0555665Z T: int, 2025-05-07T20:32:22.0555737Z D: int, 2025-05-07T20:32:22.0555840Z scale_ub: Optional[float], 2025-05-07T20:32:22.0555924Z contiguous: bool, 2025-05-07T20:32:22.0556003Z compiled: bool, 2025-05-07T20:32:22.0556081Z ) -> None: 2025-05-07T20:32:22.0556174Z torch.manual_seed(2025) 2025-05-07T20:32:22.0556245Z 2025-05-07T20:32:22.0556420Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0556534Z 2025-05-07T20:32:22.0556627Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0556747Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0556831Z x = x_sign * x_clamp 2025-05-07T20:32:22.0556915Z x0 = x[:, :D] 2025-05-07T20:32:22.0556990Z x1 = x[:, D:] 2025-05-07T20:32:22.0557056Z 2025-05-07T20:32:22.0557138Z if contiguous: 2025-05-07T20:32:22.0557223Z x0 = x0.contiguous() 2025-05-07T20:32:22.0557309Z x1 = x1.contiguous() 2025-05-07T20:32:22.0557386Z 2025-05-07T20:32:22.0557473Z if scale_ub is not None: 2025-05-07T20:32:22.0557574Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0557713Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0557786Z ) 2025-05-07T20:32:22.0557873Z else: 2025-05-07T20:32:22.0557971Z scale_ub_tensor = None 2025-05-07T20:32:22.0558048Z 2025-05-07T20:32:22.0558185Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0558274Z op = silu_mul_quant 2025-05-07T20:32:22.0558356Z if compiled: 2025-05-07T20:32:22.0558464Z op = torch.compile(op) 2025-05-07T20:32:22.0558568Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0558640Z 2025-05-07T20:32:22.0558738Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0558742Z 2025-05-07T20:32:22.0558840Z moe/activation_test.py:117: 2025-05-07T20:32:22.0558975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0559075Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0559175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0559555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0559651Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0560196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0560303Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0560706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0560951Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0561293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0561391Z kernel = self.compile( 2025-05-07T20:32:22.0561788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0561965Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0562769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0562824Z 2025-05-07T20:32:22.0563034Z self = 2025-05-07T20:32:22.0563914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0564420Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a329dda0>} 2025-05-07T20:32:22.0565170Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0565367Z context = 2025-05-07T20:32:22.0565372Z 2025-05-07T20:32:22.0565541Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0565848Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0565965Z module_map=module_map) 2025-05-07T20:32:22.0566126Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0566228Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0566305Z E ^ 2025-05-07T20:32:22.0566659Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0566664Z 2025-05-07T20:32:22.0567093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0567097Z 2025-05-07T20:32:22.0567200Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0567422Z self=, 2025-05-07T20:32:22.0567513Z T=4096, 2025-05-07T20:32:22.0567594Z D=5120, 2025-05-07T20:32:22.0567683Z scale_ub=1200.0, 2025-05-07T20:32:22.0567769Z contiguous=False, 2025-05-07T20:32:22.0567851Z compiled=False, 2025-05-07T20:32:22.0567927Z ) 2025-05-07T20:32:22.0568143Z self = 2025-05-07T20:32:22.0568317Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.0568321Z 2025-05-07T20:32:22.0568405Z @given( 2025-05-07T20:32:22.0568525Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0568620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0568733Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0568856Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0568964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0569040Z ) 2025-05-07T20:32:22.0569327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0569423Z def test_silu_mul_quant( 2025-05-07T20:32:22.0569507Z self, 2025-05-07T20:32:22.0569581Z T: int, 2025-05-07T20:32:22.0569653Z D: int, 2025-05-07T20:32:22.0569754Z scale_ub: Optional[float], 2025-05-07T20:32:22.0569842Z contiguous: bool, 2025-05-07T20:32:22.0569928Z compiled: bool, 2025-05-07T20:32:22.0570011Z ) -> None: 2025-05-07T20:32:22.0570104Z torch.manual_seed(2025) 2025-05-07T20:32:22.0570175Z 2025-05-07T20:32:22.0570348Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0570427Z 2025-05-07T20:32:22.0570540Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0570678Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0570771Z x = x_sign * x_clamp 2025-05-07T20:32:22.0570855Z x0 = x[:, :D] 2025-05-07T20:32:22.0570974Z x1 = x[:, D:] 2025-05-07T20:32:22.0571045Z 2025-05-07T20:32:22.0571176Z if contiguous: 2025-05-07T20:32:22.0571263Z x0 = x0.contiguous() 2025-05-07T20:32:22.0571350Z x1 = x1.contiguous() 2025-05-07T20:32:22.0571428Z 2025-05-07T20:32:22.0571515Z if scale_ub is not None: 2025-05-07T20:32:22.0571617Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0571754Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0571828Z ) 2025-05-07T20:32:22.0571909Z else: 2025-05-07T20:32:22.0571999Z scale_ub_tensor = None 2025-05-07T20:32:22.0572069Z 2025-05-07T20:32:22.0572204Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0572291Z op = silu_mul_quant 2025-05-07T20:32:22.0572373Z if compiled: 2025-05-07T20:32:22.0572479Z op = torch.compile(op) 2025-05-07T20:32:22.0572584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0572657Z 2025-05-07T20:32:22.0572758Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0572804Z 2025-05-07T20:32:22.0572900Z moe/activation_test.py:117: 2025-05-07T20:32:22.0573027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0573131Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0573228Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0573731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0573827Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0574185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0574414Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0574759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0574859Z kernel = self.compile( 2025-05-07T20:32:22.0575247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0575421Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0575550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0575555Z 2025-05-07T20:32:22.0575758Z self = 2025-05-07T20:32:22.0576528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0577032Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a329f420>} 2025-05-07T20:32:22.0577821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0578024Z context = 2025-05-07T20:32:22.0578029Z 2025-05-07T20:32:22.0578191Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0578460Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0578563Z module_map=module_map) 2025-05-07T20:32:22.0578721Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0578821Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0578897Z E ^ 2025-05-07T20:32:22.0579290Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0579295Z 2025-05-07T20:32:22.0579753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0579760Z 2025-05-07T20:32:22.0579862Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0580091Z self=, 2025-05-07T20:32:22.0580166Z T=4096, 2025-05-07T20:32:22.0580237Z D=5120, 2025-05-07T20:32:22.0580323Z scale_ub=1200.0, 2025-05-07T20:32:22.0580407Z contiguous=False, 2025-05-07T20:32:22.0580489Z compiled=True, 2025-05-07T20:32:22.0580570Z ) 2025-05-07T20:32:22.0580787Z self = 2025-05-07T20:32:22.0580965Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.0580970Z 2025-05-07T20:32:22.0581044Z @given( 2025-05-07T20:32:22.0581163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0581266Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0581423Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0581540Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0581658Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0581731Z ) 2025-05-07T20:32:22.0581974Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0582073Z def test_silu_mul_quant( 2025-05-07T20:32:22.0582148Z self, 2025-05-07T20:32:22.0582229Z T: int, 2025-05-07T20:32:22.0582306Z D: int, 2025-05-07T20:32:22.0582400Z scale_ub: Optional[float], 2025-05-07T20:32:22.0582495Z contiguous: bool, 2025-05-07T20:32:22.0582583Z compiled: bool, 2025-05-07T20:32:22.0582657Z ) -> None: 2025-05-07T20:32:22.0582755Z torch.manual_seed(2025) 2025-05-07T20:32:22.0582827Z 2025-05-07T20:32:22.0583000Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0583082Z 2025-05-07T20:32:22.0583177Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0583301Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0583393Z x = x_sign * x_clamp 2025-05-07T20:32:22.0583471Z x0 = x[:, :D] 2025-05-07T20:32:22.0583548Z x1 = x[:, D:] 2025-05-07T20:32:22.0583629Z 2025-05-07T20:32:22.0583712Z if contiguous: 2025-05-07T20:32:22.0583805Z x0 = x0.contiguous() 2025-05-07T20:32:22.0583893Z x1 = x1.contiguous() 2025-05-07T20:32:22.0583963Z 2025-05-07T20:32:22.0584061Z if scale_ub is not None: 2025-05-07T20:32:22.0584166Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0584300Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0584377Z ) 2025-05-07T20:32:22.0584451Z else: 2025-05-07T20:32:22.0584543Z scale_ub_tensor = None 2025-05-07T20:32:22.0584619Z 2025-05-07T20:32:22.0584795Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0584890Z op = silu_mul_quant 2025-05-07T20:32:22.0584978Z if compiled: 2025-05-07T20:32:22.0585074Z op = torch.compile(op) 2025-05-07T20:32:22.0585182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0585254Z 2025-05-07T20:32:22.0585341Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0585346Z 2025-05-07T20:32:22.0585447Z moe/activation_test.py:117: 2025-05-07T20:32:22.0585572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0585669Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0585771Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0586140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0586232Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0586799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0586934Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0587298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0587520Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0587860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0587959Z kernel = self.compile( 2025-05-07T20:32:22.0588341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0588518Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0588643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0588647Z 2025-05-07T20:32:22.0588855Z self = 2025-05-07T20:32:22.0589688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0590182Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1cf0860>} 2025-05-07T20:32:22.0590933Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0591119Z context = 2025-05-07T20:32:22.0591124Z 2025-05-07T20:32:22.0591297Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0591611Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0591720Z module_map=module_map) 2025-05-07T20:32:22.0591884Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0591983Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0592058Z E ^ 2025-05-07T20:32:22.0592423Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0592427Z 2025-05-07T20:32:22.0592842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0592847Z 2025-05-07T20:32:22.0592956Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0593177Z self=, 2025-05-07T20:32:22.0593251Z T=2048, 2025-05-07T20:32:22.0593334Z D=7168, 2025-05-07T20:32:22.0593414Z scale_ub=1200.0, 2025-05-07T20:32:22.0593544Z contiguous=False, 2025-05-07T20:32:22.0593636Z compiled=False, 2025-05-07T20:32:22.0593705Z ) 2025-05-07T20:32:22.0593919Z self = 2025-05-07T20:32:22.0594100Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.0594104Z 2025-05-07T20:32:22.0594178Z @given( 2025-05-07T20:32:22.0594300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0594396Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0594508Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0594632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0594741Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0594813Z ) 2025-05-07T20:32:22.0595062Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0595195Z def test_silu_mul_quant( 2025-05-07T20:32:22.0595313Z self, 2025-05-07T20:32:22.0595399Z T: int, 2025-05-07T20:32:22.0595475Z D: int, 2025-05-07T20:32:22.0595575Z scale_ub: Optional[float], 2025-05-07T20:32:22.0595661Z contiguous: bool, 2025-05-07T20:32:22.0595745Z compiled: bool, 2025-05-07T20:32:22.0595828Z ) -> None: 2025-05-07T20:32:22.0595921Z torch.manual_seed(2025) 2025-05-07T20:32:22.0595993Z 2025-05-07T20:32:22.0596166Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0596240Z 2025-05-07T20:32:22.0596330Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0596459Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0596547Z x = x_sign * x_clamp 2025-05-07T20:32:22.0596624Z x0 = x[:, :D] 2025-05-07T20:32:22.0596707Z x1 = x[:, D:] 2025-05-07T20:32:22.0596777Z 2025-05-07T20:32:22.0596858Z if contiguous: 2025-05-07T20:32:22.0596956Z x0 = x0.contiguous() 2025-05-07T20:32:22.0597047Z x1 = x1.contiguous() 2025-05-07T20:32:22.0597166Z 2025-05-07T20:32:22.0597253Z if scale_ub is not None: 2025-05-07T20:32:22.0597356Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0597493Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0597566Z ) 2025-05-07T20:32:22.0597641Z else: 2025-05-07T20:32:22.0597737Z scale_ub_tensor = None 2025-05-07T20:32:22.0597809Z 2025-05-07T20:32:22.0597935Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0598029Z op = silu_mul_quant 2025-05-07T20:32:22.0598111Z if compiled: 2025-05-07T20:32:22.0598207Z op = torch.compile(op) 2025-05-07T20:32:22.0598315Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0598385Z 2025-05-07T20:32:22.0598478Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0598483Z 2025-05-07T20:32:22.0598579Z moe/activation_test.py:117: 2025-05-07T20:32:22.0598711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0598819Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0598915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0599413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0599512Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0599872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0600099Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0600443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0600534Z kernel = self.compile( 2025-05-07T20:32:22.0600969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0601172Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0601311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0601325Z 2025-05-07T20:32:22.0601534Z self = 2025-05-07T20:32:22.0602310Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0602809Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1cf16c0>} 2025-05-07T20:32:22.0603684Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0603913Z context = 2025-05-07T20:32:22.0603917Z 2025-05-07T20:32:22.0604086Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0604346Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0604458Z module_map=module_map) 2025-05-07T20:32:22.0604618Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0604712Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0604791Z E ^ 2025-05-07T20:32:22.0605145Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0605150Z 2025-05-07T20:32:22.0605566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0605624Z 2025-05-07T20:32:22.0605725Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0605945Z self=, 2025-05-07T20:32:22.0606027Z T=1, 2025-05-07T20:32:22.0606103Z D=7168, 2025-05-07T20:32:22.0606182Z scale_ub=None, 2025-05-07T20:32:22.0606268Z contiguous=True, 2025-05-07T20:32:22.0606350Z compiled=False, 2025-05-07T20:32:22.0606421Z ) 2025-05-07T20:32:22.0606645Z self = 2025-05-07T20:32:22.0606807Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:22.0606812Z 2025-05-07T20:32:22.0606896Z @given( 2025-05-07T20:32:22.0607012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0607109Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0607228Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0607349Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0607464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0607545Z ) 2025-05-07T20:32:22.0607787Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0607878Z def test_silu_mul_quant( 2025-05-07T20:32:22.0607960Z self, 2025-05-07T20:32:22.0608036Z T: int, 2025-05-07T20:32:22.0608110Z D: int, 2025-05-07T20:32:22.0608211Z scale_ub: Optional[float], 2025-05-07T20:32:22.0608296Z contiguous: bool, 2025-05-07T20:32:22.0608386Z compiled: bool, 2025-05-07T20:32:22.0608462Z ) -> None: 2025-05-07T20:32:22.0608553Z torch.manual_seed(2025) 2025-05-07T20:32:22.0608630Z 2025-05-07T20:32:22.0608796Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0608868Z 2025-05-07T20:32:22.0608961Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0609130Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0609224Z x = x_sign * x_clamp 2025-05-07T20:32:22.0609306Z x0 = x[:, :D] 2025-05-07T20:32:22.0609385Z x1 = x[:, D:] 2025-05-07T20:32:22.0609455Z 2025-05-07T20:32:22.0609542Z if contiguous: 2025-05-07T20:32:22.0609632Z x0 = x0.contiguous() 2025-05-07T20:32:22.0609723Z x1 = x1.contiguous() 2025-05-07T20:32:22.0609794Z 2025-05-07T20:32:22.0609883Z if scale_ub is not None: 2025-05-07T20:32:22.0609991Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0610120Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0610191Z ) 2025-05-07T20:32:22.0610270Z else: 2025-05-07T20:32:22.0610364Z scale_ub_tensor = None 2025-05-07T20:32:22.0610440Z 2025-05-07T20:32:22.0610596Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0610749Z op = silu_mul_quant 2025-05-07T20:32:22.0610832Z if compiled: 2025-05-07T20:32:22.0610983Z op = torch.compile(op) 2025-05-07T20:32:22.0611085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0611163Z 2025-05-07T20:32:22.0611252Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0611256Z 2025-05-07T20:32:22.0611351Z moe/activation_test.py:117: 2025-05-07T20:32:22.0611483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0611579Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0611675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0612181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0612275Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0612633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0612864Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0613253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0613350Z kernel = self.compile( 2025-05-07T20:32:22.0613734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0613907Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0614038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0614042Z 2025-05-07T20:32:22.0614245Z self = 2025-05-07T20:32:22.0615016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0615514Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1cf0fe0>} 2025-05-07T20:32:22.0616269Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0616457Z context = 2025-05-07T20:32:22.0616462Z 2025-05-07T20:32:22.0616625Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0616892Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0616998Z module_map=module_map) 2025-05-07T20:32:22.0617159Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0617260Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0617338Z E ^ 2025-05-07T20:32:22.0617765Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0617770Z 2025-05-07T20:32:22.0618186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0618190Z 2025-05-07T20:32:22.0618289Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0618518Z self=, 2025-05-07T20:32:22.0618589Z T=16384, 2025-05-07T20:32:22.0618668Z D=7168, 2025-05-07T20:32:22.0618749Z scale_ub=1200.0, 2025-05-07T20:32:22.0618831Z contiguous=False, 2025-05-07T20:32:22.0618916Z compiled=True, 2025-05-07T20:32:22.0618987Z ) 2025-05-07T20:32:22.0619202Z self = 2025-05-07T20:32:22.0619424Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.0619469Z 2025-05-07T20:32:22.0619549Z @given( 2025-05-07T20:32:22.0619664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0619767Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0619878Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0620000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0620111Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0620184Z ) 2025-05-07T20:32:22.0620434Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0620526Z def test_silu_mul_quant( 2025-05-07T20:32:22.0620601Z self, 2025-05-07T20:32:22.0620681Z T: int, 2025-05-07T20:32:22.0620759Z D: int, 2025-05-07T20:32:22.0620854Z scale_ub: Optional[float], 2025-05-07T20:32:22.0620944Z contiguous: bool, 2025-05-07T20:32:22.0621029Z compiled: bool, 2025-05-07T20:32:22.0621104Z ) -> None: 2025-05-07T20:32:22.0621211Z torch.manual_seed(2025) 2025-05-07T20:32:22.0621328Z 2025-05-07T20:32:22.0621495Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0621574Z 2025-05-07T20:32:22.0621667Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0621796Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0621885Z x = x_sign * x_clamp 2025-05-07T20:32:22.0621961Z x0 = x[:, :D] 2025-05-07T20:32:22.0622040Z x1 = x[:, D:] 2025-05-07T20:32:22.0622108Z 2025-05-07T20:32:22.0622187Z if contiguous: 2025-05-07T20:32:22.0622278Z x0 = x0.contiguous() 2025-05-07T20:32:22.0622365Z x1 = x1.contiguous() 2025-05-07T20:32:22.0622439Z 2025-05-07T20:32:22.0622532Z if scale_ub is not None: 2025-05-07T20:32:22.0622633Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0622767Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0622848Z ) 2025-05-07T20:32:22.0622923Z else: 2025-05-07T20:32:22.0623023Z scale_ub_tensor = None 2025-05-07T20:32:22.0623094Z 2025-05-07T20:32:22.0623219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0623312Z op = silu_mul_quant 2025-05-07T20:32:22.0623391Z if compiled: 2025-05-07T20:32:22.0623489Z op = torch.compile(op) 2025-05-07T20:32:22.0623596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0623667Z 2025-05-07T20:32:22.0623754Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0623758Z 2025-05-07T20:32:22.0623866Z moe/activation_test.py:117: 2025-05-07T20:32:22.0623992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0624097Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0624196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0624613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0624718Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0625214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0625310Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0625675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0625898Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0626246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0626335Z kernel = self.compile( 2025-05-07T20:32:22.0626719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0626938Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0627105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0627109Z 2025-05-07T20:32:22.0627313Z self = 2025-05-07T20:32:22.0628088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0628584Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1cf3b00>} 2025-05-07T20:32:22.0629340Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0629530Z context = 2025-05-07T20:32:22.0629578Z 2025-05-07T20:32:22.0629746Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0630008Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0630112Z module_map=module_map) 2025-05-07T20:32:22.0630276Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0630370Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0630447Z E ^ 2025-05-07T20:32:22.0630839Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0630845Z 2025-05-07T20:32:22.0631277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0631282Z 2025-05-07T20:32:22.0631389Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0631610Z self=, 2025-05-07T20:32:22.0631688Z T=1, 2025-05-07T20:32:22.0631767Z D=7168, 2025-05-07T20:32:22.0631845Z scale_ub=None, 2025-05-07T20:32:22.0631927Z contiguous=False, 2025-05-07T20:32:22.0632012Z compiled=False, 2025-05-07T20:32:22.0632083Z ) 2025-05-07T20:32:22.0632304Z self = 2025-05-07T20:32:22.0632468Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.0632472Z 2025-05-07T20:32:22.0632549Z @given( 2025-05-07T20:32:22.0632671Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0632767Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0632880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0632999Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0633111Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0633192Z ) 2025-05-07T20:32:22.0633481Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0633573Z def test_silu_mul_quant( 2025-05-07T20:32:22.0633657Z self, 2025-05-07T20:32:22.0633733Z T: int, 2025-05-07T20:32:22.0633804Z D: int, 2025-05-07T20:32:22.0633906Z scale_ub: Optional[float], 2025-05-07T20:32:22.0633991Z contiguous: bool, 2025-05-07T20:32:22.0634074Z compiled: bool, 2025-05-07T20:32:22.0634158Z ) -> None: 2025-05-07T20:32:22.0634254Z torch.manual_seed(2025) 2025-05-07T20:32:22.0634325Z 2025-05-07T20:32:22.0634499Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0634573Z 2025-05-07T20:32:22.0634662Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0634794Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0634922Z x = x_sign * x_clamp 2025-05-07T20:32:22.0635005Z x0 = x[:, :D] 2025-05-07T20:32:22.0635125Z x1 = x[:, D:] 2025-05-07T20:32:22.0635197Z 2025-05-07T20:32:22.0635285Z if contiguous: 2025-05-07T20:32:22.0635375Z x0 = x0.contiguous() 2025-05-07T20:32:22.0635462Z x1 = x1.contiguous() 2025-05-07T20:32:22.0635540Z 2025-05-07T20:32:22.0635628Z if scale_ub is not None: 2025-05-07T20:32:22.0635731Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0635871Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0635944Z ) 2025-05-07T20:32:22.0636021Z else: 2025-05-07T20:32:22.0636118Z scale_ub_tensor = None 2025-05-07T20:32:22.0636190Z 2025-05-07T20:32:22.0636326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0636415Z op = silu_mul_quant 2025-05-07T20:32:22.0636494Z if compiled: 2025-05-07T20:32:22.0636600Z op = torch.compile(op) 2025-05-07T20:32:22.0636705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0636822Z 2025-05-07T20:32:22.0636917Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0636922Z 2025-05-07T20:32:22.0637017Z moe/activation_test.py:117: 2025-05-07T20:32:22.0637142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0637247Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0637346Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0637850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0637944Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0638304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0638764Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0639160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0639262Z kernel = self.compile( 2025-05-07T20:32:22.0639655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0639825Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0639955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0639959Z 2025-05-07T20:32:22.0640162Z self = 2025-05-07T20:32:22.0640930Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0641530Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a268c9a0>} 2025-05-07T20:32:22.0642287Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0642484Z context = 2025-05-07T20:32:22.0642488Z 2025-05-07T20:32:22.0642656Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0642924Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0643028Z module_map=module_map) 2025-05-07T20:32:22.0643189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0643292Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0643446Z E ^ 2025-05-07T20:32:22.0643873Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0643968Z 2025-05-07T20:32:22.0644391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0644396Z 2025-05-07T20:32:22.0644496Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0644726Z self=, 2025-05-07T20:32:22.0644801Z T=2048, 2025-05-07T20:32:22.0644875Z D=7168, 2025-05-07T20:32:22.0644962Z scale_ub=None, 2025-05-07T20:32:22.0645047Z contiguous=False, 2025-05-07T20:32:22.0645127Z compiled=True, 2025-05-07T20:32:22.0645205Z ) 2025-05-07T20:32:22.0645419Z self = 2025-05-07T20:32:22.0645590Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.0645601Z 2025-05-07T20:32:22.0645679Z @given( 2025-05-07T20:32:22.0645796Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0645994Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0646113Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0646227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0646342Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0646419Z ) 2025-05-07T20:32:22.0651477Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0651585Z def test_silu_mul_quant( 2025-05-07T20:32:22.0651669Z self, 2025-05-07T20:32:22.0651745Z T: int, 2025-05-07T20:32:22.0651817Z D: int, 2025-05-07T20:32:22.0651916Z scale_ub: Optional[float], 2025-05-07T20:32:22.0652007Z contiguous: bool, 2025-05-07T20:32:22.0652092Z compiled: bool, 2025-05-07T20:32:22.0652175Z ) -> None: 2025-05-07T20:32:22.0652268Z torch.manual_seed(2025) 2025-05-07T20:32:22.0652343Z 2025-05-07T20:32:22.0652523Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0652602Z 2025-05-07T20:32:22.0652692Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0652822Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0652908Z x = x_sign * x_clamp 2025-05-07T20:32:22.0652992Z x0 = x[:, :D] 2025-05-07T20:32:22.0653067Z x1 = x[:, D:] 2025-05-07T20:32:22.0653137Z 2025-05-07T20:32:22.0653222Z if contiguous: 2025-05-07T20:32:22.0653310Z x0 = x0.contiguous() 2025-05-07T20:32:22.0653397Z x1 = x1.contiguous() 2025-05-07T20:32:22.0653473Z 2025-05-07T20:32:22.0653561Z if scale_ub is not None: 2025-05-07T20:32:22.0653665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0653802Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0653876Z ) 2025-05-07T20:32:22.0653951Z else: 2025-05-07T20:32:22.0654045Z scale_ub_tensor = None 2025-05-07T20:32:22.0654186Z 2025-05-07T20:32:22.0654323Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0654411Z op = silu_mul_quant 2025-05-07T20:32:22.0654496Z if compiled: 2025-05-07T20:32:22.0654600Z op = torch.compile(op) 2025-05-07T20:32:22.0654703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0654776Z 2025-05-07T20:32:22.0654871Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0654876Z 2025-05-07T20:32:22.0654971Z moe/activation_test.py:117: 2025-05-07T20:32:22.0655102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0655205Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0655302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0655677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0655812Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0656314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0656456Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0656811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0657031Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0657376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0657465Z kernel = self.compile( 2025-05-07T20:32:22.0657853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0658024Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0658153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0658160Z 2025-05-07T20:32:22.0658415Z self = 2025-05-07T20:32:22.0659184Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0659684Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a268dd00>} 2025-05-07T20:32:22.0660434Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0660623Z context = 2025-05-07T20:32:22.0660632Z 2025-05-07T20:32:22.0660801Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0661069Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0661181Z module_map=module_map) 2025-05-07T20:32:22.0661343Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0661438Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0661520Z E ^ 2025-05-07T20:32:22.0661873Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0661878Z 2025-05-07T20:32:22.0662296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0662301Z 2025-05-07T20:32:22.0662400Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0662621Z self=, 2025-05-07T20:32:22.0662705Z T=4096, 2025-05-07T20:32:22.0662825Z D=7168, 2025-05-07T20:32:22.0662910Z scale_ub=None, 2025-05-07T20:32:22.0662997Z contiguous=False, 2025-05-07T20:32:22.0663077Z compiled=True, 2025-05-07T20:32:22.0663149Z ) 2025-05-07T20:32:22.0663370Z self = 2025-05-07T20:32:22.0663539Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.0663543Z 2025-05-07T20:32:22.0663622Z @given( 2025-05-07T20:32:22.0663736Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0663832Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0663947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0664061Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0664172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0664251Z ) 2025-05-07T20:32:22.0664535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0664670Z def test_silu_mul_quant( 2025-05-07T20:32:22.0664745Z self, 2025-05-07T20:32:22.0664817Z T: int, 2025-05-07T20:32:22.0664894Z D: int, 2025-05-07T20:32:22.0664988Z scale_ub: Optional[float], 2025-05-07T20:32:22.0665075Z contiguous: bool, 2025-05-07T20:32:22.0665162Z compiled: bool, 2025-05-07T20:32:22.0665238Z ) -> None: 2025-05-07T20:32:22.0665331Z torch.manual_seed(2025) 2025-05-07T20:32:22.0665406Z 2025-05-07T20:32:22.0665573Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0665642Z 2025-05-07T20:32:22.0665734Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0665857Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0665945Z x = x_sign * x_clamp 2025-05-07T20:32:22.0666024Z x0 = x[:, :D] 2025-05-07T20:32:22.0666100Z x1 = x[:, D:] 2025-05-07T20:32:22.0666176Z 2025-05-07T20:32:22.0666260Z if contiguous: 2025-05-07T20:32:22.0666352Z x0 = x0.contiguous() 2025-05-07T20:32:22.0666490Z x1 = x1.contiguous() 2025-05-07T20:32:22.0666562Z 2025-05-07T20:32:22.0666652Z if scale_ub is not None: 2025-05-07T20:32:22.0666760Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0666891Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0666965Z ) 2025-05-07T20:32:22.0667043Z else: 2025-05-07T20:32:22.0667135Z scale_ub_tensor = None 2025-05-07T20:32:22.0667206Z 2025-05-07T20:32:22.0667340Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0667430Z op = silu_mul_quant 2025-05-07T20:32:22.0667519Z if compiled: 2025-05-07T20:32:22.0667618Z op = torch.compile(op) 2025-05-07T20:32:22.0667722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0667800Z 2025-05-07T20:32:22.0667892Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0667899Z 2025-05-07T20:32:22.0668000Z moe/activation_test.py:117: 2025-05-07T20:32:22.0668132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0668230Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0668330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0668708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0668798Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0669297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0669393Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0669749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0669979Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0670364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0670467Z kernel = self.compile( 2025-05-07T20:32:22.0670850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0671022Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0671153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0671157Z 2025-05-07T20:32:22.0671361Z self = 2025-05-07T20:32:22.0672132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0672675Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a268e840>} 2025-05-07T20:32:22.0673466Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0673657Z context = 2025-05-07T20:32:22.0673662Z 2025-05-07T20:32:22.0673824Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0674091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0674198Z module_map=module_map) 2025-05-07T20:32:22.0674359Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0674461Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0674539Z E ^ 2025-05-07T20:32:22.0674897Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0674944Z 2025-05-07T20:32:22.0675368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0675373Z 2025-05-07T20:32:22.0675478Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0675703Z self=, 2025-05-07T20:32:22.0675778Z T=16384, 2025-05-07T20:32:22.0675854Z D=5120, 2025-05-07T20:32:22.0675938Z scale_ub=1200.0, 2025-05-07T20:32:22.0676016Z contiguous=False, 2025-05-07T20:32:22.0676099Z compiled=False, 2025-05-07T20:32:22.0676171Z ) 2025-05-07T20:32:22.0676390Z self = 2025-05-07T20:32:22.0676568Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.0676575Z 2025-05-07T20:32:22.0676654Z @given( 2025-05-07T20:32:22.0676777Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0676872Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0676984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0677097Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0677205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0677281Z ) 2025-05-07T20:32:22.0677521Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0677614Z def test_silu_mul_quant( 2025-05-07T20:32:22.0677685Z self, 2025-05-07T20:32:22.0677757Z T: int, 2025-05-07T20:32:22.0677835Z D: int, 2025-05-07T20:32:22.0677927Z scale_ub: Optional[float], 2025-05-07T20:32:22.0678016Z contiguous: bool, 2025-05-07T20:32:22.0678103Z compiled: bool, 2025-05-07T20:32:22.0678176Z ) -> None: 2025-05-07T20:32:22.0678271Z torch.manual_seed(2025) 2025-05-07T20:32:22.0678392Z 2025-05-07T20:32:22.0678562Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0678634Z 2025-05-07T20:32:22.0678725Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0678843Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0678930Z x = x_sign * x_clamp 2025-05-07T20:32:22.0679010Z x0 = x[:, :D] 2025-05-07T20:32:22.0679085Z x1 = x[:, D:] 2025-05-07T20:32:22.0679158Z 2025-05-07T20:32:22.0679238Z if contiguous: 2025-05-07T20:32:22.0679324Z x0 = x0.contiguous() 2025-05-07T20:32:22.0679415Z x1 = x1.contiguous() 2025-05-07T20:32:22.0679485Z 2025-05-07T20:32:22.0679572Z if scale_ub is not None: 2025-05-07T20:32:22.0679680Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0679808Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0679922Z ) 2025-05-07T20:32:22.0680002Z else: 2025-05-07T20:32:22.0680158Z scale_ub_tensor = None 2025-05-07T20:32:22.0680230Z 2025-05-07T20:32:22.0680359Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0680447Z op = silu_mul_quant 2025-05-07T20:32:22.0680534Z if compiled: 2025-05-07T20:32:22.0680629Z op = torch.compile(op) 2025-05-07T20:32:22.0680730Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0680802Z 2025-05-07T20:32:22.0680891Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0680896Z 2025-05-07T20:32:22.0680993Z moe/activation_test.py:117: 2025-05-07T20:32:22.0681121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0681218Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0681312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0681817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0681957Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0682319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0682539Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0682877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0682969Z kernel = self.compile( 2025-05-07T20:32:22.0683414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0683593Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0683717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0683722Z 2025-05-07T20:32:22.0683927Z self = 2025-05-07T20:32:22.0684702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0685199Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2784040>} 2025-05-07T20:32:22.0685951Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0686139Z context = 2025-05-07T20:32:22.0686144Z 2025-05-07T20:32:22.0686304Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0686691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0686799Z module_map=module_map) 2025-05-07T20:32:22.0686959Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0687051Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0687122Z E ^ 2025-05-07T20:32:22.0687476Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0687481Z 2025-05-07T20:32:22.0687895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0687900Z 2025-05-07T20:32:22.0688007Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0688226Z self=, 2025-05-07T20:32:22.0688302Z T=16384, 2025-05-07T20:32:22.0688380Z D=5120, 2025-05-07T20:32:22.0688502Z scale_ub=1200.0, 2025-05-07T20:32:22.0688584Z contiguous=True, 2025-05-07T20:32:22.0688708Z compiled=True, 2025-05-07T20:32:22.0688781Z ) 2025-05-07T20:32:22.0688994Z self = 2025-05-07T20:32:22.0689172Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:22.0689177Z 2025-05-07T20:32:22.0689250Z @given( 2025-05-07T20:32:22.0689369Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0689462Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0689572Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0689686Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0689795Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0689866Z ) 2025-05-07T20:32:22.0690112Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0690202Z def test_silu_mul_quant( 2025-05-07T20:32:22.0690278Z self, 2025-05-07T20:32:22.0690359Z T: int, 2025-05-07T20:32:22.0690478Z D: int, 2025-05-07T20:32:22.0690574Z scale_ub: Optional[float], 2025-05-07T20:32:22.0690669Z contiguous: bool, 2025-05-07T20:32:22.0690752Z compiled: bool, 2025-05-07T20:32:22.0690830Z ) -> None: 2025-05-07T20:32:22.0690921Z torch.manual_seed(2025) 2025-05-07T20:32:22.0690995Z 2025-05-07T20:32:22.0691164Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0691237Z 2025-05-07T20:32:22.0691324Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0691448Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0691531Z x = x_sign * x_clamp 2025-05-07T20:32:22.0691607Z x0 = x[:, :D] 2025-05-07T20:32:22.0691684Z x1 = x[:, D:] 2025-05-07T20:32:22.0691754Z 2025-05-07T20:32:22.0691833Z if contiguous: 2025-05-07T20:32:22.0691926Z x0 = x0.contiguous() 2025-05-07T20:32:22.0692014Z x1 = x1.contiguous() 2025-05-07T20:32:22.0692088Z 2025-05-07T20:32:22.0692181Z if scale_ub is not None: 2025-05-07T20:32:22.0692282Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0692418Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0692488Z ) 2025-05-07T20:32:22.0692559Z else: 2025-05-07T20:32:22.0692653Z scale_ub_tensor = None 2025-05-07T20:32:22.0692722Z 2025-05-07T20:32:22.0692847Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0692938Z op = silu_mul_quant 2025-05-07T20:32:22.0693018Z if compiled: 2025-05-07T20:32:22.0693115Z op = torch.compile(op) 2025-05-07T20:32:22.0693223Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0693292Z 2025-05-07T20:32:22.0693377Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0693388Z 2025-05-07T20:32:22.0693483Z moe/activation_test.py:117: 2025-05-07T20:32:22.0693654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0693760Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0693857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0694227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0694321Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0694815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0694914Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0695272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0695491Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0695873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0696001Z kernel = self.compile( 2025-05-07T20:32:22.0696385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0696560Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0696684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0696688Z 2025-05-07T20:32:22.0696898Z self = 2025-05-07T20:32:22.0697667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0698164Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2785300>} 2025-05-07T20:32:22.0698922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0699150Z context = 2025-05-07T20:32:22.0699154Z 2025-05-07T20:32:22.0699318Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0699577Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0699679Z module_map=module_map) 2025-05-07T20:32:22.0699838Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0699934Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0700009Z E ^ 2025-05-07T20:32:22.0700364Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0700372Z 2025-05-07T20:32:22.0700808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0700814Z 2025-05-07T20:32:22.0700928Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0701168Z self=, 2025-05-07T20:32:22.0701246Z T=16384, 2025-05-07T20:32:22.0701320Z D=5120, 2025-05-07T20:32:22.0701395Z scale_ub=None, 2025-05-07T20:32:22.0701477Z contiguous=False, 2025-05-07T20:32:22.0701554Z compiled=True, 2025-05-07T20:32:22.0701622Z ) 2025-05-07T20:32:22.0701841Z self = 2025-05-07T20:32:22.0702011Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.0702016Z 2025-05-07T20:32:22.0702089Z @given( 2025-05-07T20:32:22.0702208Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0702343Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0702469Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0702580Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0702688Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0702766Z ) 2025-05-07T20:32:22.0703010Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0703096Z def test_silu_mul_quant( 2025-05-07T20:32:22.0703173Z self, 2025-05-07T20:32:22.0703248Z T: int, 2025-05-07T20:32:22.0703319Z D: int, 2025-05-07T20:32:22.0703420Z scale_ub: Optional[float], 2025-05-07T20:32:22.0703504Z contiguous: bool, 2025-05-07T20:32:22.0703584Z compiled: bool, 2025-05-07T20:32:22.0703661Z ) -> None: 2025-05-07T20:32:22.0703749Z torch.manual_seed(2025) 2025-05-07T20:32:22.0703827Z 2025-05-07T20:32:22.0704034Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0704147Z 2025-05-07T20:32:22.0704243Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0704365Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0704448Z x = x_sign * x_clamp 2025-05-07T20:32:22.0704527Z x0 = x[:, :D] 2025-05-07T20:32:22.0704604Z x1 = x[:, D:] 2025-05-07T20:32:22.0704671Z 2025-05-07T20:32:22.0704753Z if contiguous: 2025-05-07T20:32:22.0704842Z x0 = x0.contiguous() 2025-05-07T20:32:22.0704930Z x1 = x1.contiguous() 2025-05-07T20:32:22.0705003Z 2025-05-07T20:32:22.0705089Z if scale_ub is not None: 2025-05-07T20:32:22.0705189Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0705324Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0705396Z ) 2025-05-07T20:32:22.0705469Z else: 2025-05-07T20:32:22.0705558Z scale_ub_tensor = None 2025-05-07T20:32:22.0705632Z 2025-05-07T20:32:22.0705765Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0705902Z op = silu_mul_quant 2025-05-07T20:32:22.0705983Z if compiled: 2025-05-07T20:32:22.0706084Z op = torch.compile(op) 2025-05-07T20:32:22.0706185Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0706255Z 2025-05-07T20:32:22.0706343Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0706347Z 2025-05-07T20:32:22.0706439Z moe/activation_test.py:117: 2025-05-07T20:32:22.0706569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0706663Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0706760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0707132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0707219Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0707717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0707820Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0708177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0708398Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0708745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0708835Z kernel = self.compile( 2025-05-07T20:32:22.0709219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0709395Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0709518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0709524Z 2025-05-07T20:32:22.0709776Z self = 2025-05-07T20:32:22.0710547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0711045Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2785e40>} 2025-05-07T20:32:22.0711795Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0711982Z context = 2025-05-07T20:32:22.0711986Z 2025-05-07T20:32:22.0712213Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0712512Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0712617Z module_map=module_map) 2025-05-07T20:32:22.0712775Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0712869Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0712948Z E ^ 2025-05-07T20:32:22.0713298Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0713303Z 2025-05-07T20:32:22.0713715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0713725Z 2025-05-07T20:32:22.0713824Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0714042Z self=, 2025-05-07T20:32:22.0714118Z T=2048, 2025-05-07T20:32:22.0714192Z D=5120, 2025-05-07T20:32:22.0714271Z scale_ub=None, 2025-05-07T20:32:22.0714402Z contiguous=False, 2025-05-07T20:32:22.0714481Z compiled=True, 2025-05-07T20:32:22.0714550Z ) 2025-05-07T20:32:22.0714769Z self = 2025-05-07T20:32:22.0714940Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.0714945Z 2025-05-07T20:32:22.0715025Z @given( 2025-05-07T20:32:22.0715139Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0715231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0715343Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0715453Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0715562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0715636Z ) 2025-05-07T20:32:22.0715876Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0715967Z def test_silu_mul_quant( 2025-05-07T20:32:22.0716047Z self, 2025-05-07T20:32:22.0716122Z T: int, 2025-05-07T20:32:22.0716192Z D: int, 2025-05-07T20:32:22.0716288Z scale_ub: Optional[float], 2025-05-07T20:32:22.0716371Z contiguous: bool, 2025-05-07T20:32:22.0716455Z compiled: bool, 2025-05-07T20:32:22.0716529Z ) -> None: 2025-05-07T20:32:22.0716620Z torch.manual_seed(2025) 2025-05-07T20:32:22.0716693Z 2025-05-07T20:32:22.0716858Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0716929Z 2025-05-07T20:32:22.0717022Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0717139Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0717223Z x = x_sign * x_clamp 2025-05-07T20:32:22.0717303Z x0 = x[:, :D] 2025-05-07T20:32:22.0717378Z x1 = x[:, D:] 2025-05-07T20:32:22.0717448Z 2025-05-07T20:32:22.0717533Z if contiguous: 2025-05-07T20:32:22.0717619Z x0 = x0.contiguous() 2025-05-07T20:32:22.0717759Z x1 = x1.contiguous() 2025-05-07T20:32:22.0717830Z 2025-05-07T20:32:22.0717915Z if scale_ub is not None: 2025-05-07T20:32:22.0718018Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0718149Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0718220Z ) 2025-05-07T20:32:22.0718297Z else: 2025-05-07T20:32:22.0718386Z scale_ub_tensor = None 2025-05-07T20:32:22.0718454Z 2025-05-07T20:32:22.0718581Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0718665Z op = silu_mul_quant 2025-05-07T20:32:22.0718744Z if compiled: 2025-05-07T20:32:22.0718844Z op = torch.compile(op) 2025-05-07T20:32:22.0718944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0719010Z 2025-05-07T20:32:22.0719100Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0719146Z 2025-05-07T20:32:22.0719239Z moe/activation_test.py:117: 2025-05-07T20:32:22.0719410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0719506Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0719598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0719967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0720053Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0720563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0720672Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0721053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0721274Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0721615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0721746Z kernel = self.compile( 2025-05-07T20:32:22.0722132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0722301Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0722428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0722432Z 2025-05-07T20:32:22.0722633Z self = 2025-05-07T20:32:22.0723508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0724006Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f18a2787240>} 2025-05-07T20:32:22.0724758Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0724948Z context = 2025-05-07T20:32:22.0724952Z 2025-05-07T20:32:22.0725112Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0725373Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0725477Z module_map=module_map) 2025-05-07T20:32:22.0725634Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0725730Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0725802Z E ^ 2025-05-07T20:32:22.0726198Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0726207Z 2025-05-07T20:32:22.0726622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0726626Z 2025-05-07T20:32:22.0726724Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0726948Z self=, 2025-05-07T20:32:22.0727019Z T=2048, 2025-05-07T20:32:22.0727091Z D=5120, 2025-05-07T20:32:22.0727170Z scale_ub=1200.0, 2025-05-07T20:32:22.0727249Z contiguous=False, 2025-05-07T20:32:22.0727329Z compiled=True, 2025-05-07T20:32:22.0727403Z ) 2025-05-07T20:32:22.0727616Z self = 2025-05-07T20:32:22.0727787Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.0727791Z 2025-05-07T20:32:22.0727911Z @given( 2025-05-07T20:32:22.0728030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0728168Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0728278Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0728393Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0728505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0728574Z ) 2025-05-07T20:32:22.0728814Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0728903Z def test_silu_mul_quant( 2025-05-07T20:32:22.0728976Z self, 2025-05-07T20:32:22.0729047Z T: int, 2025-05-07T20:32:22.0729120Z D: int, 2025-05-07T20:32:22.0729212Z scale_ub: Optional[float], 2025-05-07T20:32:22.0729295Z contiguous: bool, 2025-05-07T20:32:22.0729378Z compiled: bool, 2025-05-07T20:32:22.0729448Z ) -> None: 2025-05-07T20:32:22.0729540Z torch.manual_seed(2025) 2025-05-07T20:32:22.0729613Z 2025-05-07T20:32:22.0729783Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0729899Z 2025-05-07T20:32:22.0729986Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0730105Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0730192Z x = x_sign * x_clamp 2025-05-07T20:32:22.0730267Z x0 = x[:, :D] 2025-05-07T20:32:22.0730340Z x1 = x[:, D:] 2025-05-07T20:32:22.0730411Z 2025-05-07T20:32:22.0730491Z if contiguous: 2025-05-07T20:32:22.0730576Z x0 = x0.contiguous() 2025-05-07T20:32:22.0730662Z x1 = x1.contiguous() 2025-05-07T20:32:22.0730730Z 2025-05-07T20:32:22.0730814Z if scale_ub is not None: 2025-05-07T20:32:22.0730917Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0731046Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0731121Z ) 2025-05-07T20:32:22.0731194Z else: 2025-05-07T20:32:22.0731284Z scale_ub_tensor = None 2025-05-07T20:32:22.0731360Z 2025-05-07T20:32:22.0731486Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0731571Z op = silu_mul_quant 2025-05-07T20:32:22.0731658Z if compiled: 2025-05-07T20:32:22.0731752Z op = torch.compile(op) 2025-05-07T20:32:22.0731850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0731923Z 2025-05-07T20:32:22.0732009Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0732014Z 2025-05-07T20:32:22.0732110Z moe/activation_test.py:117: 2025-05-07T20:32:22.0732234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0732329Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0732428Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0732797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0732888Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0733432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0733527Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0733887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0734106Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0734446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0734538Z kernel = self.compile( 2025-05-07T20:32:22.0734919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0735087Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0735255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0735301Z 2025-05-07T20:32:22.0735506Z self = 2025-05-07T20:32:22.0736274Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0736764Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1afc720>} 2025-05-07T20:32:22.0737516Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0737699Z context = 2025-05-07T20:32:22.0737706Z 2025-05-07T20:32:22.0737869Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0738174Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0738277Z module_map=module_map) 2025-05-07T20:32:22.0738670Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0738818Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0738899Z E ^ 2025-05-07T20:32:22.0739256Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0739261Z 2025-05-07T20:32:22.0739672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0739677Z 2025-05-07T20:32:22.0739773Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0740002Z self=, 2025-05-07T20:32:22.0740073Z T=4096, 2025-05-07T20:32:22.0740155Z D=5120, 2025-05-07T20:32:22.0740238Z scale_ub=1200.0, 2025-05-07T20:32:22.0740314Z contiguous=True, 2025-05-07T20:32:22.0740394Z compiled=True, 2025-05-07T20:32:22.0740465Z ) 2025-05-07T20:32:22.0740678Z self = 2025-05-07T20:32:22.0740847Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:22.0740852Z 2025-05-07T20:32:22.0740923Z @given( 2025-05-07T20:32:22.0741040Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0741136Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0741245Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0741362Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0741470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0741539Z ) 2025-05-07T20:32:22.0741785Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0742000Z def test_silu_mul_quant( 2025-05-07T20:32:22.0742073Z self, 2025-05-07T20:32:22.0742152Z T: int, 2025-05-07T20:32:22.0742226Z D: int, 2025-05-07T20:32:22.0742318Z scale_ub: Optional[float], 2025-05-07T20:32:22.0742408Z contiguous: bool, 2025-05-07T20:32:22.0742487Z compiled: bool, 2025-05-07T20:32:22.0742560Z ) -> None: 2025-05-07T20:32:22.0742655Z torch.manual_seed(2025) 2025-05-07T20:32:22.0742724Z 2025-05-07T20:32:22.0742890Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0742961Z 2025-05-07T20:32:22.0743048Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0743170Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0743255Z x = x_sign * x_clamp 2025-05-07T20:32:22.0743328Z x0 = x[:, :D] 2025-05-07T20:32:22.0743404Z x1 = x[:, D:] 2025-05-07T20:32:22.0743532Z 2025-05-07T20:32:22.0743612Z if contiguous: 2025-05-07T20:32:22.0743763Z x0 = x0.contiguous() 2025-05-07T20:32:22.0743849Z x1 = x1.contiguous() 2025-05-07T20:32:22.0743917Z 2025-05-07T20:32:22.0744005Z if scale_ub is not None: 2025-05-07T20:32:22.0744104Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0744234Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0744305Z ) 2025-05-07T20:32:22.0744374Z else: 2025-05-07T20:32:22.0744469Z scale_ub_tensor = None 2025-05-07T20:32:22.0744536Z 2025-05-07T20:32:22.0744659Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0744748Z op = silu_mul_quant 2025-05-07T20:32:22.0744829Z if compiled: 2025-05-07T20:32:22.0744923Z op = torch.compile(op) 2025-05-07T20:32:22.0745028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0745096Z 2025-05-07T20:32:22.0745183Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0745191Z 2025-05-07T20:32:22.0745354Z moe/activation_test.py:117: 2025-05-07T20:32:22.0745478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0745578Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0745671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0746039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0746131Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0746623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0746714Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0747073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0747292Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0747640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0747728Z kernel = self.compile( 2025-05-07T20:32:22.0748109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0748283Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0748405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0748409Z 2025-05-07T20:32:22.0748613Z self = 2025-05-07T20:32:22.0749374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0749906Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1afd260>} 2025-05-07T20:32:22.0750661Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0750847Z context = 2025-05-07T20:32:22.0750852Z 2025-05-07T20:32:22.0751014Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0751273Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0751371Z module_map=module_map) 2025-05-07T20:32:22.0751530Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0751624Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0751737Z E ^ 2025-05-07T20:32:22.0752096Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0752140Z 2025-05-07T20:32:22.0752550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0752555Z 2025-05-07T20:32:22.0752655Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0752872Z self=, 2025-05-07T20:32:22.0752946Z T=128, 2025-05-07T20:32:22.0753022Z D=5120, 2025-05-07T20:32:22.0753101Z scale_ub=1200.0, 2025-05-07T20:32:22.0753187Z contiguous=False, 2025-05-07T20:32:22.0753266Z compiled=True, 2025-05-07T20:32:22.0753337Z ) 2025-05-07T20:32:22.0753557Z self = 2025-05-07T20:32:22.0753729Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.0753734Z 2025-05-07T20:32:22.0753812Z @given( 2025-05-07T20:32:22.0753984Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0754079Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0754191Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0754312Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0754422Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0754497Z ) 2025-05-07T20:32:22.0754745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0754835Z def test_silu_mul_quant( 2025-05-07T20:32:22.0754915Z self, 2025-05-07T20:32:22.0754991Z T: int, 2025-05-07T20:32:22.0755067Z D: int, 2025-05-07T20:32:22.0755170Z scale_ub: Optional[float], 2025-05-07T20:32:22.0755257Z contiguous: bool, 2025-05-07T20:32:22.0755338Z compiled: bool, 2025-05-07T20:32:22.0755421Z ) -> None: 2025-05-07T20:32:22.0755515Z torch.manual_seed(2025) 2025-05-07T20:32:22.0755594Z 2025-05-07T20:32:22.0755767Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0755844Z 2025-05-07T20:32:22.0755940Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0756067Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0756156Z x = x_sign * x_clamp 2025-05-07T20:32:22.0756239Z x0 = x[:, :D] 2025-05-07T20:32:22.0756317Z x1 = x[:, D:] 2025-05-07T20:32:22.0756388Z 2025-05-07T20:32:22.0756474Z if contiguous: 2025-05-07T20:32:22.0756561Z x0 = x0.contiguous() 2025-05-07T20:32:22.0756651Z x1 = x1.contiguous() 2025-05-07T20:32:22.0756724Z 2025-05-07T20:32:22.0756812Z if scale_ub is not None: 2025-05-07T20:32:22.0756914Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0757053Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0757132Z ) 2025-05-07T20:32:22.0757207Z else: 2025-05-07T20:32:22.0757358Z scale_ub_tensor = None 2025-05-07T20:32:22.0757430Z 2025-05-07T20:32:22.0757564Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0757651Z op = silu_mul_quant 2025-05-07T20:32:22.0757734Z if compiled: 2025-05-07T20:32:22.0757834Z op = torch.compile(op) 2025-05-07T20:32:22.0757940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0758010Z 2025-05-07T20:32:22.0758101Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0758106Z 2025-05-07T20:32:22.0758202Z moe/activation_test.py:117: 2025-05-07T20:32:22.0758329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0758433Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0758529Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0758947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0759077Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0759575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0759673Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0760032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0760253Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0760602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0760693Z kernel = self.compile( 2025-05-07T20:32:22.0761083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0761257Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0761386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0761433Z 2025-05-07T20:32:22.0761646Z self = 2025-05-07T20:32:22.0762413Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0762913Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1afe480>} 2025-05-07T20:32:22.0763766Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0763967Z context = 2025-05-07T20:32:22.0763975Z 2025-05-07T20:32:22.0764146Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0764405Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0764514Z module_map=module_map) 2025-05-07T20:32:22.0764673Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0764766Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0764848Z E ^ 2025-05-07T20:32:22.0765201Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0765206Z 2025-05-07T20:32:22.0765627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0765631Z 2025-05-07T20:32:22.0765732Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0765998Z self=, 2025-05-07T20:32:22.0766087Z T=16384, 2025-05-07T20:32:22.0766161Z D=7168, 2025-05-07T20:32:22.0766240Z scale_ub=1200.0, 2025-05-07T20:32:22.0766327Z contiguous=True, 2025-05-07T20:32:22.0766409Z compiled=True, 2025-05-07T20:32:22.0766486Z ) 2025-05-07T20:32:22.0766701Z self = 2025-05-07T20:32:22.0766872Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:22.0766877Z 2025-05-07T20:32:22.0766958Z @given( 2025-05-07T20:32:22.0767074Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0767169Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0767285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0767398Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0767508Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0767625Z ) 2025-05-07T20:32:22.0767872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0768010Z def test_silu_mul_quant( 2025-05-07T20:32:22.0768084Z self, 2025-05-07T20:32:22.0768157Z T: int, 2025-05-07T20:32:22.0768239Z D: int, 2025-05-07T20:32:22.0768334Z scale_ub: Optional[float], 2025-05-07T20:32:22.0768419Z contiguous: bool, 2025-05-07T20:32:22.0768509Z compiled: bool, 2025-05-07T20:32:22.0768584Z ) -> None: 2025-05-07T20:32:22.0771939Z torch.manual_seed(2025) 2025-05-07T20:32:22.0772021Z 2025-05-07T20:32:22.0772202Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0772274Z 2025-05-07T20:32:22.0772364Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0772489Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0772578Z x = x_sign * x_clamp 2025-05-07T20:32:22.0772657Z x0 = x[:, :D] 2025-05-07T20:32:22.0772737Z x1 = x[:, D:] 2025-05-07T20:32:22.0772905Z 2025-05-07T20:32:22.0772986Z if contiguous: 2025-05-07T20:32:22.0773080Z x0 = x0.contiguous() 2025-05-07T20:32:22.0773168Z x1 = x1.contiguous() 2025-05-07T20:32:22.0773238Z 2025-05-07T20:32:22.0773330Z if scale_ub is not None: 2025-05-07T20:32:22.0773431Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0773567Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0773640Z ) 2025-05-07T20:32:22.0773710Z else: 2025-05-07T20:32:22.0773805Z scale_ub_tensor = None 2025-05-07T20:32:22.0773874Z 2025-05-07T20:32:22.0774000Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0774090Z op = silu_mul_quant 2025-05-07T20:32:22.0774172Z if compiled: 2025-05-07T20:32:22.0774267Z op = torch.compile(op) 2025-05-07T20:32:22.0774376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0774447Z 2025-05-07T20:32:22.0774539Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0774547Z 2025-05-07T20:32:22.0774641Z moe/activation_test.py:117: 2025-05-07T20:32:22.0774768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0774868Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0774964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0775331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0775425Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0775916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0776012Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0776375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0776642Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0776995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0777086Z kernel = self.compile( 2025-05-07T20:32:22.0777469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0777645Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0777769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0777773Z 2025-05-07T20:32:22.0777977Z self = 2025-05-07T20:32:22.0778783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0779318Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1affd80>} 2025-05-07T20:32:22.0780073Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0780260Z context = 2025-05-07T20:32:22.0780265Z 2025-05-07T20:32:22.0780432Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0780691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0780794Z module_map=module_map) 2025-05-07T20:32:22.0780959Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0781057Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0781141Z E ^ 2025-05-07T20:32:22.0781537Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0781542Z 2025-05-07T20:32:22.0781954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0781959Z 2025-05-07T20:32:22.0782060Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0782279Z self=, 2025-05-07T20:32:22.0782359Z T=16384, 2025-05-07T20:32:22.0782435Z D=5120, 2025-05-07T20:32:22.0782512Z scale_ub=1200.0, 2025-05-07T20:32:22.0782596Z contiguous=True, 2025-05-07T20:32:22.0782676Z compiled=False, 2025-05-07T20:32:22.0782753Z ) 2025-05-07T20:32:22.0782970Z self = 2025-05-07T20:32:22.0783150Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:22.0783159Z 2025-05-07T20:32:22.0783231Z @given( 2025-05-07T20:32:22.0783351Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0783449Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0783561Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0783677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0783786Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0783860Z ) 2025-05-07T20:32:22.0784104Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0784193Z def test_silu_mul_quant( 2025-05-07T20:32:22.0784266Z self, 2025-05-07T20:32:22.0784338Z T: int, 2025-05-07T20:32:22.0784412Z D: int, 2025-05-07T20:32:22.0784508Z scale_ub: Optional[float], 2025-05-07T20:32:22.0784592Z contiguous: bool, 2025-05-07T20:32:22.0784674Z compiled: bool, 2025-05-07T20:32:22.0784758Z ) -> None: 2025-05-07T20:32:22.0784897Z torch.manual_seed(2025) 2025-05-07T20:32:22.0784969Z 2025-05-07T20:32:22.0785140Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0785210Z 2025-05-07T20:32:22.0785299Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0785418Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0785506Z x = x_sign * x_clamp 2025-05-07T20:32:22.0785583Z x0 = x[:, :D] 2025-05-07T20:32:22.0785658Z x1 = x[:, D:] 2025-05-07T20:32:22.0785727Z 2025-05-07T20:32:22.0785810Z if contiguous: 2025-05-07T20:32:22.0785898Z x0 = x0.contiguous() 2025-05-07T20:32:22.0785984Z x1 = x1.contiguous() 2025-05-07T20:32:22.0786058Z 2025-05-07T20:32:22.0786143Z if scale_ub is not None: 2025-05-07T20:32:22.0786246Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0786423Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0786537Z ) 2025-05-07T20:32:22.0786613Z else: 2025-05-07T20:32:22.0786704Z scale_ub_tensor = None 2025-05-07T20:32:22.0786772Z 2025-05-07T20:32:22.0786902Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0786990Z op = silu_mul_quant 2025-05-07T20:32:22.0787072Z if compiled: 2025-05-07T20:32:22.0787168Z op = torch.compile(op) 2025-05-07T20:32:22.0787270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0787337Z 2025-05-07T20:32:22.0787426Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0787431Z 2025-05-07T20:32:22.0787525Z moe/activation_test.py:117: 2025-05-07T20:32:22.0787653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0787747Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0787844Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0788349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0788487Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0788847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0789071Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0789409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0789505Z kernel = self.compile( 2025-05-07T20:32:22.0789889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0790061Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0790188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0790195Z 2025-05-07T20:32:22.0790401Z self = 2025-05-07T20:32:22.0791214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0791726Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1774cc0>} 2025-05-07T20:32:22.0792473Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0792667Z context = 2025-05-07T20:32:22.0792671Z 2025-05-07T20:32:22.0792835Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0793141Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0793249Z module_map=module_map) 2025-05-07T20:32:22.0793407Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0793506Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0793582Z E ^ 2025-05-07T20:32:22.0793933Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0793941Z 2025-05-07T20:32:22.0794353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0794358Z 2025-05-07T20:32:22.0794454Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0794675Z self=, 2025-05-07T20:32:22.0794747Z T=1, 2025-05-07T20:32:22.0794861Z D=7168, 2025-05-07T20:32:22.0794986Z scale_ub=1200.0, 2025-05-07T20:32:22.0795073Z contiguous=False, 2025-05-07T20:32:22.0795156Z compiled=False, 2025-05-07T20:32:22.0795230Z ) 2025-05-07T20:32:22.0795446Z self = 2025-05-07T20:32:22.0795618Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.0795622Z 2025-05-07T20:32:22.0795696Z @given( 2025-05-07T20:32:22.0795813Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0795909Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0796020Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0796132Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0796248Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0796317Z ) 2025-05-07T20:32:22.0796564Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0796661Z def test_silu_mul_quant( 2025-05-07T20:32:22.0796782Z self, 2025-05-07T20:32:22.0796858Z T: int, 2025-05-07T20:32:22.0796932Z D: int, 2025-05-07T20:32:22.0797025Z scale_ub: Optional[float], 2025-05-07T20:32:22.0797112Z contiguous: bool, 2025-05-07T20:32:22.0797194Z compiled: bool, 2025-05-07T20:32:22.0797270Z ) -> None: 2025-05-07T20:32:22.0797361Z torch.manual_seed(2025) 2025-05-07T20:32:22.0797438Z 2025-05-07T20:32:22.0797607Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0797678Z 2025-05-07T20:32:22.0797769Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0797889Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0797978Z x = x_sign * x_clamp 2025-05-07T20:32:22.0798053Z x0 = x[:, :D] 2025-05-07T20:32:22.0798128Z x1 = x[:, D:] 2025-05-07T20:32:22.0798202Z 2025-05-07T20:32:22.0798283Z if contiguous: 2025-05-07T20:32:22.0798372Z x0 = x0.contiguous() 2025-05-07T20:32:22.0798464Z x1 = x1.contiguous() 2025-05-07T20:32:22.0798533Z 2025-05-07T20:32:22.0798622Z if scale_ub is not None: 2025-05-07T20:32:22.0798726Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0798855Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0798929Z ) 2025-05-07T20:32:22.0799005Z else: 2025-05-07T20:32:22.0799095Z scale_ub_tensor = None 2025-05-07T20:32:22.0799160Z 2025-05-07T20:32:22.0799291Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0799377Z op = silu_mul_quant 2025-05-07T20:32:22.0799461Z if compiled: 2025-05-07T20:32:22.0799556Z op = torch.compile(op) 2025-05-07T20:32:22.0799656Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0799729Z 2025-05-07T20:32:22.0799816Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0799823Z 2025-05-07T20:32:22.0799960Z moe/activation_test.py:117: 2025-05-07T20:32:22.0800094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0800191Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0800286Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0800786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0800884Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0801247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0801465Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0801803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0801895Z kernel = self.compile( 2025-05-07T20:32:22.0802317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0802554Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0802676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0802681Z 2025-05-07T20:32:22.0802884Z self = 2025-05-07T20:32:22.0803742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0804235Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1775080>} 2025-05-07T20:32:22.0804992Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0805221Z context = 2025-05-07T20:32:22.0805226Z 2025-05-07T20:32:22.0805387Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0805649Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0805750Z module_map=module_map) 2025-05-07T20:32:22.0805912Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0806005Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0806077Z E ^ 2025-05-07T20:32:22.0806431Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0806436Z 2025-05-07T20:32:22.0806854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0806865Z 2025-05-07T20:32:22.0806966Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0807186Z self=, 2025-05-07T20:32:22.0807259Z T=4096, 2025-05-07T20:32:22.0807334Z D=7168, 2025-05-07T20:32:22.0807412Z scale_ub=1200.0, 2025-05-07T20:32:22.0807491Z contiguous=False, 2025-05-07T20:32:22.0807568Z compiled=True, 2025-05-07T20:32:22.0807635Z ) 2025-05-07T20:32:22.0807848Z self = 2025-05-07T20:32:22.0808020Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.0808025Z 2025-05-07T20:32:22.0808097Z @given( 2025-05-07T20:32:22.0808217Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0808311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0808425Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0808584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0808696Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0808768Z ) 2025-05-07T20:32:22.0809013Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0809101Z def test_silu_mul_quant( 2025-05-07T20:32:22.0809170Z self, 2025-05-07T20:32:22.0809247Z T: int, 2025-05-07T20:32:22.0809317Z D: int, 2025-05-07T20:32:22.0809411Z scale_ub: Optional[float], 2025-05-07T20:32:22.0809495Z contiguous: bool, 2025-05-07T20:32:22.0809574Z compiled: bool, 2025-05-07T20:32:22.0809648Z ) -> None: 2025-05-07T20:32:22.0809736Z torch.manual_seed(2025) 2025-05-07T20:32:22.0809803Z 2025-05-07T20:32:22.0809971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0810042Z 2025-05-07T20:32:22.0810172Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0810339Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0810426Z x = x_sign * x_clamp 2025-05-07T20:32:22.0810500Z x0 = x[:, :D] 2025-05-07T20:32:22.0810578Z x1 = x[:, D:] 2025-05-07T20:32:22.0810646Z 2025-05-07T20:32:22.0810723Z if contiguous: 2025-05-07T20:32:22.0810810Z x0 = x0.contiguous() 2025-05-07T20:32:22.0810894Z x1 = x1.contiguous() 2025-05-07T20:32:22.0810965Z 2025-05-07T20:32:22.0811051Z if scale_ub is not None: 2025-05-07T20:32:22.0811150Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0811283Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0811353Z ) 2025-05-07T20:32:22.0811425Z else: 2025-05-07T20:32:22.0811518Z scale_ub_tensor = None 2025-05-07T20:32:22.0811585Z 2025-05-07T20:32:22.0811711Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0811802Z op = silu_mul_quant 2025-05-07T20:32:22.0811887Z if compiled: 2025-05-07T20:32:22.0812097Z op = torch.compile(op) 2025-05-07T20:32:22.0812197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0812265Z 2025-05-07T20:32:22.0812356Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0812360Z 2025-05-07T20:32:22.0812452Z moe/activation_test.py:117: 2025-05-07T20:32:22.0812577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0812674Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0812770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0813136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0813227Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0813726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0813825Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0814188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0814407Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0814748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0814835Z kernel = self.compile( 2025-05-07T20:32:22.0815220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0815395Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0815517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0815521Z 2025-05-07T20:32:22.0815726Z self = 2025-05-07T20:32:22.0816540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0817039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1777060>} 2025-05-07T20:32:22.0817790Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0817977Z context = 2025-05-07T20:32:22.0817981Z 2025-05-07T20:32:22.0818145Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0818440Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0818586Z module_map=module_map) 2025-05-07T20:32:22.0818745Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0818838Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0818913Z E ^ 2025-05-07T20:32:22.0819263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0819268Z 2025-05-07T20:32:22.0819678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0819683Z 2025-05-07T20:32:22.0819785Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0820001Z self=, 2025-05-07T20:32:22.0820077Z T=128, 2025-05-07T20:32:22.0820150Z D=7168, 2025-05-07T20:32:22.0820225Z scale_ub=1200.0, 2025-05-07T20:32:22.0820311Z contiguous=False, 2025-05-07T20:32:22.0820388Z compiled=True, 2025-05-07T20:32:22.0820465Z ) 2025-05-07T20:32:22.0820747Z self = 2025-05-07T20:32:22.0820937Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.0820942Z 2025-05-07T20:32:22.0821015Z @given( 2025-05-07T20:32:22.0821131Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0821223Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0821338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0821451Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0821559Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0821629Z ) 2025-05-07T20:32:22.0821868Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0821958Z def test_silu_mul_quant( 2025-05-07T20:32:22.0822033Z self, 2025-05-07T20:32:22.0822107Z T: int, 2025-05-07T20:32:22.0822182Z D: int, 2025-05-07T20:32:22.0822282Z scale_ub: Optional[float], 2025-05-07T20:32:22.0822367Z contiguous: bool, 2025-05-07T20:32:22.0822446Z compiled: bool, 2025-05-07T20:32:22.0822523Z ) -> None: 2025-05-07T20:32:22.0822612Z torch.manual_seed(2025) 2025-05-07T20:32:22.0822685Z 2025-05-07T20:32:22.0822849Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0822919Z 2025-05-07T20:32:22.0823006Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0823125Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0823206Z x = x_sign * x_clamp 2025-05-07T20:32:22.0823287Z x0 = x[:, :D] 2025-05-07T20:32:22.0823360Z x1 = x[:, D:] 2025-05-07T20:32:22.0823428Z 2025-05-07T20:32:22.0823510Z if contiguous: 2025-05-07T20:32:22.0823595Z x0 = x0.contiguous() 2025-05-07T20:32:22.0823683Z x1 = x1.contiguous() 2025-05-07T20:32:22.0823751Z 2025-05-07T20:32:22.0823884Z if scale_ub is not None: 2025-05-07T20:32:22.0823993Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0824123Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0824193Z ) 2025-05-07T20:32:22.0824267Z else: 2025-05-07T20:32:22.0824356Z scale_ub_tensor = None 2025-05-07T20:32:22.0824423Z 2025-05-07T20:32:22.0824551Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0824637Z op = silu_mul_quant 2025-05-07T20:32:22.0824715Z if compiled: 2025-05-07T20:32:22.0824815Z op = torch.compile(op) 2025-05-07T20:32:22.0824915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0824983Z 2025-05-07T20:32:22.0825072Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0825076Z 2025-05-07T20:32:22.0825167Z moe/activation_test.py:117: 2025-05-07T20:32:22.0825333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0825475Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0825570Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0825940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0826028Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0826522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0826617Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0826979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0827202Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0827541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0827629Z kernel = self.compile( 2025-05-07T20:32:22.0828019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0828233Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0828357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0828365Z 2025-05-07T20:32:22.0828568Z self = 2025-05-07T20:32:22.0829335Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0829834Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f188c360>} 2025-05-07T20:32:22.0830583Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0830778Z context = 2025-05-07T20:32:22.0830782Z 2025-05-07T20:32:22.0830947Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0831204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0831308Z module_map=module_map) 2025-05-07T20:32:22.0831468Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0831561Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0831636Z E ^ 2025-05-07T20:32:22.0831988Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0831995Z 2025-05-07T20:32:22.0832451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0832460Z 2025-05-07T20:32:22.0832559Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0832778Z self=, 2025-05-07T20:32:22.0832854Z T=2048, 2025-05-07T20:32:22.0832927Z D=7168, 2025-05-07T20:32:22.0833003Z scale_ub=None, 2025-05-07T20:32:22.0833083Z contiguous=True, 2025-05-07T20:32:22.0833156Z compiled=True, 2025-05-07T20:32:22.0833224Z ) 2025-05-07T20:32:22.0833448Z self = 2025-05-07T20:32:22.0833612Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.0833617Z 2025-05-07T20:32:22.0833693Z @given( 2025-05-07T20:32:22.0833807Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0833966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0834123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0834236Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0834344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0834415Z ) 2025-05-07T20:32:22.0834655Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0834744Z def test_silu_mul_quant( 2025-05-07T20:32:22.0834820Z self, 2025-05-07T20:32:22.0834893Z T: int, 2025-05-07T20:32:22.0834971Z D: int, 2025-05-07T20:32:22.0835064Z scale_ub: Optional[float], 2025-05-07T20:32:22.0835147Z contiguous: bool, 2025-05-07T20:32:22.0835229Z compiled: bool, 2025-05-07T20:32:22.0835302Z ) -> None: 2025-05-07T20:32:22.0835392Z torch.manual_seed(2025) 2025-05-07T20:32:22.0835464Z 2025-05-07T20:32:22.0835632Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0835702Z 2025-05-07T20:32:22.0835798Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0835961Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0836043Z x = x_sign * x_clamp 2025-05-07T20:32:22.0836121Z x0 = x[:, :D] 2025-05-07T20:32:22.0836197Z x1 = x[:, D:] 2025-05-07T20:32:22.0836265Z 2025-05-07T20:32:22.0836346Z if contiguous: 2025-05-07T20:32:22.0836430Z x0 = x0.contiguous() 2025-05-07T20:32:22.0836514Z x1 = x1.contiguous() 2025-05-07T20:32:22.0836579Z 2025-05-07T20:32:22.0836663Z if scale_ub is not None: 2025-05-07T20:32:22.0836767Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0836894Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0836965Z ) 2025-05-07T20:32:22.0837040Z else: 2025-05-07T20:32:22.0837129Z scale_ub_tensor = None 2025-05-07T20:32:22.0837196Z 2025-05-07T20:32:22.0837326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0837418Z op = silu_mul_quant 2025-05-07T20:32:22.0837498Z if compiled: 2025-05-07T20:32:22.0837601Z op = torch.compile(op) 2025-05-07T20:32:22.0837701Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0837772Z 2025-05-07T20:32:22.0837859Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0837863Z 2025-05-07T20:32:22.0837954Z moe/activation_test.py:117: 2025-05-07T20:32:22.0838081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0838177Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0838270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0838915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0839009Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0839601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0839704Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0840062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0840284Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0840622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0840710Z kernel = self.compile( 2025-05-07T20:32:22.0841093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0841264Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0841388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0841393Z 2025-05-07T20:32:22.0841654Z self = 2025-05-07T20:32:22.0842477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0842971Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f188cea0>} 2025-05-07T20:32:22.0843799Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0843988Z context = 2025-05-07T20:32:22.0843993Z 2025-05-07T20:32:22.0844152Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0844417Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0844586Z module_map=module_map) 2025-05-07T20:32:22.0844743Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0844840Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0844913Z E ^ 2025-05-07T20:32:22.0845262Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0845267Z 2025-05-07T20:32:22.0845681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0845686Z 2025-05-07T20:32:22.0845783Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0846007Z self=, 2025-05-07T20:32:22.0846081Z T=16384, 2025-05-07T20:32:22.0846154Z D=5120, 2025-05-07T20:32:22.0846238Z scale_ub=None, 2025-05-07T20:32:22.0846319Z contiguous=False, 2025-05-07T20:32:22.0846405Z compiled=False, 2025-05-07T20:32:22.0846477Z ) 2025-05-07T20:32:22.0846689Z self = 2025-05-07T20:32:22.0846859Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.0846863Z 2025-05-07T20:32:22.0846938Z @given( 2025-05-07T20:32:22.0847051Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0847143Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0847251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0847361Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0847470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0847539Z ) 2025-05-07T20:32:22.0847782Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0847876Z def test_silu_mul_quant( 2025-05-07T20:32:22.0847946Z self, 2025-05-07T20:32:22.0848071Z T: int, 2025-05-07T20:32:22.0848149Z D: int, 2025-05-07T20:32:22.0848241Z scale_ub: Optional[float], 2025-05-07T20:32:22.0848326Z contiguous: bool, 2025-05-07T20:32:22.0848407Z compiled: bool, 2025-05-07T20:32:22.0848481Z ) -> None: 2025-05-07T20:32:22.0848573Z torch.manual_seed(2025) 2025-05-07T20:32:22.0848643Z 2025-05-07T20:32:22.0848808Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0848877Z 2025-05-07T20:32:22.0848963Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0849081Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0850933Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0850976Z 2025-05-07T20:32:22.0851092Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:22.0851097Z 2025-05-07T20:32:22.0851197Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0851415Z self=, 2025-05-07T20:32:22.0851491Z T=4096, 2025-05-07T20:32:22.0851564Z D=7168, 2025-05-07T20:32:22.0851641Z scale_ub=1200.0, 2025-05-07T20:32:22.0851723Z contiguous=True, 2025-05-07T20:32:22.0851799Z compiled=True, 2025-05-07T20:32:22.0851868Z ) 2025-05-07T20:32:22.0852083Z self = 2025-05-07T20:32:22.0852252Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:22.0852305Z 2025-05-07T20:32:22.0852379Z @given( 2025-05-07T20:32:22.0852493Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0852586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0852693Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0852804Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0852910Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0852982Z ) 2025-05-07T20:32:22.0853220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0853308Z def test_silu_mul_quant( 2025-05-07T20:32:22.0853385Z self, 2025-05-07T20:32:22.0853459Z T: int, 2025-05-07T20:32:22.0853528Z D: int, 2025-05-07T20:32:22.0853622Z scale_ub: Optional[float], 2025-05-07T20:32:22.0853705Z contiguous: bool, 2025-05-07T20:32:22.0853788Z compiled: bool, 2025-05-07T20:32:22.0853865Z ) -> None: 2025-05-07T20:32:22.0853962Z torch.manual_seed(2025) 2025-05-07T20:32:22.0854030Z 2025-05-07T20:32:22.0854198Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0854266Z 2025-05-07T20:32:22.0854356Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0854476Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0856258Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0856267Z 2025-05-07T20:32:22.0856424Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:22.0856432Z 2025-05-07T20:32:22.0856530Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0856752Z self=, 2025-05-07T20:32:22.0856825Z T=16384, 2025-05-07T20:32:22.0856898Z D=7168, 2025-05-07T20:32:22.0856977Z scale_ub=None, 2025-05-07T20:32:22.0857060Z contiguous=False, 2025-05-07T20:32:22.0857138Z compiled=False, 2025-05-07T20:32:22.0857211Z ) 2025-05-07T20:32:22.0857420Z self = 2025-05-07T20:32:22.0857594Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.0857598Z 2025-05-07T20:32:22.0857673Z @given( 2025-05-07T20:32:22.0857787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0857921Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0858033Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0858182Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0858292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0858361Z ) 2025-05-07T20:32:22.0858604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0858692Z def test_silu_mul_quant( 2025-05-07T20:32:22.0858765Z self, 2025-05-07T20:32:22.0858839Z T: int, 2025-05-07T20:32:22.0858912Z D: int, 2025-05-07T20:32:22.0859001Z scale_ub: Optional[float], 2025-05-07T20:32:22.0859086Z contiguous: bool, 2025-05-07T20:32:22.0859167Z compiled: bool, 2025-05-07T20:32:22.0859239Z ) -> None: 2025-05-07T20:32:22.0859328Z torch.manual_seed(2025) 2025-05-07T20:32:22.0859396Z 2025-05-07T20:32:22.0859559Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0861348Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0861397Z 2025-05-07T20:32:22.0861512Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:22.0861519Z 2025-05-07T20:32:22.0861620Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0861838Z self=, 2025-05-07T20:32:22.0861914Z T=2048, 2025-05-07T20:32:22.0861987Z D=7168, 2025-05-07T20:32:22.0862066Z scale_ub=1200.0, 2025-05-07T20:32:22.0862150Z contiguous=True, 2025-05-07T20:32:22.0862234Z compiled=True, 2025-05-07T20:32:22.0862308Z ) 2025-05-07T20:32:22.0862523Z self = 2025-05-07T20:32:22.0862687Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:22.0862692Z 2025-05-07T20:32:22.0862765Z @given( 2025-05-07T20:32:22.0862884Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0862979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0863091Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0863201Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0863307Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0863381Z ) 2025-05-07T20:32:22.0863619Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0863706Z def test_silu_mul_quant( 2025-05-07T20:32:22.0863784Z self, 2025-05-07T20:32:22.0863858Z T: int, 2025-05-07T20:32:22.0863998Z D: int, 2025-05-07T20:32:22.0864094Z scale_ub: Optional[float], 2025-05-07T20:32:22.0864176Z contiguous: bool, 2025-05-07T20:32:22.0864258Z compiled: bool, 2025-05-07T20:32:22.0864329Z ) -> None: 2025-05-07T20:32:22.0864418Z torch.manual_seed(2025) 2025-05-07T20:32:22.0864486Z 2025-05-07T20:32:22.0864651Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0864721Z 2025-05-07T20:32:22.0864808Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0864927Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0866730Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0866778Z 2025-05-07T20:32:22.0866889Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:22.0866894Z 2025-05-07T20:32:22.0866990Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0867208Z self=, 2025-05-07T20:32:22.0867280Z T=2048, 2025-05-07T20:32:22.0867356Z D=7168, 2025-05-07T20:32:22.0867433Z scale_ub=None, 2025-05-07T20:32:22.0867514Z contiguous=True, 2025-05-07T20:32:22.0867598Z compiled=False, 2025-05-07T20:32:22.0867668Z ) 2025-05-07T20:32:22.0867877Z self = 2025-05-07T20:32:22.0868049Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:22.0868056Z 2025-05-07T20:32:22.0868175Z @given( 2025-05-07T20:32:22.0868287Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0868383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0868490Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0868599Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0868709Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0868778Z ) 2025-05-07T20:32:22.0869019Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0869107Z def test_silu_mul_quant( 2025-05-07T20:32:22.0869181Z self, 2025-05-07T20:32:22.0869254Z T: int, 2025-05-07T20:32:22.0869325Z D: int, 2025-05-07T20:32:22.0869417Z scale_ub: Optional[float], 2025-05-07T20:32:22.0869504Z contiguous: bool, 2025-05-07T20:32:22.0869584Z compiled: bool, 2025-05-07T20:32:22.0869658Z ) -> None: 2025-05-07T20:32:22.0869755Z torch.manual_seed(2025) 2025-05-07T20:32:22.0869826Z 2025-05-07T20:32:22.0869989Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0870061Z 2025-05-07T20:32:22.0870147Z > x_sign = torch.sign(x) 2025-05-07T20:32:22.0871911Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0871918Z 2025-05-07T20:32:22.0872032Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:22.0872036Z 2025-05-07T20:32:22.0872182Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0872403Z self=, 2025-05-07T20:32:22.0872475Z T=1, 2025-05-07T20:32:22.0872550Z D=7168, 2025-05-07T20:32:22.0872628Z scale_ub=1200.0, 2025-05-07T20:32:22.0872709Z contiguous=True, 2025-05-07T20:32:22.0872790Z compiled=False, 2025-05-07T20:32:22.0872860Z ) 2025-05-07T20:32:22.0873070Z self = 2025-05-07T20:32:22.0873233Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:22.0873237Z 2025-05-07T20:32:22.0873310Z @given( 2025-05-07T20:32:22.0873424Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0873518Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0873625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0873780Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0873929Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0874003Z ) 2025-05-07T20:32:22.0874244Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0874332Z def test_silu_mul_quant( 2025-05-07T20:32:22.0874407Z self, 2025-05-07T20:32:22.0874479Z T: int, 2025-05-07T20:32:22.0874549Z D: int, 2025-05-07T20:32:22.0874651Z scale_ub: Optional[float], 2025-05-07T20:32:22.0874734Z contiguous: bool, 2025-05-07T20:32:22.0874812Z compiled: bool, 2025-05-07T20:32:22.0874889Z ) -> None: 2025-05-07T20:32:22.0874979Z torch.manual_seed(2025) 2025-05-07T20:32:22.0875048Z 2025-05-07T20:32:22.0875217Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0875289Z 2025-05-07T20:32:22.0875375Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0875501Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0875590Z x = x_sign * x_clamp 2025-05-07T20:32:22.0875711Z x0 = x[:, :D] 2025-05-07T20:32:22.0875787Z x1 = x[:, D:] 2025-05-07T20:32:22.0875855Z 2025-05-07T20:32:22.0875939Z if contiguous: 2025-05-07T20:32:22.0876027Z x0 = x0.contiguous() 2025-05-07T20:32:22.0876115Z x1 = x1.contiguous() 2025-05-07T20:32:22.0876190Z 2025-05-07T20:32:22.0876277Z if scale_ub is not None: 2025-05-07T20:32:22.0876375Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0876509Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0876581Z ) 2025-05-07T20:32:22.0876652Z else: 2025-05-07T20:32:22.0876743Z scale_ub_tensor = None 2025-05-07T20:32:22.0876811Z 2025-05-07T20:32:22.0876936Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0877025Z op = silu_mul_quant 2025-05-07T20:32:22.0877107Z if compiled: 2025-05-07T20:32:22.0877205Z op = torch.compile(op) 2025-05-07T20:32:22.0877310Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0877379Z 2025-05-07T20:32:22.0877466Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0877470Z 2025-05-07T20:32:22.0877563Z moe/activation_test.py:117: 2025-05-07T20:32:22.0877687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0877782Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0877877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0878378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0878473Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0878831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0879058Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0879445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0879538Z kernel = self.compile( 2025-05-07T20:32:22.0879924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0880095Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0880217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0880222Z 2025-05-07T20:32:22.0880423Z self = 2025-05-07T20:32:22.0881238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0881782Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1644680>} 2025-05-07T20:32:22.0882573Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0882763Z context = 2025-05-07T20:32:22.0882768Z 2025-05-07T20:32:22.0882928Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0883188Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0883291Z module_map=module_map) 2025-05-07T20:32:22.0883547Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0883643Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0883716Z E ^ 2025-05-07T20:32:22.0884070Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0884124Z 2025-05-07T20:32:22.0884542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0884546Z 2025-05-07T20:32:22.0884647Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0884869Z self=, 2025-05-07T20:32:22.0884943Z T=128, 2025-05-07T20:32:22.0885013Z D=5120, 2025-05-07T20:32:22.0885092Z scale_ub=None, 2025-05-07T20:32:22.0885170Z contiguous=True, 2025-05-07T20:32:22.0885249Z compiled=False, 2025-05-07T20:32:22.0885322Z ) 2025-05-07T20:32:22.0885535Z self = 2025-05-07T20:32:22.0885706Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:22.0885714Z 2025-05-07T20:32:22.0885789Z @given( 2025-05-07T20:32:22.0885911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0886002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0886116Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0886229Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0886337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0886408Z ) 2025-05-07T20:32:22.0886648Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0886739Z def test_silu_mul_quant( 2025-05-07T20:32:22.0886815Z self, 2025-05-07T20:32:22.0886886Z T: int, 2025-05-07T20:32:22.0886961Z D: int, 2025-05-07T20:32:22.0887053Z scale_ub: Optional[float], 2025-05-07T20:32:22.0887136Z contiguous: bool, 2025-05-07T20:32:22.0887219Z compiled: bool, 2025-05-07T20:32:22.0887292Z ) -> None: 2025-05-07T20:32:22.0887386Z torch.manual_seed(2025) 2025-05-07T20:32:22.0887509Z 2025-05-07T20:32:22.0887678Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0887751Z 2025-05-07T20:32:22.0887836Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0887956Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0888043Z x = x_sign * x_clamp 2025-05-07T20:32:22.0888118Z x0 = x[:, :D] 2025-05-07T20:32:22.0888191Z x1 = x[:, D:] 2025-05-07T20:32:22.0888258Z 2025-05-07T20:32:22.0888337Z if contiguous: 2025-05-07T20:32:22.0891650Z x0 = x0.contiguous() 2025-05-07T20:32:22.0891750Z x1 = x1.contiguous() 2025-05-07T20:32:22.0891824Z 2025-05-07T20:32:22.0891914Z if scale_ub is not None: 2025-05-07T20:32:22.0892017Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0892153Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0892295Z ) 2025-05-07T20:32:22.0892369Z else: 2025-05-07T20:32:22.0892506Z scale_ub_tensor = None 2025-05-07T20:32:22.0892577Z 2025-05-07T20:32:22.0892706Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0892796Z op = silu_mul_quant 2025-05-07T20:32:22.0892881Z if compiled: 2025-05-07T20:32:22.0892983Z op = torch.compile(op) 2025-05-07T20:32:22.0893085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0893155Z 2025-05-07T20:32:22.0893249Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0893254Z 2025-05-07T20:32:22.0893348Z moe/activation_test.py:117: 2025-05-07T20:32:22.0893476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0893579Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0893676Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0894178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0894351Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0894709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0894933Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0895274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0895365Z kernel = self.compile( 2025-05-07T20:32:22.0895754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0895924Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0896053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0896058Z 2025-05-07T20:32:22.0896261Z self = 2025-05-07T20:32:22.0897035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0897534Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f16458a0>} 2025-05-07T20:32:22.0898279Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0898470Z context = 2025-05-07T20:32:22.0898475Z 2025-05-07T20:32:22.0898636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0898945Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0899056Z module_map=module_map) 2025-05-07T20:32:22.0899217Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0899315Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0899388Z E ^ 2025-05-07T20:32:22.0899740Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0899745Z 2025-05-07T20:32:22.0900157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0900162Z 2025-05-07T20:32:22.0900260Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0900480Z self=, 2025-05-07T20:32:22.0900556Z T=128, 2025-05-07T20:32:22.0900630Z D=7168, 2025-05-07T20:32:22.0900752Z scale_ub=None, 2025-05-07T20:32:22.0900837Z contiguous=True, 2025-05-07T20:32:22.0900960Z compiled=False, 2025-05-07T20:32:22.0901037Z ) 2025-05-07T20:32:22.0901249Z self = 2025-05-07T20:32:22.0901415Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:22.0901419Z 2025-05-07T20:32:22.0901494Z @given( 2025-05-07T20:32:22.0901611Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0901708Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0901818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0901929Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0902041Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0902109Z ) 2025-05-07T20:32:22.0902349Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0902441Z def test_silu_mul_quant( 2025-05-07T20:32:22.0902517Z self, 2025-05-07T20:32:22.0902590Z T: int, 2025-05-07T20:32:22.0902718Z D: int, 2025-05-07T20:32:22.0902813Z scale_ub: Optional[float], 2025-05-07T20:32:22.0902897Z contiguous: bool, 2025-05-07T20:32:22.0902983Z compiled: bool, 2025-05-07T20:32:22.0903058Z ) -> None: 2025-05-07T20:32:22.0903154Z torch.manual_seed(2025) 2025-05-07T20:32:22.0903225Z 2025-05-07T20:32:22.0903391Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0903464Z 2025-05-07T20:32:22.0903553Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0903673Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0903761Z x = x_sign * x_clamp 2025-05-07T20:32:22.0903836Z x0 = x[:, :D] 2025-05-07T20:32:22.0903909Z x1 = x[:, D:] 2025-05-07T20:32:22.0903983Z 2025-05-07T20:32:22.0904061Z if contiguous: 2025-05-07T20:32:22.0904153Z x0 = x0.contiguous() 2025-05-07T20:32:22.0904247Z x1 = x1.contiguous() 2025-05-07T20:32:22.0904321Z 2025-05-07T20:32:22.0904412Z if scale_ub is not None: 2025-05-07T20:32:22.0904519Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0904648Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0904723Z ) 2025-05-07T20:32:22.0904797Z else: 2025-05-07T20:32:22.0904888Z scale_ub_tensor = None 2025-05-07T20:32:22.0904961Z 2025-05-07T20:32:22.0905086Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0905173Z op = silu_mul_quant 2025-05-07T20:32:22.0905260Z if compiled: 2025-05-07T20:32:22.0905355Z op = torch.compile(op) 2025-05-07T20:32:22.0905455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0905532Z 2025-05-07T20:32:22.0905619Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0905624Z 2025-05-07T20:32:22.0905722Z moe/activation_test.py:117: 2025-05-07T20:32:22.0905906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0906015Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0906111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0906609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0906700Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0907063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0907282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0907627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0907716Z kernel = self.compile( 2025-05-07T20:32:22.0908140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0908355Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0908481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0908486Z 2025-05-07T20:32:22.0908686Z self = 2025-05-07T20:32:22.0909455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0909947Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f16467a0>} 2025-05-07T20:32:22.0910697Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0910927Z context = 2025-05-07T20:32:22.0910932Z 2025-05-07T20:32:22.0911097Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0911358Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0911459Z module_map=module_map) 2025-05-07T20:32:22.0911623Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0911717Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0911791Z E ^ 2025-05-07T20:32:22.0912147Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0912151Z 2025-05-07T20:32:22.0912565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0912570Z 2025-05-07T20:32:22.0912676Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0912897Z self=, 2025-05-07T20:32:22.0912971Z T=2048, 2025-05-07T20:32:22.0913044Z D=7168, 2025-05-07T20:32:22.0913121Z scale_ub=1200.0, 2025-05-07T20:32:22.0913199Z contiguous=True, 2025-05-07T20:32:22.0913282Z compiled=False, 2025-05-07T20:32:22.0913349Z ) 2025-05-07T20:32:22.0913566Z self = 2025-05-07T20:32:22.0913735Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:22.0913740Z 2025-05-07T20:32:22.0913817Z @given( 2025-05-07T20:32:22.0913937Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0914035Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0914144Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0914262Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0914417Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0914493Z ) 2025-05-07T20:32:22.0914738Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0914829Z def test_silu_mul_quant( 2025-05-07T20:32:22.0914905Z self, 2025-05-07T20:32:22.0914980Z T: int, 2025-05-07T20:32:22.0915056Z D: int, 2025-05-07T20:32:22.0915153Z scale_ub: Optional[float], 2025-05-07T20:32:22.0915239Z contiguous: bool, 2025-05-07T20:32:22.0915321Z compiled: bool, 2025-05-07T20:32:22.0915401Z ) -> None: 2025-05-07T20:32:22.0915491Z torch.manual_seed(2025) 2025-05-07T20:32:22.0915562Z 2025-05-07T20:32:22.0915733Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0917554Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0917598Z 2025-05-07T20:32:22.0917726Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:22.0917731Z 2025-05-07T20:32:22.0917830Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0918048Z self=, 2025-05-07T20:32:22.0918127Z T=1, 2025-05-07T20:32:22.0918201Z D=5120, 2025-05-07T20:32:22.0918278Z scale_ub=1200.0, 2025-05-07T20:32:22.0918364Z contiguous=True, 2025-05-07T20:32:22.0918444Z compiled=False, 2025-05-07T20:32:22.0918517Z ) 2025-05-07T20:32:22.0918737Z self = 2025-05-07T20:32:22.0918940Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:22.0918945Z 2025-05-07T20:32:22.0919023Z @given( 2025-05-07T20:32:22.0919138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0919231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0919342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0919453Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0919561Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0919637Z ) 2025-05-07T20:32:22.0919875Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0919963Z def test_silu_mul_quant( 2025-05-07T20:32:22.0920037Z self, 2025-05-07T20:32:22.0920111Z T: int, 2025-05-07T20:32:22.0920184Z D: int, 2025-05-07T20:32:22.0920286Z scale_ub: Optional[float], 2025-05-07T20:32:22.0920376Z contiguous: bool, 2025-05-07T20:32:22.0920463Z compiled: bool, 2025-05-07T20:32:22.0920538Z ) -> None: 2025-05-07T20:32:22.0920630Z torch.manual_seed(2025) 2025-05-07T20:32:22.0920706Z 2025-05-07T20:32:22.0920872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0920943Z 2025-05-07T20:32:22.0921037Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0921157Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0921244Z x = x_sign * x_clamp 2025-05-07T20:32:22.0921321Z x0 = x[:, :D] 2025-05-07T20:32:22.0921394Z x1 = x[:, D:] 2025-05-07T20:32:22.0921464Z 2025-05-07T20:32:22.0921545Z if contiguous: 2025-05-07T20:32:22.0921633Z x0 = x0.contiguous() 2025-05-07T20:32:22.0921723Z x1 = x1.contiguous() 2025-05-07T20:32:22.0921790Z 2025-05-07T20:32:22.0921877Z if scale_ub is not None: 2025-05-07T20:32:22.0922031Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0922175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0922249Z ) 2025-05-07T20:32:22.0922322Z else: 2025-05-07T20:32:22.0922412Z scale_ub_tensor = None 2025-05-07T20:32:22.0922481Z 2025-05-07T20:32:22.0922609Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0922696Z op = silu_mul_quant 2025-05-07T20:32:22.0922775Z if compiled: 2025-05-07T20:32:22.0922874Z op = torch.compile(op) 2025-05-07T20:32:22.0922975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0923052Z 2025-05-07T20:32:22.0923138Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0923142Z 2025-05-07T20:32:22.0923235Z moe/activation_test.py:117: 2025-05-07T20:32:22.0923477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0923617Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0923776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0924285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0924376Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0924738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0924957Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0925297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0925394Z kernel = self.compile( 2025-05-07T20:32:22.0925779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0925953Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0926081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0926129Z 2025-05-07T20:32:22.0926331Z self = 2025-05-07T20:32:22.0927098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0927590Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f1647b00>} 2025-05-07T20:32:22.0928339Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0928526Z context = 2025-05-07T20:32:22.0928536Z 2025-05-07T20:32:22.0928699Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0928969Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0929070Z module_map=module_map) 2025-05-07T20:32:22.0929229Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0929325Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0929396Z E ^ 2025-05-07T20:32:22.0929752Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0929756Z 2025-05-07T20:32:22.0930165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0930169Z 2025-05-07T20:32:22.0930267Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0930530Z self=, 2025-05-07T20:32:22.0930612Z T=2048, 2025-05-07T20:32:22.0930688Z D=5120, 2025-05-07T20:32:22.0930765Z scale_ub=None, 2025-05-07T20:32:22.0930843Z contiguous=True, 2025-05-07T20:32:22.0930923Z compiled=False, 2025-05-07T20:32:22.0930991Z ) 2025-05-07T20:32:22.0931204Z self = 2025-05-07T20:32:22.0931378Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:22.0931383Z 2025-05-07T20:32:22.0931453Z @given( 2025-05-07T20:32:22.0931568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0931664Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0931773Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0931887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0931994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0932103Z ) 2025-05-07T20:32:22.0932352Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0932483Z def test_silu_mul_quant( 2025-05-07T20:32:22.0932555Z self, 2025-05-07T20:32:22.0932628Z T: int, 2025-05-07T20:32:22.0932699Z D: int, 2025-05-07T20:32:22.0932793Z scale_ub: Optional[float], 2025-05-07T20:32:22.0932879Z contiguous: bool, 2025-05-07T20:32:22.0932961Z compiled: bool, 2025-05-07T20:32:22.0933032Z ) -> None: 2025-05-07T20:32:22.0933126Z torch.manual_seed(2025) 2025-05-07T20:32:22.0933195Z 2025-05-07T20:32:22.0933360Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0933434Z 2025-05-07T20:32:22.0933520Z > x_sign = torch.sign(x) 2025-05-07T20:32:22.0935308Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0935358Z 2025-05-07T20:32:22.0935471Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:22.0935476Z 2025-05-07T20:32:22.0935576Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0935794Z self=, 2025-05-07T20:32:22.0935867Z T=16384, 2025-05-07T20:32:22.0935943Z D=5120, 2025-05-07T20:32:22.0936018Z scale_ub=None, 2025-05-07T20:32:22.0936096Z contiguous=True, 2025-05-07T20:32:22.0936178Z compiled=False, 2025-05-07T20:32:22.0936246Z ) 2025-05-07T20:32:22.0936458Z self = 2025-05-07T20:32:22.0936642Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:22.0936647Z 2025-05-07T20:32:22.0936721Z @given( 2025-05-07T20:32:22.0936838Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0936930Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0937037Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0937151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0937258Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0937327Z ) 2025-05-07T20:32:22.0937568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0937657Z def test_silu_mul_quant( 2025-05-07T20:32:22.0937731Z self, 2025-05-07T20:32:22.0937803Z T: int, 2025-05-07T20:32:22.0937874Z D: int, 2025-05-07T20:32:22.0937970Z scale_ub: Optional[float], 2025-05-07T20:32:22.0938053Z contiguous: bool, 2025-05-07T20:32:22.0938180Z compiled: bool, 2025-05-07T20:32:22.0938256Z ) -> None: 2025-05-07T20:32:22.0938345Z torch.manual_seed(2025) 2025-05-07T20:32:22.0938601Z 2025-05-07T20:32:22.0938840Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0940656Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0940663Z 2025-05-07T20:32:22.0940868Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:22.0941011Z 2025-05-07T20:32:22.0941114Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0941336Z self=, 2025-05-07T20:32:22.0941409Z T=4096, 2025-05-07T20:32:22.0941481Z D=5120, 2025-05-07T20:32:22.0941564Z scale_ub=None, 2025-05-07T20:32:22.0941641Z contiguous=True, 2025-05-07T20:32:22.0941720Z compiled=False, 2025-05-07T20:32:22.0941789Z ) 2025-05-07T20:32:22.0941999Z self = 2025-05-07T20:32:22.0942164Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:22.0942168Z 2025-05-07T20:32:22.0942244Z @given( 2025-05-07T20:32:22.0942354Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0942446Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0942561Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0942675Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0942855Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0942924Z ) 2025-05-07T20:32:22.0943163Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0943256Z def test_silu_mul_quant( 2025-05-07T20:32:22.0943328Z self, 2025-05-07T20:32:22.0943397Z T: int, 2025-05-07T20:32:22.0943475Z D: int, 2025-05-07T20:32:22.0943567Z scale_ub: Optional[float], 2025-05-07T20:32:22.0943654Z contiguous: bool, 2025-05-07T20:32:22.0943738Z compiled: bool, 2025-05-07T20:32:22.0943811Z ) -> None: 2025-05-07T20:32:22.0943900Z torch.manual_seed(2025) 2025-05-07T20:32:22.0943975Z 2025-05-07T20:32:22.0944138Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0945912Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0945923Z 2025-05-07T20:32:22.0946035Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:22.0946040Z 2025-05-07T20:32:22.0946137Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0946353Z self=, 2025-05-07T20:32:22.0946428Z T=2048, 2025-05-07T20:32:22.0946503Z D=5120, 2025-05-07T20:32:22.0946580Z scale_ub=None, 2025-05-07T20:32:22.0946663Z contiguous=False, 2025-05-07T20:32:22.0946747Z compiled=False, 2025-05-07T20:32:22.0946816Z ) 2025-05-07T20:32:22.0947091Z self = 2025-05-07T20:32:22.0947269Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.0947274Z 2025-05-07T20:32:22.0947346Z @given( 2025-05-07T20:32:22.0947461Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0947557Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0947665Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0947777Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0947884Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0947953Z ) 2025-05-07T20:32:22.0948193Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0948281Z def test_silu_mul_quant( 2025-05-07T20:32:22.0948356Z self, 2025-05-07T20:32:22.0948428Z T: int, 2025-05-07T20:32:22.0948543Z D: int, 2025-05-07T20:32:22.0948675Z scale_ub: Optional[float], 2025-05-07T20:32:22.0948761Z contiguous: bool, 2025-05-07T20:32:22.0948840Z compiled: bool, 2025-05-07T20:32:22.0948915Z ) -> None: 2025-05-07T20:32:22.0949006Z torch.manual_seed(2025) 2025-05-07T20:32:22.0949074Z 2025-05-07T20:32:22.0949243Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0951013Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0951019Z 2025-05-07T20:32:22.0951142Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:22.0951187Z 2025-05-07T20:32:22.0951285Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0951503Z self=, 2025-05-07T20:32:22.0951576Z T=4096, 2025-05-07T20:32:22.0951648Z D=7168, 2025-05-07T20:32:22.0951732Z scale_ub=None, 2025-05-07T20:32:22.0951815Z contiguous=True, 2025-05-07T20:32:22.0951890Z compiled=True, 2025-05-07T20:32:22.0951962Z ) 2025-05-07T20:32:22.0952171Z self = 2025-05-07T20:32:22.0952335Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.0952339Z 2025-05-07T20:32:22.0952415Z @given( 2025-05-07T20:32:22.0952527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0952628Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0952740Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0952857Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0952965Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0953034Z ) 2025-05-07T20:32:22.0953271Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0953359Z def test_silu_mul_quant( 2025-05-07T20:32:22.0953435Z self, 2025-05-07T20:32:22.0953507Z T: int, 2025-05-07T20:32:22.0953582Z D: int, 2025-05-07T20:32:22.0953675Z scale_ub: Optional[float], 2025-05-07T20:32:22.0953757Z contiguous: bool, 2025-05-07T20:32:22.0953838Z compiled: bool, 2025-05-07T20:32:22.0953911Z ) -> None: 2025-05-07T20:32:22.0954000Z torch.manual_seed(2025) 2025-05-07T20:32:22.0954070Z 2025-05-07T20:32:22.0954234Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0956055Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0956070Z 2025-05-07T20:32:22.0956180Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:22.0956185Z 2025-05-07T20:32:22.0956282Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0956502Z self=, 2025-05-07T20:32:22.0956576Z T=2048, 2025-05-07T20:32:22.0956650Z D=5120, 2025-05-07T20:32:22.0956727Z scale_ub=1200.0, 2025-05-07T20:32:22.0956880Z contiguous=False, 2025-05-07T20:32:22.0957007Z compiled=False, 2025-05-07T20:32:22.0957074Z ) 2025-05-07T20:32:22.0957285Z self = 2025-05-07T20:32:22.0957460Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.0957465Z 2025-05-07T20:32:22.0957537Z @given( 2025-05-07T20:32:22.0957646Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0957739Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0957846Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0957958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0958064Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0958134Z ) 2025-05-07T20:32:22.0958375Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0958462Z def test_silu_mul_quant( 2025-05-07T20:32:22.0958535Z self, 2025-05-07T20:32:22.0958609Z T: int, 2025-05-07T20:32:22.0958730Z D: int, 2025-05-07T20:32:22.0958822Z scale_ub: Optional[float], 2025-05-07T20:32:22.0958906Z contiguous: bool, 2025-05-07T20:32:22.0958987Z compiled: bool, 2025-05-07T20:32:22.0959060Z ) -> None: 2025-05-07T20:32:22.0959151Z torch.manual_seed(2025) 2025-05-07T20:32:22.0959217Z 2025-05-07T20:32:22.0959381Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0961150Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0961161Z 2025-05-07T20:32:22.0961278Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:22.0961283Z 2025-05-07T20:32:22.0961378Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0961593Z self=, 2025-05-07T20:32:22.0961671Z T=4096, 2025-05-07T20:32:22.0961742Z D=7168, 2025-05-07T20:32:22.0961816Z scale_ub=1200.0, 2025-05-07T20:32:22.0961897Z contiguous=True, 2025-05-07T20:32:22.0961972Z compiled=False, 2025-05-07T20:32:22.0962041Z ) 2025-05-07T20:32:22.0962257Z self = 2025-05-07T20:32:22.0962422Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:22.0962427Z 2025-05-07T20:32:22.0962502Z @given( 2025-05-07T20:32:22.0962614Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0962752Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0962868Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0962976Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0963081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0963156Z ) 2025-05-07T20:32:22.0963474Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0963560Z def test_silu_mul_quant( 2025-05-07T20:32:22.0963634Z self, 2025-05-07T20:32:22.0963706Z T: int, 2025-05-07T20:32:22.0963778Z D: int, 2025-05-07T20:32:22.0963870Z scale_ub: Optional[float], 2025-05-07T20:32:22.0963952Z contiguous: bool, 2025-05-07T20:32:22.0964036Z compiled: bool, 2025-05-07T20:32:22.0964108Z ) -> None: 2025-05-07T20:32:22.0964196Z torch.manual_seed(2025) 2025-05-07T20:32:22.0964268Z 2025-05-07T20:32:22.0964476Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0966292Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0966303Z 2025-05-07T20:32:22.0966414Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:22.0966419Z 2025-05-07T20:32:22.0966515Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0966733Z self=, 2025-05-07T20:32:22.0966807Z T=16384, 2025-05-07T20:32:22.0966884Z D=7168, 2025-05-07T20:32:22.0966961Z scale_ub=None, 2025-05-07T20:32:22.0967087Z contiguous=False, 2025-05-07T20:32:22.0967167Z compiled=True, 2025-05-07T20:32:22.0967238Z ) 2025-05-07T20:32:22.0967447Z self = 2025-05-07T20:32:22.0967621Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.0967626Z 2025-05-07T20:32:22.0967699Z @given( 2025-05-07T20:32:22.0967811Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0967910Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0968016Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0968127Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0968234Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0968304Z ) 2025-05-07T20:32:22.0968547Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0968635Z def test_silu_mul_quant( 2025-05-07T20:32:22.0968713Z self, 2025-05-07T20:32:22.0968790Z T: int, 2025-05-07T20:32:22.0968860Z D: int, 2025-05-07T20:32:22.0968951Z scale_ub: Optional[float], 2025-05-07T20:32:22.0969036Z contiguous: bool, 2025-05-07T20:32:22.0969116Z compiled: bool, 2025-05-07T20:32:22.0969188Z ) -> None: 2025-05-07T20:32:22.0969280Z torch.manual_seed(2025) 2025-05-07T20:32:22.0969349Z 2025-05-07T20:32:22.0969512Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0971332Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0971343Z 2025-05-07T20:32:22.0971458Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:22.0971463Z 2025-05-07T20:32:22.0971559Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0971776Z self=, 2025-05-07T20:32:22.0971854Z T=4096, 2025-05-07T20:32:22.0971925Z D=7168, 2025-05-07T20:32:22.0971999Z scale_ub=None, 2025-05-07T20:32:22.0972080Z contiguous=True, 2025-05-07T20:32:22.0972156Z compiled=False, 2025-05-07T20:32:22.0972227Z ) 2025-05-07T20:32:22.0972439Z self = 2025-05-07T20:32:22.0972602Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:22.0972606Z 2025-05-07T20:32:22.0972721Z @given( 2025-05-07T20:32:22.0972835Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0972967Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0973076Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0973185Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0973292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0973364Z ) 2025-05-07T20:32:22.0973601Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0973689Z def test_silu_mul_quant( 2025-05-07T20:32:22.0973764Z self, 2025-05-07T20:32:22.0973833Z T: int, 2025-05-07T20:32:22.0973905Z D: int, 2025-05-07T20:32:22.0973996Z scale_ub: Optional[float], 2025-05-07T20:32:22.0974077Z contiguous: bool, 2025-05-07T20:32:22.0974161Z compiled: bool, 2025-05-07T20:32:22.0974233Z ) -> None: 2025-05-07T20:32:22.0974321Z torch.manual_seed(2025) 2025-05-07T20:32:22.0974396Z 2025-05-07T20:32:22.0974562Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0976378Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0976389Z 2025-05-07T20:32:22.0976499Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:22.0976503Z 2025-05-07T20:32:22.0976601Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0976823Z self=, 2025-05-07T20:32:22.0976898Z T=16384, 2025-05-07T20:32:22.0976980Z D=7168, 2025-05-07T20:32:22.0977059Z scale_ub=None, 2025-05-07T20:32:22.0977137Z contiguous=True, 2025-05-07T20:32:22.0977217Z compiled=False, 2025-05-07T20:32:22.0977286Z ) 2025-05-07T20:32:22.0977494Z self = 2025-05-07T20:32:22.0977666Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:22.0977670Z 2025-05-07T20:32:22.0977743Z @given( 2025-05-07T20:32:22.0977857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0977955Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0978062Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0978174Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0978285Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0978355Z ) 2025-05-07T20:32:22.0978643Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0978738Z def test_silu_mul_quant( 2025-05-07T20:32:22.0978811Z self, 2025-05-07T20:32:22.0978889Z T: int, 2025-05-07T20:32:22.0978961Z D: int, 2025-05-07T20:32:22.0979055Z scale_ub: Optional[float], 2025-05-07T20:32:22.0979140Z contiguous: bool, 2025-05-07T20:32:22.0979217Z compiled: bool, 2025-05-07T20:32:22.0979290Z ) -> None: 2025-05-07T20:32:22.0979381Z torch.manual_seed(2025) 2025-05-07T20:32:22.0979450Z 2025-05-07T20:32:22.0979614Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0981486Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0981530Z 2025-05-07T20:32:22.0981643Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:22.0981647Z 2025-05-07T20:32:22.0981743Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0981959Z self=, 2025-05-07T20:32:22.0982034Z T=16384, 2025-05-07T20:32:22.0982108Z D=7168, 2025-05-07T20:32:22.0982187Z scale_ub=1200.0, 2025-05-07T20:32:22.0982271Z contiguous=True, 2025-05-07T20:32:22.0982352Z compiled=False, 2025-05-07T20:32:22.0982420Z ) 2025-05-07T20:32:22.0982634Z self = 2025-05-07T20:32:22.0982805Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:22.0982810Z 2025-05-07T20:32:22.0982891Z @given( 2025-05-07T20:32:22.0983044Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0983137Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0983246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0983355Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0983465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0983537Z ) 2025-05-07T20:32:22.0983775Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0983863Z def test_silu_mul_quant( 2025-05-07T20:32:22.0983938Z self, 2025-05-07T20:32:22.0984011Z T: int, 2025-05-07T20:32:22.0984084Z D: int, 2025-05-07T20:32:22.0984180Z scale_ub: Optional[float], 2025-05-07T20:32:22.0984263Z contiguous: bool, 2025-05-07T20:32:22.0984345Z compiled: bool, 2025-05-07T20:32:22.0984420Z ) -> None: 2025-05-07T20:32:22.0984513Z torch.manual_seed(2025) 2025-05-07T20:32:22.0984592Z 2025-05-07T20:32:22.0984756Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0986526Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.0986537Z 2025-05-07T20:32:22.0986648Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:22.0986652Z 2025-05-07T20:32:22.0986751Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0987040Z self=, 2025-05-07T20:32:22.0987116Z T=128, 2025-05-07T20:32:22.0987193Z D=5120, 2025-05-07T20:32:22.0987269Z scale_ub=1200.0, 2025-05-07T20:32:22.0987348Z contiguous=False, 2025-05-07T20:32:22.0987428Z compiled=False, 2025-05-07T20:32:22.0987496Z ) 2025-05-07T20:32:22.0987706Z self = 2025-05-07T20:32:22.0987877Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.0987881Z 2025-05-07T20:32:22.0987949Z @given( 2025-05-07T20:32:22.0988059Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0988154Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0988259Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0988377Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0988526Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0988598Z ) 2025-05-07T20:32:22.0988880Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0988971Z def test_silu_mul_quant( 2025-05-07T20:32:22.0989045Z self, 2025-05-07T20:32:22.0989118Z T: int, 2025-05-07T20:32:22.0989189Z D: int, 2025-05-07T20:32:22.0989279Z scale_ub: Optional[float], 2025-05-07T20:32:22.0989365Z contiguous: bool, 2025-05-07T20:32:22.0989444Z compiled: bool, 2025-05-07T20:32:22.0989515Z ) -> None: 2025-05-07T20:32:22.0989609Z torch.manual_seed(2025) 2025-05-07T20:32:22.0989673Z 2025-05-07T20:32:22.0989837Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0989910Z 2025-05-07T20:32:22.0989997Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0990120Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0990203Z x = x_sign * x_clamp 2025-05-07T20:32:22.0990283Z x0 = x[:, :D] 2025-05-07T20:32:22.0990364Z x1 = x[:, D:] 2025-05-07T20:32:22.0990476Z 2025-05-07T20:32:22.0990554Z if contiguous: 2025-05-07T20:32:22.0990642Z x0 = x0.contiguous() 2025-05-07T20:32:22.0990728Z x1 = x1.contiguous() 2025-05-07T20:32:22.0990797Z 2025-05-07T20:32:22.0990887Z if scale_ub is not None: 2025-05-07T20:32:22.0990987Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0991141Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0991218Z ) 2025-05-07T20:32:22.0991310Z else: 2025-05-07T20:32:22.0991404Z scale_ub_tensor = None 2025-05-07T20:32:22.0991472Z 2025-05-07T20:32:22.0991596Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0991682Z op = silu_mul_quant 2025-05-07T20:32:22.0991761Z if compiled: 2025-05-07T20:32:22.0991852Z op = torch.compile(op) 2025-05-07T20:32:22.0991957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0992028Z 2025-05-07T20:32:22.0992117Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0992121Z 2025-05-07T20:32:22.0992215Z moe/activation_test.py:117: 2025-05-07T20:32:22.0992340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0992438Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0992534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0993032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0993123Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0993482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0993699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0994091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0994185Z kernel = self.compile( 2025-05-07T20:32:22.0994570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0994740Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0994861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0994866Z 2025-05-07T20:32:22.0995072Z self = 2025-05-07T20:32:22.0995841Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0996375Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f146e700>} 2025-05-07T20:32:22.0997168Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0997354Z context = 2025-05-07T20:32:22.0997362Z 2025-05-07T20:32:22.0997525Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0997789Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0997892Z module_map=module_map) 2025-05-07T20:32:22.0998047Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0998139Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0998213Z E ^ 2025-05-07T20:32:22.0998568Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0998618Z 2025-05-07T20:32:22.0999037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0999042Z 2025-05-07T20:32:22.0999142Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0999362Z self=, 2025-05-07T20:32:22.0999434Z T=2048, 2025-05-07T20:32:22.0999502Z D=7168, 2025-05-07T20:32:22.0999577Z scale_ub=None, 2025-05-07T20:32:22.0999660Z contiguous=False, 2025-05-07T20:32:22.0999742Z compiled=False, 2025-05-07T20:32:22.0999811Z ) 2025-05-07T20:32:22.1000028Z self = 2025-05-07T20:32:22.1000197Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.1000202Z 2025-05-07T20:32:22.1000279Z @given( 2025-05-07T20:32:22.1000395Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.1000492Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.1000607Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.1000716Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.1000825Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.1000897Z ) 2025-05-07T20:32:22.1001136Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.1001223Z def test_silu_mul_quant( 2025-05-07T20:32:22.1001294Z self, 2025-05-07T20:32:22.1001364Z T: int, 2025-05-07T20:32:22.1001439Z D: int, 2025-05-07T20:32:22.1001534Z scale_ub: Optional[float], 2025-05-07T20:32:22.1001616Z contiguous: bool, 2025-05-07T20:32:22.1001706Z compiled: bool, 2025-05-07T20:32:22.1001780Z ) -> None: 2025-05-07T20:32:22.1001869Z torch.manual_seed(2025) 2025-05-07T20:32:22.1001939Z 2025-05-07T20:32:22.1002152Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.1004054Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.1004060Z 2025-05-07T20:32:22.1004172Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:22.1004177Z 2025-05-07T20:32:22.1004275Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.1004495Z self=, 2025-05-07T20:32:22.1004611Z T=128, 2025-05-07T20:32:22.1004687Z D=7168, 2025-05-07T20:32:22.1004805Z scale_ub=1200.0, 2025-05-07T20:32:22.1004883Z contiguous=True, 2025-05-07T20:32:22.1004963Z compiled=True, 2025-05-07T20:32:22.1005029Z ) 2025-05-07T20:32:22.1005240Z self = 2025-05-07T20:32:22.1005406Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:22.1005410Z 2025-05-07T20:32:22.1005481Z @given( 2025-05-07T20:32:22.1005592Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.1005686Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.1005792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.1005903Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.1006010Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.1006081Z ) 2025-05-07T20:32:22.1006327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.1006419Z def test_silu_mul_quant( 2025-05-07T20:32:22.1006535Z self, 2025-05-07T20:32:22.1006611Z T: int, 2025-05-07T20:32:22.1006684Z D: int, 2025-05-07T20:32:22.1006777Z scale_ub: Optional[float], 2025-05-07T20:32:22.1006862Z contiguous: bool, 2025-05-07T20:32:22.1006943Z compiled: bool, 2025-05-07T20:32:22.1007016Z ) -> None: 2025-05-07T20:32:22.1007109Z torch.manual_seed(2025) 2025-05-07T20:32:22.1007177Z 2025-05-07T20:32:22.1007344Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.1007414Z 2025-05-07T20:32:22.1007501Z x_sign = torch.sign(x) 2025-05-07T20:32:22.1007625Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.1007709Z x = x_sign * x_clamp 2025-05-07T20:32:22.1007787Z x0 = x[:, :D] 2025-05-07T20:32:22.1007863Z x1 = x[:, D:] 2025-05-07T20:32:22.1007928Z 2025-05-07T20:32:22.1008009Z if contiguous: 2025-05-07T20:32:22.1008106Z x0 = x0.contiguous() 2025-05-07T20:32:22.1008193Z x1 = x1.contiguous() 2025-05-07T20:32:22.1008263Z 2025-05-07T20:32:22.1008352Z if scale_ub is not None: 2025-05-07T20:32:22.1008452Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.1008583Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.1008655Z ) 2025-05-07T20:32:22.1008727Z else: 2025-05-07T20:32:22.1008819Z scale_ub_tensor = None 2025-05-07T20:32:22.1008886Z 2025-05-07T20:32:22.1009010Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.1009098Z op = silu_mul_quant 2025-05-07T20:32:22.1009177Z if compiled: 2025-05-07T20:32:22.1009270Z op = torch.compile(op) 2025-05-07T20:32:22.1009374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.1009443Z 2025-05-07T20:32:22.1009529Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.1009533Z 2025-05-07T20:32:22.1009678Z moe/activation_test.py:117: 2025-05-07T20:32:22.1009807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.1009907Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.1013167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.1013569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.1013664Z return fn(*args, **kwargs) 2025-05-07T20:32:22.1014167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.1014261Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.1014620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.1014910Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.1015256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.1015390Z kernel = self.compile( 2025-05-07T20:32:22.1015778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.1015949Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.1016076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.1016081Z 2025-05-07T20:32:22.1016284Z self = 2025-05-07T20:32:22.1017052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.1017556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16f146ff60>} 2025-05-07T20:32:22.1018372Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.1018561Z context = 2025-05-07T20:32:22.1018566Z 2025-05-07T20:32:22.1018727Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.1018994Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.1019095Z module_map=module_map) 2025-05-07T20:32:22.1019255Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.1019356Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.1019428Z E ^ 2025-05-07T20:32:22.1019786Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.1019795Z 2025-05-07T20:32:22.1020215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.1020219Z 2025-05-07T20:32:22.1020319Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.1020541Z self=, 2025-05-07T20:32:22.1020612Z T=128, 2025-05-07T20:32:22.1020687Z D=7168, 2025-05-07T20:32:22.1020770Z scale_ub=1200.0, 2025-05-07T20:32:22.1020850Z contiguous=True, 2025-05-07T20:32:22.1020930Z compiled=False, 2025-05-07T20:32:22.1021006Z ) 2025-05-07T20:32:22.1021218Z self = 2025-05-07T20:32:22.1021387Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:22.1021395Z 2025-05-07T20:32:22.1021473Z @given( 2025-05-07T20:32:22.1021636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.1021737Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.1021846Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.1021958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.1022069Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.1022142Z ) 2025-05-07T20:32:22.1022382Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.1022473Z def test_silu_mul_quant( 2025-05-07T20:32:22.1022547Z self, 2025-05-07T20:32:22.1022619Z T: int, 2025-05-07T20:32:22.1022697Z D: int, 2025-05-07T20:32:22.1022789Z scale_ub: Optional[float], 2025-05-07T20:32:22.1022876Z contiguous: bool, 2025-05-07T20:32:22.1022956Z compiled: bool, 2025-05-07T20:32:22.1023031Z ) -> None: 2025-05-07T20:32:22.1023169Z torch.manual_seed(2025) 2025-05-07T20:32:22.1023237Z 2025-05-07T20:32:22.1023447Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.1023524Z 2025-05-07T20:32:22.1023614Z x_sign = torch.sign(x) 2025-05-07T20:32:22.1023735Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.1025509Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.1025515Z 2025-05-07T20:32:22.1025630Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:22.1025635Z 2025-05-07T20:32:22.1025742Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.1026004Z self=, 2025-05-07T20:32:22.1026082Z T=128, 2025-05-07T20:32:22.1026156Z D=5120, 2025-05-07T20:32:22.1026235Z scale_ub=1200.0, 2025-05-07T20:32:22.1026320Z contiguous=True, 2025-05-07T20:32:22.1026397Z compiled=True, 2025-05-07T20:32:22.1026467Z ) 2025-05-07T20:32:22.1026684Z self = 2025-05-07T20:32:22.1026850Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:22.1026854Z 2025-05-07T20:32:22.1026927Z @given( 2025-05-07T20:32:22.1027043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.1027136Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.1027246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.1027358Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.1027473Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.1027545Z ) 2025-05-07T20:32:22.1027784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.1027872Z def test_silu_mul_quant( 2025-05-07T20:32:22.1027949Z self, 2025-05-07T20:32:22.1028023Z T: int, 2025-05-07T20:32:22.1028097Z D: int, 2025-05-07T20:32:22.1028192Z scale_ub: Optional[float], 2025-05-07T20:32:22.1028278Z contiguous: bool, 2025-05-07T20:32:22.1028359Z compiled: bool, 2025-05-07T20:32:22.1028436Z ) -> None: 2025-05-07T20:32:22.1028527Z torch.manual_seed(2025) 2025-05-07T20:32:22.1028596Z 2025-05-07T20:32:22.1028762Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.1028831Z 2025-05-07T20:32:22.1028921Z x_sign = torch.sign(x) 2025-05-07T20:32:22.1029044Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.1030856Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.1030869Z 2025-05-07T20:32:22.1030983Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:22.1030988Z 2025-05-07T20:32:22.1031086Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.1031310Z self=, 2025-05-07T20:32:22.1031384Z T=128, 2025-05-07T20:32:22.1031492Z D=7168, 2025-05-07T20:32:22.1031579Z scale_ub=None, 2025-05-07T20:32:22.1031704Z contiguous=True, 2025-05-07T20:32:22.1031790Z compiled=True, 2025-05-07T20:32:22.1031862Z ) 2025-05-07T20:32:22.1032074Z self = 2025-05-07T20:32:22.1032240Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.1032245Z 2025-05-07T20:32:22.1032317Z @given( 2025-05-07T20:32:22.1032431Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.1032526Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.1032634Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.1032746Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.1032857Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.1032929Z ) 2025-05-07T20:32:22.1033170Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.1033261Z def test_silu_mul_quant( 2025-05-07T20:32:22.1033338Z self, 2025-05-07T20:32:22.1033461Z T: int, 2025-05-07T20:32:22.1033533Z D: int, 2025-05-07T20:32:22.1033626Z scale_ub: Optional[float], 2025-05-07T20:32:22.1033715Z contiguous: bool, 2025-05-07T20:32:22.1033799Z compiled: bool, 2025-05-07T20:32:22.1033872Z ) -> None: 2025-05-07T20:32:22.1033963Z torch.manual_seed(2025) 2025-05-07T20:32:22.1034029Z 2025-05-07T20:32:22.1034194Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.1035962Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:22.1035973Z 2025-05-07T20:32:22.1036086Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:22.1036217Z =============================== warnings summary =============================== 2025-05-07T20:32:22.1036520Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:22.1036824Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:22.1037118Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:22.1037985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:22.1038260Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:22.1038268Z 2025-05-07T20:32:22.1038761Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:22.1038995Z ================= 1 failed, 1 deselected, 3 warnings in 14.16s ================= 2025-05-07T20:32:23.7176622Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:23.7827812Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:32:23.7828159Z 2025-05-07T20:32:25.7848617Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:27.9536093Z ============================= test session starts ============================== 2025-05-07T20:32:27.9536834Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:27.9537354Z cachedir: .pytest_cache 2025-05-07T20:32:27.9537924Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:27.9539012Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:27.9539426Z plugins: hypothesis-6.131.14 2025-05-07T20:32:29.5749891Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:29.7265700Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:29.7266510Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:29.7266939Z 2025-05-07T20:32:32.1390891Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.1392232Z self=, 2025-05-07T20:32:32.1393461Z T=1, 2025-05-07T20:32:32.1393826Z D=5120, 2025-05-07T20:32:32.1394203Z scale_ub=None, 2025-05-07T20:32:32.1394503Z contiguous=True, 2025-05-07T20:32:32.1394724Z compiled=True, 2025-05-07T20:32:32.1394928Z ) 2025-05-07T20:32:32.1395253Z self = 2025-05-07T20:32:32.1395747Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.1396014Z 2025-05-07T20:32:32.1396093Z @given( 2025-05-07T20:32:32.1396326Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.1396644Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.1396948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.1397284Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.1397619Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.1397898Z ) 2025-05-07T20:32:32.1398258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.1398707Z def test_silu_mul_quant( 2025-05-07T20:32:32.1398955Z self, 2025-05-07T20:32:32.1399145Z T: int, 2025-05-07T20:32:32.1399349Z D: int, 2025-05-07T20:32:32.1399572Z scale_ub: Optional[float], 2025-05-07T20:32:32.1399842Z contiguous: bool, 2025-05-07T20:32:32.1400082Z compiled: bool, 2025-05-07T20:32:32.1400311Z ) -> None: 2025-05-07T20:32:32.1400523Z torch.manual_seed(2025) 2025-05-07T20:32:32.1400766Z 2025-05-07T20:32:32.1401046Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.1401384Z 2025-05-07T20:32:32.1401585Z x_sign = torch.sign(x) 2025-05-07T20:32:32.1401879Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.1402186Z x = x_sign * x_clamp 2025-05-07T20:32:32.1402432Z x0 = x[:, :D] 2025-05-07T20:32:32.1402652Z x1 = x[:, D:] 2025-05-07T20:32:32.1402956Z 2025-05-07T20:32:32.1403149Z if contiguous: 2025-05-07T20:32:32.1403388Z x0 = x0.contiguous() 2025-05-07T20:32:32.1403769Z x1 = x1.contiguous() 2025-05-07T20:32:32.1404024Z 2025-05-07T20:32:32.1404221Z if scale_ub is not None: 2025-05-07T20:32:32.1404498Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.1404831Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.1405147Z ) 2025-05-07T20:32:32.1405341Z else: 2025-05-07T20:32:32.1405547Z scale_ub_tensor = None 2025-05-07T20:32:32.1405805Z 2025-05-07T20:32:32.1406044Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.1406354Z op = silu_mul_quant 2025-05-07T20:32:32.1406610Z if compiled: 2025-05-07T20:32:32.1406861Z op = torch.compile(op) 2025-05-07T20:32:32.1407246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.1407620Z 2025-05-07T20:32:32.1407818Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.1408100Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.1408394Z 2025-05-07T20:32:32.1408636Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.1408972Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.1409270Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.1409585Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.1409947Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.1410255Z 2025-05-07T20:32:32.1410460Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.1410653Z 2025-05-07T20:32:32.1410763Z moe/activation_test.py:126: 2025-05-07T20:32:32.1411055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.1411398Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.1411732Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.1412583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.1413345Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.1413895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.1414585Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.1415273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.1416005Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.1416767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:32.1417523Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.1418258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.1418903Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.1419525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.1420053Z fn() 2025-05-07T20:32:32.1420561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.1421153Z self.fn.run( 2025-05-07T20:32:32.1421624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.1422155Z kernel = self.compile( 2025-05-07T20:32:32.1422708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.1423425Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.1423830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.1424059Z 2025-05-07T20:32:32.1424268Z self = 2025-05-07T20:32:32.1425353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.1426744Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3ab33d3a0>} 2025-05-07T20:32:32.1428137Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.1429209Z context = 2025-05-07T20:32:32.1429502Z 2025-05-07T20:32:32.1429671Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.1430198Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.1430676Z module_map=module_map) 2025-05-07T20:32:32.1431045Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.1431405Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.1431674Z E ^ 2025-05-07T20:32:32.1432133Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.1432592Z 2025-05-07T20:32:32.1433014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.1433540Z 2025-05-07T20:32:32.1433651Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.1434109Z self=, 2025-05-07T20:32:32.1434505Z T=2048, 2025-05-07T20:32:32.1434696Z D=5120, 2025-05-07T20:32:32.1434886Z scale_ub=1200.0, 2025-05-07T20:32:32.1435104Z contiguous=True, 2025-05-07T20:32:32.1435327Z compiled=False, 2025-05-07T20:32:32.1435541Z ) 2025-05-07T20:32:33.0991594Z self = 2025-05-07T20:32:33.0993240Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:33.0994009Z 2025-05-07T20:32:33.0994248Z @given( 2025-05-07T20:32:33.0994633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.0994995Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.0995308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.0995658Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.0996009Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.0996304Z ) 2025-05-07T20:32:33.0996662Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.0997105Z def test_silu_mul_quant( 2025-05-07T20:32:33.0997350Z self, 2025-05-07T20:32:33.0997551Z T: int, 2025-05-07T20:32:33.0997744Z D: int, 2025-05-07T20:32:33.0997968Z scale_ub: Optional[float], 2025-05-07T20:32:33.0998246Z contiguous: bool, 2025-05-07T20:32:33.0998484Z compiled: bool, 2025-05-07T20:32:33.0998719Z ) -> None: 2025-05-07T20:32:33.0998936Z torch.manual_seed(2025) 2025-05-07T20:32:33.0999176Z 2025-05-07T20:32:33.0999458Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.0999809Z 2025-05-07T20:32:33.1000000Z x_sign = torch.sign(x) 2025-05-07T20:32:33.1000303Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.1000939Z x = x_sign * x_clamp 2025-05-07T20:32:33.1001200Z x0 = x[:, :D] 2025-05-07T20:32:33.1001417Z x1 = x[:, D:] 2025-05-07T20:32:33.1001629Z 2025-05-07T20:32:33.1001824Z if contiguous: 2025-05-07T20:32:33.1002055Z x0 = x0.contiguous() 2025-05-07T20:32:33.1002318Z x1 = x1.contiguous() 2025-05-07T20:32:33.1002565Z 2025-05-07T20:32:33.1002756Z if scale_ub is not None: 2025-05-07T20:32:33.1003037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.1003382Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.1003847Z ) 2025-05-07T20:32:33.1004045Z else: 2025-05-07T20:32:33.1004262Z scale_ub_tensor = None 2025-05-07T20:32:33.1004514Z 2025-05-07T20:32:33.1004748Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.1005063Z op = silu_mul_quant 2025-05-07T20:32:33.1005410Z if compiled: 2025-05-07T20:32:33.1005667Z op = torch.compile(op) 2025-05-07T20:32:33.1006050Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1006324Z 2025-05-07T20:32:33.1006521Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.1006693Z 2025-05-07T20:32:33.1006796Z moe/activation_test.py:117: 2025-05-07T20:32:33.1007094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1007426Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.1007713Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1008415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.1009113Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.1009659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.1010358Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.1011128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.1011667Z kernel = self.compile( 2025-05-07T20:32:33.1012221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.1012886Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.1013287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1013518Z 2025-05-07T20:32:33.1013728Z self = 2025-05-07T20:32:33.1014867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.1016261Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3ab1ec2c0>} 2025-05-07T20:32:33.1017618Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.1018651Z context = 2025-05-07T20:32:33.1018950Z 2025-05-07T20:32:33.1019119Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.1019649Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.1020130Z module_map=module_map) 2025-05-07T20:32:33.1020493Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.1020854Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.1021125Z E ^ 2025-05-07T20:32:33.1021645Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.1022116Z 2025-05-07T20:32:33.1022539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.1023067Z 2025-05-07T20:32:33.1023175Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.1023593Z self=, 2025-05-07T20:32:33.1024001Z T=2048, 2025-05-07T20:32:33.1024197Z D=5120, 2025-05-07T20:32:33.1024398Z scale_ub=1200.0, 2025-05-07T20:32:33.1024624Z contiguous=True, 2025-05-07T20:32:33.1024855Z compiled=True, 2025-05-07T20:32:33.1025071Z ) 2025-05-07T20:32:33.1025392Z self = 2025-05-07T20:32:33.1025935Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:33.1026219Z 2025-05-07T20:32:33.1026340Z @given( 2025-05-07T20:32:33.1026580Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.1026892Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.1027202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.1027538Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.1027864Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.1028158Z ) 2025-05-07T20:32:33.1028513Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.1028955Z def test_silu_mul_quant( 2025-05-07T20:32:33.1029202Z self, 2025-05-07T20:32:33.1029402Z T: int, 2025-05-07T20:32:33.1029595Z D: int, 2025-05-07T20:32:33.1029818Z scale_ub: Optional[float], 2025-05-07T20:32:33.1030093Z contiguous: bool, 2025-05-07T20:32:33.1030340Z compiled: bool, 2025-05-07T20:32:33.1030561Z ) -> None: 2025-05-07T20:32:33.1030785Z torch.manual_seed(2025) 2025-05-07T20:32:33.1031089Z 2025-05-07T20:32:33.1031362Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.1031708Z 2025-05-07T20:32:33.1031906Z x_sign = torch.sign(x) 2025-05-07T20:32:33.1032194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.1032508Z x = x_sign * x_clamp 2025-05-07T20:32:33.1032756Z x0 = x[:, :D] 2025-05-07T20:32:33.1032970Z x1 = x[:, D:] 2025-05-07T20:32:33.1033187Z 2025-05-07T20:32:33.1033379Z if contiguous: 2025-05-07T20:32:33.1033610Z x0 = x0.contiguous() 2025-05-07T20:32:33.1033875Z x1 = x1.contiguous() 2025-05-07T20:32:33.1034126Z 2025-05-07T20:32:33.1034316Z if scale_ub is not None: 2025-05-07T20:32:33.1034596Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.1034938Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.1035260Z ) 2025-05-07T20:32:33.1035460Z else: 2025-05-07T20:32:33.1035683Z scale_ub_tensor = None 2025-05-07T20:32:33.1035944Z 2025-05-07T20:32:33.1036173Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.1036497Z op = silu_mul_quant 2025-05-07T20:32:33.1036754Z if compiled: 2025-05-07T20:32:33.1036999Z op = torch.compile(op) 2025-05-07T20:32:33.1037305Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1037591Z 2025-05-07T20:32:33.1037782Z y_fp8, y_scale = fn() 2025-05-07T20:32:33.1038073Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:33.1038697Z 2025-05-07T20:32:33.1038944Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.1039289Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:33.1039589Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:33.1039909Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:33.1040345Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.1040673Z 2025-05-07T20:32:33.1040880Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:33.1041076Z 2025-05-07T20:32:33.1041176Z moe/activation_test.py:126: 2025-05-07T20:32:33.1041475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1041814Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:33.1042138Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.1042934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:33.1043814Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:33.1044389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.1045180Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.1045938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:33.1046679Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.1047445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:33.1048195Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.1048938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:33.1049590Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:33.1050199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:33.1050734Z fn() 2025-05-07T20:32:33.1051259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:33.1051918Z self.fn.run( 2025-05-07T20:32:33.1052389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.1059714Z kernel = self.compile( 2025-05-07T20:32:33.1060318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.1060997Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.1061399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1061640Z 2025-05-07T20:32:33.1061850Z self = 2025-05-07T20:32:33.1062949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.1064334Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3aa0eb880>} 2025-05-07T20:32:33.1065678Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.1066716Z context = 2025-05-07T20:32:33.1067017Z 2025-05-07T20:32:33.1067189Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.1067724Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.1068188Z module_map=module_map) 2025-05-07T20:32:33.1068565Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.1069002Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:33.1069273Z E ^ 2025-05-07T20:32:33.1069745Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.1070206Z 2025-05-07T20:32:33.1070628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.1071142Z 2025-05-07T20:32:33.1071252Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.1071662Z self=, 2025-05-07T20:32:33.1072070Z T=16384, 2025-05-07T20:32:33.1072270Z D=7168, 2025-05-07T20:32:33.1072464Z scale_ub=1200.0, 2025-05-07T20:32:33.1072693Z contiguous=False, 2025-05-07T20:32:33.1072924Z compiled=False, 2025-05-07T20:32:33.1073132Z ) 2025-05-07T20:32:33.9301293Z self = 2025-05-07T20:32:33.9302195Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:33.9302594Z 2025-05-07T20:32:33.9302697Z @given( 2025-05-07T20:32:33.9303006Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.9303420Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.9303788Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.9304133Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.9304471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.9304770Z ) 2025-05-07T20:32:33.9305124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.9305580Z def test_silu_mul_quant( 2025-05-07T20:32:33.9305833Z self, 2025-05-07T20:32:33.9306032Z T: int, 2025-05-07T20:32:33.9306242Z D: int, 2025-05-07T20:32:33.9306472Z scale_ub: Optional[float], 2025-05-07T20:32:33.9306751Z contiguous: bool, 2025-05-07T20:32:33.9307008Z compiled: bool, 2025-05-07T20:32:33.9307356Z ) -> None: 2025-05-07T20:32:33.9307574Z torch.manual_seed(2025) 2025-05-07T20:32:33.9307827Z 2025-05-07T20:32:33.9308119Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.9308462Z 2025-05-07T20:32:33.9308667Z x_sign = torch.sign(x) 2025-05-07T20:32:33.9308967Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.9309284Z x = x_sign * x_clamp 2025-05-07T20:32:33.9309525Z x0 = x[:, :D] 2025-05-07T20:32:33.9309753Z x1 = x[:, D:] 2025-05-07T20:32:33.9309998Z 2025-05-07T20:32:33.9310193Z if contiguous: 2025-05-07T20:32:33.9310429Z x0 = x0.contiguous() 2025-05-07T20:32:33.9310697Z x1 = x1.contiguous() 2025-05-07T20:32:33.9310950Z 2025-05-07T20:32:33.9311155Z if scale_ub is not None: 2025-05-07T20:32:33.9311442Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.9311788Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.9312113Z ) 2025-05-07T20:32:33.9312306Z else: 2025-05-07T20:32:33.9312528Z scale_ub_tensor = None 2025-05-07T20:32:33.9312796Z 2025-05-07T20:32:33.9313027Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.9313351Z op = silu_mul_quant 2025-05-07T20:32:33.9313612Z if compiled: 2025-05-07T20:32:33.9313859Z op = torch.compile(op) 2025-05-07T20:32:33.9314163Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.9314448Z 2025-05-07T20:32:33.9314648Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.9314828Z 2025-05-07T20:32:33.9314933Z moe/activation_test.py:117: 2025-05-07T20:32:33.9315230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9315563Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.9315858Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.9316661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.9317370Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.9317917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.9318616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.9319298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.9319837Z kernel = self.compile( 2025-05-07T20:32:33.9320393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.9321059Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.9321510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9321781Z 2025-05-07T20:32:33.9321994Z self = 2025-05-07T20:32:33.9323079Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.9324605Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a9e23380>} 2025-05-07T20:32:33.9326007Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.9327039Z context = 2025-05-07T20:32:33.9327330Z 2025-05-07T20:32:33.9327504Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.9328082Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.9328556Z module_map=module_map) 2025-05-07T20:32:33.9328920Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.9329282Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.9329549Z E ^ 2025-05-07T20:32:33.9330010Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.9330472Z 2025-05-07T20:32:33.9330890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.9331413Z 2025-05-07T20:32:33.9331519Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.9331935Z self=, 2025-05-07T20:32:33.9332331Z T=1, 2025-05-07T20:32:33.9332519Z D=7168, 2025-05-07T20:32:33.9332720Z scale_ub=None, 2025-05-07T20:32:33.9332933Z contiguous=True, 2025-05-07T20:32:33.9333162Z compiled=True, 2025-05-07T20:32:33.9333373Z ) 2025-05-07T20:32:33.9333697Z self = 2025-05-07T20:32:33.9334174Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:33.9334440Z 2025-05-07T20:32:33.9334518Z @given( 2025-05-07T20:32:33.9334750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.9335065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.9335372Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.9335704Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.9336032Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.9336325Z ) 2025-05-07T20:32:33.9336682Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.9337181Z def test_silu_mul_quant( 2025-05-07T20:32:33.9337421Z self, 2025-05-07T20:32:33.9337622Z T: int, 2025-05-07T20:32:33.9337823Z D: int, 2025-05-07T20:32:33.9338038Z scale_ub: Optional[float], 2025-05-07T20:32:33.9338320Z contiguous: bool, 2025-05-07T20:32:33.9338854Z compiled: bool, 2025-05-07T20:32:33.9339076Z ) -> None: 2025-05-07T20:32:33.9339297Z torch.manual_seed(2025) 2025-05-07T20:32:33.9339544Z 2025-05-07T20:32:33.9339816Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.9340165Z 2025-05-07T20:32:33.9340363Z x_sign = torch.sign(x) 2025-05-07T20:32:33.9340647Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.9340958Z x = x_sign * x_clamp 2025-05-07T20:32:33.9341198Z x0 = x[:, :D] 2025-05-07T20:32:33.9341410Z x1 = x[:, D:] 2025-05-07T20:32:33.9341693Z 2025-05-07T20:32:33.9341883Z if contiguous: 2025-05-07T20:32:33.9342176Z x0 = x0.contiguous() 2025-05-07T20:32:33.9342440Z x1 = x1.contiguous() 2025-05-07T20:32:33.9342679Z 2025-05-07T20:32:33.9342871Z if scale_ub is not None: 2025-05-07T20:32:33.9343146Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.9343488Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.9343806Z ) 2025-05-07T20:32:33.9343997Z else: 2025-05-07T20:32:33.9344209Z scale_ub_tensor = None 2025-05-07T20:32:33.9344465Z 2025-05-07T20:32:33.9344696Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.9345015Z op = silu_mul_quant 2025-05-07T20:32:33.9345270Z if compiled: 2025-05-07T20:32:33.9345510Z op = torch.compile(op) 2025-05-07T20:32:33.9345814Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.9346092Z 2025-05-07T20:32:33.9346285Z y_fp8, y_scale = fn() 2025-05-07T20:32:33.9346583Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:33.9346953Z 2025-05-07T20:32:33.9347188Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.9347531Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:33.9347830Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:33.9348153Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:33.9348513Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.9348827Z 2025-05-07T20:32:33.9349035Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:33.9349229Z 2025-05-07T20:32:33.9349330Z moe/activation_test.py:126: 2025-05-07T20:32:33.9349631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9349970Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:33.9350299Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.9351101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:33.9351867Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:33.9352424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.9353107Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.9353805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:33.9354547Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.9355362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:33.9356113Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.9356951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:33.9357606Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:33.9358214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:33.9358731Z fn() 2025-05-07T20:32:33.9359244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:33.9359833Z self.fn.run( 2025-05-07T20:32:33.9360297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.9360832Z kernel = self.compile( 2025-05-07T20:32:33.9361380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.9362089Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.9362542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9362775Z 2025-05-07T20:32:33.9362999Z self = 2025-05-07T20:32:33.9364213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.9365581Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a9e979c0>} 2025-05-07T20:32:33.9366934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.9367977Z context = 2025-05-07T20:32:33.9368328Z 2025-05-07T20:32:33.9368506Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.9369028Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.9369502Z module_map=module_map) 2025-05-07T20:32:33.9369870Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.9370233Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:33.9370500Z E ^ 2025-05-07T20:32:33.9370971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.9371423Z 2025-05-07T20:32:33.9371858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.9372374Z 2025-05-07T20:32:33.9372490Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.9372903Z self=, 2025-05-07T20:32:33.9373316Z T=4096, 2025-05-07T20:32:33.9373510Z D=5120, 2025-05-07T20:32:33.9373698Z scale_ub=None, 2025-05-07T20:32:33.9373918Z contiguous=False, 2025-05-07T20:32:33.9374151Z compiled=False, 2025-05-07T20:32:33.9374351Z ) 2025-05-07T20:32:34.8708096Z self = 2025-05-07T20:32:34.8708832Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:34.8709219Z 2025-05-07T20:32:34.8709335Z @given( 2025-05-07T20:32:34.8709616Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.8709931Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.8710247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.8710592Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.8710947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.8711241Z ) 2025-05-07T20:32:34.8711919Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.8712367Z def test_silu_mul_quant( 2025-05-07T20:32:34.8712606Z self, 2025-05-07T20:32:34.8712803Z T: int, 2025-05-07T20:32:34.8713008Z D: int, 2025-05-07T20:32:34.8713224Z scale_ub: Optional[float], 2025-05-07T20:32:34.8713504Z contiguous: bool, 2025-05-07T20:32:34.8713750Z compiled: bool, 2025-05-07T20:32:34.8713976Z ) -> None: 2025-05-07T20:32:34.8714199Z torch.manual_seed(2025) 2025-05-07T20:32:34.8714446Z 2025-05-07T20:32:34.8714718Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.8715061Z 2025-05-07T20:32:34.8715254Z x_sign = torch.sign(x) 2025-05-07T20:32:34.8715543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.8715985Z x = x_sign * x_clamp 2025-05-07T20:32:34.8716252Z x0 = x[:, :D] 2025-05-07T20:32:34.8716557Z x1 = x[:, D:] 2025-05-07T20:32:34.8716768Z 2025-05-07T20:32:34.8716952Z if contiguous: 2025-05-07T20:32:34.8717195Z x0 = x0.contiguous() 2025-05-07T20:32:34.8717464Z x1 = x1.contiguous() 2025-05-07T20:32:34.8717704Z 2025-05-07T20:32:34.8717903Z if scale_ub is not None: 2025-05-07T20:32:34.8718181Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.8718518Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.8718825Z ) 2025-05-07T20:32:34.8719026Z else: 2025-05-07T20:32:34.8719243Z scale_ub_tensor = None 2025-05-07T20:32:34.8719494Z 2025-05-07T20:32:34.8719733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.8720049Z op = silu_mul_quant 2025-05-07T20:32:34.8720295Z if compiled: 2025-05-07T20:32:34.8720552Z op = torch.compile(op) 2025-05-07T20:32:34.8720858Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.8721235Z 2025-05-07T20:32:34.8721440Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.8721605Z 2025-05-07T20:32:34.8721717Z moe/activation_test.py:117: 2025-05-07T20:32:34.8722011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.8722347Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.8722634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.8723329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.8724244Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.8724790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.8725481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.8726159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.8726704Z kernel = self.compile( 2025-05-07T20:32:34.8727256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.8727927Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.8728326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.8728563Z 2025-05-07T20:32:34.8728773Z self = 2025-05-07T20:32:34.8729858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.8731304Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a9e78180>} 2025-05-07T20:32:34.8732665Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.8733693Z context = 2025-05-07T20:32:34.8733990Z 2025-05-07T20:32:34.8734158Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.8734688Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.8735159Z module_map=module_map) 2025-05-07T20:32:34.8735526Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.8735888Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.8736156Z E ^ 2025-05-07T20:32:34.8736669Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.8737177Z 2025-05-07T20:32:34.8737600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.8738122Z 2025-05-07T20:32:34.8738230Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.8738974Z self=, 2025-05-07T20:32:34.8739375Z T=4096, 2025-05-07T20:32:34.8739570Z D=7168, 2025-05-07T20:32:34.8739774Z scale_ub=None, 2025-05-07T20:32:34.8739991Z contiguous=False, 2025-05-07T20:32:34.8740231Z compiled=False, 2025-05-07T20:32:34.8740448Z ) 2025-05-07T20:32:34.8740774Z self = 2025-05-07T20:32:34.8741281Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:34.8741555Z 2025-05-07T20:32:34.8741644Z @given( 2025-05-07T20:32:34.8741882Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.8742288Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.8742604Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.8742936Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.8743263Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.8743558Z ) 2025-05-07T20:32:34.8743917Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.8744355Z def test_silu_mul_quant( 2025-05-07T20:32:34.8744602Z self, 2025-05-07T20:32:34.8744804Z T: int, 2025-05-07T20:32:34.8745001Z D: int, 2025-05-07T20:32:34.8745254Z scale_ub: Optional[float], 2025-05-07T20:32:34.8745554Z contiguous: bool, 2025-05-07T20:32:34.8745792Z compiled: bool, 2025-05-07T20:32:34.8746017Z ) -> None: 2025-05-07T20:32:34.8746235Z torch.manual_seed(2025) 2025-05-07T20:32:34.8746474Z 2025-05-07T20:32:34.8746757Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.8747110Z 2025-05-07T20:32:34.8747308Z x_sign = torch.sign(x) 2025-05-07T20:32:34.8747596Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.8747908Z x = x_sign * x_clamp 2025-05-07T20:32:34.8748153Z x0 = x[:, :D] 2025-05-07T20:32:34.8748366Z x1 = x[:, D:] 2025-05-07T20:32:34.8748574Z 2025-05-07T20:32:34.8748766Z if contiguous: 2025-05-07T20:32:34.8748995Z x0 = x0.contiguous() 2025-05-07T20:32:34.8749259Z x1 = x1.contiguous() 2025-05-07T20:32:34.8749505Z 2025-05-07T20:32:34.8749695Z if scale_ub is not None: 2025-05-07T20:32:34.8749971Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.8750307Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.8750614Z ) 2025-05-07T20:32:34.8750807Z else: 2025-05-07T20:32:34.8751025Z scale_ub_tensor = None 2025-05-07T20:32:34.8751348Z 2025-05-07T20:32:34.8751590Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.8751910Z op = silu_mul_quant 2025-05-07T20:32:34.8752153Z if compiled: 2025-05-07T20:32:34.8752405Z op = torch.compile(op) 2025-05-07T20:32:34.8752706Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.8752990Z 2025-05-07T20:32:34.8753182Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.8753355Z 2025-05-07T20:32:34.8753455Z moe/activation_test.py:117: 2025-05-07T20:32:34.8753753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.8754081Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.8754369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.8755066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.8755838Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.8756440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.8757137Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.8757809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.8758344Z kernel = self.compile( 2025-05-07T20:32:34.8758899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.8759572Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.8759982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.8760213Z 2025-05-07T20:32:34.8760421Z self = 2025-05-07T20:32:34.8761513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.8762932Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a9e7b880>} 2025-05-07T20:32:34.8764392Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.8765417Z context = 2025-05-07T20:32:34.8765710Z 2025-05-07T20:32:34.8765878Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.8766406Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.8766877Z module_map=module_map) 2025-05-07T20:32:34.8767246Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.8767605Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.8767869Z E ^ 2025-05-07T20:32:34.8768329Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.8768790Z 2025-05-07T20:32:34.8769213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.8769738Z 2025-05-07T20:32:34.8769843Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.8770258Z self=, 2025-05-07T20:32:34.8770661Z T=128, 2025-05-07T20:32:34.8770853Z D=7168, 2025-05-07T20:32:34.8771048Z scale_ub=None, 2025-05-07T20:32:34.8771260Z contiguous=False, 2025-05-07T20:32:34.8771493Z compiled=True, 2025-05-07T20:32:34.8771703Z ) 2025-05-07T20:32:34.9226673Z self = 2025-05-07T20:32:34.9227461Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:34.9227826Z 2025-05-07T20:32:34.9227909Z @given( 2025-05-07T20:32:34.9228145Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.9228466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.9228774Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.9229111Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.9229446Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.9229732Z ) 2025-05-07T20:32:34.9230088Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.9230536Z def test_silu_mul_quant( 2025-05-07T20:32:34.9230771Z self, 2025-05-07T20:32:34.9230971Z T: int, 2025-05-07T20:32:34.9231273Z D: int, 2025-05-07T20:32:34.9231579Z scale_ub: Optional[float], 2025-05-07T20:32:34.9231866Z contiguous: bool, 2025-05-07T20:32:34.9232113Z compiled: bool, 2025-05-07T20:32:34.9232341Z ) -> None: 2025-05-07T20:32:34.9232566Z torch.manual_seed(2025) 2025-05-07T20:32:34.9232820Z 2025-05-07T20:32:34.9233092Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.9233441Z 2025-05-07T20:32:34.9233641Z x_sign = torch.sign(x) 2025-05-07T20:32:34.9233938Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.9234246Z x = x_sign * x_clamp 2025-05-07T20:32:34.9234493Z x0 = x[:, :D] 2025-05-07T20:32:34.9234711Z x1 = x[:, D:] 2025-05-07T20:32:34.9234920Z 2025-05-07T20:32:34.9235113Z if contiguous: 2025-05-07T20:32:34.9235351Z x0 = x0.contiguous() 2025-05-07T20:32:34.9235609Z x1 = x1.contiguous() 2025-05-07T20:32:34.9235859Z 2025-05-07T20:32:34.9236058Z if scale_ub is not None: 2025-05-07T20:32:34.9236422Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.9236771Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.9237085Z ) 2025-05-07T20:32:34.9237274Z else: 2025-05-07T20:32:34.9237494Z scale_ub_tensor = None 2025-05-07T20:32:34.9237753Z 2025-05-07T20:32:34.9237987Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.9238313Z op = silu_mul_quant 2025-05-07T20:32:34.9238782Z if compiled: 2025-05-07T20:32:34.9239036Z op = torch.compile(op) 2025-05-07T20:32:34.9239329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.9239614Z 2025-05-07T20:32:34.9239816Z y_fp8, y_scale = fn() 2025-05-07T20:32:34.9240101Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:34.9240399Z 2025-05-07T20:32:34.9240650Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.9240990Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:34.9241297Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:34.9241621Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:34.9241979Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:34.9242299Z 2025-05-07T20:32:34.9242512Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:34.9242708Z 2025-05-07T20:32:34.9242823Z moe/activation_test.py:126: 2025-05-07T20:32:34.9243117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.9243464Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:34.9243929Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:34.9244718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:34.9245543Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:34.9246218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.9246916Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.9247610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:34.9248348Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:34.9249123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:34.9249886Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:34.9250621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:34.9251331Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:34.9252003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:34.9252528Z fn() 2025-05-07T20:32:34.9253047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:34.9253642Z self.fn.run( 2025-05-07T20:32:34.9254124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.9254664Z kernel = self.compile( 2025-05-07T20:32:34.9255221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.9255888Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.9256289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.9256523Z 2025-05-07T20:32:34.9256738Z self = 2025-05-07T20:32:34.9257889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.9259270Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a9a6ed40>} 2025-05-07T20:32:34.9260616Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.9261646Z context = 2025-05-07T20:32:34.9261935Z 2025-05-07T20:32:34.9262112Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.9262649Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.9263122Z module_map=module_map) 2025-05-07T20:32:34.9263493Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.9263853Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:34.9264114Z E ^ 2025-05-07T20:32:34.9264582Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.9265033Z 2025-05-07T20:32:34.9265461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.9265977Z 2025-05-07T20:32:34.9266087Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.9266498Z self=, 2025-05-07T20:32:34.9266904Z T=128, 2025-05-07T20:32:34.9267096Z D=7168, 2025-05-07T20:32:34.9267289Z scale_ub=None, 2025-05-07T20:32:34.9267562Z contiguous=False, 2025-05-07T20:32:34.9267797Z compiled=False, 2025-05-07T20:32:34.9268007Z ) 2025-05-07T20:32:35.2439244Z self = 2025-05-07T20:32:35.2440022Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.2440400Z 2025-05-07T20:32:35.2440506Z @given( 2025-05-07T20:32:35.2440828Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2441143Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2441451Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2441782Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2442106Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2442395Z ) 2025-05-07T20:32:35.2442749Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2443502Z def test_silu_mul_quant( 2025-05-07T20:32:35.2443896Z self, 2025-05-07T20:32:35.2444202Z T: int, 2025-05-07T20:32:35.2444400Z D: int, 2025-05-07T20:32:35.2444614Z scale_ub: Optional[float], 2025-05-07T20:32:35.2444886Z contiguous: bool, 2025-05-07T20:32:35.2445124Z compiled: bool, 2025-05-07T20:32:35.2445347Z ) -> None: 2025-05-07T20:32:35.2445563Z torch.manual_seed(2025) 2025-05-07T20:32:35.2445806Z 2025-05-07T20:32:35.2446076Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2446422Z 2025-05-07T20:32:35.2446617Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2446905Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2447218Z x = x_sign * x_clamp 2025-05-07T20:32:35.2447456Z x0 = x[:, :D] 2025-05-07T20:32:35.2447666Z x1 = x[:, D:] 2025-05-07T20:32:35.2447879Z 2025-05-07T20:32:35.2448072Z if contiguous: 2025-05-07T20:32:35.2448306Z x0 = x0.contiguous() 2025-05-07T20:32:35.2448574Z x1 = x1.contiguous() 2025-05-07T20:32:35.2448911Z 2025-05-07T20:32:35.2449109Z if scale_ub is not None: 2025-05-07T20:32:35.2449382Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.2449722Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.2450040Z ) 2025-05-07T20:32:35.2450234Z else: 2025-05-07T20:32:35.2450447Z scale_ub_tensor = None 2025-05-07T20:32:35.2450705Z 2025-05-07T20:32:35.2450934Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2451246Z op = silu_mul_quant 2025-05-07T20:32:35.2451496Z if compiled: 2025-05-07T20:32:35.2451735Z op = torch.compile(op) 2025-05-07T20:32:35.2452034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2452309Z 2025-05-07T20:32:35.2452496Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.2452666Z 2025-05-07T20:32:35.2452768Z moe/activation_test.py:117: 2025-05-07T20:32:35.2453071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2453409Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.2453690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2454382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.2455083Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.2455619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.2456307Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2456980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2457519Z kernel = self.compile( 2025-05-07T20:32:35.2458150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2458818Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2459220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2459447Z 2025-05-07T20:32:35.2459661Z self = 2025-05-07T20:32:35.2460733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2462110Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a9348e00>} 2025-05-07T20:32:35.2463500Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2464590Z context = 2025-05-07T20:32:35.2464880Z 2025-05-07T20:32:35.2465054Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2465634Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2466109Z module_map=module_map) 2025-05-07T20:32:35.2466481Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2466835Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.2467102Z E ^ 2025-05-07T20:32:35.2467573Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2468024Z 2025-05-07T20:32:35.2468448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.2468977Z 2025-05-07T20:32:35.2469127Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2469548Z self=, 2025-05-07T20:32:35.2469958Z T=4096, 2025-05-07T20:32:35.2470144Z D=5120, 2025-05-07T20:32:35.2470342Z scale_ub=1200.0, 2025-05-07T20:32:35.2470565Z contiguous=True, 2025-05-07T20:32:35.2470782Z compiled=False, 2025-05-07T20:32:35.2470993Z ) 2025-05-07T20:32:35.2471313Z self = 2025-05-07T20:32:35.2471827Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.2472101Z 2025-05-07T20:32:35.2472186Z @given( 2025-05-07T20:32:35.2472415Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2472734Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2473046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2473383Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2473716Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2474010Z ) 2025-05-07T20:32:35.2474366Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2474805Z def test_silu_mul_quant( 2025-05-07T20:32:35.2475050Z self, 2025-05-07T20:32:35.2475248Z T: int, 2025-05-07T20:32:35.2475440Z D: int, 2025-05-07T20:32:35.2475662Z scale_ub: Optional[float], 2025-05-07T20:32:35.2475937Z contiguous: bool, 2025-05-07T20:32:35.2476172Z compiled: bool, 2025-05-07T20:32:35.2476397Z ) -> None: 2025-05-07T20:32:35.2476617Z torch.manual_seed(2025) 2025-05-07T20:32:35.2476857Z 2025-05-07T20:32:35.2477128Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2477472Z 2025-05-07T20:32:35.2477658Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2477953Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2478318Z x = x_sign * x_clamp 2025-05-07T20:32:35.2478560Z x0 = x[:, :D] 2025-05-07T20:32:35.2478772Z x1 = x[:, D:] 2025-05-07T20:32:35.2478978Z 2025-05-07T20:32:35.2479166Z if contiguous: 2025-05-07T20:32:35.2479392Z x0 = x0.contiguous() 2025-05-07T20:32:35.2479651Z x1 = x1.contiguous() 2025-05-07T20:32:35.2479897Z 2025-05-07T20:32:35.2480086Z if scale_ub is not None: 2025-05-07T20:32:35.2480360Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.2480697Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.2481001Z ) 2025-05-07T20:32:35.2481202Z else: 2025-05-07T20:32:35.2481414Z scale_ub_tensor = None 2025-05-07T20:32:35.2481659Z 2025-05-07T20:32:35.2481895Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2482254Z op = silu_mul_quant 2025-05-07T20:32:35.2482497Z if compiled: 2025-05-07T20:32:35.2482798Z op = torch.compile(op) 2025-05-07T20:32:35.2483097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2483371Z 2025-05-07T20:32:35.2483673Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.2483845Z 2025-05-07T20:32:35.2483943Z moe/activation_test.py:117: 2025-05-07T20:32:35.2484239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2484571Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.2484860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2485563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.2486261Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.2486797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.2487493Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2488242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2488776Z kernel = self.compile( 2025-05-07T20:32:35.2489320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2489984Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2490389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2490617Z 2025-05-07T20:32:35.2490827Z self = 2025-05-07T20:32:35.2491913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2493286Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a9349f80>} 2025-05-07T20:32:35.2494645Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2495726Z context = 2025-05-07T20:32:35.2496022Z 2025-05-07T20:32:35.2496189Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2496716Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2497188Z module_map=module_map) 2025-05-07T20:32:35.2497551Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2497908Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.2498173Z E ^ 2025-05-07T20:32:35.2498713Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2499168Z 2025-05-07T20:32:35.2499590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.2500112Z 2025-05-07T20:32:35.2500216Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2500633Z self=, 2025-05-07T20:32:35.2501033Z T=1, 2025-05-07T20:32:35.2501218Z D=5120, 2025-05-07T20:32:35.2501412Z scale_ub=None, 2025-05-07T20:32:35.2501629Z contiguous=True, 2025-05-07T20:32:35.2501846Z compiled=True, 2025-05-07T20:32:35.2502055Z ) 2025-05-07T20:32:35.7016096Z self = 2025-05-07T20:32:35.7016963Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.7017324Z 2025-05-07T20:32:35.7017425Z @given( 2025-05-07T20:32:35.7017669Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7017986Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7018298Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7018629Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7018964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7019257Z ) 2025-05-07T20:32:35.7019607Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7020056Z def test_silu_mul_quant( 2025-05-07T20:32:35.7020301Z self, 2025-05-07T20:32:35.7020493Z T: int, 2025-05-07T20:32:35.7020697Z D: int, 2025-05-07T20:32:35.7020916Z scale_ub: Optional[float], 2025-05-07T20:32:35.7021184Z contiguous: bool, 2025-05-07T20:32:35.7021426Z compiled: bool, 2025-05-07T20:32:35.7021659Z ) -> None: 2025-05-07T20:32:35.7021878Z torch.manual_seed(2025) 2025-05-07T20:32:35.7022225Z 2025-05-07T20:32:35.7022501Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7022849Z 2025-05-07T20:32:35.7023040Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7023333Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7023647Z x = x_sign * x_clamp 2025-05-07T20:32:35.7023881Z x0 = x[:, :D] 2025-05-07T20:32:35.7024100Z x1 = x[:, D:] 2025-05-07T20:32:35.7024312Z 2025-05-07T20:32:35.7024496Z if contiguous: 2025-05-07T20:32:35.7024733Z x0 = x0.contiguous() 2025-05-07T20:32:35.7024993Z x1 = x1.contiguous() 2025-05-07T20:32:35.7025233Z 2025-05-07T20:32:35.7025429Z if scale_ub is not None: 2025-05-07T20:32:35.7025704Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7026040Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7026354Z ) 2025-05-07T20:32:35.7026556Z else: 2025-05-07T20:32:35.7026766Z scale_ub_tensor = None 2025-05-07T20:32:35.7027022Z 2025-05-07T20:32:35.7027260Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7027576Z op = silu_mul_quant 2025-05-07T20:32:35.7027823Z if compiled: 2025-05-07T20:32:35.7028072Z op = torch.compile(op) 2025-05-07T20:32:35.7028377Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7028649Z 2025-05-07T20:32:35.7028853Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.7029144Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.7029435Z 2025-05-07T20:32:35.7029678Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7030018Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.7030307Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.7030627Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.7031092Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7031414Z 2025-05-07T20:32:35.7031616Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.7031819Z 2025-05-07T20:32:35.7031924Z moe/activation_test.py:126: 2025-05-07T20:32:35.7032227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7032570Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.7032906Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7033706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.7034461Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.7035016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7035762Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7036505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.7037229Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7037993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.7039165Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7040132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.7040774Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.7041383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.7041916Z fn() 2025-05-07T20:32:35.7042429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.7043126Z self.fn.run( 2025-05-07T20:32:35.7043739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7044280Z kernel = self.compile( 2025-05-07T20:32:35.7044825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7045492Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7045899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7046130Z 2025-05-07T20:32:35.7046343Z self = 2025-05-07T20:32:35.7047422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7048816Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a934afc0>} 2025-05-07T20:32:35.7050165Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7051200Z context = 2025-05-07T20:32:35.7051491Z 2025-05-07T20:32:35.7051660Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7052191Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7052662Z module_map=module_map) 2025-05-07T20:32:35.7053038Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7053479Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.7053754Z E ^ 2025-05-07T20:32:35.7054228Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7054687Z 2025-05-07T20:32:35.7055112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7055688Z 2025-05-07T20:32:35.7055795Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7056214Z self=, 2025-05-07T20:32:35.7056623Z T=2048, 2025-05-07T20:32:35.7056812Z D=5120, 2025-05-07T20:32:35.7057010Z scale_ub=None, 2025-05-07T20:32:35.7057233Z contiguous=True, 2025-05-07T20:32:35.7057454Z compiled=True, 2025-05-07T20:32:35.7057667Z ) 2025-05-07T20:32:36.1451677Z self = 2025-05-07T20:32:36.1452541Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:36.1459226Z 2025-05-07T20:32:36.1459321Z @given( 2025-05-07T20:32:36.1459583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.1459900Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.1460217Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.1460559Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.1460888Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.1461181Z ) 2025-05-07T20:32:36.1461540Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.1461991Z def test_silu_mul_quant( 2025-05-07T20:32:36.1462233Z self, 2025-05-07T20:32:36.1462436Z T: int, 2025-05-07T20:32:36.1462645Z D: int, 2025-05-07T20:32:36.1462860Z scale_ub: Optional[float], 2025-05-07T20:32:36.1463149Z contiguous: bool, 2025-05-07T20:32:36.1463406Z compiled: bool, 2025-05-07T20:32:36.1463790Z ) -> None: 2025-05-07T20:32:36.1464021Z torch.manual_seed(2025) 2025-05-07T20:32:36.1464274Z 2025-05-07T20:32:36.1464547Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.1464903Z 2025-05-07T20:32:36.1465112Z x_sign = torch.sign(x) 2025-05-07T20:32:36.1465404Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.1465718Z x = x_sign * x_clamp 2025-05-07T20:32:36.1465969Z x0 = x[:, :D] 2025-05-07T20:32:36.1466183Z x1 = x[:, D:] 2025-05-07T20:32:36.1466396Z 2025-05-07T20:32:36.1466591Z if contiguous: 2025-05-07T20:32:36.1466825Z x0 = x0.contiguous() 2025-05-07T20:32:36.1467096Z x1 = x1.contiguous() 2025-05-07T20:32:36.1467347Z 2025-05-07T20:32:36.1467546Z if scale_ub is not None: 2025-05-07T20:32:36.1467817Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.1468165Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.1468486Z ) 2025-05-07T20:32:36.1468678Z else: 2025-05-07T20:32:36.1468896Z scale_ub_tensor = None 2025-05-07T20:32:36.1469154Z 2025-05-07T20:32:36.1469389Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.1469709Z op = silu_mul_quant 2025-05-07T20:32:36.1469965Z if compiled: 2025-05-07T20:32:36.1470211Z op = torch.compile(op) 2025-05-07T20:32:36.1470517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.1470801Z 2025-05-07T20:32:36.1470993Z y_fp8, y_scale = fn() 2025-05-07T20:32:36.1471286Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:36.1471583Z 2025-05-07T20:32:36.1471816Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.1472158Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:36.1472552Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:36.1472883Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:36.1473242Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.1473564Z 2025-05-07T20:32:36.1473775Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:36.1473971Z 2025-05-07T20:32:36.1474073Z moe/activation_test.py:126: 2025-05-07T20:32:36.1474381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.1474724Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:36.1475050Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.1475903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:36.1476673Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:36.1477286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.1478019Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.1478717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:36.1479456Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.1480214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:36.1480963Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.1481698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:36.1482347Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:36.1482965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:36.1483720Z fn() 2025-05-07T20:32:36.1484236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:36.1484827Z self.fn.run( 2025-05-07T20:32:36.1485294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.1485832Z kernel = self.compile( 2025-05-07T20:32:36.1486381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.1487050Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.1487448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.1487685Z 2025-05-07T20:32:36.1487894Z self = 2025-05-07T20:32:36.1488979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.1490362Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a90d07c0>} 2025-05-07T20:32:36.1491699Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.1492726Z context = 2025-05-07T20:32:36.1493015Z 2025-05-07T20:32:36.1493189Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.1493708Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.1494232Z module_map=module_map) 2025-05-07T20:32:36.1494607Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.1494970Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:36.1495233Z E ^ 2025-05-07T20:32:36.1495694Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.1496144Z 2025-05-07T20:32:36.1496568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.1497080Z 2025-05-07T20:32:36.1497185Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.1497596Z self=, 2025-05-07T20:32:36.1497998Z T=128, 2025-05-07T20:32:36.1498189Z D=5120, 2025-05-07T20:32:36.1498377Z scale_ub=None, 2025-05-07T20:32:36.1498594Z contiguous=True, 2025-05-07T20:32:36.1498863Z compiled=True, 2025-05-07T20:32:36.1499097Z ) 2025-05-07T20:32:36.8355571Z self = 2025-05-07T20:32:36.8356265Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:36.8356534Z 2025-05-07T20:32:36.8356619Z @given( 2025-05-07T20:32:36.8356867Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.8357191Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.8357513Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.8357846Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.8358180Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.8358470Z ) 2025-05-07T20:32:36.8358820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.8359272Z def test_silu_mul_quant( 2025-05-07T20:32:36.8359524Z self, 2025-05-07T20:32:36.8359722Z T: int, 2025-05-07T20:32:36.8359927Z D: int, 2025-05-07T20:32:36.8360163Z scale_ub: Optional[float], 2025-05-07T20:32:36.8360756Z contiguous: bool, 2025-05-07T20:32:36.8361004Z compiled: bool, 2025-05-07T20:32:36.8361239Z ) -> None: 2025-05-07T20:32:36.8361454Z torch.manual_seed(2025) 2025-05-07T20:32:36.8361702Z 2025-05-07T20:32:36.8361984Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.8362333Z 2025-05-07T20:32:36.8362524Z x_sign = torch.sign(x) 2025-05-07T20:32:36.8362822Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.8363139Z x = x_sign * x_clamp 2025-05-07T20:32:36.8363380Z x0 = x[:, :D] 2025-05-07T20:32:36.8363746Z x1 = x[:, D:] 2025-05-07T20:32:36.8363965Z 2025-05-07T20:32:36.8364156Z if contiguous: 2025-05-07T20:32:36.8364396Z x0 = x0.contiguous() 2025-05-07T20:32:36.8364660Z x1 = x1.contiguous() 2025-05-07T20:32:36.8364904Z 2025-05-07T20:32:36.8365104Z if scale_ub is not None: 2025-05-07T20:32:36.8365390Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.8365741Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.8366091Z ) 2025-05-07T20:32:36.8366287Z else: 2025-05-07T20:32:36.8366495Z scale_ub_tensor = None 2025-05-07T20:32:36.8366753Z 2025-05-07T20:32:36.8366993Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.8367306Z op = silu_mul_quant 2025-05-07T20:32:36.8367566Z if compiled: 2025-05-07T20:32:36.8367818Z op = torch.compile(op) 2025-05-07T20:32:36.8368125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.8368401Z 2025-05-07T20:32:36.8368603Z y_fp8, y_scale = fn() 2025-05-07T20:32:36.8368893Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:36.8369186Z 2025-05-07T20:32:36.8369438Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.8369898Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:36.8370204Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:36.8370520Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:36.8370892Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.8371211Z 2025-05-07T20:32:36.8371411Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:36.8371613Z 2025-05-07T20:32:36.8371717Z moe/activation_test.py:126: 2025-05-07T20:32:36.8372018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.8372354Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:36.8372688Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.8373487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:36.8374344Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:36.8374976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.8375671Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.8376374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:36.8377105Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.8377860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:36.8378618Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.8379357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:36.8379998Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:36.8380658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:36.8381181Z fn() 2025-05-07T20:32:36.8381698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:36.8382277Z self.fn.run( 2025-05-07T20:32:36.8382750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.8383289Z kernel = self.compile( 2025-05-07T20:32:36.8383830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.8384493Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.8384888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.8385117Z 2025-05-07T20:32:36.8385342Z self = 2025-05-07T20:32:36.8386423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.8387813Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8ddde40>} 2025-05-07T20:32:36.8389163Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.8390194Z context = 2025-05-07T20:32:36.8390483Z 2025-05-07T20:32:36.8390657Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.8391222Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.8391700Z module_map=module_map) 2025-05-07T20:32:36.8392071Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.8392443Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:36.8392714Z E ^ 2025-05-07T20:32:36.8393182Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.8393634Z 2025-05-07T20:32:36.8394061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.8394578Z 2025-05-07T20:32:36.8394683Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.8395101Z self=, 2025-05-07T20:32:36.8395508Z T=4096, 2025-05-07T20:32:36.8395793Z D=5120, 2025-05-07T20:32:36.8396028Z scale_ub=None, 2025-05-07T20:32:36.8396309Z contiguous=True, 2025-05-07T20:32:36.8396533Z compiled=True, 2025-05-07T20:32:36.8396751Z ) 2025-05-07T20:32:37.3588201Z self = 2025-05-07T20:32:37.3588960Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.3589299Z 2025-05-07T20:32:37.3589381Z @given( 2025-05-07T20:32:37.3589617Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.3589929Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.3590238Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.3590573Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.3590900Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.3591191Z ) 2025-05-07T20:32:37.3591543Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.3592008Z def test_silu_mul_quant( 2025-05-07T20:32:37.3592259Z self, 2025-05-07T20:32:37.3592815Z T: int, 2025-05-07T20:32:37.3593022Z D: int, 2025-05-07T20:32:37.3593235Z scale_ub: Optional[float], 2025-05-07T20:32:37.3593516Z contiguous: bool, 2025-05-07T20:32:37.3593760Z compiled: bool, 2025-05-07T20:32:37.3593989Z ) -> None: 2025-05-07T20:32:37.3594206Z torch.manual_seed(2025) 2025-05-07T20:32:37.3594450Z 2025-05-07T20:32:37.3594724Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.3595072Z 2025-05-07T20:32:37.3595267Z x_sign = torch.sign(x) 2025-05-07T20:32:37.3595554Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.3595868Z x = x_sign * x_clamp 2025-05-07T20:32:37.3596113Z x0 = x[:, :D] 2025-05-07T20:32:37.3596329Z x1 = x[:, D:] 2025-05-07T20:32:37.3596540Z 2025-05-07T20:32:37.3596732Z if contiguous: 2025-05-07T20:32:37.3596966Z x0 = x0.contiguous() 2025-05-07T20:32:37.3597239Z x1 = x1.contiguous() 2025-05-07T20:32:37.3597486Z 2025-05-07T20:32:37.3597682Z if scale_ub is not None: 2025-05-07T20:32:37.3597959Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.3598299Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.3598615Z ) 2025-05-07T20:32:37.3598802Z else: 2025-05-07T20:32:37.3599022Z scale_ub_tensor = None 2025-05-07T20:32:37.3599284Z 2025-05-07T20:32:37.3599524Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.3599847Z op = silu_mul_quant 2025-05-07T20:32:37.3600108Z if compiled: 2025-05-07T20:32:37.3600350Z op = torch.compile(op) 2025-05-07T20:32:37.3600651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.3600942Z 2025-05-07T20:32:37.3601137Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.3601439Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.3601738Z 2025-05-07T20:32:37.3602105Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.3602448Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.3602746Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.3603064Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.3603417Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.3603954Z 2025-05-07T20:32:37.3604159Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.3604354Z 2025-05-07T20:32:37.3604458Z moe/activation_test.py:126: 2025-05-07T20:32:37.3604759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.3605099Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.3605423Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.3606327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.3607187Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.3607738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.3608420Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.3609122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.3609863Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.3610626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.3611379Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.3612121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.3612816Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.3613422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.3613941Z fn() 2025-05-07T20:32:37.3614460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.3615052Z self.fn.run( 2025-05-07T20:32:37.3615522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.3616057Z kernel = self.compile( 2025-05-07T20:32:37.3616606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.3617260Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.3617661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.3617904Z 2025-05-07T20:32:37.3618117Z self = 2025-05-07T20:32:37.3619205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.3620601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a887e700>} 2025-05-07T20:32:37.3621951Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.3622987Z context = 2025-05-07T20:32:37.3623286Z 2025-05-07T20:32:37.3623498Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.3624033Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.3624503Z module_map=module_map) 2025-05-07T20:32:37.3624871Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.3625236Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.3625503Z E ^ 2025-05-07T20:32:37.3625977Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.3626438Z 2025-05-07T20:32:37.3626860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.3627376Z 2025-05-07T20:32:37.3627491Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.3627949Z self=, 2025-05-07T20:32:37.3628394Z T=16384, 2025-05-07T20:32:37.3628603Z D=5120, 2025-05-07T20:32:37.3628798Z scale_ub=None, 2025-05-07T20:32:37.3629016Z contiguous=True, 2025-05-07T20:32:37.3629249Z compiled=True, 2025-05-07T20:32:37.3629454Z ) 2025-05-07T20:32:37.3893399Z W0507 20:32:37.388000 88291 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:37.3894940Z W0507 20:32:37.388000 88291 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:37.3896339Z W0507 20:32:37.388000 88291 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:37.3897343Z W0507 20:32:37.388000 88291 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:37.3898633Z W0507 20:32:37.388000 88291 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:37.4595659Z self = 2025-05-07T20:32:37.4596453Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.4596835Z 2025-05-07T20:32:37.4596956Z @given( 2025-05-07T20:32:37.4597229Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4597547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4597856Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4598181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4598513Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4598804Z ) 2025-05-07T20:32:37.4599168Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4599634Z def test_silu_mul_quant( 2025-05-07T20:32:37.4599884Z self, 2025-05-07T20:32:37.4600079Z T: int, 2025-05-07T20:32:37.4600279Z D: int, 2025-05-07T20:32:37.4600507Z scale_ub: Optional[float], 2025-05-07T20:32:37.4600776Z contiguous: bool, 2025-05-07T20:32:37.4601025Z compiled: bool, 2025-05-07T20:32:37.4601257Z ) -> None: 2025-05-07T20:32:37.4601470Z torch.manual_seed(2025) 2025-05-07T20:32:37.4601714Z 2025-05-07T20:32:37.4601999Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4602343Z 2025-05-07T20:32:37.4602560Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4602849Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4603164Z x = x_sign * x_clamp 2025-05-07T20:32:37.4603408Z x0 = x[:, :D] 2025-05-07T20:32:37.4603801Z x1 = x[:, D:] 2025-05-07T20:32:37.4604013Z 2025-05-07T20:32:37.4604395Z if contiguous: 2025-05-07T20:32:37.4604635Z x0 = x0.contiguous() 2025-05-07T20:32:37.4604906Z x1 = x1.contiguous() 2025-05-07T20:32:37.4605155Z 2025-05-07T20:32:37.4605347Z if scale_ub is not None: 2025-05-07T20:32:37.4605624Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4605967Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4606272Z ) 2025-05-07T20:32:37.4606468Z else: 2025-05-07T20:32:37.4606684Z scale_ub_tensor = None 2025-05-07T20:32:37.4606937Z 2025-05-07T20:32:37.4607173Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4607494Z op = silu_mul_quant 2025-05-07T20:32:37.4607749Z if compiled: 2025-05-07T20:32:37.4607994Z op = torch.compile(op) 2025-05-07T20:32:37.4608302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4608671Z 2025-05-07T20:32:37.4608868Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.4609244Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.4609541Z 2025-05-07T20:32:37.4609775Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4610115Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.4610411Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.4610723Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.4611085Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4611401Z 2025-05-07T20:32:37.4611605Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.4611808Z 2025-05-07T20:32:37.4611912Z moe/activation_test.py:126: 2025-05-07T20:32:37.4612211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4612553Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.4612885Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4613692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.4614542Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.4615095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4615790Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4616529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.4623716Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4624535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.4625314Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4626116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.4626769Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.4627374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.4627902Z fn() 2025-05-07T20:32:37.4628422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.4629001Z self.fn.run( 2025-05-07T20:32:37.4629476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4630019Z kernel = self.compile( 2025-05-07T20:32:37.4630571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4631229Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4631713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4631946Z 2025-05-07T20:32:37.4632164Z self = 2025-05-07T20:32:37.4633244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4634634Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8415d00>} 2025-05-07T20:32:37.4636022Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4637067Z context = 2025-05-07T20:32:37.4637398Z 2025-05-07T20:32:37.4637574Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4638097Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4638877Z module_map=module_map) 2025-05-07T20:32:37.4639254Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4639609Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.4639884Z E ^ 2025-05-07T20:32:37.4640359Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4640815Z 2025-05-07T20:32:37.4641248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4641769Z 2025-05-07T20:32:37.4641878Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4642305Z self=, 2025-05-07T20:32:37.4642832Z T=1, 2025-05-07T20:32:37.4643015Z D=5120, 2025-05-07T20:32:37.4643212Z scale_ub=1200.0, 2025-05-07T20:32:37.4643447Z contiguous=True, 2025-05-07T20:32:37.4643730Z compiled=True, 2025-05-07T20:32:37.4643935Z ) 2025-05-07T20:32:37.7505155Z self = 2025-05-07T20:32:37.7505878Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.7506253Z 2025-05-07T20:32:37.7506363Z @given( 2025-05-07T20:32:37.7506601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.7506931Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.7507244Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.7507574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.7507935Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.7508249Z ) 2025-05-07T20:32:37.7508617Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.7509063Z def test_silu_mul_quant( 2025-05-07T20:32:37.7509319Z self, 2025-05-07T20:32:37.7509528Z T: int, 2025-05-07T20:32:37.7509730Z D: int, 2025-05-07T20:32:37.7509968Z scale_ub: Optional[float], 2025-05-07T20:32:37.7510254Z contiguous: bool, 2025-05-07T20:32:37.7510500Z compiled: bool, 2025-05-07T20:32:37.7510740Z ) -> None: 2025-05-07T20:32:37.7510969Z torch.manual_seed(2025) 2025-05-07T20:32:37.7511214Z 2025-05-07T20:32:37.7511506Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.7511858Z 2025-05-07T20:32:37.7512061Z x_sign = torch.sign(x) 2025-05-07T20:32:37.7512364Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.7512684Z x = x_sign * x_clamp 2025-05-07T20:32:37.7512928Z x0 = x[:, :D] 2025-05-07T20:32:37.7513473Z x1 = x[:, D:] 2025-05-07T20:32:37.7513702Z 2025-05-07T20:32:37.7513898Z if contiguous: 2025-05-07T20:32:37.7514149Z x0 = x0.contiguous() 2025-05-07T20:32:37.7514423Z x1 = x1.contiguous() 2025-05-07T20:32:37.7514680Z 2025-05-07T20:32:37.7514876Z if scale_ub is not None: 2025-05-07T20:32:37.7515161Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.7515506Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.7515817Z ) 2025-05-07T20:32:37.7516027Z else: 2025-05-07T20:32:37.7516271Z scale_ub_tensor = None 2025-05-07T20:32:37.7516554Z 2025-05-07T20:32:37.7516800Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.7517123Z op = silu_mul_quant 2025-05-07T20:32:37.7517382Z if compiled: 2025-05-07T20:32:37.7517755Z op = torch.compile(op) 2025-05-07T20:32:37.7518062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.7518431Z 2025-05-07T20:32:37.7518634Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.7518802Z 2025-05-07T20:32:37.7518912Z moe/activation_test.py:117: 2025-05-07T20:32:37.7519211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.7519553Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.7519845Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.7520402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.7520971Z return fn(*args, **kwargs) 2025-05-07T20:32:37.7521639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.7522338Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.7522880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.7523808Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.7524495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.7525036Z kernel = self.compile( 2025-05-07T20:32:37.7525599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.7526269Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.7526726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.7526953Z 2025-05-07T20:32:37.7527166Z self = 2025-05-07T20:32:37.7528254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.7529652Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8670ae0>} 2025-05-07T20:32:37.7531005Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.7532045Z context = 2025-05-07T20:32:37.7532333Z 2025-05-07T20:32:37.7532501Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.7533026Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.7533495Z module_map=module_map) 2025-05-07T20:32:37.7533861Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.7534272Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.7534541Z E ^ 2025-05-07T20:32:37.7535012Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.7535465Z 2025-05-07T20:32:37.7535887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.7536410Z 2025-05-07T20:32:37.7536516Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.7536931Z self=, 2025-05-07T20:32:37.7537331Z T=1, 2025-05-07T20:32:37.7537523Z D=5120, 2025-05-07T20:32:37.7537721Z scale_ub=None, 2025-05-07T20:32:37.7537938Z contiguous=False, 2025-05-07T20:32:37.7538174Z compiled=True, 2025-05-07T20:32:37.7538809Z ) 2025-05-07T20:32:37.8019793Z self = 2025-05-07T20:32:37.8020617Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.8020979Z 2025-05-07T20:32:37.8021087Z @given( 2025-05-07T20:32:37.8021393Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.8021809Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.8022155Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.8022490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.8022825Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.8023117Z ) 2025-05-07T20:32:37.8023465Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.8023913Z def test_silu_mul_quant( 2025-05-07T20:32:37.8024160Z self, 2025-05-07T20:32:37.8024353Z T: int, 2025-05-07T20:32:37.8024557Z D: int, 2025-05-07T20:32:37.8024778Z scale_ub: Optional[float], 2025-05-07T20:32:37.8025050Z contiguous: bool, 2025-05-07T20:32:37.8025296Z compiled: bool, 2025-05-07T20:32:37.8025622Z ) -> None: 2025-05-07T20:32:37.8025838Z torch.manual_seed(2025) 2025-05-07T20:32:37.8026095Z 2025-05-07T20:32:37.8026375Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.8026715Z 2025-05-07T20:32:37.8026915Z x_sign = torch.sign(x) 2025-05-07T20:32:37.8027212Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.8027521Z x = x_sign * x_clamp 2025-05-07T20:32:37.8027766Z x0 = x[:, :D] 2025-05-07T20:32:37.8027989Z x1 = x[:, D:] 2025-05-07T20:32:37.8028205Z 2025-05-07T20:32:37.8028391Z if contiguous: 2025-05-07T20:32:37.8028631Z x0 = x0.contiguous() 2025-05-07T20:32:37.8028896Z x1 = x1.contiguous() 2025-05-07T20:32:37.8029137Z 2025-05-07T20:32:37.8029337Z if scale_ub is not None: 2025-05-07T20:32:37.8029617Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.8029957Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.8030279Z ) 2025-05-07T20:32:37.8030481Z else: 2025-05-07T20:32:37.8030693Z scale_ub_tensor = None 2025-05-07T20:32:37.8030952Z 2025-05-07T20:32:37.8031192Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.8031509Z op = silu_mul_quant 2025-05-07T20:32:37.8031764Z if compiled: 2025-05-07T20:32:37.8032016Z op = torch.compile(op) 2025-05-07T20:32:37.8032317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.8032598Z 2025-05-07T20:32:37.8032802Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.8033096Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.8033390Z 2025-05-07T20:32:37.8033639Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.8033982Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.8034278Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.8034690Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.8035063Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.8035375Z 2025-05-07T20:32:37.8035585Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.8035781Z 2025-05-07T20:32:37.8035891Z moe/activation_test.py:126: 2025-05-07T20:32:37.8036192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8036539Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.8036874Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.8037668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.8038696Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.8039333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.8040099Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.8040807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.8041535Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.8042292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.8043049Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.8043936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.8044586Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.8045197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.8045812Z fn() 2025-05-07T20:32:37.8046332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.8046920Z self.fn.run( 2025-05-07T20:32:37.8047400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.8047933Z kernel = self.compile( 2025-05-07T20:32:37.8048482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.8049146Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.8049541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8049780Z 2025-05-07T20:32:37.8049990Z self = 2025-05-07T20:32:37.8051071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.8052455Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8699e40>} 2025-05-07T20:32:37.8053801Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.8054823Z context = 2025-05-07T20:32:37.8055118Z 2025-05-07T20:32:37.8055286Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.8055811Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.8056284Z module_map=module_map) 2025-05-07T20:32:37.8056719Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.8057088Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.8057360Z E ^ 2025-05-07T20:32:37.8057820Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.8058277Z 2025-05-07T20:32:37.8058699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.8059222Z 2025-05-07T20:32:37.8059326Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.8059742Z self=, 2025-05-07T20:32:37.8060144Z T=1, 2025-05-07T20:32:37.8060333Z D=5120, 2025-05-07T20:32:37.8060531Z scale_ub=None, 2025-05-07T20:32:37.8060742Z contiguous=True, 2025-05-07T20:32:37.8061013Z compiled=False, 2025-05-07T20:32:37.8061236Z ) 2025-05-07T20:32:37.9233402Z self = 2025-05-07T20:32:37.9234137Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.9234500Z 2025-05-07T20:32:37.9234606Z @given( 2025-05-07T20:32:37.9234918Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.9235342Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.9235656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.9235992Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.9236313Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.9236600Z ) 2025-05-07T20:32:37.9236953Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.9237397Z def test_silu_mul_quant( 2025-05-07T20:32:37.9237634Z self, 2025-05-07T20:32:37.9237833Z T: int, 2025-05-07T20:32:37.9238038Z D: int, 2025-05-07T20:32:37.9238252Z scale_ub: Optional[float], 2025-05-07T20:32:37.9239164Z contiguous: bool, 2025-05-07T20:32:37.9239408Z compiled: bool, 2025-05-07T20:32:37.9239632Z ) -> None: 2025-05-07T20:32:37.9239850Z torch.manual_seed(2025) 2025-05-07T20:32:37.9240096Z 2025-05-07T20:32:37.9240366Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.9240714Z 2025-05-07T20:32:37.9240910Z x_sign = torch.sign(x) 2025-05-07T20:32:37.9241235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.9241544Z x = x_sign * x_clamp 2025-05-07T20:32:37.9241782Z x0 = x[:, :D] 2025-05-07T20:32:37.9242007Z x1 = x[:, D:] 2025-05-07T20:32:37.9242224Z 2025-05-07T20:32:37.9242408Z if contiguous: 2025-05-07T20:32:37.9242650Z x0 = x0.contiguous() 2025-05-07T20:32:37.9242916Z x1 = x1.contiguous() 2025-05-07T20:32:37.9243155Z 2025-05-07T20:32:37.9243355Z if scale_ub is not None: 2025-05-07T20:32:37.9243775Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.9244112Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.9244432Z ) 2025-05-07T20:32:37.9244629Z else: 2025-05-07T20:32:37.9244833Z scale_ub_tensor = None 2025-05-07T20:32:37.9245089Z 2025-05-07T20:32:37.9245322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.9245636Z op = silu_mul_quant 2025-05-07T20:32:37.9245881Z if compiled: 2025-05-07T20:32:37.9246127Z op = torch.compile(op) 2025-05-07T20:32:37.9246425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.9246697Z 2025-05-07T20:32:37.9246895Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.9247058Z 2025-05-07T20:32:37.9247164Z moe/activation_test.py:117: 2025-05-07T20:32:37.9247461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9247794Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.9248177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.9248869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.9249563Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.9250102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.9250791Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.9251453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.9251991Z kernel = self.compile( 2025-05-07T20:32:37.9252541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.9253300Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.9253776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9254018Z 2025-05-07T20:32:37.9254228Z self = 2025-05-07T20:32:37.9255314Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.9256732Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a869b740>} 2025-05-07T20:32:37.9258084Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.9259124Z context = 2025-05-07T20:32:37.9259493Z 2025-05-07T20:32:37.9259664Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.9260194Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.9260664Z module_map=module_map) 2025-05-07T20:32:37.9261039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.9261398Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.9261656Z E ^ 2025-05-07T20:32:37.9262127Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.9262588Z 2025-05-07T20:32:37.9263008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.9263531Z 2025-05-07T20:32:37.9263647Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.9264063Z self=, 2025-05-07T20:32:37.9264478Z T=128, 2025-05-07T20:32:37.9264671Z D=5120, 2025-05-07T20:32:37.9264864Z scale_ub=None, 2025-05-07T20:32:37.9265091Z contiguous=False, 2025-05-07T20:32:37.9265324Z compiled=True, 2025-05-07T20:32:37.9265539Z ) 2025-05-07T20:32:37.9265856Z self = 2025-05-07T20:32:37.9266356Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.9266657Z 2025-05-07T20:32:37.9266762Z @given( 2025-05-07T20:32:37.9266985Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.9267302Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.9267611Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.9267936Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.9268268Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.9268555Z ) 2025-05-07T20:32:37.9268961Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.9269406Z def test_silu_mul_quant( 2025-05-07T20:32:37.9269646Z self, 2025-05-07T20:32:37.9269846Z T: int, 2025-05-07T20:32:37.9270037Z D: int, 2025-05-07T20:32:37.9270263Z scale_ub: Optional[float], 2025-05-07T20:32:37.9270538Z contiguous: bool, 2025-05-07T20:32:37.9270771Z compiled: bool, 2025-05-07T20:32:37.9270997Z ) -> None: 2025-05-07T20:32:37.9271218Z torch.manual_seed(2025) 2025-05-07T20:32:37.9271454Z 2025-05-07T20:32:37.9271730Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.9272074Z 2025-05-07T20:32:37.9272264Z x_sign = torch.sign(x) 2025-05-07T20:32:37.9272558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.9272873Z x = x_sign * x_clamp 2025-05-07T20:32:37.9273152Z x0 = x[:, :D] 2025-05-07T20:32:37.9273419Z x1 = x[:, D:] 2025-05-07T20:32:37.9273630Z 2025-05-07T20:32:37.9273810Z if contiguous: 2025-05-07T20:32:37.9274043Z x0 = x0.contiguous() 2025-05-07T20:32:37.9274308Z x1 = x1.contiguous() 2025-05-07T20:32:37.9274552Z 2025-05-07T20:32:37.9274742Z if scale_ub is not None: 2025-05-07T20:32:37.9275018Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.9275359Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.9275664Z ) 2025-05-07T20:32:37.9275858Z else: 2025-05-07T20:32:37.9276070Z scale_ub_tensor = None 2025-05-07T20:32:37.9276322Z 2025-05-07T20:32:37.9276555Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.9276874Z op = silu_mul_quant 2025-05-07T20:32:37.9277122Z if compiled: 2025-05-07T20:32:37.9277372Z op = torch.compile(op) 2025-05-07T20:32:37.9277675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.9277999Z 2025-05-07T20:32:37.9278198Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.9278362Z 2025-05-07T20:32:37.9278467Z moe/activation_test.py:117: 2025-05-07T20:32:37.9278766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9279098Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.9279381Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.9279944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.9280503Z return fn(*args, **kwargs) 2025-05-07T20:32:37.9281170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.9281868Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.9282416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.9283104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.9283873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.9284413Z kernel = self.compile( 2025-05-07T20:32:37.9284951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.9285615Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.9286013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9286241Z 2025-05-07T20:32:37.9286454Z self = 2025-05-07T20:32:37.9287582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.9288959Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287d09120>} 2025-05-07T20:32:37.9290310Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.9291344Z context = 2025-05-07T20:32:37.9291632Z 2025-05-07T20:32:37.9291806Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.9292324Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.9292795Z module_map=module_map) 2025-05-07T20:32:37.9293208Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.9293561Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.9293863Z E ^ 2025-05-07T20:32:37.9294331Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.9294784Z 2025-05-07T20:32:37.9295213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.9295731Z 2025-05-07T20:32:37.9295838Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.9296303Z self=, 2025-05-07T20:32:37.9296709Z T=128, 2025-05-07T20:32:37.9296891Z D=7168, 2025-05-07T20:32:37.9304258Z scale_ub=1200.0, 2025-05-07T20:32:37.9304491Z contiguous=False, 2025-05-07T20:32:37.9304722Z compiled=False, 2025-05-07T20:32:37.9304932Z ) 2025-05-07T20:32:38.0195305Z self = 2025-05-07T20:32:38.0196103Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:38.0196752Z 2025-05-07T20:32:38.0196835Z @given( 2025-05-07T20:32:38.0197083Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.0197399Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.0197716Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.0198051Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.0198379Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.0198675Z ) 2025-05-07T20:32:38.0199031Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.0199488Z def test_silu_mul_quant( 2025-05-07T20:32:38.0199732Z self, 2025-05-07T20:32:38.0199938Z T: int, 2025-05-07T20:32:38.0200148Z D: int, 2025-05-07T20:32:38.0200366Z scale_ub: Optional[float], 2025-05-07T20:32:38.0200650Z contiguous: bool, 2025-05-07T20:32:38.0200900Z compiled: bool, 2025-05-07T20:32:38.0201140Z ) -> None: 2025-05-07T20:32:38.0201366Z torch.manual_seed(2025) 2025-05-07T20:32:38.0201619Z 2025-05-07T20:32:38.0201897Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.0202249Z 2025-05-07T20:32:38.0202454Z x_sign = torch.sign(x) 2025-05-07T20:32:38.0202745Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.0203066Z x = x_sign * x_clamp 2025-05-07T20:32:38.0203314Z x0 = x[:, :D] 2025-05-07T20:32:38.0203688Z x1 = x[:, D:] 2025-05-07T20:32:38.0203908Z 2025-05-07T20:32:38.0204105Z if contiguous: 2025-05-07T20:32:38.0204349Z x0 = x0.contiguous() 2025-05-07T20:32:38.0204608Z x1 = x1.contiguous() 2025-05-07T20:32:38.0204852Z 2025-05-07T20:32:38.0205055Z if scale_ub is not None: 2025-05-07T20:32:38.0205330Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.0205786Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.0206118Z ) 2025-05-07T20:32:38.0206315Z else: 2025-05-07T20:32:38.0206535Z scale_ub_tensor = None 2025-05-07T20:32:38.0206800Z 2025-05-07T20:32:38.0207037Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.0207362Z op = silu_mul_quant 2025-05-07T20:32:38.0207623Z if compiled: 2025-05-07T20:32:38.0207872Z op = torch.compile(op) 2025-05-07T20:32:38.0208184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.0208468Z 2025-05-07T20:32:38.0208665Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.0208839Z 2025-05-07T20:32:38.0208942Z moe/activation_test.py:117: 2025-05-07T20:32:38.0209251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.0209593Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.0210003Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.0210712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.0211496Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.0212036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.0212728Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.0213403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.0213949Z kernel = self.compile( 2025-05-07T20:32:38.0214493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.0215168Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.0215578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.0215809Z 2025-05-07T20:32:38.0216038Z self = 2025-05-07T20:32:38.0217204Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.0218597Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287d08360>} 2025-05-07T20:32:38.0219944Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.0220972Z context = 2025-05-07T20:32:38.0221268Z 2025-05-07T20:32:38.0221444Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.0221989Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.0222464Z module_map=module_map) 2025-05-07T20:32:38.0222828Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.0223188Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.0223454Z E ^ 2025-05-07T20:32:38.0223915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.0224370Z 2025-05-07T20:32:38.0224789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.0225309Z 2025-05-07T20:32:38.0225416Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.0225833Z self=, 2025-05-07T20:32:38.0226236Z T=128, 2025-05-07T20:32:38.0226434Z D=5120, 2025-05-07T20:32:38.0226690Z scale_ub=None, 2025-05-07T20:32:38.0226909Z contiguous=False, 2025-05-07T20:32:38.0227142Z compiled=False, 2025-05-07T20:32:38.0227359Z ) 2025-05-07T20:32:38.0227677Z self = 2025-05-07T20:32:38.0228182Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:38.0228452Z 2025-05-07T20:32:38.0228540Z @given( 2025-05-07T20:32:38.0228777Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.0229091Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.0229402Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.0229734Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.0230059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.0230347Z ) 2025-05-07T20:32:38.0230744Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.0231228Z def test_silu_mul_quant( 2025-05-07T20:32:38.0231479Z self, 2025-05-07T20:32:38.0231678Z T: int, 2025-05-07T20:32:38.0231879Z D: int, 2025-05-07T20:32:38.0232101Z scale_ub: Optional[float], 2025-05-07T20:32:38.0232382Z contiguous: bool, 2025-05-07T20:32:38.0232620Z compiled: bool, 2025-05-07T20:32:38.0232857Z ) -> None: 2025-05-07T20:32:38.0233082Z torch.manual_seed(2025) 2025-05-07T20:32:38.0233335Z 2025-05-07T20:32:38.0233609Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.0233959Z 2025-05-07T20:32:38.0234162Z x_sign = torch.sign(x) 2025-05-07T20:32:38.0234452Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.0234764Z x = x_sign * x_clamp 2025-05-07T20:32:38.0235011Z x0 = x[:, :D] 2025-05-07T20:32:38.0235228Z x1 = x[:, D:] 2025-05-07T20:32:38.0235448Z 2025-05-07T20:32:38.0235645Z if contiguous: 2025-05-07T20:32:38.0235882Z x0 = x0.contiguous() 2025-05-07T20:32:38.0236198Z x1 = x1.contiguous() 2025-05-07T20:32:38.0236447Z 2025-05-07T20:32:38.0236638Z if scale_ub is not None: 2025-05-07T20:32:38.0236921Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.0237263Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.0237569Z ) 2025-05-07T20:32:38.0237773Z else: 2025-05-07T20:32:38.0237986Z scale_ub_tensor = None 2025-05-07T20:32:38.0238239Z 2025-05-07T20:32:38.0238752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.0239073Z op = silu_mul_quant 2025-05-07T20:32:38.0239330Z if compiled: 2025-05-07T20:32:38.0239579Z op = torch.compile(op) 2025-05-07T20:32:38.0239884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.0240164Z 2025-05-07T20:32:38.0240365Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.0240540Z 2025-05-07T20:32:38.0240651Z moe/activation_test.py:117: 2025-05-07T20:32:38.0240955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.0241286Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.0241572Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.0242262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.0242962Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.0243500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.0244311Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.0244989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.0245525Z kernel = self.compile( 2025-05-07T20:32:38.0246663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.0247345Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.0247748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.0247977Z 2025-05-07T20:32:38.0248186Z self = 2025-05-07T20:32:38.0249260Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.0250624Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb28785c720>} 2025-05-07T20:32:38.0252039Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.0253128Z context = 2025-05-07T20:32:38.0253416Z 2025-05-07T20:32:38.0253586Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.0254115Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.0254587Z module_map=module_map) 2025-05-07T20:32:38.0254951Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.0255316Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.0255579Z E ^ 2025-05-07T20:32:38.0256054Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.0256506Z 2025-05-07T20:32:38.0256932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.0257526Z 2025-05-07T20:32:38.0257631Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.0258051Z self=, 2025-05-07T20:32:38.0258449Z T=128, 2025-05-07T20:32:38.0258642Z D=5120, 2025-05-07T20:32:38.0258841Z scale_ub=1200.0, 2025-05-07T20:32:38.0259069Z contiguous=True, 2025-05-07T20:32:38.0259294Z compiled=False, 2025-05-07T20:32:38.0259502Z ) 2025-05-07T20:32:38.3387855Z self = 2025-05-07T20:32:38.3388613Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:38.3388986Z 2025-05-07T20:32:38.3389105Z @given( 2025-05-07T20:32:38.3389344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.3389685Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.3390002Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.3390342Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.3390676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.3390964Z ) 2025-05-07T20:32:38.3391327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.3391774Z def test_silu_mul_quant( 2025-05-07T20:32:38.3392020Z self, 2025-05-07T20:32:38.3392216Z T: int, 2025-05-07T20:32:38.3392421Z D: int, 2025-05-07T20:32:38.3392646Z scale_ub: Optional[float], 2025-05-07T20:32:38.3392915Z contiguous: bool, 2025-05-07T20:32:38.3393163Z compiled: bool, 2025-05-07T20:32:38.3393397Z ) -> None: 2025-05-07T20:32:38.3393613Z torch.manual_seed(2025) 2025-05-07T20:32:38.3393863Z 2025-05-07T20:32:38.3394146Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.3394504Z 2025-05-07T20:32:38.3394697Z x_sign = torch.sign(x) 2025-05-07T20:32:38.3395321Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.3395641Z x = x_sign * x_clamp 2025-05-07T20:32:38.3395879Z x0 = x[:, :D] 2025-05-07T20:32:38.3396100Z x1 = x[:, D:] 2025-05-07T20:32:38.3396310Z 2025-05-07T20:32:38.3396497Z if contiguous: 2025-05-07T20:32:38.3396735Z x0 = x0.contiguous() 2025-05-07T20:32:38.3396995Z x1 = x1.contiguous() 2025-05-07T20:32:38.3397237Z 2025-05-07T20:32:38.3397433Z if scale_ub is not None: 2025-05-07T20:32:38.3397709Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.3398039Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.3398348Z ) 2025-05-07T20:32:38.3398548Z else: 2025-05-07T20:32:38.3398756Z scale_ub_tensor = None 2025-05-07T20:32:38.3399014Z 2025-05-07T20:32:38.3399347Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.3399664Z op = silu_mul_quant 2025-05-07T20:32:38.3400009Z if compiled: 2025-05-07T20:32:38.3400260Z op = torch.compile(op) 2025-05-07T20:32:38.3400562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.3400835Z 2025-05-07T20:32:38.3401031Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.3401195Z 2025-05-07T20:32:38.3401300Z moe/activation_test.py:117: 2025-05-07T20:32:38.3401594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.3401928Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.3402215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.3402911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.3403764Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.3404314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.3405011Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.3405771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.3406365Z kernel = self.compile( 2025-05-07T20:32:38.3406918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.3407587Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.3407983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.3408220Z 2025-05-07T20:32:38.3408428Z self = 2025-05-07T20:32:38.3409514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.3410895Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb28785d8a0>} 2025-05-07T20:32:38.3412228Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.3413259Z context = 2025-05-07T20:32:38.3413555Z 2025-05-07T20:32:38.3413723Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.3414250Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.3414714Z module_map=module_map) 2025-05-07T20:32:38.3415084Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.3415529Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.3415794Z E ^ 2025-05-07T20:32:38.3416263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.3416724Z 2025-05-07T20:32:38.3417144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.3417659Z 2025-05-07T20:32:38.3417772Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.3418180Z self=, 2025-05-07T20:32:38.3418582Z T=1, 2025-05-07T20:32:38.3418772Z D=7168, 2025-05-07T20:32:38.3418963Z scale_ub=1200.0, 2025-05-07T20:32:38.3419186Z contiguous=True, 2025-05-07T20:32:38.3419411Z compiled=True, 2025-05-07T20:32:38.3419614Z ) 2025-05-07T20:32:38.3419989Z self = 2025-05-07T20:32:38.3420487Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:38.3420797Z 2025-05-07T20:32:38.3420884Z @given( 2025-05-07T20:32:38.3421108Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.3421428Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.3421740Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.3422065Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.3422401Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.3422691Z ) 2025-05-07T20:32:38.3423034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.3423479Z def test_silu_mul_quant( 2025-05-07T20:32:38.3423725Z self, 2025-05-07T20:32:38.3423918Z T: int, 2025-05-07T20:32:38.3424111Z D: int, 2025-05-07T20:32:38.3424329Z scale_ub: Optional[float], 2025-05-07T20:32:38.3424607Z contiguous: bool, 2025-05-07T20:32:38.3424843Z compiled: bool, 2025-05-07T20:32:38.3425121Z ) -> None: 2025-05-07T20:32:38.3425340Z torch.manual_seed(2025) 2025-05-07T20:32:38.3425577Z 2025-05-07T20:32:38.3425852Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.3426200Z 2025-05-07T20:32:38.3426391Z x_sign = torch.sign(x) 2025-05-07T20:32:38.3426688Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.3427000Z x = x_sign * x_clamp 2025-05-07T20:32:38.3427236Z x0 = x[:, :D] 2025-05-07T20:32:38.3427456Z x1 = x[:, D:] 2025-05-07T20:32:38.3427668Z 2025-05-07T20:32:38.3427854Z if contiguous: 2025-05-07T20:32:38.3428089Z x0 = x0.contiguous() 2025-05-07T20:32:38.3428355Z x1 = x1.contiguous() 2025-05-07T20:32:38.3428595Z 2025-05-07T20:32:38.3428793Z if scale_ub is not None: 2025-05-07T20:32:38.3429075Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.3429414Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.3429723Z ) 2025-05-07T20:32:38.3429916Z else: 2025-05-07T20:32:38.3430129Z scale_ub_tensor = None 2025-05-07T20:32:38.3430380Z 2025-05-07T20:32:38.3430615Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.3430929Z op = silu_mul_quant 2025-05-07T20:32:38.3431173Z if compiled: 2025-05-07T20:32:38.3431420Z op = torch.compile(op) 2025-05-07T20:32:38.3431716Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.3431986Z 2025-05-07T20:32:38.3432180Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.3432341Z 2025-05-07T20:32:38.3432448Z moe/activation_test.py:117: 2025-05-07T20:32:38.3432736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.3433067Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.3433351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.3433962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.3434528Z return fn(*args, **kwargs) 2025-05-07T20:32:38.3435191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.3435891Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.3436428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.3437184Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.3437854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.3438661Z kernel = self.compile( 2025-05-07T20:32:38.3439282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.3439962Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.3440422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.3440650Z 2025-05-07T20:32:38.3440864Z self = 2025-05-07T20:32:38.3441935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.3443298Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb28785ee80>} 2025-05-07T20:32:38.3444750Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.3445780Z context = 2025-05-07T20:32:38.3446137Z 2025-05-07T20:32:38.3446312Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.3446829Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.3447299Z module_map=module_map) 2025-05-07T20:32:38.3447665Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.3448015Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.3448276Z E ^ 2025-05-07T20:32:38.3448741Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.3449191Z 2025-05-07T20:32:38.3449617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.3450134Z 2025-05-07T20:32:38.3450242Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.3450660Z self=, 2025-05-07T20:32:38.3451061Z T=1, 2025-05-07T20:32:38.3451240Z D=7168, 2025-05-07T20:32:38.3451436Z scale_ub=1200.0, 2025-05-07T20:32:38.3451659Z contiguous=False, 2025-05-07T20:32:38.3451879Z compiled=True, 2025-05-07T20:32:38.3452090Z ) 2025-05-07T20:32:38.4474035Z self = 2025-05-07T20:32:38.4475455Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:38.4476186Z 2025-05-07T20:32:38.4476397Z @given( 2025-05-07T20:32:38.4476853Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.4477278Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.4477590Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.4477932Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.4478556Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.4478895Z ) 2025-05-07T20:32:38.4479291Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.4479816Z def test_silu_mul_quant( 2025-05-07T20:32:38.4480079Z self, 2025-05-07T20:32:38.4480283Z T: int, 2025-05-07T20:32:38.4480494Z D: int, 2025-05-07T20:32:38.4480728Z scale_ub: Optional[float], 2025-05-07T20:32:38.4481023Z contiguous: bool, 2025-05-07T20:32:38.4481284Z compiled: bool, 2025-05-07T20:32:38.4481527Z ) -> None: 2025-05-07T20:32:38.4481753Z torch.manual_seed(2025) 2025-05-07T20:32:38.4482022Z 2025-05-07T20:32:38.4482321Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.4482707Z 2025-05-07T20:32:38.4482918Z x_sign = torch.sign(x) 2025-05-07T20:32:38.4483324Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.4483773Z x = x_sign * x_clamp 2025-05-07T20:32:38.4484106Z x0 = x[:, :D] 2025-05-07T20:32:38.4484327Z x1 = x[:, D:] 2025-05-07T20:32:38.4484543Z 2025-05-07T20:32:38.4484729Z if contiguous: 2025-05-07T20:32:38.4484967Z x0 = x0.contiguous() 2025-05-07T20:32:38.4485233Z x1 = x1.contiguous() 2025-05-07T20:32:38.4485475Z 2025-05-07T20:32:38.4485673Z if scale_ub is not None: 2025-05-07T20:32:38.4485952Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.4486332Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.4494403Z ) 2025-05-07T20:32:38.4494612Z else: 2025-05-07T20:32:38.4494832Z scale_ub_tensor = None 2025-05-07T20:32:38.4495098Z 2025-05-07T20:32:38.4495330Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.4495652Z op = silu_mul_quant 2025-05-07T20:32:38.4495913Z if compiled: 2025-05-07T20:32:38.4496163Z op = torch.compile(op) 2025-05-07T20:32:38.4496619Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.4496900Z 2025-05-07T20:32:38.4497097Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.4497275Z 2025-05-07T20:32:38.4497378Z moe/activation_test.py:117: 2025-05-07T20:32:38.4497684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.4498017Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.4498312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.4498882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.4499445Z return fn(*args, **kwargs) 2025-05-07T20:32:38.4500094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.4500787Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.4501334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.4502027Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.4502693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.4503231Z kernel = self.compile( 2025-05-07T20:32:38.4503777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.4504432Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.4504850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.4505087Z 2025-05-07T20:32:38.4505294Z self = 2025-05-07T20:32:38.4506426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.4507817Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287af4680>} 2025-05-07T20:32:38.4509149Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.4510178Z context = 2025-05-07T20:32:38.4510474Z 2025-05-07T20:32:38.4510641Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.4511165Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.4511674Z module_map=module_map) 2025-05-07T20:32:38.4512049Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.4512456Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.4512707Z E ^ 2025-05-07T20:32:38.4513174Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.4513630Z 2025-05-07T20:32:38.4514050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.4514562Z 2025-05-07T20:32:38.4514678Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.4515087Z self=, 2025-05-07T20:32:38.4515489Z T=1, 2025-05-07T20:32:38.4515675Z D=7168, 2025-05-07T20:32:38.4515866Z scale_ub=None, 2025-05-07T20:32:38.4516082Z contiguous=False, 2025-05-07T20:32:38.4516309Z compiled=True, 2025-05-07T20:32:38.4516511Z ) 2025-05-07T20:32:38.5190308Z self = 2025-05-07T20:32:38.5191374Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:38.5191649Z 2025-05-07T20:32:38.5191739Z @given( 2025-05-07T20:32:38.5191975Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.5192294Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.5192606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.5192937Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.5193264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.5193553Z ) 2025-05-07T20:32:38.5193904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.5194344Z def test_silu_mul_quant( 2025-05-07T20:32:38.5194597Z self, 2025-05-07T20:32:38.5194799Z T: int, 2025-05-07T20:32:38.5194995Z D: int, 2025-05-07T20:32:38.5195226Z scale_ub: Optional[float], 2025-05-07T20:32:38.5195505Z contiguous: bool, 2025-05-07T20:32:38.5195752Z compiled: bool, 2025-05-07T20:32:38.5195987Z ) -> None: 2025-05-07T20:32:38.5196212Z torch.manual_seed(2025) 2025-05-07T20:32:38.5196452Z 2025-05-07T20:32:38.5196740Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.5197078Z 2025-05-07T20:32:38.5197277Z x_sign = torch.sign(x) 2025-05-07T20:32:38.5197571Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.5197876Z x = x_sign * x_clamp 2025-05-07T20:32:38.5198121Z x0 = x[:, :D] 2025-05-07T20:32:38.5198341Z x1 = x[:, D:] 2025-05-07T20:32:38.5198547Z 2025-05-07T20:32:38.5198737Z if contiguous: 2025-05-07T20:32:38.5198975Z x0 = x0.contiguous() 2025-05-07T20:32:38.5199228Z x1 = x1.contiguous() 2025-05-07T20:32:38.5199468Z 2025-05-07T20:32:38.5199667Z if scale_ub is not None: 2025-05-07T20:32:38.5199937Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.5200379Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.5200695Z ) 2025-05-07T20:32:38.5200889Z else: 2025-05-07T20:32:38.5201097Z scale_ub_tensor = None 2025-05-07T20:32:38.5201350Z 2025-05-07T20:32:38.5201590Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.5201898Z op = silu_mul_quant 2025-05-07T20:32:38.5202150Z if compiled: 2025-05-07T20:32:38.5202399Z op = torch.compile(op) 2025-05-07T20:32:38.5202691Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.5202970Z 2025-05-07T20:32:38.5203166Z y_fp8, y_scale = fn() 2025-05-07T20:32:38.5203445Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:38.5203906Z 2025-05-07T20:32:38.5204151Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.5204569Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:38.5204876Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:38.5205278Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:38.5205641Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.5205949Z 2025-05-07T20:32:38.5206152Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:38.5206347Z 2025-05-07T20:32:38.5206454Z moe/activation_test.py:126: 2025-05-07T20:32:38.5206745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.5207084Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:38.5207414Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.5208202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:38.5208962Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:38.5209518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.5210261Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.5210950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:38.5211682Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:38.5212445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:38.5213201Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:38.5213931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:38.5214580Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:38.5215199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:38.5215734Z fn() 2025-05-07T20:32:38.5216246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:38.5216833Z self.fn.run( 2025-05-07T20:32:38.5217307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.5217838Z kernel = self.compile( 2025-05-07T20:32:38.5218386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.5219048Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.5219450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.5219679Z 2025-05-07T20:32:38.5219887Z self = 2025-05-07T20:32:38.5221015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.5222403Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287af5580>} 2025-05-07T20:32:38.5223752Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.5224777Z context = 2025-05-07T20:32:38.5225070Z 2025-05-07T20:32:38.5225240Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.5225810Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.5226323Z module_map=module_map) 2025-05-07T20:32:38.5226690Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.5227054Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:38.5227325Z E ^ 2025-05-07T20:32:38.5227788Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.5228248Z 2025-05-07T20:32:38.5228670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.5229194Z 2025-05-07T20:32:38.5229301Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.5229718Z self=, 2025-05-07T20:32:38.5230116Z T=1, 2025-05-07T20:32:38.5230302Z D=5120, 2025-05-07T20:32:38.5230504Z scale_ub=1200.0, 2025-05-07T20:32:38.5230728Z contiguous=False, 2025-05-07T20:32:38.5230958Z compiled=True, 2025-05-07T20:32:38.5231176Z ) 2025-05-07T20:32:38.6447298Z self = 2025-05-07T20:32:38.6448015Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:38.6448406Z 2025-05-07T20:32:38.6448515Z @given( 2025-05-07T20:32:38.6448834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.6449266Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.6449596Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.6449927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.6450258Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.6450542Z ) 2025-05-07T20:32:38.6450896Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.6451342Z def test_silu_mul_quant( 2025-05-07T20:32:38.6451582Z self, 2025-05-07T20:32:38.6451798Z T: int, 2025-05-07T20:32:38.6452003Z D: int, 2025-05-07T20:32:38.6452235Z scale_ub: Optional[float], 2025-05-07T20:32:38.6452511Z contiguous: bool, 2025-05-07T20:32:38.6452753Z compiled: bool, 2025-05-07T20:32:38.6452977Z ) -> None: 2025-05-07T20:32:38.6453198Z torch.manual_seed(2025) 2025-05-07T20:32:38.6453447Z 2025-05-07T20:32:38.6453718Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.6454068Z 2025-05-07T20:32:38.6454267Z x_sign = torch.sign(x) 2025-05-07T20:32:38.6454592Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.6454902Z x = x_sign * x_clamp 2025-05-07T20:32:38.6455139Z x0 = x[:, :D] 2025-05-07T20:32:38.6455359Z x1 = x[:, D:] 2025-05-07T20:32:38.6455574Z 2025-05-07T20:32:38.6455759Z if contiguous: 2025-05-07T20:32:38.6455995Z x0 = x0.contiguous() 2025-05-07T20:32:38.6456277Z x1 = x1.contiguous() 2025-05-07T20:32:38.6456516Z 2025-05-07T20:32:38.6456975Z if scale_ub is not None: 2025-05-07T20:32:38.6457261Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.6457596Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.6457907Z ) 2025-05-07T20:32:38.6458106Z else: 2025-05-07T20:32:38.6458315Z scale_ub_tensor = None 2025-05-07T20:32:38.6458571Z 2025-05-07T20:32:38.6458810Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.6459123Z op = silu_mul_quant 2025-05-07T20:32:38.6459379Z if compiled: 2025-05-07T20:32:38.6459637Z op = torch.compile(op) 2025-05-07T20:32:38.6459930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.6460211Z 2025-05-07T20:32:38.6460414Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.6460579Z 2025-05-07T20:32:38.6460687Z moe/activation_test.py:117: 2025-05-07T20:32:38.6461064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.6461484Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.6461776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.6462337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.6462905Z return fn(*args, **kwargs) 2025-05-07T20:32:38.6463572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.6464266Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.6464803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.6465493Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.6466165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.6466700Z kernel = self.compile( 2025-05-07T20:32:38.6467252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.6468004Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.6468407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.6468637Z 2025-05-07T20:32:38.6468845Z self = 2025-05-07T20:32:38.6469929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.6471317Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287af6b60>} 2025-05-07T20:32:38.6472673Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.6473716Z context = 2025-05-07T20:32:38.6474004Z 2025-05-07T20:32:38.6474174Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.6474704Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.6475181Z module_map=module_map) 2025-05-07T20:32:38.6475541Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.6475903Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.6476170Z E ^ 2025-05-07T20:32:38.6476635Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.6477099Z 2025-05-07T20:32:38.6477566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.6478097Z 2025-05-07T20:32:38.6478202Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.6478623Z self=, 2025-05-07T20:32:38.6479022Z T=1, 2025-05-07T20:32:38.6479214Z D=5120, 2025-05-07T20:32:38.6479415Z scale_ub=1200.0, 2025-05-07T20:32:38.6479638Z contiguous=False, 2025-05-07T20:32:38.6479868Z compiled=False, 2025-05-07T20:32:38.6480084Z ) 2025-05-07T20:32:38.6480411Z self = 2025-05-07T20:32:38.6480901Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:38.6481178Z 2025-05-07T20:32:38.6481256Z @given( 2025-05-07T20:32:38.6481489Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.6481847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.6482238Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.6482575Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.6482906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.6483202Z ) 2025-05-07T20:32:38.6483735Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.6484184Z def test_silu_mul_quant( 2025-05-07T20:32:38.6484427Z self, 2025-05-07T20:32:38.6484634Z T: int, 2025-05-07T20:32:38.6484837Z D: int, 2025-05-07T20:32:38.6485059Z scale_ub: Optional[float], 2025-05-07T20:32:38.6485336Z contiguous: bool, 2025-05-07T20:32:38.6485580Z compiled: bool, 2025-05-07T20:32:38.6485800Z ) -> None: 2025-05-07T20:32:38.6486021Z torch.manual_seed(2025) 2025-05-07T20:32:38.6486272Z 2025-05-07T20:32:38.6486549Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.6486897Z 2025-05-07T20:32:38.6487098Z x_sign = torch.sign(x) 2025-05-07T20:32:38.6487443Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.6487757Z x = x_sign * x_clamp 2025-05-07T20:32:38.6488000Z x0 = x[:, :D] 2025-05-07T20:32:38.6488218Z x1 = x[:, D:] 2025-05-07T20:32:38.6488432Z 2025-05-07T20:32:38.6488624Z if contiguous: 2025-05-07T20:32:38.6488859Z x0 = x0.contiguous() 2025-05-07T20:32:38.6489120Z x1 = x1.contiguous() 2025-05-07T20:32:38.6489367Z 2025-05-07T20:32:38.6489562Z if scale_ub is not None: 2025-05-07T20:32:38.6489834Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.6490171Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.6490479Z ) 2025-05-07T20:32:38.6490670Z else: 2025-05-07T20:32:38.6490889Z scale_ub_tensor = None 2025-05-07T20:32:38.6491146Z 2025-05-07T20:32:38.6491380Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.6491702Z op = silu_mul_quant 2025-05-07T20:32:38.6491966Z if compiled: 2025-05-07T20:32:38.6492211Z op = torch.compile(op) 2025-05-07T20:32:38.6492519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.6492798Z 2025-05-07T20:32:38.6492990Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.6493161Z 2025-05-07T20:32:38.6493266Z moe/activation_test.py:117: 2025-05-07T20:32:38.6493578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.6493916Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.6494197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.6494896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.6495601Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.6496211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.6496925Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.6497603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.6498148Z kernel = self.compile( 2025-05-07T20:32:38.6498691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.6499362Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.6499773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.6500003Z 2025-05-07T20:32:38.6500223Z self = 2025-05-07T20:32:38.6501344Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.6502821Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287af72e0>} 2025-05-07T20:32:38.6504170Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.6505208Z context = 2025-05-07T20:32:38.6505495Z 2025-05-07T20:32:38.6505665Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.6506192Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.6506664Z module_map=module_map) 2025-05-07T20:32:38.6507039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.6507441Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.6507705Z E ^ 2025-05-07T20:32:38.6508174Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.6508628Z 2025-05-07T20:32:38.6509049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.6509575Z 2025-05-07T20:32:38.6509679Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.6510097Z self=, 2025-05-07T20:32:38.6510505Z T=16384, 2025-05-07T20:32:38.6510698Z D=5120, 2025-05-07T20:32:38.6510893Z scale_ub=1200.0, 2025-05-07T20:32:38.6511123Z contiguous=False, 2025-05-07T20:32:38.6511343Z compiled=True, 2025-05-07T20:32:38.6511553Z ) 2025-05-07T20:32:38.8973730Z self = 2025-05-07T20:32:38.8974528Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:38.8974932Z 2025-05-07T20:32:38.8975037Z @given( 2025-05-07T20:32:38.8975270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.8975591Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.8975902Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.8976246Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.8976610Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.8976899Z ) 2025-05-07T20:32:38.8977252Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.8977689Z def test_silu_mul_quant( 2025-05-07T20:32:38.8977935Z self, 2025-05-07T20:32:38.8978132Z T: int, 2025-05-07T20:32:38.8978330Z D: int, 2025-05-07T20:32:38.8978555Z scale_ub: Optional[float], 2025-05-07T20:32:38.8978832Z contiguous: bool, 2025-05-07T20:32:38.8979358Z compiled: bool, 2025-05-07T20:32:38.8979598Z ) -> None: 2025-05-07T20:32:38.8979816Z torch.manual_seed(2025) 2025-05-07T20:32:38.8980054Z 2025-05-07T20:32:38.8980333Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.8980676Z 2025-05-07T20:32:38.8980866Z x_sign = torch.sign(x) 2025-05-07T20:32:38.8981164Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.8981475Z x = x_sign * x_clamp 2025-05-07T20:32:38.8981717Z x0 = x[:, :D] 2025-05-07T20:32:38.8981926Z x1 = x[:, D:] 2025-05-07T20:32:38.8982137Z 2025-05-07T20:32:38.8982327Z if contiguous: 2025-05-07T20:32:38.8982556Z x0 = x0.contiguous() 2025-05-07T20:32:38.8982816Z x1 = x1.contiguous() 2025-05-07T20:32:38.8983055Z 2025-05-07T20:32:38.8983243Z if scale_ub is not None: 2025-05-07T20:32:38.8983602Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.8984018Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.8984324Z ) 2025-05-07T20:32:38.8984519Z else: 2025-05-07T20:32:38.8984731Z scale_ub_tensor = None 2025-05-07T20:32:38.8984978Z 2025-05-07T20:32:38.8985214Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.8985530Z op = silu_mul_quant 2025-05-07T20:32:38.8985777Z if compiled: 2025-05-07T20:32:38.8986026Z op = torch.compile(op) 2025-05-07T20:32:38.8986324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.8986602Z 2025-05-07T20:32:38.8986788Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.8986960Z 2025-05-07T20:32:38.8987058Z moe/activation_test.py:117: 2025-05-07T20:32:38.8987352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.8987679Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.8987969Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.8988631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.8989193Z return fn(*args, **kwargs) 2025-05-07T20:32:38.8989856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.8990545Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.8991084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.8999203Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.8999925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.9000466Z kernel = self.compile( 2025-05-07T20:32:38.9001027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.9001708Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.9002109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.9002349Z 2025-05-07T20:32:38.9002560Z self = 2025-05-07T20:32:38.9003745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.9005128Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2874e0fe0>} 2025-05-07T20:32:38.9006571Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.9007622Z context = 2025-05-07T20:32:38.9007917Z 2025-05-07T20:32:38.9008083Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.9008614Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.9009088Z module_map=module_map) 2025-05-07T20:32:38.9009452Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.9009817Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.9010083Z E ^ 2025-05-07T20:32:38.9010546Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.9011007Z 2025-05-07T20:32:38.9011472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.9012035Z 2025-05-07T20:32:38.9012146Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.9012565Z self=, 2025-05-07T20:32:38.9012963Z T=2048, 2025-05-07T20:32:38.9013160Z D=7168, 2025-05-07T20:32:38.9013355Z scale_ub=1200.0, 2025-05-07T20:32:38.9013585Z contiguous=False, 2025-05-07T20:32:38.9013817Z compiled=True, 2025-05-07T20:32:38.9014030Z ) 2025-05-07T20:32:38.9014346Z self = 2025-05-07T20:32:38.9014844Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:38.9015126Z 2025-05-07T20:32:38.9015205Z @given( 2025-05-07T20:32:38.9015439Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.9015747Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.9016054Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.9016389Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.9016797Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.9017086Z ) 2025-05-07T20:32:38.9017436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.9017872Z def test_silu_mul_quant( 2025-05-07T20:32:38.9018115Z self, 2025-05-07T20:32:38.9018313Z T: int, 2025-05-07T20:32:38.9018507Z D: int, 2025-05-07T20:32:38.9018733Z scale_ub: Optional[float], 2025-05-07T20:32:38.9019005Z contiguous: bool, 2025-05-07T20:32:38.9019241Z compiled: bool, 2025-05-07T20:32:38.9019468Z ) -> None: 2025-05-07T20:32:38.9019688Z torch.manual_seed(2025) 2025-05-07T20:32:38.9019929Z 2025-05-07T20:32:38.9020200Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.9020533Z 2025-05-07T20:32:38.9020729Z x_sign = torch.sign(x) 2025-05-07T20:32:38.9021016Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.9021333Z x = x_sign * x_clamp 2025-05-07T20:32:38.9021582Z x0 = x[:, :D] 2025-05-07T20:32:38.9021790Z x1 = x[:, D:] 2025-05-07T20:32:38.9021998Z 2025-05-07T20:32:38.9022187Z if contiguous: 2025-05-07T20:32:38.9022416Z x0 = x0.contiguous() 2025-05-07T20:32:38.9022674Z x1 = x1.contiguous() 2025-05-07T20:32:38.9022915Z 2025-05-07T20:32:38.9023104Z if scale_ub is not None: 2025-05-07T20:32:38.9023377Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.9023712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.9024010Z ) 2025-05-07T20:32:38.9024209Z else: 2025-05-07T20:32:38.9024425Z scale_ub_tensor = None 2025-05-07T20:32:38.9024680Z 2025-05-07T20:32:38.9024910Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.9025229Z op = silu_mul_quant 2025-05-07T20:32:38.9025487Z if compiled: 2025-05-07T20:32:38.9025780Z op = torch.compile(op) 2025-05-07T20:32:38.9026087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.9026364Z 2025-05-07T20:32:38.9026553Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.9026723Z 2025-05-07T20:32:38.9026822Z moe/activation_test.py:117: 2025-05-07T20:32:38.9027120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.9027445Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.9027729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.9028294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.9028854Z return fn(*args, **kwargs) 2025-05-07T20:32:38.9029521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.9030260Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.9030798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.9031525Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.9032195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.9032729Z kernel = self.compile( 2025-05-07T20:32:38.9033265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.9033924Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.9034319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.9034544Z 2025-05-07T20:32:38.9034755Z self = 2025-05-07T20:32:38.9035822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.9037278Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2874e1b20>} 2025-05-07T20:32:38.9038929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.9039955Z context = 2025-05-07T20:32:38.9040238Z 2025-05-07T20:32:38.9040404Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.9040927Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.9041395Z module_map=module_map) 2025-05-07T20:32:38.9041768Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.9042117Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.9042373Z E ^ 2025-05-07T20:32:38.9042841Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.9043291Z 2025-05-07T20:32:38.9043868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.9044389Z 2025-05-07T20:32:38.9944410Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.9945057Z self=, 2025-05-07T20:32:38.9945629Z T=1, 2025-05-07T20:32:38.9945887Z D=5120, 2025-05-07T20:32:38.9946154Z scale_ub=None, 2025-05-07T20:32:38.9946470Z contiguous=False, 2025-05-07T20:32:38.9946733Z compiled=False, 2025-05-07T20:32:38.9946963Z ) 2025-05-07T20:32:38.9947503Z self = 2025-05-07T20:32:38.9948020Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:38.9948287Z 2025-05-07T20:32:38.9948374Z @given( 2025-05-07T20:32:38.9948608Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.9948930Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.9949244Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.9949573Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.9949908Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.9950203Z ) 2025-05-07T20:32:38.9950556Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.9951009Z def test_silu_mul_quant( 2025-05-07T20:32:38.9951259Z self, 2025-05-07T20:32:38.9951456Z T: int, 2025-05-07T20:32:38.9951760Z D: int, 2025-05-07T20:32:38.9951989Z scale_ub: Optional[float], 2025-05-07T20:32:38.9952353Z contiguous: bool, 2025-05-07T20:32:38.9952593Z compiled: bool, 2025-05-07T20:32:38.9952854Z ) -> None: 2025-05-07T20:32:38.9953076Z torch.manual_seed(2025) 2025-05-07T20:32:38.9953322Z 2025-05-07T20:32:38.9953595Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.9953942Z 2025-05-07T20:32:38.9954145Z x_sign = torch.sign(x) 2025-05-07T20:32:38.9954446Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.9954756Z x = x_sign * x_clamp 2025-05-07T20:32:38.9955006Z x0 = x[:, :D] 2025-05-07T20:32:38.9955231Z x1 = x[:, D:] 2025-05-07T20:32:38.9955443Z 2025-05-07T20:32:38.9955636Z if contiguous: 2025-05-07T20:32:38.9955876Z x0 = x0.contiguous() 2025-05-07T20:32:38.9956134Z x1 = x1.contiguous() 2025-05-07T20:32:38.9956383Z 2025-05-07T20:32:38.9956584Z if scale_ub is not None: 2025-05-07T20:32:38.9956863Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.9957299Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.9957619Z ) 2025-05-07T20:32:38.9957812Z else: 2025-05-07T20:32:38.9958038Z scale_ub_tensor = None 2025-05-07T20:32:38.9958307Z 2025-05-07T20:32:38.9958545Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.9958871Z op = silu_mul_quant 2025-05-07T20:32:38.9959134Z if compiled: 2025-05-07T20:32:38.9959391Z op = torch.compile(op) 2025-05-07T20:32:38.9959693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.9959978Z 2025-05-07T20:32:38.9960180Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.9960348Z 2025-05-07T20:32:38.9960453Z moe/activation_test.py:117: 2025-05-07T20:32:38.9960757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.9961098Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.9961391Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.9962096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.9962798Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.9963345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.9964185Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.9964860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.9965400Z kernel = self.compile( 2025-05-07T20:32:38.9965945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.9966615Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.9967068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.9967304Z 2025-05-07T20:32:38.9967524Z self = 2025-05-07T20:32:38.9968597Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.9969979Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2874e2e80>} 2025-05-07T20:32:38.9971322Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.9972396Z context = 2025-05-07T20:32:38.9972728Z 2025-05-07T20:32:38.9972906Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.9973426Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.9973899Z module_map=module_map) 2025-05-07T20:32:38.9974273Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.9974649Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.9974909Z E ^ 2025-05-07T20:32:38.9975377Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.9975848Z 2025-05-07T20:32:38.9976271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.9976788Z 2025-05-07T20:32:38.9976900Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.9977319Z self=, 2025-05-07T20:32:38.9977774Z T=4096, 2025-05-07T20:32:38.9977972Z D=7168, 2025-05-07T20:32:38.9978163Z scale_ub=1200.0, 2025-05-07T20:32:38.9978399Z contiguous=False, 2025-05-07T20:32:38.9978630Z compiled=False, 2025-05-07T20:32:38.9978833Z ) 2025-05-07T20:32:38.9979159Z self = 2025-05-07T20:32:38.9979665Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:38.9979945Z 2025-05-07T20:32:38.9980032Z @given( 2025-05-07T20:32:38.9980264Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.9980580Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.9980891Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.9981219Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.9981552Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.9981838Z ) 2025-05-07T20:32:38.9982194Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.9982643Z def test_silu_mul_quant( 2025-05-07T20:32:38.9982890Z self, 2025-05-07T20:32:38.9983087Z T: int, 2025-05-07T20:32:38.9983282Z D: int, 2025-05-07T20:32:38.9983509Z scale_ub: Optional[float], 2025-05-07T20:32:38.9983785Z contiguous: bool, 2025-05-07T20:32:38.9984022Z compiled: bool, 2025-05-07T20:32:38.9984251Z ) -> None: 2025-05-07T20:32:38.9984471Z torch.manual_seed(2025) 2025-05-07T20:32:38.9984710Z 2025-05-07T20:32:38.9984994Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.9985342Z 2025-05-07T20:32:38.9985538Z x_sign = torch.sign(x) 2025-05-07T20:32:38.9985835Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.9986149Z x = x_sign * x_clamp 2025-05-07T20:32:38.9986392Z x0 = x[:, :D] 2025-05-07T20:32:38.9986665Z x1 = x[:, D:] 2025-05-07T20:32:38.9986880Z 2025-05-07T20:32:38.9987066Z if contiguous: 2025-05-07T20:32:38.9987306Z x0 = x0.contiguous() 2025-05-07T20:32:38.9987570Z x1 = x1.contiguous() 2025-05-07T20:32:38.9987814Z 2025-05-07T20:32:38.9988013Z if scale_ub is not None: 2025-05-07T20:32:38.9988292Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.9988634Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.9988940Z ) 2025-05-07T20:32:38.9989143Z else: 2025-05-07T20:32:38.9989361Z scale_ub_tensor = None 2025-05-07T20:32:38.9989608Z 2025-05-07T20:32:38.9989846Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.9990167Z op = silu_mul_quant 2025-05-07T20:32:38.9990414Z if compiled: 2025-05-07T20:32:38.9990674Z op = torch.compile(op) 2025-05-07T20:32:38.9991042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.9991388Z 2025-05-07T20:32:38.9991590Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.9991756Z 2025-05-07T20:32:38.9991865Z moe/activation_test.py:117: 2025-05-07T20:32:38.9992159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.9992501Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.9992796Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.9993498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.9994188Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.9994738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.9995439Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.9996115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.9996707Z kernel = self.compile( 2025-05-07T20:32:38.9997261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.9997928Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.9998320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.9998558Z 2025-05-07T20:32:38.9998768Z self = 2025-05-07T20:32:38.9999847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.0001217Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8ba8040>} 2025-05-07T20:32:39.0002575Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.0003694Z context = 2025-05-07T20:32:39.0003992Z 2025-05-07T20:32:39.0004162Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.0004693Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.0005161Z module_map=module_map) 2025-05-07T20:32:39.0005534Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.0005892Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.0006160Z E ^ 2025-05-07T20:32:39.0006633Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.0007144Z 2025-05-07T20:32:39.0007567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.0008082Z 2025-05-07T20:32:39.0008192Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.0008610Z self=, 2025-05-07T20:32:39.0009013Z T=16384, 2025-05-07T20:32:39.0009212Z D=7168, 2025-05-07T20:32:39.0009409Z scale_ub=None, 2025-05-07T20:32:39.0009622Z contiguous=True, 2025-05-07T20:32:39.0009853Z compiled=True, 2025-05-07T20:32:39.0010064Z ) 2025-05-07T20:32:39.1390184Z self = 2025-05-07T20:32:39.1390943Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:39.1391340Z 2025-05-07T20:32:39.1391450Z @given( 2025-05-07T20:32:39.1392120Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.1392586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.1392916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.1393260Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.1393599Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.1393890Z ) 2025-05-07T20:32:39.1394247Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.1394697Z def test_silu_mul_quant( 2025-05-07T20:32:39.1394942Z self, 2025-05-07T20:32:39.1395145Z T: int, 2025-05-07T20:32:39.1395353Z D: int, 2025-05-07T20:32:39.1395576Z scale_ub: Optional[float], 2025-05-07T20:32:39.1395859Z contiguous: bool, 2025-05-07T20:32:39.1396111Z compiled: bool, 2025-05-07T20:32:39.1396341Z ) -> None: 2025-05-07T20:32:39.1396566Z torch.manual_seed(2025) 2025-05-07T20:32:39.1396819Z 2025-05-07T20:32:39.1397102Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.1397557Z 2025-05-07T20:32:39.1397757Z x_sign = torch.sign(x) 2025-05-07T20:32:39.1398052Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.1398361Z x = x_sign * x_clamp 2025-05-07T20:32:39.1398606Z x0 = x[:, :D] 2025-05-07T20:32:39.1398829Z x1 = x[:, D:] 2025-05-07T20:32:39.1399039Z 2025-05-07T20:32:39.1399235Z if contiguous: 2025-05-07T20:32:39.1399473Z x0 = x0.contiguous() 2025-05-07T20:32:39.1399734Z x1 = x1.contiguous() 2025-05-07T20:32:39.1399983Z 2025-05-07T20:32:39.1400181Z if scale_ub is not None: 2025-05-07T20:32:39.1400454Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.1400795Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.1401112Z ) 2025-05-07T20:32:39.1401308Z else: 2025-05-07T20:32:39.1401529Z scale_ub_tensor = None 2025-05-07T20:32:39.1401789Z 2025-05-07T20:32:39.1402029Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.1402365Z op = silu_mul_quant 2025-05-07T20:32:39.1402627Z if compiled: 2025-05-07T20:32:39.1402873Z op = torch.compile(op) 2025-05-07T20:32:39.1403178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.1403463Z 2025-05-07T20:32:39.1403794Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.1403962Z 2025-05-07T20:32:39.1404064Z moe/activation_test.py:117: 2025-05-07T20:32:39.1404365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.1404705Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.1404988Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.1405561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.1406138Z return fn(*args, **kwargs) 2025-05-07T20:32:39.1406950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.1407652Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.1408198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.1408892Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.1409564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.1410113Z kernel = self.compile( 2025-05-07T20:32:39.1410668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.1411341Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.1411789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.1412027Z 2025-05-07T20:32:39.1412284Z self = 2025-05-07T20:32:39.1413370Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.1414761Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8ba9260>} 2025-05-07T20:32:39.1416104Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.1417137Z context = 2025-05-07T20:32:39.1417435Z 2025-05-07T20:32:39.1417609Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.1418191Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.1418657Z module_map=module_map) 2025-05-07T20:32:39.1419031Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.1419394Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.1419662Z E ^ 2025-05-07T20:32:39.1420123Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.1420584Z 2025-05-07T20:32:39.1421006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.1421525Z 2025-05-07T20:32:39.1421641Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.1422054Z self=, 2025-05-07T20:32:39.1422464Z T=4096, 2025-05-07T20:32:39.1422655Z D=5120, 2025-05-07T20:32:39.1422860Z scale_ub=None, 2025-05-07T20:32:39.1423098Z contiguous=False, 2025-05-07T20:32:39.1423329Z compiled=True, 2025-05-07T20:32:39.1423538Z ) 2025-05-07T20:32:39.1423861Z self = 2025-05-07T20:32:39.1424355Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:39.1424635Z 2025-05-07T20:32:39.1424712Z @given( 2025-05-07T20:32:39.1424946Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.1425254Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.1425565Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.1425896Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.1426222Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.1426513Z ) 2025-05-07T20:32:39.1426872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.1427372Z def test_silu_mul_quant( 2025-05-07T20:32:39.1427616Z self, 2025-05-07T20:32:39.1427812Z T: int, 2025-05-07T20:32:39.1428034Z D: int, 2025-05-07T20:32:39.1435407Z scale_ub: Optional[float], 2025-05-07T20:32:39.1435693Z contiguous: bool, 2025-05-07T20:32:39.1435945Z compiled: bool, 2025-05-07T20:32:39.1436178Z ) -> None: 2025-05-07T20:32:39.1436403Z torch.manual_seed(2025) 2025-05-07T20:32:39.1436700Z 2025-05-07T20:32:39.1436983Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.1437325Z 2025-05-07T20:32:39.1437524Z x_sign = torch.sign(x) 2025-05-07T20:32:39.1437824Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.1438142Z x = x_sign * x_clamp 2025-05-07T20:32:39.1438600Z x0 = x[:, :D] 2025-05-07T20:32:39.1438835Z x1 = x[:, D:] 2025-05-07T20:32:39.1439048Z 2025-05-07T20:32:39.1439362Z if contiguous: 2025-05-07T20:32:39.1439672Z x0 = x0.contiguous() 2025-05-07T20:32:39.1439944Z x1 = x1.contiguous() 2025-05-07T20:32:39.1440186Z 2025-05-07T20:32:39.1440387Z if scale_ub is not None: 2025-05-07T20:32:39.1440674Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.1441038Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.1441355Z ) 2025-05-07T20:32:39.1441558Z else: 2025-05-07T20:32:39.1441768Z scale_ub_tensor = None 2025-05-07T20:32:39.1442027Z 2025-05-07T20:32:39.1442263Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.1442573Z op = silu_mul_quant 2025-05-07T20:32:39.1442834Z if compiled: 2025-05-07T20:32:39.1443083Z op = torch.compile(op) 2025-05-07T20:32:39.1443386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.1443758Z 2025-05-07T20:32:39.1443959Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.1444128Z 2025-05-07T20:32:39.1444240Z moe/activation_test.py:117: 2025-05-07T20:32:39.1444607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.1444944Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.1445229Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.1445790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.1446357Z return fn(*args, **kwargs) 2025-05-07T20:32:39.1447016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.1447712Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.1448246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.1448935Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.1449606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.1450141Z kernel = self.compile( 2025-05-07T20:32:39.1450690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.1451354Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.1451763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.1451992Z 2025-05-07T20:32:39.1452199Z self = 2025-05-07T20:32:39.1453278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.1454727Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8ba9da0>} 2025-05-07T20:32:39.1456082Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.1457119Z context = 2025-05-07T20:32:39.1457409Z 2025-05-07T20:32:39.1457580Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.1458111Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.1458586Z module_map=module_map) 2025-05-07T20:32:39.1458950Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.1459314Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.1459583Z E ^ 2025-05-07T20:32:39.1460135Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.1460629Z 2025-05-07T20:32:39.1461048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.1461576Z 2025-05-07T20:32:39.2611458Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.2612746Z self=, 2025-05-07T20:32:39.2613844Z T=4096, 2025-05-07T20:32:39.2614343Z D=5120, 2025-05-07T20:32:39.2614819Z scale_ub=1200.0, 2025-05-07T20:32:39.2615267Z contiguous=False, 2025-05-07T20:32:39.2615703Z compiled=False, 2025-05-07T20:32:39.2616107Z ) 2025-05-07T20:32:39.2616741Z self = 2025-05-07T20:32:39.2617313Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:39.2617607Z 2025-05-07T20:32:39.2617705Z @given( 2025-05-07T20:32:39.2617970Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.2618599Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.2618943Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.2619282Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.2619614Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.2619907Z ) 2025-05-07T20:32:39.2620251Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.2620697Z def test_silu_mul_quant( 2025-05-07T20:32:39.2620943Z self, 2025-05-07T20:32:39.2621130Z T: int, 2025-05-07T20:32:39.2621329Z D: int, 2025-05-07T20:32:39.2621552Z scale_ub: Optional[float], 2025-05-07T20:32:39.2621821Z contiguous: bool, 2025-05-07T20:32:39.2622062Z compiled: bool, 2025-05-07T20:32:39.2622294Z ) -> None: 2025-05-07T20:32:39.2622506Z torch.manual_seed(2025) 2025-05-07T20:32:39.2622746Z 2025-05-07T20:32:39.2623030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.2623378Z 2025-05-07T20:32:39.2623574Z x_sign = torch.sign(x) 2025-05-07T20:32:39.2623871Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.2624184Z x = x_sign * x_clamp 2025-05-07T20:32:39.2624418Z x0 = x[:, :D] 2025-05-07T20:32:39.2624635Z x1 = x[:, D:] 2025-05-07T20:32:39.2624840Z 2025-05-07T20:32:39.2625020Z if contiguous: 2025-05-07T20:32:39.2625259Z x0 = x0.contiguous() 2025-05-07T20:32:39.2625518Z x1 = x1.contiguous() 2025-05-07T20:32:39.2625757Z 2025-05-07T20:32:39.2625952Z if scale_ub is not None: 2025-05-07T20:32:39.2626226Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.2626560Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.2626871Z ) 2025-05-07T20:32:39.2627075Z else: 2025-05-07T20:32:39.2627378Z scale_ub_tensor = None 2025-05-07T20:32:39.2627638Z 2025-05-07T20:32:39.2627874Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.2628183Z op = silu_mul_quant 2025-05-07T20:32:39.2628439Z if compiled: 2025-05-07T20:32:39.2628694Z op = torch.compile(op) 2025-05-07T20:32:39.2629001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2629273Z 2025-05-07T20:32:39.2629472Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.2629636Z 2025-05-07T20:32:39.2629748Z moe/activation_test.py:117: 2025-05-07T20:32:39.2630037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2630375Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.2630666Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2631449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.2632243Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.2632795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.2633486Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.2634158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.2634700Z kernel = self.compile( 2025-05-07T20:32:39.2635252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.2635917Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.2636312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2636576Z 2025-05-07T20:32:39.2636811Z self = 2025-05-07T20:32:39.2637892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.2639642Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8bab420>} 2025-05-07T20:32:39.2640990Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.2642026Z context = 2025-05-07T20:32:39.2642323Z 2025-05-07T20:32:39.2642490Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.2643017Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.2643489Z module_map=module_map) 2025-05-07T20:32:39.2644016Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.2644370Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.2644625Z E ^ 2025-05-07T20:32:39.2645092Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.2645552Z 2025-05-07T20:32:39.2645971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.2646488Z 2025-05-07T20:32:39.2646601Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.2647011Z self=, 2025-05-07T20:32:39.2647420Z T=4096, 2025-05-07T20:32:39.2647611Z D=5120, 2025-05-07T20:32:39.2647798Z scale_ub=1200.0, 2025-05-07T20:32:39.2648027Z contiguous=False, 2025-05-07T20:32:39.2648257Z compiled=True, 2025-05-07T20:32:39.2648540Z ) 2025-05-07T20:32:39.2648866Z self = 2025-05-07T20:32:39.2649363Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:39.2649636Z 2025-05-07T20:32:39.2649722Z @given( 2025-05-07T20:32:39.2649948Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.2650269Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.2650579Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.2650905Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.2651238Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.2651529Z ) 2025-05-07T20:32:39.2651875Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.2652321Z def test_silu_mul_quant( 2025-05-07T20:32:39.2652630Z self, 2025-05-07T20:32:39.2652834Z T: int, 2025-05-07T20:32:39.2653086Z D: int, 2025-05-07T20:32:39.2653317Z scale_ub: Optional[float], 2025-05-07T20:32:39.2653592Z contiguous: bool, 2025-05-07T20:32:39.2653832Z compiled: bool, 2025-05-07T20:32:39.2654059Z ) -> None: 2025-05-07T20:32:39.2654283Z torch.manual_seed(2025) 2025-05-07T20:32:39.2654526Z 2025-05-07T20:32:39.2654806Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.2655153Z 2025-05-07T20:32:39.2655345Z x_sign = torch.sign(x) 2025-05-07T20:32:39.2655640Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.2655949Z x = x_sign * x_clamp 2025-05-07T20:32:39.2656184Z x0 = x[:, :D] 2025-05-07T20:32:39.2656405Z x1 = x[:, D:] 2025-05-07T20:32:39.2656621Z 2025-05-07T20:32:39.2656809Z if contiguous: 2025-05-07T20:32:39.2657043Z x0 = x0.contiguous() 2025-05-07T20:32:39.2657313Z x1 = x1.contiguous() 2025-05-07T20:32:39.2657550Z 2025-05-07T20:32:39.2657818Z if scale_ub is not None: 2025-05-07T20:32:39.2658095Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.2658430Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.2658734Z ) 2025-05-07T20:32:39.2658929Z else: 2025-05-07T20:32:39.2659141Z scale_ub_tensor = None 2025-05-07T20:32:39.2659390Z 2025-05-07T20:32:39.2659644Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.2659957Z op = silu_mul_quant 2025-05-07T20:32:39.2660214Z if compiled: 2025-05-07T20:32:39.2660467Z op = torch.compile(op) 2025-05-07T20:32:39.2660759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2661039Z 2025-05-07T20:32:39.2661234Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.2661401Z 2025-05-07T20:32:39.2661510Z moe/activation_test.py:117: 2025-05-07T20:32:39.2661805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2662145Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.2662436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2662991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.2663559Z return fn(*args, **kwargs) 2025-05-07T20:32:39.2664228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.2664917Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.2665454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.2666145Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.2666875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.2667407Z kernel = self.compile( 2025-05-07T20:32:39.2668007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.2668671Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.2669070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2669299Z 2025-05-07T20:32:39.2669507Z self = 2025-05-07T20:32:39.2670590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.2671997Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2875fc860>} 2025-05-07T20:32:39.2673352Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.2674428Z context = 2025-05-07T20:32:39.2674717Z 2025-05-07T20:32:39.2674885Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.2675409Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.2675883Z module_map=module_map) 2025-05-07T20:32:39.2676245Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.2676612Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.2676878Z E ^ 2025-05-07T20:32:39.2677355Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.2677807Z 2025-05-07T20:32:39.2678233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.2678806Z 2025-05-07T20:32:39.3565469Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.3566772Z self=, 2025-05-07T20:32:39.3567428Z T=2048, 2025-05-07T20:32:39.3567644Z D=7168, 2025-05-07T20:32:39.3567834Z scale_ub=1200.0, 2025-05-07T20:32:39.3568067Z contiguous=False, 2025-05-07T20:32:39.3568301Z compiled=False, 2025-05-07T20:32:39.3568506Z ) 2025-05-07T20:32:39.3568833Z self = 2025-05-07T20:32:39.3569335Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:39.3569614Z 2025-05-07T20:32:39.3569701Z @given( 2025-05-07T20:32:39.3569931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.3570257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.3570585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.3570910Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.3571242Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.3571528Z ) 2025-05-07T20:32:39.3571876Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.3572323Z def test_silu_mul_quant( 2025-05-07T20:32:39.3572565Z self, 2025-05-07T20:32:39.3572764Z T: int, 2025-05-07T20:32:39.3572958Z D: int, 2025-05-07T20:32:39.3573185Z scale_ub: Optional[float], 2025-05-07T20:32:39.3573458Z contiguous: bool, 2025-05-07T20:32:39.3573694Z compiled: bool, 2025-05-07T20:32:39.3573921Z ) -> None: 2025-05-07T20:32:39.3574141Z torch.manual_seed(2025) 2025-05-07T20:32:39.3574378Z 2025-05-07T20:32:39.3574656Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.3575004Z 2025-05-07T20:32:39.3575484Z x_sign = torch.sign(x) 2025-05-07T20:32:39.3575782Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.3576096Z x = x_sign * x_clamp 2025-05-07T20:32:39.3576333Z x0 = x[:, :D] 2025-05-07T20:32:39.3576554Z x1 = x[:, D:] 2025-05-07T20:32:39.3576765Z 2025-05-07T20:32:39.3576949Z if contiguous: 2025-05-07T20:32:39.3577186Z x0 = x0.contiguous() 2025-05-07T20:32:39.3577488Z x1 = x1.contiguous() 2025-05-07T20:32:39.3577731Z 2025-05-07T20:32:39.3577925Z if scale_ub is not None: 2025-05-07T20:32:39.3578202Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.3578543Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.3578851Z ) 2025-05-07T20:32:39.3579052Z else: 2025-05-07T20:32:39.3579265Z scale_ub_tensor = None 2025-05-07T20:32:39.3579600Z 2025-05-07T20:32:39.3579842Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.3580238Z op = silu_mul_quant 2025-05-07T20:32:39.3580486Z if compiled: 2025-05-07T20:32:39.3580736Z op = torch.compile(op) 2025-05-07T20:32:39.3581036Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.3581310Z 2025-05-07T20:32:39.3581504Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.3581667Z 2025-05-07T20:32:39.3581778Z moe/activation_test.py:117: 2025-05-07T20:32:39.3582082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.3582418Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.3582711Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.3583408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.3584093Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.3584641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.3585419Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.3586091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.3586665Z kernel = self.compile( 2025-05-07T20:32:39.3587222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.3587889Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.3588279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.3588517Z 2025-05-07T20:32:39.3588724Z self = 2025-05-07T20:32:39.3589806Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.3591200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2875fd6c0>} 2025-05-07T20:32:39.3592549Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.3593574Z context = 2025-05-07T20:32:39.3593868Z 2025-05-07T20:32:39.3594036Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.3594561Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.3595037Z module_map=module_map) 2025-05-07T20:32:39.3595446Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.3595813Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.3596074Z E ^ 2025-05-07T20:32:39.3596532Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.3596992Z 2025-05-07T20:32:39.3597412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.3597935Z 2025-05-07T20:32:39.3598038Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.3598456Z self=, 2025-05-07T20:32:39.3598852Z T=1, 2025-05-07T20:32:39.3599038Z D=7168, 2025-05-07T20:32:39.3599235Z scale_ub=None, 2025-05-07T20:32:39.3599442Z contiguous=True, 2025-05-07T20:32:39.3599671Z compiled=False, 2025-05-07T20:32:39.3599878Z ) 2025-05-07T20:32:39.3600243Z self = 2025-05-07T20:32:39.3600771Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:39.3601030Z 2025-05-07T20:32:39.3601113Z @given( 2025-05-07T20:32:39.3601336Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.3601655Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.3601964Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.3602296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.3602621Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.3602910Z ) 2025-05-07T20:32:39.3603256Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.3603857Z def test_silu_mul_quant( 2025-05-07T20:32:39.3604100Z self, 2025-05-07T20:32:39.3604297Z T: int, 2025-05-07T20:32:39.3604495Z D: int, 2025-05-07T20:32:39.3604724Z scale_ub: Optional[float], 2025-05-07T20:32:39.3604998Z contiguous: bool, 2025-05-07T20:32:39.3605341Z compiled: bool, 2025-05-07T20:32:39.3605565Z ) -> None: 2025-05-07T20:32:39.3605783Z torch.manual_seed(2025) 2025-05-07T20:32:39.3606023Z 2025-05-07T20:32:39.3606304Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.3606646Z 2025-05-07T20:32:39.3606845Z x_sign = torch.sign(x) 2025-05-07T20:32:39.3607135Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.3607444Z x = x_sign * x_clamp 2025-05-07T20:32:39.3607688Z x0 = x[:, :D] 2025-05-07T20:32:39.3607907Z x1 = x[:, D:] 2025-05-07T20:32:39.3608122Z 2025-05-07T20:32:39.3608310Z if contiguous: 2025-05-07T20:32:39.3608539Z x0 = x0.contiguous() 2025-05-07T20:32:39.3608801Z x1 = x1.contiguous() 2025-05-07T20:32:39.3609043Z 2025-05-07T20:32:39.3609233Z if scale_ub is not None: 2025-05-07T20:32:39.3609510Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.3609859Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.3610162Z ) 2025-05-07T20:32:39.3610356Z else: 2025-05-07T20:32:39.3610569Z scale_ub_tensor = None 2025-05-07T20:32:39.3610816Z 2025-05-07T20:32:39.3611047Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.3611362Z op = silu_mul_quant 2025-05-07T20:32:39.3611617Z if compiled: 2025-05-07T20:32:39.3611855Z op = torch.compile(op) 2025-05-07T20:32:39.3612156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.3612432Z 2025-05-07T20:32:39.3612618Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.3612786Z 2025-05-07T20:32:39.3612885Z moe/activation_test.py:117: 2025-05-07T20:32:39.3613180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.3613507Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.3613838Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.3614549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.3615240Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.3615772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.3616489Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.3617188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.3617712Z kernel = self.compile( 2025-05-07T20:32:39.3618262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.3618924Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.3619374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.3619645Z 2025-05-07T20:32:39.3619852Z self = 2025-05-07T20:32:39.3620935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.3622301Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2875fcfe0>} 2025-05-07T20:32:39.3623652Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.3632999Z context = 2025-05-07T20:32:39.3633339Z 2025-05-07T20:32:39.3633533Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.3634139Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.3634613Z module_map=module_map) 2025-05-07T20:32:39.3634988Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.3635347Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.3635604Z E ^ 2025-05-07T20:32:39.3636083Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.3636532Z 2025-05-07T20:32:39.3636969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.3637484Z 2025-05-07T20:32:39.3637601Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.3638015Z self=, 2025-05-07T20:32:39.3638625Z T=16384, 2025-05-07T20:32:39.3638837Z D=7168, 2025-05-07T20:32:39.3639028Z scale_ub=1200.0, 2025-05-07T20:32:39.3639262Z contiguous=False, 2025-05-07T20:32:39.3639497Z compiled=True, 2025-05-07T20:32:39.7234787Z ) 2025-05-07T20:32:39.7235414Z self = 2025-05-07T20:32:39.7236146Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:39.7236623Z 2025-05-07T20:32:39.7236860Z @given( 2025-05-07T20:32:39.7237380Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.7238012Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.7238948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.7239609Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.7240254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.7240846Z ) 2025-05-07T20:32:39.7241932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.7242834Z def test_silu_mul_quant( 2025-05-07T20:32:39.7243325Z self, 2025-05-07T20:32:39.7243894Z T: int, 2025-05-07T20:32:39.7244275Z D: int, 2025-05-07T20:32:39.7244708Z scale_ub: Optional[float], 2025-05-07T20:32:39.7245252Z contiguous: bool, 2025-05-07T20:32:39.7245720Z compiled: bool, 2025-05-07T20:32:39.7246167Z ) -> None: 2025-05-07T20:32:39.7246593Z torch.manual_seed(2025) 2025-05-07T20:32:39.7247015Z 2025-05-07T20:32:39.7247345Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.7247695Z 2025-05-07T20:32:39.7247896Z x_sign = torch.sign(x) 2025-05-07T20:32:39.7248190Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.7248505Z x = x_sign * x_clamp 2025-05-07T20:32:39.7248749Z x0 = x[:, :D] 2025-05-07T20:32:39.7249065Z x1 = x[:, D:] 2025-05-07T20:32:39.7249355Z 2025-05-07T20:32:39.7249557Z if contiguous: 2025-05-07T20:32:39.7249793Z x0 = x0.contiguous() 2025-05-07T20:32:39.7250067Z x1 = x1.contiguous() 2025-05-07T20:32:39.7250315Z 2025-05-07T20:32:39.7250509Z if scale_ub is not None: 2025-05-07T20:32:39.7250796Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.7251143Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.7251450Z ) 2025-05-07T20:32:39.7251653Z else: 2025-05-07T20:32:39.7251874Z scale_ub_tensor = None 2025-05-07T20:32:39.7252132Z 2025-05-07T20:32:39.7252378Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.7252700Z op = silu_mul_quant 2025-05-07T20:32:39.7252952Z if compiled: 2025-05-07T20:32:39.7253212Z op = torch.compile(op) 2025-05-07T20:32:39.7253524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7253811Z 2025-05-07T20:32:39.7254019Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.7254278Z 2025-05-07T20:32:39.7254388Z moe/activation_test.py:117: 2025-05-07T20:32:39.7254680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7255019Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.7255306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7255865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.7256437Z return fn(*args, **kwargs) 2025-05-07T20:32:39.7257108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.7257801Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.7258341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.7259034Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.7259710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.7260246Z kernel = self.compile( 2025-05-07T20:32:39.7260794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.7261456Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.7261859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7262091Z 2025-05-07T20:32:39.7262300Z self = 2025-05-07T20:32:39.7263383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.7264817Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2875ffb00>} 2025-05-07T20:32:39.7266169Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.7267190Z context = 2025-05-07T20:32:39.7267505Z 2025-05-07T20:32:39.7267681Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.7268202Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.7268671Z module_map=module_map) 2025-05-07T20:32:39.7269038Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.7269433Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.7269737Z E ^ 2025-05-07T20:32:39.7270209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.7270661Z 2025-05-07T20:32:39.7271088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.7271603Z 2025-05-07T20:32:39.7271708Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.7272125Z self=, 2025-05-07T20:32:39.7272533Z T=1, 2025-05-07T20:32:39.7272714Z D=7168, 2025-05-07T20:32:39.7272911Z scale_ub=None, 2025-05-07T20:32:39.7273139Z contiguous=False, 2025-05-07T20:32:39.7273360Z compiled=False, 2025-05-07T20:32:39.7273571Z ) 2025-05-07T20:32:39.7273894Z self = 2025-05-07T20:32:39.7274383Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:39.7274662Z 2025-05-07T20:32:39.7274791Z @given( 2025-05-07T20:32:39.7275027Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.7275342Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.7275647Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.7275980Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.7276312Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.7276596Z ) 2025-05-07T20:32:39.7276949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.7277394Z def test_silu_mul_quant( 2025-05-07T20:32:39.7277633Z self, 2025-05-07T20:32:39.7277834Z T: int, 2025-05-07T20:32:39.7278042Z D: int, 2025-05-07T20:32:39.7278260Z scale_ub: Optional[float], 2025-05-07T20:32:39.7278534Z contiguous: bool, 2025-05-07T20:32:39.7278779Z compiled: bool, 2025-05-07T20:32:39.7279004Z ) -> None: 2025-05-07T20:32:39.7279217Z torch.manual_seed(2025) 2025-05-07T20:32:39.7279463Z 2025-05-07T20:32:39.7279736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.7280069Z 2025-05-07T20:32:39.7280269Z x_sign = torch.sign(x) 2025-05-07T20:32:39.7280565Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.7280867Z x = x_sign * x_clamp 2025-05-07T20:32:39.7281110Z x0 = x[:, :D] 2025-05-07T20:32:39.7281329Z x1 = x[:, D:] 2025-05-07T20:32:39.7281535Z 2025-05-07T20:32:39.7281724Z if contiguous: 2025-05-07T20:32:39.7281957Z x0 = x0.contiguous() 2025-05-07T20:32:39.7282208Z x1 = x1.contiguous() 2025-05-07T20:32:39.7282450Z 2025-05-07T20:32:39.7282645Z if scale_ub is not None: 2025-05-07T20:32:39.7282917Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.7283256Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.7283670Z ) 2025-05-07T20:32:39.7283922Z else: 2025-05-07T20:32:39.7284131Z scale_ub_tensor = None 2025-05-07T20:32:39.7284387Z 2025-05-07T20:32:39.7284622Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.7284932Z op = silu_mul_quant 2025-05-07T20:32:39.7285181Z if compiled: 2025-05-07T20:32:39.7285429Z op = torch.compile(op) 2025-05-07T20:32:39.7285721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7285998Z 2025-05-07T20:32:39.7286193Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.7286357Z 2025-05-07T20:32:39.7286460Z moe/activation_test.py:117: 2025-05-07T20:32:39.7286757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7287093Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.7287377Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7288114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.7288854Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.7289399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.7290078Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.7290752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.7291292Z kernel = self.compile( 2025-05-07T20:32:39.7291844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.7292500Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.7292900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7293129Z 2025-05-07T20:32:39.7293350Z self = 2025-05-07T20:32:39.7294473Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.7295834Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a83749a0>} 2025-05-07T20:32:39.7297177Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.7298212Z context = 2025-05-07T20:32:39.7298499Z 2025-05-07T20:32:39.7298676Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.7299197Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.7299672Z module_map=module_map) 2025-05-07T20:32:39.7300037Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.7300395Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.7300652Z E ^ 2025-05-07T20:32:39.7301116Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.7301568Z 2025-05-07T20:32:39.7301992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.7302506Z 2025-05-07T20:32:39.7302612Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.7303025Z self=, 2025-05-07T20:32:39.7303450Z T=2048, 2025-05-07T20:32:39.7303642Z D=7168, 2025-05-07T20:32:39.7303837Z scale_ub=None, 2025-05-07T20:32:39.7304105Z contiguous=False, 2025-05-07T20:32:39.7304337Z compiled=True, 2025-05-07T20:32:39.7304554Z ) 2025-05-07T20:32:39.7995394Z self = 2025-05-07T20:32:39.7996193Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:39.7996594Z 2025-05-07T20:32:39.7996722Z @given( 2025-05-07T20:32:39.7996958Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.7997281Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.7997596Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.7997927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.7998255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.7998550Z ) 2025-05-07T20:32:39.7998906Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.7999685Z def test_silu_mul_quant( 2025-05-07T20:32:39.8000039Z self, 2025-05-07T20:32:39.8000251Z T: int, 2025-05-07T20:32:39.8000446Z D: int, 2025-05-07T20:32:39.8000669Z scale_ub: Optional[float], 2025-05-07T20:32:39.8000948Z contiguous: bool, 2025-05-07T20:32:39.8001187Z compiled: bool, 2025-05-07T20:32:39.8001424Z ) -> None: 2025-05-07T20:32:39.8001641Z torch.manual_seed(2025) 2025-05-07T20:32:39.8001881Z 2025-05-07T20:32:39.8002159Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.8002502Z 2025-05-07T20:32:39.8002709Z x_sign = torch.sign(x) 2025-05-07T20:32:39.8003000Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.8003315Z x = x_sign * x_clamp 2025-05-07T20:32:39.8003690Z x0 = x[:, :D] 2025-05-07T20:32:39.8003910Z x1 = x[:, D:] 2025-05-07T20:32:39.8004126Z 2025-05-07T20:32:39.8004318Z if contiguous: 2025-05-07T20:32:39.8004554Z x0 = x0.contiguous() 2025-05-07T20:32:39.8004823Z x1 = x1.contiguous() 2025-05-07T20:32:39.8005160Z 2025-05-07T20:32:39.8005353Z if scale_ub is not None: 2025-05-07T20:32:39.8005634Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.8005982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.8006286Z ) 2025-05-07T20:32:39.8006498Z else: 2025-05-07T20:32:39.8006752Z scale_ub_tensor = None 2025-05-07T20:32:39.8007008Z 2025-05-07T20:32:39.8007245Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.8007565Z op = silu_mul_quant 2025-05-07T20:32:39.8007830Z if compiled: 2025-05-07T20:32:39.8008078Z op = torch.compile(op) 2025-05-07T20:32:39.8008380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.8008672Z 2025-05-07T20:32:39.8008864Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.8009039Z 2025-05-07T20:32:39.8009141Z moe/activation_test.py:117: 2025-05-07T20:32:39.8009456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8009789Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.8010078Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.8010652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.8011224Z return fn(*args, **kwargs) 2025-05-07T20:32:39.8011896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.8012592Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.8013147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.8013833Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.8014606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.8015160Z kernel = self.compile( 2025-05-07T20:32:39.8015718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.8016382Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.8016789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8017021Z 2025-05-07T20:32:39.8017241Z self = 2025-05-07T20:32:39.8018348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.8019791Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8375d00>} 2025-05-07T20:32:39.8021181Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.8022213Z context = 2025-05-07T20:32:39.8022500Z 2025-05-07T20:32:39.8022680Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.8023199Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.8023670Z module_map=module_map) 2025-05-07T20:32:39.8024050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.8024426Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.8024692Z E ^ 2025-05-07T20:32:39.8025196Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.8025699Z 2025-05-07T20:32:39.8026129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.8026648Z 2025-05-07T20:32:39.8026755Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.8027178Z self=, 2025-05-07T20:32:39.8027584Z T=4096, 2025-05-07T20:32:39.8027778Z D=7168, 2025-05-07T20:32:39.8027969Z scale_ub=None, 2025-05-07T20:32:39.8028190Z contiguous=False, 2025-05-07T20:32:39.8028420Z compiled=True, 2025-05-07T20:32:39.8028624Z ) 2025-05-07T20:32:39.8028954Z self = 2025-05-07T20:32:39.8029455Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:39.8029728Z 2025-05-07T20:32:39.8029810Z @given( 2025-05-07T20:32:39.8030050Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.8030374Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.8030678Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.8031022Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.8031358Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.8031652Z ) 2025-05-07T20:32:39.8031999Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.8032452Z def test_silu_mul_quant( 2025-05-07T20:32:39.8032704Z self, 2025-05-07T20:32:39.8032896Z T: int, 2025-05-07T20:32:39.8033098Z D: int, 2025-05-07T20:32:39.8033324Z scale_ub: Optional[float], 2025-05-07T20:32:39.8033593Z contiguous: bool, 2025-05-07T20:32:39.8033846Z compiled: bool, 2025-05-07T20:32:39.8034076Z ) -> None: 2025-05-07T20:32:39.8034288Z torch.manual_seed(2025) 2025-05-07T20:32:39.8034545Z 2025-05-07T20:32:39.8034877Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.8035229Z 2025-05-07T20:32:39.8035428Z x_sign = torch.sign(x) 2025-05-07T20:32:39.8035730Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.8036047Z x = x_sign * x_clamp 2025-05-07T20:32:39.8036285Z x0 = x[:, :D] 2025-05-07T20:32:39.8036505Z x1 = x[:, D:] 2025-05-07T20:32:39.8036740Z 2025-05-07T20:32:39.8036947Z if contiguous: 2025-05-07T20:32:39.8037182Z x0 = x0.contiguous() 2025-05-07T20:32:39.8037447Z x1 = x1.contiguous() 2025-05-07T20:32:39.8037686Z 2025-05-07T20:32:39.8037881Z if scale_ub is not None: 2025-05-07T20:32:39.8038159Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.8038864Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.8039184Z ) 2025-05-07T20:32:39.8039383Z else: 2025-05-07T20:32:39.8039666Z scale_ub_tensor = None 2025-05-07T20:32:39.8039981Z 2025-05-07T20:32:39.8040225Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.8040540Z op = silu_mul_quant 2025-05-07T20:32:39.8040798Z if compiled: 2025-05-07T20:32:39.8041050Z op = torch.compile(op) 2025-05-07T20:32:39.8041344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.8041623Z 2025-05-07T20:32:39.8041818Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.8041983Z 2025-05-07T20:32:39.8042094Z moe/activation_test.py:117: 2025-05-07T20:32:39.8042387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8042722Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.8043014Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.8043708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.8044285Z return fn(*args, **kwargs) 2025-05-07T20:32:39.8044958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.8045741Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.8046282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.8046978Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.8047656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.8048198Z kernel = self.compile( 2025-05-07T20:32:39.8048753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.8049423Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.8049829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8050064Z 2025-05-07T20:32:39.8050286Z self = 2025-05-07T20:32:39.8051377Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.8052750Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8376840>} 2025-05-07T20:32:39.8054102Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.8055138Z context = 2025-05-07T20:32:39.8055426Z 2025-05-07T20:32:39.8055666Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.8056204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.8056678Z module_map=module_map) 2025-05-07T20:32:39.8057043Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.8057455Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.8057724Z E ^ 2025-05-07T20:32:39.8058193Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.8058644Z 2025-05-07T20:32:39.8059065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.8059587Z 2025-05-07T20:32:39.9335565Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.9336255Z self=, 2025-05-07T20:32:39.9337128Z T=16384, 2025-05-07T20:32:39.9337488Z D=5120, 2025-05-07T20:32:39.9337764Z scale_ub=1200.0, 2025-05-07T20:32:39.9338050Z contiguous=False, 2025-05-07T20:32:39.9338339Z compiled=False, 2025-05-07T20:32:39.9338830Z ) 2025-05-07T20:32:39.9339158Z self = 2025-05-07T20:32:39.9339662Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:39.9339942Z 2025-05-07T20:32:39.9340020Z @given( 2025-05-07T20:32:39.9340249Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.9340562Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.9340864Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.9341196Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.9341521Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.9341806Z ) 2025-05-07T20:32:39.9342150Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.9342598Z def test_silu_mul_quant( 2025-05-07T20:32:39.9342968Z self, 2025-05-07T20:32:39.9343157Z T: int, 2025-05-07T20:32:39.9343357Z D: int, 2025-05-07T20:32:39.9343575Z scale_ub: Optional[float], 2025-05-07T20:32:39.9343842Z contiguous: bool, 2025-05-07T20:32:39.9344106Z compiled: bool, 2025-05-07T20:32:39.9344341Z ) -> None: 2025-05-07T20:32:39.9344555Z torch.manual_seed(2025) 2025-05-07T20:32:39.9353807Z 2025-05-07T20:32:39.9354091Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.9354429Z 2025-05-07T20:32:39.9354640Z x_sign = torch.sign(x) 2025-05-07T20:32:39.9354936Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.9355254Z x = x_sign * x_clamp 2025-05-07T20:32:39.9355502Z x0 = x[:, :D] 2025-05-07T20:32:39.9355719Z x1 = x[:, D:] 2025-05-07T20:32:39.9355941Z 2025-05-07T20:32:39.9356139Z if contiguous: 2025-05-07T20:32:39.9356384Z x0 = x0.contiguous() 2025-05-07T20:32:39.9356655Z x1 = x1.contiguous() 2025-05-07T20:32:39.9356904Z 2025-05-07T20:32:39.9357097Z if scale_ub is not None: 2025-05-07T20:32:39.9357383Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.9357727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.9358032Z ) 2025-05-07T20:32:39.9358239Z else: 2025-05-07T20:32:39.9358455Z scale_ub_tensor = None 2025-05-07T20:32:39.9358699Z 2025-05-07T20:32:39.9358941Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.9359263Z op = silu_mul_quant 2025-05-07T20:32:39.9359519Z if compiled: 2025-05-07T20:32:39.9359769Z op = torch.compile(op) 2025-05-07T20:32:39.9360069Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.9360352Z 2025-05-07T20:32:39.9360547Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.9360724Z 2025-05-07T20:32:39.9360958Z moe/activation_test.py:117: 2025-05-07T20:32:39.9361261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.9361597Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.9361881Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.9362579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.9363274Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.9363975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.9364665Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.9365338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.9365945Z kernel = self.compile( 2025-05-07T20:32:39.9366500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.9367272Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.9367685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.9367914Z 2025-05-07T20:32:39.9368123Z self = 2025-05-07T20:32:39.9369203Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.9370580Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287f14040>} 2025-05-07T20:32:39.9371927Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.9372998Z context = 2025-05-07T20:32:39.9373286Z 2025-05-07T20:32:39.9373455Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.9373982Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.9374458Z module_map=module_map) 2025-05-07T20:32:39.9374822Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.9375182Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.9375450Z E ^ 2025-05-07T20:32:39.9375924Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.9376375Z 2025-05-07T20:32:39.9376799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.9377325Z 2025-05-07T20:32:39.9377428Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.9377871Z self=, 2025-05-07T20:32:39.9378281Z T=16384, 2025-05-07T20:32:39.9378474Z D=5120, 2025-05-07T20:32:39.9378671Z scale_ub=1200.0, 2025-05-07T20:32:39.9378899Z contiguous=True, 2025-05-07T20:32:39.9379117Z compiled=True, 2025-05-07T20:32:39.9379327Z ) 2025-05-07T20:32:39.9379657Z self = 2025-05-07T20:32:39.9380164Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:39.9380439Z 2025-05-07T20:32:39.9380516Z @given( 2025-05-07T20:32:39.9380752Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.9381069Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.9381422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.9381760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.9382087Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.9382370Z ) 2025-05-07T20:32:39.9382722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.9383173Z def test_silu_mul_quant( 2025-05-07T20:32:39.9383420Z self, 2025-05-07T20:32:39.9383613Z T: int, 2025-05-07T20:32:39.9383819Z D: int, 2025-05-07T20:32:39.9384046Z scale_ub: Optional[float], 2025-05-07T20:32:39.9384311Z contiguous: bool, 2025-05-07T20:32:39.9384553Z compiled: bool, 2025-05-07T20:32:39.9384772Z ) -> None: 2025-05-07T20:32:39.9384986Z torch.manual_seed(2025) 2025-05-07T20:32:39.9385231Z 2025-05-07T20:32:39.9385498Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.9385883Z 2025-05-07T20:32:39.9386078Z x_sign = torch.sign(x) 2025-05-07T20:32:39.9386408Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.9386718Z x = x_sign * x_clamp 2025-05-07T20:32:39.9386962Z x0 = x[:, :D] 2025-05-07T20:32:39.9387176Z x1 = x[:, D:] 2025-05-07T20:32:39.9387387Z 2025-05-07T20:32:39.9387578Z if contiguous: 2025-05-07T20:32:39.9387811Z x0 = x0.contiguous() 2025-05-07T20:32:39.9388063Z x1 = x1.contiguous() 2025-05-07T20:32:39.9388302Z 2025-05-07T20:32:39.9388494Z if scale_ub is not None: 2025-05-07T20:32:39.9388761Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.9389094Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.9389402Z ) 2025-05-07T20:32:39.9389588Z else: 2025-05-07T20:32:39.9389797Z scale_ub_tensor = None 2025-05-07T20:32:39.9390049Z 2025-05-07T20:32:39.9390277Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.9390597Z op = silu_mul_quant 2025-05-07T20:32:39.9390915Z if compiled: 2025-05-07T20:32:39.9391157Z op = torch.compile(op) 2025-05-07T20:32:39.9391455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.9391733Z 2025-05-07T20:32:39.9391919Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.9392089Z 2025-05-07T20:32:39.9392187Z moe/activation_test.py:117: 2025-05-07T20:32:39.9392482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.9392817Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.9393093Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.9393656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.9394220Z return fn(*args, **kwargs) 2025-05-07T20:32:39.9394876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.9395576Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.9396118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.9396857Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.9397520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.9398054Z kernel = self.compile( 2025-05-07T20:32:39.9398600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.9399262Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.9399652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.9399886Z 2025-05-07T20:32:39.9400096Z self = 2025-05-07T20:32:39.9401217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.9402588Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287f15300>} 2025-05-07T20:32:39.9404010Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.9405038Z context = 2025-05-07T20:32:39.9405329Z 2025-05-07T20:32:39.9405495Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.9406068Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.9406569Z module_map=module_map) 2025-05-07T20:32:39.9406937Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.9407291Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.9407546Z E ^ 2025-05-07T20:32:39.9408013Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.9408470Z 2025-05-07T20:32:39.9408887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.9409401Z 2025-05-07T20:32:40.2640687Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.2641953Z self=, 2025-05-07T20:32:40.2643065Z T=16384, 2025-05-07T20:32:40.2643469Z D=5120, 2025-05-07T20:32:40.2644046Z scale_ub=None, 2025-05-07T20:32:40.2644509Z contiguous=False, 2025-05-07T20:32:40.2644980Z compiled=True, 2025-05-07T20:32:40.2645774Z ) 2025-05-07T20:32:40.2646418Z self = 2025-05-07T20:32:40.2647088Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.2647370Z 2025-05-07T20:32:40.2647456Z @given( 2025-05-07T20:32:40.2647686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.2648001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.2648310Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.2648636Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.2648966Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.2649254Z ) 2025-05-07T20:32:40.2649601Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.2650053Z def test_silu_mul_quant( 2025-05-07T20:32:40.2650308Z self, 2025-05-07T20:32:40.2650502Z T: int, 2025-05-07T20:32:40.2650712Z D: int, 2025-05-07T20:32:40.2650935Z scale_ub: Optional[float], 2025-05-07T20:32:40.2651210Z contiguous: bool, 2025-05-07T20:32:40.2651447Z compiled: bool, 2025-05-07T20:32:40.2651681Z ) -> None: 2025-05-07T20:32:40.2651908Z torch.manual_seed(2025) 2025-05-07T20:32:40.2652145Z 2025-05-07T20:32:40.2652420Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.2652764Z 2025-05-07T20:32:40.2652956Z x_sign = torch.sign(x) 2025-05-07T20:32:40.2653252Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.2653561Z x = x_sign * x_clamp 2025-05-07T20:32:40.2653796Z x0 = x[:, :D] 2025-05-07T20:32:40.2654014Z x1 = x[:, D:] 2025-05-07T20:32:40.2654230Z 2025-05-07T20:32:40.2654416Z if contiguous: 2025-05-07T20:32:40.2654651Z x0 = x0.contiguous() 2025-05-07T20:32:40.2654920Z x1 = x1.contiguous() 2025-05-07T20:32:40.2655249Z 2025-05-07T20:32:40.2655455Z if scale_ub is not None: 2025-05-07T20:32:40.2655754Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.2656094Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.2656397Z ) 2025-05-07T20:32:40.2656598Z else: 2025-05-07T20:32:40.2656811Z scale_ub_tensor = None 2025-05-07T20:32:40.2657059Z 2025-05-07T20:32:40.2657301Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.2657620Z op = silu_mul_quant 2025-05-07T20:32:40.2657868Z if compiled: 2025-05-07T20:32:40.2658119Z op = torch.compile(op) 2025-05-07T20:32:40.2658418Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.2658689Z 2025-05-07T20:32:40.2658888Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.2659053Z 2025-05-07T20:32:40.2659160Z moe/activation_test.py:117: 2025-05-07T20:32:40.2659537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.2659982Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.2660267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.2660832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.2661391Z return fn(*args, **kwargs) 2025-05-07T20:32:40.2662063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.2662753Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.2663290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.2663980Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.2664657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.2665200Z kernel = self.compile( 2025-05-07T20:32:40.2665785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.2666447Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.2666852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.2667120Z 2025-05-07T20:32:40.2667346Z self = 2025-05-07T20:32:40.2668419Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.2669808Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287f15e40>} 2025-05-07T20:32:40.2671161Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.2672195Z context = 2025-05-07T20:32:40.2672480Z 2025-05-07T20:32:40.2672657Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.2673177Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.2673649Z module_map=module_map) 2025-05-07T20:32:40.2674017Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.2674363Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.2674630Z E ^ 2025-05-07T20:32:40.2675100Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.2675555Z 2025-05-07T20:32:40.2676023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.2676545Z 2025-05-07T20:32:40.2676650Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.2677070Z self=, 2025-05-07T20:32:40.2677475Z T=2048, 2025-05-07T20:32:40.2677660Z D=5120, 2025-05-07T20:32:40.2677855Z scale_ub=None, 2025-05-07T20:32:40.2678071Z contiguous=False, 2025-05-07T20:32:40.2678293Z compiled=True, 2025-05-07T20:32:40.2678497Z ) 2025-05-07T20:32:40.3409090Z self = 2025-05-07T20:32:40.3409927Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.3410307Z 2025-05-07T20:32:40.3410430Z @given( 2025-05-07T20:32:40.3410972Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.3411298Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.3411704Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.3412038Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.3412362Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.3412655Z ) 2025-05-07T20:32:40.3413006Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.3413444Z def test_silu_mul_quant( 2025-05-07T20:32:40.3413692Z self, 2025-05-07T20:32:40.3413894Z T: int, 2025-05-07T20:32:40.3414090Z D: int, 2025-05-07T20:32:40.3414318Z scale_ub: Optional[float], 2025-05-07T20:32:40.3414599Z contiguous: bool, 2025-05-07T20:32:40.3414836Z compiled: bool, 2025-05-07T20:32:40.3415069Z ) -> None: 2025-05-07T20:32:40.3415289Z torch.manual_seed(2025) 2025-05-07T20:32:40.3415532Z 2025-05-07T20:32:40.3415814Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.3416165Z 2025-05-07T20:32:40.3416453Z x_sign = torch.sign(x) 2025-05-07T20:32:40.3416770Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.3417107Z x = x_sign * x_clamp 2025-05-07T20:32:40.3417354Z x0 = x[:, :D] 2025-05-07T20:32:40.3417566Z x1 = x[:, D:] 2025-05-07T20:32:40.3417780Z 2025-05-07T20:32:40.3417977Z if contiguous: 2025-05-07T20:32:40.3418210Z x0 = x0.contiguous() 2025-05-07T20:32:40.3418475Z x1 = x1.contiguous() 2025-05-07T20:32:40.3418716Z 2025-05-07T20:32:40.3418907Z if scale_ub is not None: 2025-05-07T20:32:40.3419187Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.3419528Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.3419836Z ) 2025-05-07T20:32:40.3420034Z else: 2025-05-07T20:32:40.3420247Z scale_ub_tensor = None 2025-05-07T20:32:40.3420500Z 2025-05-07T20:32:40.3420740Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.3421064Z op = silu_mul_quant 2025-05-07T20:32:40.3421317Z if compiled: 2025-05-07T20:32:40.3421562Z op = torch.compile(op) 2025-05-07T20:32:40.3421859Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3422140Z 2025-05-07T20:32:40.3422333Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.3422504Z 2025-05-07T20:32:40.3422607Z moe/activation_test.py:117: 2025-05-07T20:32:40.3422907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3423241Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.3423533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3424098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.3424664Z return fn(*args, **kwargs) 2025-05-07T20:32:40.3425417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.3426120Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.3426667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.3427352Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.3428025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.3428570Z kernel = self.compile( 2025-05-07T20:32:40.3429125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.3429788Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.3430192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3430471Z 2025-05-07T20:32:40.3430697Z self = 2025-05-07T20:32:40.3432084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.3433830Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287f17240>} 2025-05-07T20:32:40.3435520Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.3436799Z context = 2025-05-07T20:32:40.3437139Z 2025-05-07T20:32:40.3437337Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.3437956Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.3438852Z module_map=module_map) 2025-05-07T20:32:40.3439218Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.3439579Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.3439836Z E ^ 2025-05-07T20:32:40.3440305Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.3440761Z 2025-05-07T20:32:40.3441188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.3441710Z 2025-05-07T20:32:40.3441823Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.3442233Z self=, 2025-05-07T20:32:40.3442636Z T=2048, 2025-05-07T20:32:40.3442830Z D=5120, 2025-05-07T20:32:40.3443021Z scale_ub=1200.0, 2025-05-07T20:32:40.3443261Z contiguous=False, 2025-05-07T20:32:40.3443489Z compiled=True, 2025-05-07T20:32:40.3443792Z ) 2025-05-07T20:32:40.3444121Z self = 2025-05-07T20:32:40.3444623Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:40.3444895Z 2025-05-07T20:32:40.3444972Z @given( 2025-05-07T20:32:40.3445204Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.3445520Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.3445829Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.3446155Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.3446486Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.3446799Z ) 2025-05-07T20:32:40.3447171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.3447619Z def test_silu_mul_quant( 2025-05-07T20:32:40.3447937Z self, 2025-05-07T20:32:40.3448134Z T: int, 2025-05-07T20:32:40.3448335Z D: int, 2025-05-07T20:32:40.3448557Z scale_ub: Optional[float], 2025-05-07T20:32:40.3448827Z contiguous: bool, 2025-05-07T20:32:40.3449072Z compiled: bool, 2025-05-07T20:32:40.3449296Z ) -> None: 2025-05-07T20:32:40.3449507Z torch.manual_seed(2025) 2025-05-07T20:32:40.3449751Z 2025-05-07T20:32:40.3450029Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.3450367Z 2025-05-07T20:32:40.3450565Z x_sign = torch.sign(x) 2025-05-07T20:32:40.3450862Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.3451169Z x = x_sign * x_clamp 2025-05-07T20:32:40.3451402Z x0 = x[:, :D] 2025-05-07T20:32:40.3451622Z x1 = x[:, D:] 2025-05-07T20:32:40.3451836Z 2025-05-07T20:32:40.3452093Z if contiguous: 2025-05-07T20:32:40.3452331Z x0 = x0.contiguous() 2025-05-07T20:32:40.3452656Z x1 = x1.contiguous() 2025-05-07T20:32:40.3452892Z 2025-05-07T20:32:40.3453085Z if scale_ub is not None: 2025-05-07T20:32:40.3453361Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.3453690Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.3454004Z ) 2025-05-07T20:32:40.3454197Z else: 2025-05-07T20:32:40.3454403Z scale_ub_tensor = None 2025-05-07T20:32:40.3454657Z 2025-05-07T20:32:40.3454891Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.3455198Z op = silu_mul_quant 2025-05-07T20:32:40.3455455Z if compiled: 2025-05-07T20:32:40.3455701Z op = torch.compile(op) 2025-05-07T20:32:40.3455999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3456274Z 2025-05-07T20:32:40.3456469Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.3456636Z 2025-05-07T20:32:40.3456744Z moe/activation_test.py:117: 2025-05-07T20:32:40.3457112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3457449Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.3457734Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3458292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.3458864Z return fn(*args, **kwargs) 2025-05-07T20:32:40.3459539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.3460239Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.3460777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.3461471Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.3470319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.3471030Z kernel = self.compile( 2025-05-07T20:32:40.3471695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.3472492Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.3472959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3473243Z 2025-05-07T20:32:40.3473480Z self = 2025-05-07T20:32:40.3474623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.3476061Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287618720>} 2025-05-07T20:32:40.3477411Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.3478439Z context = 2025-05-07T20:32:40.3478726Z 2025-05-07T20:32:40.3478901Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.3479430Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.3479895Z module_map=module_map) 2025-05-07T20:32:40.3480265Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.3480623Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.3480881Z E ^ 2025-05-07T20:32:40.3481397Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.3481946Z 2025-05-07T20:32:40.3482373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.3482889Z 2025-05-07T20:32:40.4821283Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.4822243Z self=, 2025-05-07T20:32:40.4822976Z T=4096, 2025-05-07T20:32:40.4823279Z D=5120, 2025-05-07T20:32:40.4823576Z scale_ub=1200.0, 2025-05-07T20:32:40.4823934Z contiguous=True, 2025-05-07T20:32:40.4824290Z compiled=True, 2025-05-07T20:32:40.4824616Z ) 2025-05-07T20:32:40.4825138Z self = 2025-05-07T20:32:40.4825965Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:40.4826376Z 2025-05-07T20:32:40.4826516Z @given( 2025-05-07T20:32:40.4826890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.4827679Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.4828163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.4828700Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.4829240Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.4829709Z ) 2025-05-07T20:32:40.4830289Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.4831039Z def test_silu_mul_quant( 2025-05-07T20:32:40.4831430Z self, 2025-05-07T20:32:40.4831738Z T: int, 2025-05-07T20:32:40.4832060Z D: int, 2025-05-07T20:32:40.4832416Z scale_ub: Optional[float], 2025-05-07T20:32:40.4832853Z contiguous: bool, 2025-05-07T20:32:40.4833246Z compiled: bool, 2025-05-07T20:32:40.4833610Z ) -> None: 2025-05-07T20:32:40.4833944Z torch.manual_seed(2025) 2025-05-07T20:32:40.4834344Z 2025-05-07T20:32:40.4834791Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.4835363Z 2025-05-07T20:32:40.4835677Z x_sign = torch.sign(x) 2025-05-07T20:32:40.4836144Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.4836639Z x = x_sign * x_clamp 2025-05-07T20:32:40.4837056Z x0 = x[:, :D] 2025-05-07T20:32:40.4837420Z x1 = x[:, D:] 2025-05-07T20:32:40.4837754Z 2025-05-07T20:32:40.4838042Z if contiguous: 2025-05-07T20:32:40.4838640Z x0 = x0.contiguous() 2025-05-07T20:32:40.4839016Z x1 = x1.contiguous() 2025-05-07T20:32:40.4839359Z 2025-05-07T20:32:40.4839679Z if scale_ub is not None: 2025-05-07T20:32:40.4840129Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.4840697Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.4841159Z ) 2025-05-07T20:32:40.4841425Z else: 2025-05-07T20:32:40.4841711Z scale_ub_tensor = None 2025-05-07T20:32:40.4842232Z 2025-05-07T20:32:40.4842578Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.4843026Z op = silu_mul_quant 2025-05-07T20:32:40.4843388Z if compiled: 2025-05-07T20:32:40.4843885Z op = torch.compile(op) 2025-05-07T20:32:40.4844318Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.4844716Z 2025-05-07T20:32:40.4844995Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.4845239Z 2025-05-07T20:32:40.4845396Z moe/activation_test.py:117: 2025-05-07T20:32:40.4845829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.4846314Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.4846726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.4847627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.4848673Z return fn(*args, **kwargs) 2025-05-07T20:32:40.4849922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.4851136Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.4852077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.4853302Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.4854503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.4855452Z kernel = self.compile( 2025-05-07T20:32:40.4856398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.4857589Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.4858278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.4858686Z 2025-05-07T20:32:40.4859182Z self = 2025-05-07T20:32:40.4861033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.4863443Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287619260>} 2025-05-07T20:32:40.4865753Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.4867591Z context = 2025-05-07T20:32:40.4868134Z 2025-05-07T20:32:40.4868435Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.4869357Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.4870153Z module_map=module_map) 2025-05-07T20:32:40.4870777Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.4871373Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.4871814Z E ^ 2025-05-07T20:32:40.4872601Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.4873428Z 2025-05-07T20:32:40.4874170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.4875101Z 2025-05-07T20:32:40.4875272Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.4875987Z self=, 2025-05-07T20:32:40.4876695Z T=128, 2025-05-07T20:32:40.4877101Z D=5120, 2025-05-07T20:32:40.4877427Z scale_ub=1200.0, 2025-05-07T20:32:40.4877784Z contiguous=False, 2025-05-07T20:32:40.4878130Z compiled=True, 2025-05-07T20:32:40.4878468Z ) 2025-05-07T20:32:40.7581437Z self = 2025-05-07T20:32:40.7582374Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:40.7582862Z 2025-05-07T20:32:40.7583001Z @given( 2025-05-07T20:32:40.7583385Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.7583922Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.7584447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.7585015Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.7585565Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.7586049Z ) 2025-05-07T20:32:40.7587067Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.7587980Z def test_silu_mul_quant( 2025-05-07T20:32:40.7588393Z self, 2025-05-07T20:32:40.7588710Z T: int, 2025-05-07T20:32:40.7589024Z D: int, 2025-05-07T20:32:40.7589385Z scale_ub: Optional[float], 2025-05-07T20:32:40.7589843Z contiguous: bool, 2025-05-07T20:32:40.7590240Z compiled: bool, 2025-05-07T20:32:40.7590612Z ) -> None: 2025-05-07T20:32:40.7590972Z torch.manual_seed(2025) 2025-05-07T20:32:40.7591379Z 2025-05-07T20:32:40.7591833Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.7592429Z 2025-05-07T20:32:40.7592751Z x_sign = torch.sign(x) 2025-05-07T20:32:40.7593234Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.7593764Z x = x_sign * x_clamp 2025-05-07T20:32:40.7594171Z x0 = x[:, :D] 2025-05-07T20:32:40.7594523Z x1 = x[:, D:] 2025-05-07T20:32:40.7594879Z 2025-05-07T20:32:40.7595186Z if contiguous: 2025-05-07T20:32:40.7595694Z x0 = x0.contiguous() 2025-05-07T20:32:40.7596121Z x1 = x1.contiguous() 2025-05-07T20:32:40.7596506Z 2025-05-07T20:32:40.7596805Z if scale_ub is not None: 2025-05-07T20:32:40.7597247Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.7597793Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.7598283Z ) 2025-05-07T20:32:40.7598602Z else: 2025-05-07T20:32:40.7598957Z scale_ub_tensor = None 2025-05-07T20:32:40.7599381Z 2025-05-07T20:32:40.7599777Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.7600322Z op = silu_mul_quant 2025-05-07T20:32:40.7600744Z if compiled: 2025-05-07T20:32:40.7601150Z op = torch.compile(op) 2025-05-07T20:32:40.7601651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.7602123Z 2025-05-07T20:32:40.7602438Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.7602738Z 2025-05-07T20:32:40.7602908Z moe/activation_test.py:117: 2025-05-07T20:32:40.7603415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.7604133Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.7604625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.7605614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.7606624Z return fn(*args, **kwargs) 2025-05-07T20:32:40.7607846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.7609103Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.7610065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.7611296Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.7612639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.7613605Z kernel = self.compile( 2025-05-07T20:32:40.7614567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.7615745Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.7616438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.7616845Z 2025-05-07T20:32:40.7617253Z self = 2025-05-07T20:32:40.7619221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.7621827Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb28761a480>} 2025-05-07T20:32:40.7624225Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.7625617Z context = 2025-05-07T20:32:40.7626010Z 2025-05-07T20:32:40.7626242Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.7626949Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.7627588Z module_map=module_map) 2025-05-07T20:32:40.7628099Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.7628552Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.7628915Z E ^ 2025-05-07T20:32:40.7629602Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.7630363Z 2025-05-07T20:32:40.7630971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.7631752Z 2025-05-07T20:32:40.7631916Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.7632516Z self=, 2025-05-07T20:32:40.7633113Z T=16384, 2025-05-07T20:32:40.7633390Z D=7168, 2025-05-07T20:32:40.7633683Z scale_ub=1200.0, 2025-05-07T20:32:40.7634003Z contiguous=True, 2025-05-07T20:32:40.7634316Z compiled=True, 2025-05-07T20:32:40.7634612Z ) 2025-05-07T20:32:40.7635071Z self = 2025-05-07T20:32:40.7635810Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:40.7636213Z 2025-05-07T20:32:40.7636327Z @given( 2025-05-07T20:32:40.7636687Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.7637151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.7637601Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.7638130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.7638886Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.7639333Z ) 2025-05-07T20:32:40.7639893Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.7640612Z def test_silu_mul_quant( 2025-05-07T20:32:40.7640980Z self, 2025-05-07T20:32:40.7641284Z T: int, 2025-05-07T20:32:40.7641590Z D: int, 2025-05-07T20:32:40.7641930Z scale_ub: Optional[float], 2025-05-07T20:32:40.7642346Z contiguous: bool, 2025-05-07T20:32:40.7642717Z compiled: bool, 2025-05-07T20:32:40.7643066Z ) -> None: 2025-05-07T20:32:40.7643395Z torch.manual_seed(2025) 2025-05-07T20:32:40.7644032Z 2025-05-07T20:32:40.7644470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.7645009Z 2025-05-07T20:32:40.7645311Z x_sign = torch.sign(x) 2025-05-07T20:32:40.7645763Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.7646245Z x = x_sign * x_clamp 2025-05-07T20:32:40.7646621Z x0 = x[:, :D] 2025-05-07T20:32:40.7646953Z x1 = x[:, D:] 2025-05-07T20:32:40.7647273Z 2025-05-07T20:32:40.7647568Z if contiguous: 2025-05-07T20:32:40.7647935Z x0 = x0.contiguous() 2025-05-07T20:32:40.7648339Z x1 = x1.contiguous() 2025-05-07T20:32:40.7648713Z 2025-05-07T20:32:40.7649011Z if scale_ub is not None: 2025-05-07T20:32:40.7649435Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.7649969Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.7650575Z ) 2025-05-07T20:32:40.7650874Z else: 2025-05-07T20:32:40.7651261Z scale_ub_tensor = None 2025-05-07T20:32:40.7651656Z 2025-05-07T20:32:40.7652010Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.7652494Z op = silu_mul_quant 2025-05-07T20:32:40.7652879Z if compiled: 2025-05-07T20:32:40.7653252Z op = torch.compile(op) 2025-05-07T20:32:40.7653707Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.7654143Z 2025-05-07T20:32:40.7654443Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.7654705Z 2025-05-07T20:32:40.7654855Z moe/activation_test.py:117: 2025-05-07T20:32:40.7655319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.7655849Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.7656290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.7657196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.7658118Z return fn(*args, **kwargs) 2025-05-07T20:32:40.7659287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.7660421Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.7661296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.7662418Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.7663516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.7664384Z kernel = self.compile( 2025-05-07T20:32:40.7665265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.7666351Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.7666979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.7667369Z 2025-05-07T20:32:40.7667692Z self = 2025-05-07T20:32:40.7669488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.7671784Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb28761bd80>} 2025-05-07T20:32:40.7674029Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.7675727Z context = 2025-05-07T20:32:40.7676201Z 2025-05-07T20:32:40.7676580Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.7677484Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.7678245Z module_map=module_map) 2025-05-07T20:32:40.7678811Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.7679365Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.7679768Z E ^ 2025-05-07T20:32:40.7680507Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.7681263Z 2025-05-07T20:32:40.7681949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.7682813Z 2025-05-07T20:32:40.8628812Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.8629872Z self=, 2025-05-07T20:32:40.8630719Z T=16384, 2025-05-07T20:32:40.8631043Z D=5120, 2025-05-07T20:32:40.8631365Z scale_ub=1200.0, 2025-05-07T20:32:40.8631731Z contiguous=True, 2025-05-07T20:32:40.8632112Z compiled=False, 2025-05-07T20:32:40.8632458Z ) 2025-05-07T20:32:40.8632999Z self = 2025-05-07T20:32:40.8633861Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:40.8634367Z 2025-05-07T20:32:40.8634496Z @given( 2025-05-07T20:32:40.8634871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.8635399Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.8635924Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.8636488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.8637053Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.8637555Z ) 2025-05-07T20:32:40.8638205Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.8639364Z def test_silu_mul_quant( 2025-05-07T20:32:40.8639779Z self, 2025-05-07T20:32:40.8640094Z T: int, 2025-05-07T20:32:40.8640423Z D: int, 2025-05-07T20:32:40.8640787Z scale_ub: Optional[float], 2025-05-07T20:32:40.8641242Z contiguous: bool, 2025-05-07T20:32:40.8641648Z compiled: bool, 2025-05-07T20:32:40.8642023Z ) -> None: 2025-05-07T20:32:40.8642386Z torch.manual_seed(2025) 2025-05-07T20:32:40.8642782Z 2025-05-07T20:32:40.8643240Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.8643994Z 2025-05-07T20:32:40.8644301Z x_sign = torch.sign(x) 2025-05-07T20:32:40.8644776Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.8645289Z x = x_sign * x_clamp 2025-05-07T20:32:40.8645677Z x0 = x[:, :D] 2025-05-07T20:32:40.8646029Z x1 = x[:, D:] 2025-05-07T20:32:40.8646373Z 2025-05-07T20:32:40.8646676Z if contiguous: 2025-05-07T20:32:40.8647070Z x0 = x0.contiguous() 2025-05-07T20:32:40.8647517Z x1 = x1.contiguous() 2025-05-07T20:32:40.8647917Z 2025-05-07T20:32:40.8648240Z if scale_ub is not None: 2025-05-07T20:32:40.8648707Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.8649270Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.8649803Z ) 2025-05-07T20:32:40.8650127Z else: 2025-05-07T20:32:40.8650469Z scale_ub_tensor = None 2025-05-07T20:32:40.8650898Z 2025-05-07T20:32:40.8651288Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.8651829Z op = silu_mul_quant 2025-05-07T20:32:40.8652242Z if compiled: 2025-05-07T20:32:40.8652659Z op = torch.compile(op) 2025-05-07T20:32:40.8653164Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8653633Z 2025-05-07T20:32:40.8654102Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.8654389Z 2025-05-07T20:32:40.8654569Z moe/activation_test.py:117: 2025-05-07T20:32:40.8655062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8655647Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.8656131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8657405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.8658695Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.8659651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.8660891Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.8662172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.8663209Z kernel = self.compile( 2025-05-07T20:32:40.8664182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.8665357Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.8666045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8666456Z 2025-05-07T20:32:40.8666807Z self = 2025-05-07T20:32:40.8668806Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.8671352Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2872c8cc0>} 2025-05-07T20:32:40.8673382Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.8674931Z context = 2025-05-07T20:32:40.8675331Z 2025-05-07T20:32:40.8675562Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.8676279Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.8676985Z module_map=module_map) 2025-05-07T20:32:40.8677507Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.8678004Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.8678399Z E ^ 2025-05-07T20:32:40.8679094Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.8679769Z 2025-05-07T20:32:40.8680406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.8681195Z 2025-05-07T20:32:40.8681359Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.8681945Z self=, 2025-05-07T20:32:40.8682553Z T=1, 2025-05-07T20:32:40.8682817Z D=7168, 2025-05-07T20:32:40.8683109Z scale_ub=1200.0, 2025-05-07T20:32:40.8683419Z contiguous=False, 2025-05-07T20:32:40.8683899Z compiled=False, 2025-05-07T20:32:40.8684223Z ) 2025-05-07T20:32:40.8684723Z self = 2025-05-07T20:32:40.8685515Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:40.8685947Z 2025-05-07T20:32:40.8686072Z @given( 2025-05-07T20:32:40.8686425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.8687068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.8691206Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.8691727Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.8692253Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.8692709Z ) 2025-05-07T20:32:40.8693261Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.8693977Z def test_silu_mul_quant( 2025-05-07T20:32:40.8704597Z self, 2025-05-07T20:32:40.8704960Z T: int, 2025-05-07T20:32:40.8705277Z D: int, 2025-05-07T20:32:40.8705602Z scale_ub: Optional[float], 2025-05-07T20:32:40.8706050Z contiguous: bool, 2025-05-07T20:32:40.8706420Z compiled: bool, 2025-05-07T20:32:40.8706750Z ) -> None: 2025-05-07T20:32:40.8707076Z torch.manual_seed(2025) 2025-05-07T20:32:40.8707482Z 2025-05-07T20:32:40.8708037Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.8708655Z 2025-05-07T20:32:40.8709003Z x_sign = torch.sign(x) 2025-05-07T20:32:40.8709497Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.8710026Z x = x_sign * x_clamp 2025-05-07T20:32:40.8710433Z x0 = x[:, :D] 2025-05-07T20:32:40.8710787Z x1 = x[:, D:] 2025-05-07T20:32:40.8711124Z 2025-05-07T20:32:40.8711430Z if contiguous: 2025-05-07T20:32:40.8711814Z x0 = x0.contiguous() 2025-05-07T20:32:40.8712246Z x1 = x1.contiguous() 2025-05-07T20:32:40.8712635Z 2025-05-07T20:32:40.8712956Z if scale_ub is not None: 2025-05-07T20:32:40.8713420Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.8713983Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.8714512Z ) 2025-05-07T20:32:40.8714840Z else: 2025-05-07T20:32:40.8715181Z scale_ub_tensor = None 2025-05-07T20:32:40.8715600Z 2025-05-07T20:32:40.8715984Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.8716584Z op = silu_mul_quant 2025-05-07T20:32:40.8717002Z if compiled: 2025-05-07T20:32:40.8717414Z op = torch.compile(op) 2025-05-07T20:32:40.8717909Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8718379Z 2025-05-07T20:32:40.8718693Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.8718974Z 2025-05-07T20:32:40.8719142Z moe/activation_test.py:117: 2025-05-07T20:32:40.8719635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8720198Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.8720668Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8721902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.8723150Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.8724242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.8725487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.8726675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.8727688Z kernel = self.compile( 2025-05-07T20:32:40.8728662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.8729839Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.8730535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8730949Z 2025-05-07T20:32:40.8731304Z self = 2025-05-07T20:32:40.8733353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.8735996Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2872c9080>} 2025-05-07T20:32:40.8738755Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.8740650Z context = 2025-05-07T20:32:40.8741158Z 2025-05-07T20:32:40.8741457Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.8742388Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.8743336Z module_map=module_map) 2025-05-07T20:32:40.8743975Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.8744589Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.8745027Z E ^ 2025-05-07T20:32:40.8745847Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.8746670Z 2025-05-07T20:32:40.8747434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.8748370Z 2025-05-07T20:32:41.0080475Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.0081241Z self=, 2025-05-07T20:32:41.0081919Z T=4096, 2025-05-07T20:32:41.0082208Z D=7168, 2025-05-07T20:32:41.0082512Z scale_ub=1200.0, 2025-05-07T20:32:41.0082865Z contiguous=False, 2025-05-07T20:32:41.0083225Z compiled=True, 2025-05-07T20:32:41.0083721Z ) 2025-05-07T20:32:41.0084245Z self = 2025-05-07T20:32:41.0085334Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.0085815Z 2025-05-07T20:32:41.0085934Z @given( 2025-05-07T20:32:41.0086292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.0086794Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.0087285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.0087833Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.0088370Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.0088841Z ) 2025-05-07T20:32:41.0089418Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.0090145Z def test_silu_mul_quant( 2025-05-07T20:32:41.0090536Z self, 2025-05-07T20:32:41.0090850Z T: int, 2025-05-07T20:32:41.0091167Z D: int, 2025-05-07T20:32:41.0091510Z scale_ub: Optional[float], 2025-05-07T20:32:41.0091966Z contiguous: bool, 2025-05-07T20:32:41.0092353Z compiled: bool, 2025-05-07T20:32:41.0092698Z ) -> None: 2025-05-07T20:32:41.0093036Z torch.manual_seed(2025) 2025-05-07T20:32:41.0093424Z 2025-05-07T20:32:41.0093848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.0094417Z 2025-05-07T20:32:41.0094719Z x_sign = torch.sign(x) 2025-05-07T20:32:41.0095172Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.0095677Z x = x_sign * x_clamp 2025-05-07T20:32:41.0096040Z x0 = x[:, :D] 2025-05-07T20:32:41.0096375Z x1 = x[:, D:] 2025-05-07T20:32:41.0096714Z 2025-05-07T20:32:41.0097002Z if contiguous: 2025-05-07T20:32:41.0097366Z x0 = x0.contiguous() 2025-05-07T20:32:41.0097798Z x1 = x1.contiguous() 2025-05-07T20:32:41.0098197Z 2025-05-07T20:32:41.0098516Z if scale_ub is not None: 2025-05-07T20:32:41.0099050Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.0099665Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.0100106Z ) 2025-05-07T20:32:41.0100376Z else: 2025-05-07T20:32:41.0100673Z scale_ub_tensor = None 2025-05-07T20:32:41.0101044Z 2025-05-07T20:32:41.0101382Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.0101867Z op = silu_mul_quant 2025-05-07T20:32:41.0102254Z if compiled: 2025-05-07T20:32:41.0102631Z op = torch.compile(op) 2025-05-07T20:32:41.0103075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0103490Z 2025-05-07T20:32:41.0103769Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.0104028Z 2025-05-07T20:32:41.0104184Z moe/activation_test.py:117: 2025-05-07T20:32:41.0104659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0105356Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.0105808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0106756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.0107715Z return fn(*args, **kwargs) 2025-05-07T20:32:41.0108871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.0110097Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.0111064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.0112286Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.0113453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.0114417Z kernel = self.compile( 2025-05-07T20:32:41.0115377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.0116638Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.0117316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0117759Z 2025-05-07T20:32:41.0118086Z self = 2025-05-07T20:32:41.0119966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.0122321Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2872cb060>} 2025-05-07T20:32:41.0124791Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.0126675Z context = 2025-05-07T20:32:41.0127173Z 2025-05-07T20:32:41.0127452Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.0128378Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.0129202Z module_map=module_map) 2025-05-07T20:32:41.0129799Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.0130396Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.0130829Z E ^ 2025-05-07T20:32:41.0131629Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.0132444Z 2025-05-07T20:32:41.0133312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.0134263Z 2025-05-07T20:32:41.0134524Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.0135195Z self=, 2025-05-07T20:32:41.0135877Z T=128, 2025-05-07T20:32:41.0136187Z D=7168, 2025-05-07T20:32:41.0136496Z scale_ub=1200.0, 2025-05-07T20:32:41.0136860Z contiguous=False, 2025-05-07T20:32:41.0137223Z compiled=True, 2025-05-07T20:32:41.0137564Z ) 2025-05-07T20:32:41.0859440Z self = 2025-05-07T20:32:41.0860376Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.0860868Z 2025-05-07T20:32:41.0860995Z @given( 2025-05-07T20:32:41.0861382Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.0861919Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.0862804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.0863388Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.0863970Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.0864456Z ) 2025-05-07T20:32:41.0865062Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.0865844Z def test_silu_mul_quant( 2025-05-07T20:32:41.0866243Z self, 2025-05-07T20:32:41.0866568Z T: int, 2025-05-07T20:32:41.0866904Z D: int, 2025-05-07T20:32:41.0867299Z scale_ub: Optional[float], 2025-05-07T20:32:41.0867759Z contiguous: bool, 2025-05-07T20:32:41.0868165Z compiled: bool, 2025-05-07T20:32:41.0868537Z ) -> None: 2025-05-07T20:32:41.0868883Z torch.manual_seed(2025) 2025-05-07T20:32:41.0869284Z 2025-05-07T20:32:41.0869739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.0870327Z 2025-05-07T20:32:41.0870661Z x_sign = torch.sign(x) 2025-05-07T20:32:41.0871166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.0871841Z x = x_sign * x_clamp 2025-05-07T20:32:41.0872276Z x0 = x[:, :D] 2025-05-07T20:32:41.0872646Z x1 = x[:, D:] 2025-05-07T20:32:41.0872990Z 2025-05-07T20:32:41.0873305Z if contiguous: 2025-05-07T20:32:41.0873682Z x0 = x0.contiguous() 2025-05-07T20:32:41.0874087Z x1 = x1.contiguous() 2025-05-07T20:32:41.0874481Z 2025-05-07T20:32:41.0874784Z if scale_ub is not None: 2025-05-07T20:32:41.0875219Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.0875756Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.0876265Z ) 2025-05-07T20:32:41.0876569Z else: 2025-05-07T20:32:41.0876917Z scale_ub_tensor = None 2025-05-07T20:32:41.0877344Z 2025-05-07T20:32:41.0877721Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.0878272Z op = silu_mul_quant 2025-05-07T20:32:41.0878696Z if compiled: 2025-05-07T20:32:41.0879110Z op = torch.compile(op) 2025-05-07T20:32:41.0879615Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0880086Z 2025-05-07T20:32:41.0880393Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.0880680Z 2025-05-07T20:32:41.0880842Z moe/activation_test.py:117: 2025-05-07T20:32:41.0881341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0881915Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.0882387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0883367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.0884549Z return fn(*args, **kwargs) 2025-05-07T20:32:41.0885716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.0887107Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.0888073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.0889443Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.0890632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.0891584Z kernel = self.compile( 2025-05-07T20:32:41.0892546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.0893727Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.0894407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0894817Z 2025-05-07T20:32:41.0895167Z self = 2025-05-07T20:32:41.0897195Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.0899742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286e48360>} 2025-05-07T20:32:41.0902142Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.0903557Z context = 2025-05-07T20:32:41.0903963Z 2025-05-07T20:32:41.0904186Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.0904926Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.0905563Z module_map=module_map) 2025-05-07T20:32:41.0906145Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.0906613Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.0906980Z E ^ 2025-05-07T20:32:41.0907714Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.0908371Z 2025-05-07T20:32:41.0908978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.0909755Z 2025-05-07T20:32:41.0909919Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.0910511Z self=, 2025-05-07T20:32:41.0911109Z T=2048, 2025-05-07T20:32:41.0911394Z D=7168, 2025-05-07T20:32:41.0911674Z scale_ub=None, 2025-05-07T20:32:41.0911993Z contiguous=True, 2025-05-07T20:32:41.0912325Z compiled=True, 2025-05-07T20:32:41.0912635Z ) 2025-05-07T20:32:41.0913112Z self = 2025-05-07T20:32:41.0913846Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:41.0914209Z 2025-05-07T20:32:41.0914324Z @given( 2025-05-07T20:32:41.0914631Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.0915072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.0915492Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.0915930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.0916397Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.0916801Z ) 2025-05-07T20:32:41.0917294Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.0917926Z def test_silu_mul_quant( 2025-05-07T20:32:41.0918261Z self, 2025-05-07T20:32:41.0918536Z T: int, 2025-05-07T20:32:41.0918802Z D: int, 2025-05-07T20:32:41.0919207Z scale_ub: Optional[float], 2025-05-07T20:32:41.0919645Z contiguous: bool, 2025-05-07T20:32:41.0919970Z compiled: bool, 2025-05-07T20:32:41.0920279Z ) -> None: 2025-05-07T20:32:41.0920573Z torch.manual_seed(2025) 2025-05-07T20:32:41.0920903Z 2025-05-07T20:32:41.0921283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.0921769Z 2025-05-07T20:32:41.0922028Z x_sign = torch.sign(x) 2025-05-07T20:32:41.0922437Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.0922876Z x = x_sign * x_clamp 2025-05-07T20:32:41.0923206Z x0 = x[:, :D] 2025-05-07T20:32:41.0923505Z x1 = x[:, D:] 2025-05-07T20:32:41.0923928Z 2025-05-07T20:32:41.0924179Z if contiguous: 2025-05-07T20:32:41.0924508Z x0 = x0.contiguous() 2025-05-07T20:32:41.0924925Z x1 = x1.contiguous() 2025-05-07T20:32:41.0925269Z 2025-05-07T20:32:41.0925534Z if scale_ub is not None: 2025-05-07T20:32:41.0925923Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.0926404Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.0926846Z ) 2025-05-07T20:32:41.0927112Z else: 2025-05-07T20:32:41.0927406Z scale_ub_tensor = None 2025-05-07T20:32:41.0927753Z 2025-05-07T20:32:41.0928076Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.0928522Z op = silu_mul_quant 2025-05-07T20:32:41.0928862Z if compiled: 2025-05-07T20:32:41.0929206Z op = torch.compile(op) 2025-05-07T20:32:41.0929628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0930026Z 2025-05-07T20:32:41.0930307Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.0930541Z 2025-05-07T20:32:41.0930691Z moe/activation_test.py:117: 2025-05-07T20:32:41.0931119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0931592Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.0932064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0932883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.0933687Z return fn(*args, **kwargs) 2025-05-07T20:32:41.0934640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.0935642Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.0936430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.0937415Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.0938378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.0939384Z kernel = self.compile( 2025-05-07T20:32:41.0940162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.0941121Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.0941687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0942014Z 2025-05-07T20:32:41.0942308Z self = 2025-05-07T20:32:41.0943875Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.0945898Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286e48ea0>} 2025-05-07T20:32:41.0948025Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.0949609Z context = 2025-05-07T20:32:41.0950019Z 2025-05-07T20:32:41.0950267Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.0951012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.0951682Z module_map=module_map) 2025-05-07T20:32:41.0952188Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.0952660Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.0953037Z E ^ 2025-05-07T20:32:41.0953697Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.0954450Z 2025-05-07T20:32:41.0955073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.0955828Z 2025-05-07T20:32:41.1605308Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.1606048Z self=, 2025-05-07T20:32:41.1606740Z T=16384, 2025-05-07T20:32:41.1607043Z D=5120, 2025-05-07T20:32:41.1607339Z scale_ub=None, 2025-05-07T20:32:41.1607734Z contiguous=False, 2025-05-07T20:32:41.1608102Z compiled=False, 2025-05-07T20:32:41.1608440Z ) 2025-05-07T20:32:41.1608952Z self = 2025-05-07T20:32:41.1609805Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.1610282Z 2025-05-07T20:32:41.1610413Z @given( 2025-05-07T20:32:41.1610778Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.1611310Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.1611831Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.1612606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.1613175Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.1613645Z ) 2025-05-07T20:32:41.1614238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.1614992Z def test_silu_mul_quant( 2025-05-07T20:32:41.1615393Z self, 2025-05-07T20:32:41.1615711Z T: int, 2025-05-07T20:32:41.1616023Z D: int, 2025-05-07T20:32:41.1616384Z scale_ub: Optional[float], 2025-05-07T20:32:41.1616841Z contiguous: bool, 2025-05-07T20:32:41.1617229Z compiled: bool, 2025-05-07T20:32:41.1617598Z ) -> None: 2025-05-07T20:32:41.1617958Z torch.manual_seed(2025) 2025-05-07T20:32:41.1618357Z 2025-05-07T20:32:41.1618823Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.1619425Z 2025-05-07T20:32:41.1619743Z x_sign = torch.sign(x) 2025-05-07T20:32:41.1620239Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.1623948Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.1627420Z 2025-05-07T20:32:41.1627626Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.1627995Z 2025-05-07T20:32:41.1628162Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.1628829Z self=, 2025-05-07T20:32:41.1629619Z T=4096, 2025-05-07T20:32:41.1630033Z D=7168, 2025-05-07T20:32:41.1630328Z scale_ub=1200.0, 2025-05-07T20:32:41.1630670Z contiguous=True, 2025-05-07T20:32:41.1631014Z compiled=True, 2025-05-07T20:32:41.1631330Z ) 2025-05-07T20:32:41.1631847Z self = 2025-05-07T20:32:41.1632673Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.1633142Z 2025-05-07T20:32:41.1633275Z @given( 2025-05-07T20:32:41.1633644Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.1634180Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.1634683Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.1635235Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.1635800Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.1636423Z ) 2025-05-07T20:32:41.1637015Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.1637781Z def test_silu_mul_quant( 2025-05-07T20:32:41.1638190Z self, 2025-05-07T20:32:41.1650506Z T: int, 2025-05-07T20:32:41.1650884Z D: int, 2025-05-07T20:32:41.1651247Z scale_ub: Optional[float], 2025-05-07T20:32:41.1651713Z contiguous: bool, 2025-05-07T20:32:41.1652093Z compiled: bool, 2025-05-07T20:32:41.1652481Z ) -> None: 2025-05-07T20:32:41.1652845Z torch.manual_seed(2025) 2025-05-07T20:32:41.1653267Z 2025-05-07T20:32:41.1653722Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.1654316Z 2025-05-07T20:32:41.1654633Z x_sign = torch.sign(x) 2025-05-07T20:32:41.1655119Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.1658838Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.1662517Z 2025-05-07T20:32:41.1662717Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.1663097Z 2025-05-07T20:32:41.1663267Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.1663983Z self=, 2025-05-07T20:32:41.1664686Z T=16384, 2025-05-07T20:32:41.1665011Z D=7168, 2025-05-07T20:32:41.1665325Z scale_ub=None, 2025-05-07T20:32:41.1665678Z contiguous=False, 2025-05-07T20:32:41.1666051Z compiled=False, 2025-05-07T20:32:41.1666361Z ) 2025-05-07T20:32:41.1666871Z self = 2025-05-07T20:32:41.1667687Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.1668159Z 2025-05-07T20:32:41.1668280Z @given( 2025-05-07T20:32:41.1668713Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.1669212Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.1669733Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.1670299Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.1670853Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.1671341Z ) 2025-05-07T20:32:41.1671944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.1672723Z def test_silu_mul_quant( 2025-05-07T20:32:41.1673113Z self, 2025-05-07T20:32:41.1673433Z T: int, 2025-05-07T20:32:41.1673768Z D: int, 2025-05-07T20:32:41.1674261Z scale_ub: Optional[float], 2025-05-07T20:32:41.1674730Z contiguous: bool, 2025-05-07T20:32:41.1675255Z compiled: bool, 2025-05-07T20:32:41.1675616Z ) -> None: 2025-05-07T20:32:41.1675963Z torch.manual_seed(2025) 2025-05-07T20:32:41.1676371Z 2025-05-07T20:32:41.1676816Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.1680691Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.1684379Z 2025-05-07T20:32:41.1684582Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.1684966Z 2025-05-07T20:32:41.1685135Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.1685850Z self=, 2025-05-07T20:32:41.1686543Z T=2048, 2025-05-07T20:32:41.1686856Z D=7168, 2025-05-07T20:32:41.1687168Z scale_ub=1200.0, 2025-05-07T20:32:41.1687558Z contiguous=True, 2025-05-07T20:32:41.1687948Z compiled=True, 2025-05-07T20:32:41.1688289Z ) 2025-05-07T20:32:41.1688818Z self = 2025-05-07T20:32:41.1689671Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.1690149Z 2025-05-07T20:32:41.1690286Z @given( 2025-05-07T20:32:41.1690666Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.1691199Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.1691721Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.1692359Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.1692917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.1693403Z ) 2025-05-07T20:32:41.1694002Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.1694699Z def test_silu_mul_quant( 2025-05-07T20:32:41.1695022Z self, 2025-05-07T20:32:41.1695273Z T: int, 2025-05-07T20:32:41.1695521Z D: int, 2025-05-07T20:32:41.1695800Z scale_ub: Optional[float], 2025-05-07T20:32:41.1696148Z contiguous: bool, 2025-05-07T20:32:41.1696494Z compiled: bool, 2025-05-07T20:32:41.1696810Z ) -> None: 2025-05-07T20:32:41.1697148Z torch.manual_seed(2025) 2025-05-07T20:32:41.1697509Z 2025-05-07T20:32:41.1697881Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.1698367Z 2025-05-07T20:32:41.1698653Z x_sign = torch.sign(x) 2025-05-07T20:32:41.1699068Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.1702289Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.1705487Z 2025-05-07T20:32:41.1705700Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.1706051Z 2025-05-07T20:32:41.1706231Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.1706937Z self=, 2025-05-07T20:32:41.1707715Z T=2048, 2025-05-07T20:32:41.1708039Z D=7168, 2025-05-07T20:32:41.1708422Z scale_ub=None, 2025-05-07T20:32:41.1708776Z contiguous=True, 2025-05-07T20:32:41.1709146Z compiled=False, 2025-05-07T20:32:41.1709470Z ) 2025-05-07T20:32:41.2584308Z self = 2025-05-07T20:32:41.2584871Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.2585149Z 2025-05-07T20:32:41.2585229Z @given( 2025-05-07T20:32:41.2585471Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2585793Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2586103Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2586429Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2586760Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2587050Z ) 2025-05-07T20:32:41.2587624Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2588094Z def test_silu_mul_quant( 2025-05-07T20:32:41.2588353Z self, 2025-05-07T20:32:41.2588547Z T: int, 2025-05-07T20:32:41.2588747Z D: int, 2025-05-07T20:32:41.2588972Z scale_ub: Optional[float], 2025-05-07T20:32:41.2589246Z contiguous: bool, 2025-05-07T20:32:41.2589496Z compiled: bool, 2025-05-07T20:32:41.2589732Z ) -> None: 2025-05-07T20:32:41.2589946Z torch.manual_seed(2025) 2025-05-07T20:32:41.2590192Z 2025-05-07T20:32:41.2590468Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2590807Z 2025-05-07T20:32:41.2591006Z > x_sign = torch.sign(x) 2025-05-07T20:32:41.2592982Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.2594971Z 2025-05-07T20:32:41.2595092Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:41.2595312Z 2025-05-07T20:32:41.2595418Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2595839Z self=, 2025-05-07T20:32:41.2596243Z T=1, 2025-05-07T20:32:41.2596440Z D=7168, 2025-05-07T20:32:41.2596644Z scale_ub=1200.0, 2025-05-07T20:32:41.2596866Z contiguous=True, 2025-05-07T20:32:41.2597093Z compiled=False, 2025-05-07T20:32:41.2597310Z ) 2025-05-07T20:32:41.2597630Z self = 2025-05-07T20:32:41.2598127Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.2598404Z 2025-05-07T20:32:41.2598481Z @given( 2025-05-07T20:32:41.2598713Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2599031Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2599339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2599671Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2599997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2600287Z ) 2025-05-07T20:32:41.2600640Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2601079Z def test_silu_mul_quant( 2025-05-07T20:32:41.2601326Z self, 2025-05-07T20:32:41.2601521Z T: int, 2025-05-07T20:32:41.2601716Z D: int, 2025-05-07T20:32:41.2601940Z scale_ub: Optional[float], 2025-05-07T20:32:41.2602216Z contiguous: bool, 2025-05-07T20:32:41.2602541Z compiled: bool, 2025-05-07T20:32:41.2602776Z ) -> None: 2025-05-07T20:32:41.2603099Z torch.manual_seed(2025) 2025-05-07T20:32:41.2603343Z 2025-05-07T20:32:41.2603762Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2604107Z 2025-05-07T20:32:41.2604310Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2604601Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2604912Z x = x_sign * x_clamp 2025-05-07T20:32:41.2605157Z x0 = x[:, :D] 2025-05-07T20:32:41.2605368Z x1 = x[:, D:] 2025-05-07T20:32:41.2605581Z 2025-05-07T20:32:41.2605771Z if contiguous: 2025-05-07T20:32:41.2606000Z x0 = x0.contiguous() 2025-05-07T20:32:41.2606267Z x1 = x1.contiguous() 2025-05-07T20:32:41.2606512Z 2025-05-07T20:32:41.2606707Z if scale_ub is not None: 2025-05-07T20:32:41.2607037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.2607381Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.2607698Z ) 2025-05-07T20:32:41.2607895Z else: 2025-05-07T20:32:41.2608112Z scale_ub_tensor = None 2025-05-07T20:32:41.2608368Z 2025-05-07T20:32:41.2608598Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.2608914Z op = silu_mul_quant 2025-05-07T20:32:41.2609170Z if compiled: 2025-05-07T20:32:41.2609416Z op = torch.compile(op) 2025-05-07T20:32:41.2609716Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2609996Z 2025-05-07T20:32:41.2610185Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.2610355Z 2025-05-07T20:32:41.2610458Z moe/activation_test.py:117: 2025-05-07T20:32:41.2610757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2611086Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.2611374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2612073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.2612832Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.2613373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.2614067Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.2614763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.2615297Z kernel = self.compile( 2025-05-07T20:32:41.2615847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.2616516Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.2616915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2617151Z 2025-05-07T20:32:41.2617367Z self = 2025-05-07T20:32:41.2618500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.2619874Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286c68680>} 2025-05-07T20:32:41.2621218Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.2622240Z context = 2025-05-07T20:32:41.2622536Z 2025-05-07T20:32:41.2622769Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.2623306Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.2623828Z module_map=module_map) 2025-05-07T20:32:41.2624193Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.2624558Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.2624831Z E ^ 2025-05-07T20:32:41.2625299Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.2625763Z 2025-05-07T20:32:41.2626186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.2626712Z 2025-05-07T20:32:41.2626818Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2627277Z self=, 2025-05-07T20:32:41.2627679Z T=128, 2025-05-07T20:32:41.2627877Z D=5120, 2025-05-07T20:32:41.2628076Z scale_ub=None, 2025-05-07T20:32:41.2628290Z contiguous=True, 2025-05-07T20:32:41.2628518Z compiled=False, 2025-05-07T20:32:41.2628728Z ) 2025-05-07T20:32:41.5072169Z self = 2025-05-07T20:32:41.5072729Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.5073001Z 2025-05-07T20:32:41.5073119Z @given( 2025-05-07T20:32:41.5073444Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5073882Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5074198Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5074533Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5074861Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5075151Z ) 2025-05-07T20:32:41.5075524Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5075969Z def test_silu_mul_quant( 2025-05-07T20:32:41.5076512Z self, 2025-05-07T20:32:41.5076716Z T: int, 2025-05-07T20:32:41.5076910Z D: int, 2025-05-07T20:32:41.5077133Z scale_ub: Optional[float], 2025-05-07T20:32:41.5077458Z contiguous: bool, 2025-05-07T20:32:41.5077698Z compiled: bool, 2025-05-07T20:32:41.5077929Z ) -> None: 2025-05-07T20:32:41.5078148Z torch.manual_seed(2025) 2025-05-07T20:32:41.5078389Z 2025-05-07T20:32:41.5078665Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5079013Z 2025-05-07T20:32:41.5079211Z x_sign = torch.sign(x) 2025-05-07T20:32:41.5079498Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.5079812Z x = x_sign * x_clamp 2025-05-07T20:32:41.5080052Z x0 = x[:, :D] 2025-05-07T20:32:41.5080262Z x1 = x[:, D:] 2025-05-07T20:32:41.5080478Z 2025-05-07T20:32:41.5080668Z if contiguous: 2025-05-07T20:32:41.5080902Z x0 = x0.contiguous() 2025-05-07T20:32:41.5081168Z x1 = x1.contiguous() 2025-05-07T20:32:41.5081407Z 2025-05-07T20:32:41.5081595Z if scale_ub is not None: 2025-05-07T20:32:41.5081867Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.5082202Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.5082504Z ) 2025-05-07T20:32:41.5082702Z else: 2025-05-07T20:32:41.5082918Z scale_ub_tensor = None 2025-05-07T20:32:41.5083165Z 2025-05-07T20:32:41.5083398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.5083850Z op = silu_mul_quant 2025-05-07T20:32:41.5084102Z if compiled: 2025-05-07T20:32:41.5084347Z op = torch.compile(op) 2025-05-07T20:32:41.5084644Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5084921Z 2025-05-07T20:32:41.5085111Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.5085282Z 2025-05-07T20:32:41.5085483Z moe/activation_test.py:117: 2025-05-07T20:32:41.5085885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5086215Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.5086496Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5087191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.5087890Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.5088427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.5089116Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.5089789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.5090403Z kernel = self.compile( 2025-05-07T20:32:41.5090958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.5091627Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.5092026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5092263Z 2025-05-07T20:32:41.5092471Z self = 2025-05-07T20:32:41.5093550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.5094938Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286c698a0>} 2025-05-07T20:32:41.5096289Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.5097367Z context = 2025-05-07T20:32:41.5097655Z 2025-05-07T20:32:41.5097824Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.5098354Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.5098827Z module_map=module_map) 2025-05-07T20:32:41.5099189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.5099551Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.5099816Z E ^ 2025-05-07T20:32:41.5100290Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.5100745Z 2025-05-07T20:32:41.5101174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.5101705Z 2025-05-07T20:32:41.5101810Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5102230Z self=, 2025-05-07T20:32:41.5102630Z T=128, 2025-05-07T20:32:41.5102826Z D=7168, 2025-05-07T20:32:41.5103022Z scale_ub=None, 2025-05-07T20:32:41.5103237Z contiguous=True, 2025-05-07T20:32:41.5103453Z compiled=False, 2025-05-07T20:32:41.5103661Z ) 2025-05-07T20:32:41.5103980Z self = 2025-05-07T20:32:41.5104466Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.5104737Z 2025-05-07T20:32:41.5104817Z @given( 2025-05-07T20:32:41.5105045Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5105357Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5105750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5106086Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5106456Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5106743Z ) 2025-05-07T20:32:41.5107092Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5107531Z def test_silu_mul_quant( 2025-05-07T20:32:41.5107784Z self, 2025-05-07T20:32:41.5108011Z T: int, 2025-05-07T20:32:41.5108210Z D: int, 2025-05-07T20:32:41.5108424Z scale_ub: Optional[float], 2025-05-07T20:32:41.5108700Z contiguous: bool, 2025-05-07T20:32:41.5108940Z compiled: bool, 2025-05-07T20:32:41.5109164Z ) -> None: 2025-05-07T20:32:41.5109381Z torch.manual_seed(2025) 2025-05-07T20:32:41.5109625Z 2025-05-07T20:32:41.5109920Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5110302Z 2025-05-07T20:32:41.5110504Z x_sign = torch.sign(x) 2025-05-07T20:32:41.5110803Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.5111111Z x = x_sign * x_clamp 2025-05-07T20:32:41.5111354Z x0 = x[:, :D] 2025-05-07T20:32:41.5111573Z x1 = x[:, D:] 2025-05-07T20:32:41.5111777Z 2025-05-07T20:32:41.5111971Z if contiguous: 2025-05-07T20:32:41.5112209Z x0 = x0.contiguous() 2025-05-07T20:32:41.5112462Z x1 = x1.contiguous() 2025-05-07T20:32:41.5112703Z 2025-05-07T20:32:41.5112899Z if scale_ub is not None: 2025-05-07T20:32:41.5113173Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.5113501Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.5113817Z ) 2025-05-07T20:32:41.5114014Z else: 2025-05-07T20:32:41.5114219Z scale_ub_tensor = None 2025-05-07T20:32:41.5114475Z 2025-05-07T20:32:41.5114714Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.5115027Z op = silu_mul_quant 2025-05-07T20:32:41.5115332Z if compiled: 2025-05-07T20:32:41.5115580Z op = torch.compile(op) 2025-05-07T20:32:41.5115873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5116150Z 2025-05-07T20:32:41.5116344Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.5116508Z 2025-05-07T20:32:41.5116607Z moe/activation_test.py:117: 2025-05-07T20:32:41.5116902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5117233Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.5117512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5118195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.5118890Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.5119432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.5120120Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.5120792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.5121330Z kernel = self.compile( 2025-05-07T20:32:41.5121877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.5122530Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.5122935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5123167Z 2025-05-07T20:32:41.5123380Z self = 2025-05-07T20:32:41.5124596Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.5126006Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286c6a7a0>} 2025-05-07T20:32:41.5127357Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.5128386Z context = 2025-05-07T20:32:41.5128672Z 2025-05-07T20:32:41.5128846Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.5129366Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.5129840Z module_map=module_map) 2025-05-07T20:32:41.5130249Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.5130611Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.5130873Z E ^ 2025-05-07T20:32:41.5131353Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.5144160Z 2025-05-07T20:32:41.5144598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.5145130Z 2025-05-07T20:32:41.5145237Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5145658Z self=, 2025-05-07T20:32:41.5146055Z T=2048, 2025-05-07T20:32:41.5146250Z D=7168, 2025-05-07T20:32:41.5146442Z scale_ub=1200.0, 2025-05-07T20:32:41.5146662Z contiguous=True, 2025-05-07T20:32:41.5146886Z compiled=False, 2025-05-07T20:32:41.5147095Z ) 2025-05-07T20:32:41.5815953Z self = 2025-05-07T20:32:41.5816498Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.5817319Z 2025-05-07T20:32:41.5817487Z @given( 2025-05-07T20:32:41.5817965Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5818573Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5819179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5819824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5820465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5821020Z ) 2025-05-07T20:32:41.5821709Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5822586Z def test_silu_mul_quant( 2025-05-07T20:32:41.5823056Z self, 2025-05-07T20:32:41.5823442Z T: int, 2025-05-07T20:32:41.5823829Z D: int, 2025-05-07T20:32:41.5824254Z scale_ub: Optional[float], 2025-05-07T20:32:41.5824795Z contiguous: bool, 2025-05-07T20:32:41.5825283Z compiled: bool, 2025-05-07T20:32:41.5825728Z ) -> None: 2025-05-07T20:32:41.5826164Z torch.manual_seed(2025) 2025-05-07T20:32:41.5826639Z 2025-05-07T20:32:41.5827152Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5829248Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.5831122Z 2025-05-07T20:32:41.5831257Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.5831473Z 2025-05-07T20:32:41.5831695Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5832203Z self=, 2025-05-07T20:32:41.5832615Z T=1, 2025-05-07T20:32:41.5832809Z D=5120, 2025-05-07T20:32:41.5833001Z scale_ub=1200.0, 2025-05-07T20:32:41.5833232Z contiguous=True, 2025-05-07T20:32:41.5833462Z compiled=False, 2025-05-07T20:32:41.5833665Z ) 2025-05-07T20:32:41.5833985Z self = 2025-05-07T20:32:41.5834481Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.5834745Z 2025-05-07T20:32:41.5834827Z @given( 2025-05-07T20:32:41.5835065Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5835386Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5835687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5836107Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5836446Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5836743Z ) 2025-05-07T20:32:41.5837100Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5837548Z def test_silu_mul_quant( 2025-05-07T20:32:41.5837797Z self, 2025-05-07T20:32:41.5837988Z T: int, 2025-05-07T20:32:41.5838195Z D: int, 2025-05-07T20:32:41.5838695Z scale_ub: Optional[float], 2025-05-07T20:32:41.5838975Z contiguous: bool, 2025-05-07T20:32:41.5839231Z compiled: bool, 2025-05-07T20:32:41.5839462Z ) -> None: 2025-05-07T20:32:41.5839678Z torch.manual_seed(2025) 2025-05-07T20:32:41.5839926Z 2025-05-07T20:32:41.5840205Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5840546Z 2025-05-07T20:32:41.5840746Z x_sign = torch.sign(x) 2025-05-07T20:32:41.5841048Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.5841368Z x = x_sign * x_clamp 2025-05-07T20:32:41.5841690Z x0 = x[:, :D] 2025-05-07T20:32:41.5841920Z x1 = x[:, D:] 2025-05-07T20:32:41.5842135Z 2025-05-07T20:32:41.5842326Z if contiguous: 2025-05-07T20:32:41.5842570Z x0 = x0.contiguous() 2025-05-07T20:32:41.5842833Z x1 = x1.contiguous() 2025-05-07T20:32:41.5843078Z 2025-05-07T20:32:41.5843277Z if scale_ub is not None: 2025-05-07T20:32:41.5843639Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.5843974Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.5844289Z ) 2025-05-07T20:32:41.5844489Z else: 2025-05-07T20:32:41.5844696Z scale_ub_tensor = None 2025-05-07T20:32:41.5844959Z 2025-05-07T20:32:41.5845198Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.5845513Z op = silu_mul_quant 2025-05-07T20:32:41.5845769Z if compiled: 2025-05-07T20:32:41.5846030Z op = torch.compile(op) 2025-05-07T20:32:41.5846329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5846616Z 2025-05-07T20:32:41.5846815Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.5846980Z 2025-05-07T20:32:41.5847082Z moe/activation_test.py:117: 2025-05-07T20:32:41.5847388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5847727Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.5848017Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5848706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.5849415Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.5849959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.5850650Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.5851399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.5852008Z kernel = self.compile( 2025-05-07T20:32:41.5852554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.5853210Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.5853611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5853840Z 2025-05-07T20:32:41.5854053Z self = 2025-05-07T20:32:41.5855132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.5856564Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286c6bb00>} 2025-05-07T20:32:41.5857925Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.5858953Z context = 2025-05-07T20:32:41.5859240Z 2025-05-07T20:32:41.5859415Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.5859934Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.5860406Z module_map=module_map) 2025-05-07T20:32:41.5860776Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.5861138Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.5861396Z E ^ 2025-05-07T20:32:41.5861870Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.5862371Z 2025-05-07T20:32:41.5862800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.5863318Z 2025-05-07T20:32:41.5863427Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5863841Z self=, 2025-05-07T20:32:41.5864248Z T=2048, 2025-05-07T20:32:41.5864441Z D=5120, 2025-05-07T20:32:41.5864628Z scale_ub=None, 2025-05-07T20:32:41.5864851Z contiguous=True, 2025-05-07T20:32:41.5865083Z compiled=False, 2025-05-07T20:32:41.5865283Z ) 2025-05-07T20:32:41.5865605Z self = 2025-05-07T20:32:41.5866106Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.5866376Z 2025-05-07T20:32:41.5866453Z @given( 2025-05-07T20:32:41.5866684Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5866998Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5867304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5867626Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5867988Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5868295Z ) 2025-05-07T20:32:41.5868638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5869083Z def test_silu_mul_quant( 2025-05-07T20:32:41.5869324Z self, 2025-05-07T20:32:41.5869516Z T: int, 2025-05-07T20:32:41.5869716Z D: int, 2025-05-07T20:32:41.5869932Z scale_ub: Optional[float], 2025-05-07T20:32:41.5870202Z contiguous: bool, 2025-05-07T20:32:41.5870442Z compiled: bool, 2025-05-07T20:32:41.5870666Z ) -> None: 2025-05-07T20:32:41.5870878Z torch.manual_seed(2025) 2025-05-07T20:32:41.5871171Z 2025-05-07T20:32:41.5871450Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5871857Z 2025-05-07T20:32:41.5872053Z > x_sign = torch.sign(x) 2025-05-07T20:32:41.5874005Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.5875864Z 2025-05-07T20:32:41.5875982Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:41.5876234Z 2025-05-07T20:32:41.5876346Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5876757Z self=, 2025-05-07T20:32:41.5877164Z T=16384, 2025-05-07T20:32:41.5877357Z D=5120, 2025-05-07T20:32:41.5877544Z scale_ub=None, 2025-05-07T20:32:41.5877757Z contiguous=True, 2025-05-07T20:32:41.5877984Z compiled=False, 2025-05-07T20:32:41.5878184Z ) 2025-05-07T20:32:41.6578572Z self = 2025-05-07T20:32:41.6579117Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.6579405Z 2025-05-07T20:32:41.6579489Z @given( 2025-05-07T20:32:41.6579722Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6580038Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6580341Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6580682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6581021Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6581472Z ) 2025-05-07T20:32:41.6581825Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6582273Z def test_silu_mul_quant( 2025-05-07T20:32:41.6582511Z self, 2025-05-07T20:32:41.6582713Z T: int, 2025-05-07T20:32:41.6582915Z D: int, 2025-05-07T20:32:41.6583135Z scale_ub: Optional[float], 2025-05-07T20:32:41.6583408Z contiguous: bool, 2025-05-07T20:32:41.6583647Z compiled: bool, 2025-05-07T20:32:41.6583876Z ) -> None: 2025-05-07T20:32:41.6584087Z torch.manual_seed(2025) 2025-05-07T20:32:41.6584329Z 2025-05-07T20:32:41.6584606Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6586647Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.6588564Z 2025-05-07T20:32:41.6588682Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.6588912Z 2025-05-07T20:32:41.6589015Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6589427Z self=, 2025-05-07T20:32:41.6589822Z T=4096, 2025-05-07T20:32:41.6590014Z D=5120, 2025-05-07T20:32:41.6590209Z scale_ub=None, 2025-05-07T20:32:41.6590418Z contiguous=True, 2025-05-07T20:32:41.6590647Z compiled=False, 2025-05-07T20:32:41.6590858Z ) 2025-05-07T20:32:41.6591174Z self = 2025-05-07T20:32:41.6591755Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.6592119Z 2025-05-07T20:32:41.6592198Z @given( 2025-05-07T20:32:41.6592431Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6592737Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6593045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6593373Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6593695Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6593987Z ) 2025-05-07T20:32:41.6594338Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6594772Z def test_silu_mul_quant( 2025-05-07T20:32:41.6595016Z self, 2025-05-07T20:32:41.6595211Z T: int, 2025-05-07T20:32:41.6595402Z D: int, 2025-05-07T20:32:41.6595695Z scale_ub: Optional[float], 2025-05-07T20:32:41.6595975Z contiguous: bool, 2025-05-07T20:32:41.6596215Z compiled: bool, 2025-05-07T20:32:41.6596444Z ) -> None: 2025-05-07T20:32:41.6596661Z torch.manual_seed(2025) 2025-05-07T20:32:41.6596904Z 2025-05-07T20:32:41.6597173Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6599210Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.6601056Z 2025-05-07T20:32:41.6601179Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.6601394Z 2025-05-07T20:32:41.6601550Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6601962Z self=, 2025-05-07T20:32:41.6602366Z T=2048, 2025-05-07T20:32:41.6602560Z D=5120, 2025-05-07T20:32:41.6602755Z scale_ub=None, 2025-05-07T20:32:41.6602967Z contiguous=False, 2025-05-07T20:32:41.6603194Z compiled=False, 2025-05-07T20:32:41.6603401Z ) 2025-05-07T20:32:41.6603852Z self = 2025-05-07T20:32:41.6604347Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.6604620Z 2025-05-07T20:32:41.6604706Z @given( 2025-05-07T20:32:41.6604932Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6605248Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6605555Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6605879Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6606211Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6606499Z ) 2025-05-07T20:32:41.6606852Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6607290Z def test_silu_mul_quant( 2025-05-07T20:32:41.6607536Z self, 2025-05-07T20:32:41.6607737Z T: int, 2025-05-07T20:32:41.6607929Z D: int, 2025-05-07T20:32:41.6608151Z scale_ub: Optional[float], 2025-05-07T20:32:41.6608428Z contiguous: bool, 2025-05-07T20:32:41.6608665Z compiled: bool, 2025-05-07T20:32:41.6608895Z ) -> None: 2025-05-07T20:32:41.6609120Z torch.manual_seed(2025) 2025-05-07T20:32:41.6609357Z 2025-05-07T20:32:41.6609636Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6611721Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.6613666Z 2025-05-07T20:32:41.6613787Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.6613999Z 2025-05-07T20:32:41.6614112Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6614521Z self=, 2025-05-07T20:32:41.6614927Z T=4096, 2025-05-07T20:32:41.6615123Z D=7168, 2025-05-07T20:32:41.6615311Z scale_ub=None, 2025-05-07T20:32:41.6615528Z contiguous=True, 2025-05-07T20:32:41.6615794Z compiled=True, 2025-05-07T20:32:41.6615997Z ) 2025-05-07T20:32:41.6616318Z self = 2025-05-07T20:32:41.6616811Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:41.6617079Z 2025-05-07T20:32:41.6617166Z @given( 2025-05-07T20:32:41.6617390Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6617709Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6618021Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6618345Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6618674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6618962Z ) 2025-05-07T20:32:41.6619312Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6619760Z def test_silu_mul_quant( 2025-05-07T20:32:41.6620007Z self, 2025-05-07T20:32:41.6620197Z T: int, 2025-05-07T20:32:41.6620404Z D: int, 2025-05-07T20:32:41.6620628Z scale_ub: Optional[float], 2025-05-07T20:32:41.6620945Z contiguous: bool, 2025-05-07T20:32:41.6621190Z compiled: bool, 2025-05-07T20:32:41.6621419Z ) -> None: 2025-05-07T20:32:41.6621637Z torch.manual_seed(2025) 2025-05-07T20:32:41.6621872Z 2025-05-07T20:32:41.6622146Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6624191Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.6626045Z 2025-05-07T20:32:41.6626171Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.6626388Z 2025-05-07T20:32:41.6626491Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6626906Z self=, 2025-05-07T20:32:41.6627310Z T=2048, 2025-05-07T20:32:41.6627505Z D=5120, 2025-05-07T20:32:41.6627720Z scale_ub=1200.0, 2025-05-07T20:32:41.6627978Z contiguous=False, 2025-05-07T20:32:41.6628210Z compiled=False, 2025-05-07T20:32:41.6628415Z ) 2025-05-07T20:32:41.6628736Z self = 2025-05-07T20:32:41.6629234Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.6629512Z 2025-05-07T20:32:41.6629592Z @given( 2025-05-07T20:32:41.6629827Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6630146Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6630496Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6630835Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6631208Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6631502Z ) 2025-05-07T20:32:41.6631846Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6632292Z def test_silu_mul_quant( 2025-05-07T20:32:41.6632543Z self, 2025-05-07T20:32:41.6632734Z T: int, 2025-05-07T20:32:41.6632932Z D: int, 2025-05-07T20:32:41.6633151Z scale_ub: Optional[float], 2025-05-07T20:32:41.6633417Z contiguous: bool, 2025-05-07T20:32:41.6633663Z compiled: bool, 2025-05-07T20:32:41.6633890Z ) -> None: 2025-05-07T20:32:41.6634103Z torch.manual_seed(2025) 2025-05-07T20:32:41.6634347Z 2025-05-07T20:32:41.6634626Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6636722Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.6638800Z 2025-05-07T20:32:41.6638935Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.6639150Z 2025-05-07T20:32:41.6639254Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6639671Z self=, 2025-05-07T20:32:41.6640078Z T=4096, 2025-05-07T20:32:41.6640262Z D=7168, 2025-05-07T20:32:41.6640462Z scale_ub=1200.0, 2025-05-07T20:32:41.6640691Z contiguous=True, 2025-05-07T20:32:41.6640916Z compiled=False, 2025-05-07T20:32:41.6641204Z ) 2025-05-07T20:32:41.7568396Z self = 2025-05-07T20:32:41.7569127Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.7569406Z 2025-05-07T20:32:41.7569495Z @given( 2025-05-07T20:32:41.7569723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.7570041Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.7570351Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.7570685Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.7571013Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.7571307Z ) 2025-05-07T20:32:41.7571666Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.7572105Z def test_silu_mul_quant( 2025-05-07T20:32:41.7572370Z self, 2025-05-07T20:32:41.7572573Z T: int, 2025-05-07T20:32:41.7572787Z D: int, 2025-05-07T20:32:41.7573020Z scale_ub: Optional[float], 2025-05-07T20:32:41.7573300Z contiguous: bool, 2025-05-07T20:32:41.7573541Z compiled: bool, 2025-05-07T20:32:41.7573775Z ) -> None: 2025-05-07T20:32:41.7574001Z torch.manual_seed(2025) 2025-05-07T20:32:41.7574242Z 2025-05-07T20:32:41.7574518Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.7576843Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.7578797Z 2025-05-07T20:32:41.7578919Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.7579133Z 2025-05-07T20:32:41.7579246Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.7579659Z self=, 2025-05-07T20:32:41.7580069Z T=16384, 2025-05-07T20:32:41.7580269Z D=7168, 2025-05-07T20:32:41.7580466Z scale_ub=None, 2025-05-07T20:32:41.7580686Z contiguous=False, 2025-05-07T20:32:41.7580922Z compiled=True, 2025-05-07T20:32:41.7581122Z ) 2025-05-07T20:32:41.7581440Z self = 2025-05-07T20:32:41.7581939Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:41.7582215Z 2025-05-07T20:32:41.7582301Z @given( 2025-05-07T20:32:41.7582606Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.7582931Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.7583240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.7583563Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.7583892Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.7584180Z ) 2025-05-07T20:32:41.7584526Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.7584972Z def test_silu_mul_quant( 2025-05-07T20:32:41.7585220Z self, 2025-05-07T20:32:41.7585417Z T: int, 2025-05-07T20:32:41.7585608Z D: int, 2025-05-07T20:32:41.7585825Z scale_ub: Optional[float], 2025-05-07T20:32:41.7586098Z contiguous: bool, 2025-05-07T20:32:41.7586332Z compiled: bool, 2025-05-07T20:32:41.7586558Z ) -> None: 2025-05-07T20:32:41.7586779Z torch.manual_seed(2025) 2025-05-07T20:32:41.7587019Z 2025-05-07T20:32:41.7587304Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.7589351Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.7591295Z 2025-05-07T20:32:41.7591421Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.7591633Z 2025-05-07T20:32:41.7591741Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.7592165Z self=, 2025-05-07T20:32:41.7592572Z T=4096, 2025-05-07T20:32:41.7592770Z D=7168, 2025-05-07T20:32:41.7592962Z scale_ub=None, 2025-05-07T20:32:41.7593187Z contiguous=True, 2025-05-07T20:32:41.7593419Z compiled=False, 2025-05-07T20:32:41.7593622Z ) 2025-05-07T20:32:41.7593947Z self = 2025-05-07T20:32:41.7594443Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.7603822Z 2025-05-07T20:32:41.7603934Z @given( 2025-05-07T20:32:41.7604176Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.7604502Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.7604817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.7605152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.7605477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.7605766Z ) 2025-05-07T20:32:41.7606118Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.7606558Z def test_silu_mul_quant( 2025-05-07T20:32:41.7606888Z self, 2025-05-07T20:32:41.7607097Z T: int, 2025-05-07T20:32:41.7607345Z D: int, 2025-05-07T20:32:41.7607573Z scale_ub: Optional[float], 2025-05-07T20:32:41.7607859Z contiguous: bool, 2025-05-07T20:32:41.7608097Z compiled: bool, 2025-05-07T20:32:41.7608325Z ) -> None: 2025-05-07T20:32:41.7608545Z torch.manual_seed(2025) 2025-05-07T20:32:41.7608783Z 2025-05-07T20:32:41.7609069Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.7611183Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.7613063Z 2025-05-07T20:32:41.7613184Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.7613402Z 2025-05-07T20:32:41.7613518Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.7613927Z self=, 2025-05-07T20:32:41.7614338Z T=16384, 2025-05-07T20:32:41.7614536Z D=7168, 2025-05-07T20:32:41.7614725Z scale_ub=None, 2025-05-07T20:32:41.7614948Z contiguous=True, 2025-05-07T20:32:41.7615178Z compiled=False, 2025-05-07T20:32:41.7615391Z ) 2025-05-07T20:32:41.7615704Z self = 2025-05-07T20:32:41.7616204Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.7616482Z 2025-05-07T20:32:41.7616569Z @given( 2025-05-07T20:32:41.7616796Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.7617166Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.7617482Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.7617803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.7618135Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.7618425Z ) 2025-05-07T20:32:41.7618777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.7619217Z def test_silu_mul_quant( 2025-05-07T20:32:41.7619464Z self, 2025-05-07T20:32:41.7619664Z T: int, 2025-05-07T20:32:41.7619858Z D: int, 2025-05-07T20:32:41.7620079Z scale_ub: Optional[float], 2025-05-07T20:32:41.7620352Z contiguous: bool, 2025-05-07T20:32:41.7620590Z compiled: bool, 2025-05-07T20:32:41.7620821Z ) -> None: 2025-05-07T20:32:41.7621044Z torch.manual_seed(2025) 2025-05-07T20:32:41.7621288Z 2025-05-07T20:32:41.7621571Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.7623635Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.7625492Z 2025-05-07T20:32:41.7625622Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.7625835Z 2025-05-07T20:32:41.7625949Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.7626360Z self=, 2025-05-07T20:32:41.7626778Z T=16384, 2025-05-07T20:32:41.7627027Z D=7168, 2025-05-07T20:32:41.7627265Z scale_ub=1200.0, 2025-05-07T20:32:41.7627490Z contiguous=True, 2025-05-07T20:32:41.7627723Z compiled=False, 2025-05-07T20:32:41.7627928Z ) 2025-05-07T20:32:41.7628256Z self = 2025-05-07T20:32:41.7628770Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.7629053Z 2025-05-07T20:32:41.7629140Z @given( 2025-05-07T20:32:41.7629370Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.7629685Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.7629990Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.7630318Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.7630649Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.7630934Z ) 2025-05-07T20:32:41.7631332Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.7631777Z def test_silu_mul_quant( 2025-05-07T20:32:41.7632018Z self, 2025-05-07T20:32:41.7632215Z T: int, 2025-05-07T20:32:41.7632410Z D: int, 2025-05-07T20:32:41.7632621Z scale_ub: Optional[float], 2025-05-07T20:32:41.7632892Z contiguous: bool, 2025-05-07T20:32:41.7633131Z compiled: bool, 2025-05-07T20:32:41.7633349Z ) -> None: 2025-05-07T20:32:41.7633564Z torch.manual_seed(2025) 2025-05-07T20:32:41.7633807Z 2025-05-07T20:32:41.7634076Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.7636138Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.7638109Z 2025-05-07T20:32:41.7638226Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.7638723Z 2025-05-07T20:32:41.7638829Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.7639244Z self=, 2025-05-07T20:32:41.7639641Z T=128, 2025-05-07T20:32:41.7639831Z D=5120, 2025-05-07T20:32:41.7640024Z scale_ub=1200.0, 2025-05-07T20:32:41.7640245Z contiguous=False, 2025-05-07T20:32:41.7640475Z compiled=False, 2025-05-07T20:32:41.7640678Z ) 2025-05-07T20:32:41.8660362Z self = 2025-05-07T20:32:41.8661085Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.8661398Z 2025-05-07T20:32:41.8661486Z @given( 2025-05-07T20:32:41.8661727Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8662046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8662354Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8662678Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8663006Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8663296Z ) 2025-05-07T20:32:41.8663643Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8664091Z def test_silu_mul_quant( 2025-05-07T20:32:41.8664341Z self, 2025-05-07T20:32:41.8664533Z T: int, 2025-05-07T20:32:41.8664734Z D: int, 2025-05-07T20:32:41.8664958Z scale_ub: Optional[float], 2025-05-07T20:32:41.8665230Z contiguous: bool, 2025-05-07T20:32:41.8665476Z compiled: bool, 2025-05-07T20:32:41.8665710Z ) -> None: 2025-05-07T20:32:41.8666254Z torch.manual_seed(2025) 2025-05-07T20:32:41.8666508Z 2025-05-07T20:32:41.8666872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8667212Z 2025-05-07T20:32:41.8667414Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8667710Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8668026Z x = x_sign * x_clamp 2025-05-07T20:32:41.8668262Z x0 = x[:, :D] 2025-05-07T20:32:41.8668485Z x1 = x[:, D:] 2025-05-07T20:32:41.8668697Z 2025-05-07T20:32:41.8668884Z if contiguous: 2025-05-07T20:32:41.8669122Z x0 = x0.contiguous() 2025-05-07T20:32:41.8669383Z x1 = x1.contiguous() 2025-05-07T20:32:41.8669619Z 2025-05-07T20:32:41.8669815Z if scale_ub is not None: 2025-05-07T20:32:41.8670088Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.8670503Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.8670819Z ) 2025-05-07T20:32:41.8671020Z else: 2025-05-07T20:32:41.8671229Z scale_ub_tensor = None 2025-05-07T20:32:41.8671487Z 2025-05-07T20:32:41.8671724Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.8672033Z op = silu_mul_quant 2025-05-07T20:32:41.8672287Z if compiled: 2025-05-07T20:32:41.8672540Z op = torch.compile(op) 2025-05-07T20:32:41.8672836Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8673104Z 2025-05-07T20:32:41.8673304Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.8673468Z 2025-05-07T20:32:41.8673580Z moe/activation_test.py:117: 2025-05-07T20:32:41.8673870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8674207Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.8674496Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8675189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.8675977Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.8676526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.8677224Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.8677939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.8678475Z kernel = self.compile( 2025-05-07T20:32:41.8679023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.8679689Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.8680083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8680323Z 2025-05-07T20:32:41.8680535Z self = 2025-05-07T20:32:41.8681625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.8683024Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286dd2700>} 2025-05-07T20:32:41.8684475Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.8685513Z context = 2025-05-07T20:32:41.8685809Z 2025-05-07T20:32:41.8685980Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.8686562Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.8687073Z module_map=module_map) 2025-05-07T20:32:41.8687443Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.8687802Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.8688063Z E ^ 2025-05-07T20:32:41.8688533Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8688994Z 2025-05-07T20:32:41.8689419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.8689936Z 2025-05-07T20:32:41.8690046Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8690458Z self=, 2025-05-07T20:32:41.8690864Z T=2048, 2025-05-07T20:32:41.8691056Z D=7168, 2025-05-07T20:32:41.8691320Z scale_ub=None, 2025-05-07T20:32:41.8691541Z contiguous=False, 2025-05-07T20:32:41.8691771Z compiled=False, 2025-05-07T20:32:41.8691976Z ) 2025-05-07T20:32:41.8692295Z self = 2025-05-07T20:32:41.8692812Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.8693084Z 2025-05-07T20:32:41.8693163Z @given( 2025-05-07T20:32:41.8693386Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8693722Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8694033Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8694354Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8694688Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8694980Z ) 2025-05-07T20:32:41.8695329Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8695783Z def test_silu_mul_quant( 2025-05-07T20:32:41.8696023Z self, 2025-05-07T20:32:41.8696219Z T: int, 2025-05-07T20:32:41.8696469Z D: int, 2025-05-07T20:32:41.8696692Z scale_ub: Optional[float], 2025-05-07T20:32:41.8696971Z contiguous: bool, 2025-05-07T20:32:41.8697205Z compiled: bool, 2025-05-07T20:32:41.8697432Z ) -> None: 2025-05-07T20:32:41.8697651Z torch.manual_seed(2025) 2025-05-07T20:32:41.8697889Z 2025-05-07T20:32:41.8698165Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8700231Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.8702088Z 2025-05-07T20:32:41.8702214Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.8702425Z 2025-05-07T20:32:41.8702528Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8702938Z self=, 2025-05-07T20:32:41.8703342Z T=128, 2025-05-07T20:32:41.8703531Z D=7168, 2025-05-07T20:32:41.8703721Z scale_ub=1200.0, 2025-05-07T20:32:41.8703945Z contiguous=True, 2025-05-07T20:32:41.8704169Z compiled=True, 2025-05-07T20:32:41.8704368Z ) 2025-05-07T20:32:41.9013601Z self = 2025-05-07T20:32:41.9014114Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.9014503Z 2025-05-07T20:32:41.9014620Z @given( 2025-05-07T20:32:41.9014949Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9015475Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9015865Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9016211Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9016555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9016849Z ) 2025-05-07T20:32:41.9017213Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9017673Z def test_silu_mul_quant( 2025-05-07T20:32:41.9017923Z self, 2025-05-07T20:32:41.9018136Z T: int, 2025-05-07T20:32:41.9018353Z D: int, 2025-05-07T20:32:41.9018583Z scale_ub: Optional[float], 2025-05-07T20:32:41.9018874Z contiguous: bool, 2025-05-07T20:32:41.9019133Z compiled: bool, 2025-05-07T20:32:41.9019367Z ) -> None: 2025-05-07T20:32:41.9019602Z torch.manual_seed(2025) 2025-05-07T20:32:41.9019861Z 2025-05-07T20:32:41.9020441Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9020810Z 2025-05-07T20:32:41.9021024Z x_sign = torch.sign(x) 2025-05-07T20:32:41.9021335Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.9021650Z x = x_sign * x_clamp 2025-05-07T20:32:41.9021909Z x0 = x[:, :D] 2025-05-07T20:32:41.9022144Z x1 = x[:, D:] 2025-05-07T20:32:41.9022361Z 2025-05-07T20:32:41.9022564Z if contiguous: 2025-05-07T20:32:41.9022815Z x0 = x0.contiguous() 2025-05-07T20:32:41.9023081Z x1 = x1.contiguous() 2025-05-07T20:32:41.9023342Z 2025-05-07T20:32:41.9023549Z if scale_ub is not None: 2025-05-07T20:32:41.9023830Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.9024185Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.9024509Z ) 2025-05-07T20:32:41.9024711Z else: 2025-05-07T20:32:41.9024942Z scale_ub_tensor = None 2025-05-07T20:32:41.9025210Z 2025-05-07T20:32:41.9025457Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.9025881Z op = silu_mul_quant 2025-05-07T20:32:41.9026146Z if compiled: 2025-05-07T20:32:41.9026408Z op = torch.compile(op) 2025-05-07T20:32:41.9026714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.9027005Z 2025-05-07T20:32:41.9027216Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.9027411Z 2025-05-07T20:32:41.9027541Z moe/activation_test.py:117: 2025-05-07T20:32:41.9027851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.9028196Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.9028486Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.9029062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.9029642Z return fn(*args, **kwargs) 2025-05-07T20:32:41.9030327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.9031031Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.9031582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.9032283Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.9032957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.9033510Z kernel = self.compile( 2025-05-07T20:32:41.9034067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.9034745Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.9035153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.9035396Z 2025-05-07T20:32:41.9035662Z self = 2025-05-07T20:32:41.9036801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.9038195Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286dd3f60>} 2025-05-07T20:32:41.9039790Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.9040832Z context = 2025-05-07T20:32:41.9041132Z 2025-05-07T20:32:41.9041377Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.9041922Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.9042404Z module_map=module_map) 2025-05-07T20:32:41.9042789Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.9043167Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.9043441Z E ^ 2025-05-07T20:32:41.9044000Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.9044464Z 2025-05-07T20:32:41.9044891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.9045415Z 2025-05-07T20:32:41.9045530Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9045962Z self=, 2025-05-07T20:32:41.9046380Z T=128, 2025-05-07T20:32:41.9046585Z D=7168, 2025-05-07T20:32:41.9046800Z scale_ub=1200.0, 2025-05-07T20:32:41.9047102Z contiguous=True, 2025-05-07T20:32:41.9047341Z compiled=False, 2025-05-07T20:32:41.9047562Z ) 2025-05-07T20:32:41.9047888Z self = 2025-05-07T20:32:41.9048399Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.9048674Z 2025-05-07T20:32:41.9048765Z @given( 2025-05-07T20:32:41.9048999Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9049323Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9049640Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9049977Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9050308Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9050611Z ) 2025-05-07T20:32:41.9050975Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9051421Z def test_silu_mul_quant( 2025-05-07T20:32:41.9051677Z self, 2025-05-07T20:32:41.9051891Z T: int, 2025-05-07T20:32:41.9052095Z D: int, 2025-05-07T20:32:41.9052329Z scale_ub: Optional[float], 2025-05-07T20:32:41.9052614Z contiguous: bool, 2025-05-07T20:32:41.9052862Z compiled: bool, 2025-05-07T20:32:41.9053099Z ) -> None: 2025-05-07T20:32:41.9053328Z torch.manual_seed(2025) 2025-05-07T20:32:41.9053575Z 2025-05-07T20:32:41.9053862Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9054218Z 2025-05-07T20:32:41.9054422Z x_sign = torch.sign(x) 2025-05-07T20:32:41.9054728Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.9056820Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9058745Z 2025-05-07T20:32:41.9058871Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.9059089Z 2025-05-07T20:32:41.9059206Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9059625Z self=, 2025-05-07T20:32:41.9060044Z T=128, 2025-05-07T20:32:41.9060247Z D=5120, 2025-05-07T20:32:41.9060447Z scale_ub=1200.0, 2025-05-07T20:32:41.9060687Z contiguous=True, 2025-05-07T20:32:41.9060925Z compiled=True, 2025-05-07T20:32:41.9061139Z ) 2025-05-07T20:32:41.9061527Z self = 2025-05-07T20:32:41.9062036Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.9062312Z 2025-05-07T20:32:41.9062403Z @given( 2025-05-07T20:32:41.9062639Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9062965Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9063284Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9063617Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9063959Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9064259Z ) 2025-05-07T20:32:41.9064614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9065263Z def test_silu_mul_quant( 2025-05-07T20:32:41.9065524Z self, 2025-05-07T20:32:41.9065732Z T: int, 2025-05-07T20:32:41.9065936Z D: int, 2025-05-07T20:32:41.9066170Z scale_ub: Optional[float], 2025-05-07T20:32:41.9066455Z contiguous: bool, 2025-05-07T20:32:41.9066703Z compiled: bool, 2025-05-07T20:32:41.9066992Z ) -> None: 2025-05-07T20:32:41.9067220Z torch.manual_seed(2025) 2025-05-07T20:32:41.9067467Z 2025-05-07T20:32:41.9067753Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9068104Z 2025-05-07T20:32:41.9068302Z x_sign = torch.sign(x) 2025-05-07T20:32:41.9068606Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.9070601Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9072452Z 2025-05-07T20:32:41.9072578Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.9072796Z 2025-05-07T20:32:41.9072911Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9073327Z self=, 2025-05-07T20:32:41.9073741Z T=128, 2025-05-07T20:32:41.9073941Z D=7168, 2025-05-07T20:32:41.9074142Z scale_ub=None, 2025-05-07T20:32:41.9074369Z contiguous=True, 2025-05-07T20:32:41.9074604Z compiled=True, 2025-05-07T20:32:41.9074812Z ) 2025-05-07T20:32:42.1075182Z self = 2025-05-07T20:32:42.1075730Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.1076005Z 2025-05-07T20:32:42.1076087Z @given( 2025-05-07T20:32:42.1076325Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1076656Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1077283Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1077785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1078109Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1078402Z ) 2025-05-07T20:32:42.1078756Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1088035Z def test_silu_mul_quant( 2025-05-07T20:32:42.1088314Z self, 2025-05-07T20:32:42.1088521Z T: int, 2025-05-07T20:32:42.1088725Z D: int, 2025-05-07T20:32:42.1088941Z scale_ub: Optional[float], 2025-05-07T20:32:42.1089227Z contiguous: bool, 2025-05-07T20:32:42.1089473Z compiled: bool, 2025-05-07T20:32:42.1089702Z ) -> None: 2025-05-07T20:32:42.1089929Z torch.manual_seed(2025) 2025-05-07T20:32:42.1090179Z 2025-05-07T20:32:42.1090596Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1092673Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.1094549Z 2025-05-07T20:32:42.1094671Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.1094893Z 2025-05-07T20:32:42.1095465Z FAILED 2025-05-07T20:32:42.1095577Z 2025-05-07T20:32:42.1095715Z =================================== FAILURES =================================== 2025-05-07T20:32:42.1096144Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:42.1096668Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:42.1098489Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor 2025-05-07T20:32:42.1099122Z | yield 2025-05-07T20:32:42.1099683Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run 2025-05-07T20:32:42.1100261Z | self._callTestMethod(testMethod) 2025-05-07T20:32:42.1100938Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod 2025-05-07T20:32:42.1101650Z | if method() is not None: 2025-05-07T20:32:42.1101912Z | ^^^^^^^^ 2025-05-07T20:32:42.1102675Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:42.1103589Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1104000Z | ^^^^^^^ 2025-05-07T20:32:42.1104804Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:42.1105675Z | raise the_error_hypothesis_found 2025-05-07T20:32:42.1106253Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:42.1106847Z +-+---------------- 1 ---------------- 2025-05-07T20:32:42.1107253Z | Traceback (most recent call last): 2025-05-07T20:32:42.1108298Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:42.1109391Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1109906Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.1112362Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.1115213Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:42.1115825Z | self=, 2025-05-07T20:32:42.1116389Z | T=2048, 2025-05-07T20:32:42.1116717Z | D=5120, # or any other generated value 2025-05-07T20:32:42.1117628Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:42.1118173Z | contiguous=True, # or any other generated value 2025-05-07T20:32:42.1118732Z | compiled=False, # or any other generated value 2025-05-07T20:32:42.1119169Z | ) 2025-05-07T20:32:42.1119428Z | 2025-05-07T20:32:42.1120138Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:42.1120985Z +---------------- 2 ---------------- 2025-05-07T20:32:42.1121384Z | Traceback (most recent call last): 2025-05-07T20:32:42.1122379Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:42.1123460Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1124134Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.1126916Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.1129713Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:42.1130311Z | self=, 2025-05-07T20:32:42.1130770Z | T=128, 2025-05-07T20:32:42.1130979Z | D=7168, 2025-05-07T20:32:42.1131191Z | scale_ub=None, 2025-05-07T20:32:42.1131485Z | contiguous=True, 2025-05-07T20:32:42.1131831Z | compiled=True, 2025-05-07T20:32:42.1132141Z | ) 2025-05-07T20:32:42.1132391Z | 2025-05-07T20:32:42.1133120Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:42.1133934Z +---------------- 3 ---------------- 2025-05-07T20:32:42.1134225Z | Traceback (most recent call last): 2025-05-07T20:32:42.1134952Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:42.1135733Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1136115Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.1138164Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.1140412Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:42.1140850Z | self=, 2025-05-07T20:32:42.1141260Z | T=128, 2025-05-07T20:32:42.1141467Z | D=5120, 2025-05-07T20:32:42.1141676Z | scale_ub=1200.0, 2025-05-07T20:32:42.1141920Z | contiguous=True, 2025-05-07T20:32:42.1142165Z | compiled=True, 2025-05-07T20:32:42.1142387Z | ) 2025-05-07T20:32:42.1142570Z | 2025-05-07T20:32:42.1143095Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:42.1143694Z +---------------- 4 ---------------- 2025-05-07T20:32:42.1144083Z | Traceback (most recent call last): 2025-05-07T20:32:42.1144813Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:42.1145537Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.1145822Z | ^^^^^^^^ 2025-05-07T20:32:42.1146468Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:42.1147169Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1147503Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.1148308Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:42.1149111Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.1149734Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:42.1150547Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1150995Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.1151639Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:42.1152425Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1152899Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.1153578Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:42.1154393Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1154866Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.1155506Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:42.1156222Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.1156602Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.1157201Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:42.1157772Z | fn() 2025-05-07T20:32:42.1158348Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:42.1158984Z | self.fn.run( 2025-05-07T20:32:42.1159587Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:42.1160176Z | kernel = self.compile( 2025-05-07T20:32:42.1160502Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:42.1161095Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:42.1161808Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1162198Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.1162845Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:42.1163766Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1164245Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.1164714Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1165081Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.1165341Z | ^ 2025-05-07T20:32:42.1165802Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1166390Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:42.1166788Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:42.1167307Z | self=, 2025-05-07T20:32:42.1167748Z | T=1, # or any other generated value 2025-05-07T20:32:42.1168055Z | D=5120, # or any other generated value 2025-05-07T20:32:42.1168395Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:42.1168766Z | contiguous=True, # or any other generated value 2025-05-07T20:32:42.1169125Z | compiled=True, # or any other generated value 2025-05-07T20:32:42.1169516Z | ) 2025-05-07T20:32:42.1169762Z | 2025-05-07T20:32:42.1170564Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:42.1171406Z +------------------------------------ 2025-05-07T20:32:42.1171896Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:42.1172414Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1172990Z self=, 2025-05-07T20:32:42.1173563Z T=1, 2025-05-07T20:32:42.1173829Z D=5120, 2025-05-07T20:32:42.1174095Z scale_ub=None, 2025-05-07T20:32:42.1174407Z contiguous=True, 2025-05-07T20:32:42.1174724Z compiled=True, 2025-05-07T20:32:42.1175029Z ) 2025-05-07T20:32:42.1175481Z self = 2025-05-07T20:32:42.1176169Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.1176538Z 2025-05-07T20:32:42.1176680Z @given( 2025-05-07T20:32:42.1177001Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1177496Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1177936Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1178401Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1178868Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1179276Z ) 2025-05-07T20:32:42.1179766Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1180407Z def test_silu_mul_quant( 2025-05-07T20:32:42.1180753Z self, 2025-05-07T20:32:42.1181012Z T: int, 2025-05-07T20:32:42.1181299Z D: int, 2025-05-07T20:32:42.1181618Z scale_ub: Optional[float], 2025-05-07T20:32:42.1181996Z contiguous: bool, 2025-05-07T20:32:42.1182343Z compiled: bool, 2025-05-07T20:32:42.1182664Z ) -> None: 2025-05-07T20:32:42.1183079Z torch.manual_seed(2025) 2025-05-07T20:32:42.1183479Z 2025-05-07T20:32:42.1183865Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1184357Z 2025-05-07T20:32:42.1184625Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1185040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1185483Z x = x_sign * x_clamp 2025-05-07T20:32:42.1185816Z x0 = x[:, :D] 2025-05-07T20:32:42.1186133Z x1 = x[:, D:] 2025-05-07T20:32:42.1186438Z 2025-05-07T20:32:42.1186703Z if contiguous: 2025-05-07T20:32:42.1187039Z x0 = x0.contiguous() 2025-05-07T20:32:42.1187410Z x1 = x1.contiguous() 2025-05-07T20:32:42.1187744Z 2025-05-07T20:32:42.1188022Z if scale_ub is not None: 2025-05-07T20:32:42.1188415Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1188959Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1189392Z ) 2025-05-07T20:32:42.1189686Z else: 2025-05-07T20:32:42.1189994Z scale_ub_tensor = None 2025-05-07T20:32:42.1190345Z 2025-05-07T20:32:42.1190672Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1191120Z op = silu_mul_quant 2025-05-07T20:32:42.1191468Z if compiled: 2025-05-07T20:32:42.1191820Z op = torch.compile(op) 2025-05-07T20:32:42.1192243Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1192630Z 2025-05-07T20:32:42.1192903Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.1193320Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.1193731Z 2025-05-07T20:32:42.1194077Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1194540Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.1194935Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.1195380Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.1195890Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1196401Z 2025-05-07T20:32:42.1196677Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.1196963Z 2025-05-07T20:32:42.1197108Z moe/activation_test.py:126: 2025-05-07T20:32:42.1197580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1198049Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.1198516Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1199644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.1200732Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.1201504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1202487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1203440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.1204600Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1205726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.1206804Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1207839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.1208742Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.1209598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.1210343Z fn() 2025-05-07T20:32:42.1211124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.1212008Z self.fn.run( 2025-05-07T20:32:42.1212677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1213408Z kernel = self.compile( 2025-05-07T20:32:42.1214140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1215077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1215642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1215967Z 2025-05-07T20:32:42.1216261Z self = 2025-05-07T20:32:42.1217802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1219864Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3ab33d3a0>} 2025-05-07T20:32:42.1221787Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1223242Z context = 2025-05-07T20:32:42.1223652Z 2025-05-07T20:32:42.1223895Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1224625Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1225291Z module_map=module_map) 2025-05-07T20:32:42.1225802Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1226354Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.1226743Z E ^ 2025-05-07T20:32:42.1227398Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1228034Z 2025-05-07T20:32:42.1228604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1229288Z 2025-05-07T20:32:42.1229422Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1229964Z self=, 2025-05-07T20:32:42.1230497Z T=2048, 2025-05-07T20:32:42.1230736Z D=5120, 2025-05-07T20:32:42.1231001Z scale_ub=1200.0, 2025-05-07T20:32:42.1231302Z contiguous=True, 2025-05-07T20:32:42.1231592Z compiled=False, 2025-05-07T20:32:42.1231869Z ) 2025-05-07T20:32:42.1232304Z self = 2025-05-07T20:32:42.1232992Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.1233370Z 2025-05-07T20:32:42.1233472Z @given( 2025-05-07T20:32:42.1233787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1234216Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1234620Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1235079Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1235531Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1235904Z ) 2025-05-07T20:32:42.1236396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1237004Z def test_silu_mul_quant( 2025-05-07T20:32:42.1237327Z self, 2025-05-07T20:32:42.1237581Z T: int, 2025-05-07T20:32:42.1237841Z D: int, 2025-05-07T20:32:42.1238126Z scale_ub: Optional[float], 2025-05-07T20:32:42.1238726Z contiguous: bool, 2025-05-07T20:32:42.1239196Z compiled: bool, 2025-05-07T20:32:42.1239562Z ) -> None: 2025-05-07T20:32:42.1239839Z torch.manual_seed(2025) 2025-05-07T20:32:42.1240173Z 2025-05-07T20:32:42.1240551Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1241021Z 2025-05-07T20:32:42.1241291Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1241697Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1242126Z x = x_sign * x_clamp 2025-05-07T20:32:42.1242463Z x0 = x[:, :D] 2025-05-07T20:32:42.1242775Z x1 = x[:, D:] 2025-05-07T20:32:42.1243064Z 2025-05-07T20:32:42.1243328Z if contiguous: 2025-05-07T20:32:42.1243755Z x0 = x0.contiguous() 2025-05-07T20:32:42.1244116Z x1 = x1.contiguous() 2025-05-07T20:32:42.1244455Z 2025-05-07T20:32:42.1244734Z if scale_ub is not None: 2025-05-07T20:32:42.1245205Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1245699Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1246142Z ) 2025-05-07T20:32:42.1246415Z else: 2025-05-07T20:32:42.1246701Z scale_ub_tensor = None 2025-05-07T20:32:42.1247054Z 2025-05-07T20:32:42.1247381Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1247817Z op = silu_mul_quant 2025-05-07T20:32:42.1248173Z if compiled: 2025-05-07T20:32:42.1248523Z op = torch.compile(op) 2025-05-07T20:32:42.1248927Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1249313Z 2025-05-07T20:32:42.1249583Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1249805Z 2025-05-07T20:32:42.1249937Z moe/activation_test.py:117: 2025-05-07T20:32:42.1250341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1250799Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1251183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1252229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1253208Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1253945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1254880Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1255795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1256543Z kernel = self.compile( 2025-05-07T20:32:42.1257313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1258240Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1258802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1259131Z 2025-05-07T20:32:42.1259427Z self = 2025-05-07T20:32:42.1260968Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1262912Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3ab1ec2c0>} 2025-05-07T20:32:42.1264690Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1266040Z context = 2025-05-07T20:32:42.1266417Z 2025-05-07T20:32:42.1266697Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1267482Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1268090Z module_map=module_map) 2025-05-07T20:32:42.1268582Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1269099Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1269448Z E ^ 2025-05-07T20:32:42.1270087Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1270713Z 2025-05-07T20:32:42.1271290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1272001Z 2025-05-07T20:32:42.1272149Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1272750Z self=, 2025-05-07T20:32:42.1273306Z T=2048, 2025-05-07T20:32:42.1273577Z D=5120, 2025-05-07T20:32:42.1273835Z scale_ub=1200.0, 2025-05-07T20:32:42.1274149Z contiguous=True, 2025-05-07T20:32:42.1274465Z compiled=True, 2025-05-07T20:32:42.1274745Z ) 2025-05-07T20:32:42.1275194Z self = 2025-05-07T20:32:42.1298361Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.1298922Z 2025-05-07T20:32:42.1299037Z @given( 2025-05-07T20:32:42.1299342Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1299754Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1300156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1300584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1301013Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1301392Z ) 2025-05-07T20:32:42.1301869Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1302632Z def test_silu_mul_quant( 2025-05-07T20:32:42.1302942Z self, 2025-05-07T20:32:42.1303188Z T: int, 2025-05-07T20:32:42.1303443Z D: int, 2025-05-07T20:32:42.1303734Z scale_ub: Optional[float], 2025-05-07T20:32:42.1304104Z contiguous: bool, 2025-05-07T20:32:42.1304424Z compiled: bool, 2025-05-07T20:32:42.1304713Z ) -> None: 2025-05-07T20:32:42.1305009Z torch.manual_seed(2025) 2025-05-07T20:32:42.1305354Z 2025-05-07T20:32:42.1305736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1306218Z 2025-05-07T20:32:42.1306476Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1306877Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1307315Z x = x_sign * x_clamp 2025-05-07T20:32:42.1307574Z x0 = x[:, :D] 2025-05-07T20:32:42.1307796Z x1 = x[:, D:] 2025-05-07T20:32:42.1308005Z 2025-05-07T20:32:42.1308186Z if contiguous: 2025-05-07T20:32:42.1308433Z x0 = x0.contiguous() 2025-05-07T20:32:42.1308690Z x1 = x1.contiguous() 2025-05-07T20:32:42.1308917Z 2025-05-07T20:32:42.1309107Z if scale_ub is not None: 2025-05-07T20:32:42.1309373Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1309701Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1310019Z ) 2025-05-07T20:32:42.1310215Z else: 2025-05-07T20:32:42.1310425Z scale_ub_tensor = None 2025-05-07T20:32:42.1310671Z 2025-05-07T20:32:42.1310895Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1311203Z op = silu_mul_quant 2025-05-07T20:32:42.1311439Z if compiled: 2025-05-07T20:32:42.1311673Z op = torch.compile(op) 2025-05-07T20:32:42.1311966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1312236Z 2025-05-07T20:32:42.1312493Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.1312774Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.1313110Z 2025-05-07T20:32:42.1313344Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1313673Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.1313957Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.1314270Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.1314624Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1314923Z 2025-05-07T20:32:42.1315121Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.1315320Z 2025-05-07T20:32:42.1315417Z moe/activation_test.py:126: 2025-05-07T20:32:42.1315708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1316039Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.1316409Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1317209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.1317966Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.1318513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1319188Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1319881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.1320596Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1321350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.1322106Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1322898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.1323664Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.1324287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.1324802Z fn() 2025-05-07T20:32:42.1325299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.1325870Z self.fn.run( 2025-05-07T20:32:42.1326336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1326862Z kernel = self.compile( 2025-05-07T20:32:42.1327449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1328108Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1328508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1328735Z 2025-05-07T20:32:42.1328946Z self = 2025-05-07T20:32:42.1330018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1331396Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3aa0eb880>} 2025-05-07T20:32:42.1332734Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1333796Z context = 2025-05-07T20:32:42.1334132Z 2025-05-07T20:32:42.1334297Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1334814Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1335279Z module_map=module_map) 2025-05-07T20:32:42.1335631Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1335984Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.1336243Z E ^ 2025-05-07T20:32:42.1336696Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1337153Z 2025-05-07T20:32:42.1337570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1338088Z 2025-05-07T20:32:42.1338233Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1338989Z self=, 2025-05-07T20:32:42.1339393Z T=16384, 2025-05-07T20:32:42.1339587Z D=7168, 2025-05-07T20:32:42.1339777Z scale_ub=1200.0, 2025-05-07T20:32:42.1339991Z contiguous=False, 2025-05-07T20:32:42.1340209Z compiled=False, 2025-05-07T20:32:42.1340407Z ) 2025-05-07T20:32:42.1340715Z self = 2025-05-07T20:32:42.1341216Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.1341502Z 2025-05-07T20:32:42.1341579Z @given( 2025-05-07T20:32:42.1341803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1342104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1342399Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1342723Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1343042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1343489Z ) 2025-05-07T20:32:42.1343835Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1344268Z def test_silu_mul_quant( 2025-05-07T20:32:42.1344501Z self, 2025-05-07T20:32:42.1344689Z T: int, 2025-05-07T20:32:42.1344872Z D: int, 2025-05-07T20:32:42.1345083Z scale_ub: Optional[float], 2025-05-07T20:32:42.1345346Z contiguous: bool, 2025-05-07T20:32:42.1345574Z compiled: bool, 2025-05-07T20:32:42.1345782Z ) -> None: 2025-05-07T20:32:42.1345989Z torch.manual_seed(2025) 2025-05-07T20:32:42.1346223Z 2025-05-07T20:32:42.1346489Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1346838Z 2025-05-07T20:32:42.1347029Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1347333Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1347659Z x = x_sign * x_clamp 2025-05-07T20:32:42.1347905Z x0 = x[:, :D] 2025-05-07T20:32:42.1348120Z x1 = x[:, D:] 2025-05-07T20:32:42.1348326Z 2025-05-07T20:32:42.1348507Z if contiguous: 2025-05-07T20:32:42.1348726Z x0 = x0.contiguous() 2025-05-07T20:32:42.1348982Z x1 = x1.contiguous() 2025-05-07T20:32:42.1349224Z 2025-05-07T20:32:42.1349407Z if scale_ub is not None: 2025-05-07T20:32:42.1349675Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1350005Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1350315Z ) 2025-05-07T20:32:42.1350500Z else: 2025-05-07T20:32:42.1350711Z scale_ub_tensor = None 2025-05-07T20:32:42.1350960Z 2025-05-07T20:32:42.1351183Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1351494Z op = silu_mul_quant 2025-05-07T20:32:42.1351744Z if compiled: 2025-05-07T20:32:42.1351984Z op = torch.compile(op) 2025-05-07T20:32:42.1352354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1352694Z 2025-05-07T20:32:42.1352877Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1353050Z 2025-05-07T20:32:42.1353148Z moe/activation_test.py:117: 2025-05-07T20:32:42.1353439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1353762Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1354045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1354737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1355431Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1355963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1356721Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1357394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1357983Z kernel = self.compile( 2025-05-07T20:32:42.1358519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1359178Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1359580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1359807Z 2025-05-07T20:32:42.1360012Z self = 2025-05-07T20:32:42.1361092Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1362468Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a9e23380>} 2025-05-07T20:32:42.1363988Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1365016Z context = 2025-05-07T20:32:42.1365300Z 2025-05-07T20:32:42.1365466Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1365988Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1366456Z module_map=module_map) 2025-05-07T20:32:42.1366816Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1367168Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1367430Z E ^ 2025-05-07T20:32:42.1367906Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1368363Z 2025-05-07T20:32:42.1368781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1369302Z 2025-05-07T20:32:42.1369405Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1369816Z self=, 2025-05-07T20:32:42.1370219Z T=1, 2025-05-07T20:32:42.1370393Z D=7168, 2025-05-07T20:32:42.1370583Z scale_ub=None, 2025-05-07T20:32:42.1370792Z contiguous=True, 2025-05-07T20:32:42.1371003Z compiled=True, 2025-05-07T20:32:42.1371204Z ) 2025-05-07T20:32:42.1371518Z self = 2025-05-07T20:32:42.1371992Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.1372253Z 2025-05-07T20:32:42.1372335Z @given( 2025-05-07T20:32:42.1372612Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1372957Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1373258Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1373582Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1373906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1374187Z ) 2025-05-07T20:32:42.1374536Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1374974Z def test_silu_mul_quant( 2025-05-07T20:32:42.1375218Z self, 2025-05-07T20:32:42.1375410Z T: int, 2025-05-07T20:32:42.1375601Z D: int, 2025-05-07T20:32:42.1375819Z scale_ub: Optional[float], 2025-05-07T20:32:42.1376086Z contiguous: bool, 2025-05-07T20:32:42.1376325Z compiled: bool, 2025-05-07T20:32:42.1376538Z ) -> None: 2025-05-07T20:32:42.1376820Z torch.manual_seed(2025) 2025-05-07T20:32:42.1377063Z 2025-05-07T20:32:42.1377353Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1377723Z 2025-05-07T20:32:42.1377917Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1378200Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1378506Z x = x_sign * x_clamp 2025-05-07T20:32:42.1378744Z x0 = x[:, :D] 2025-05-07T20:32:42.1378954Z x1 = x[:, D:] 2025-05-07T20:32:42.1379163Z 2025-05-07T20:32:42.1379346Z if contiguous: 2025-05-07T20:32:42.1379567Z x0 = x0.contiguous() 2025-05-07T20:32:42.1379822Z x1 = x1.contiguous() 2025-05-07T20:32:42.1380060Z 2025-05-07T20:32:42.1380240Z if scale_ub is not None: 2025-05-07T20:32:42.1380513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1380846Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1381153Z ) 2025-05-07T20:32:42.1381341Z else: 2025-05-07T20:32:42.1381548Z scale_ub_tensor = None 2025-05-07T20:32:42.1381856Z 2025-05-07T20:32:42.1382084Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1382393Z op = silu_mul_quant 2025-05-07T20:32:42.1382640Z if compiled: 2025-05-07T20:32:42.1382876Z op = torch.compile(op) 2025-05-07T20:32:42.1383171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1383439Z 2025-05-07T20:32:42.1383622Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.1383909Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.1384200Z 2025-05-07T20:32:42.1384430Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1384760Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.1385048Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.1385353Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.1385716Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1386029Z 2025-05-07T20:32:42.1386235Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.1386432Z 2025-05-07T20:32:42.1386530Z moe/activation_test.py:126: 2025-05-07T20:32:42.1386825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1387161Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.1387480Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1388271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.1389030Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.1389585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1390265Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1391010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.1391782Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1392542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.1393289Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1394025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.1394672Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.1395270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.1395793Z fn() 2025-05-07T20:32:42.1396349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.1396943Z self.fn.run( 2025-05-07T20:32:42.1397409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1397954Z kernel = self.compile( 2025-05-07T20:32:42.1398498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1399152Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1399552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1399784Z 2025-05-07T20:32:42.1399989Z self = 2025-05-07T20:32:42.1401070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1402448Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a9e979c0>} 2025-05-07T20:32:42.1403917Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1404946Z context = 2025-05-07T20:32:42.1405239Z 2025-05-07T20:32:42.1405404Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1405926Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1406386Z module_map=module_map) 2025-05-07T20:32:42.1406750Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1407111Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.1407370Z E ^ 2025-05-07T20:32:42.1407832Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1408292Z 2025-05-07T20:32:42.1408711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1409228Z 2025-05-07T20:32:42.1409336Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1409740Z self=, 2025-05-07T20:32:42.1410143Z T=4096, 2025-05-07T20:32:42.1410328Z D=5120, 2025-05-07T20:32:42.1410513Z scale_ub=None, 2025-05-07T20:32:42.1410729Z contiguous=False, 2025-05-07T20:32:42.1410955Z compiled=False, 2025-05-07T20:32:42.1411150Z ) 2025-05-07T20:32:42.1411467Z self = 2025-05-07T20:32:42.1411968Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.1412294Z 2025-05-07T20:32:42.1412420Z @given( 2025-05-07T20:32:42.1412637Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1412947Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1413253Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1413576Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1413906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1414196Z ) 2025-05-07T20:32:42.1414542Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1414989Z def test_silu_mul_quant( 2025-05-07T20:32:42.1415230Z self, 2025-05-07T20:32:42.1415415Z T: int, 2025-05-07T20:32:42.1415613Z D: int, 2025-05-07T20:32:42.1415830Z scale_ub: Optional[float], 2025-05-07T20:32:42.1416103Z contiguous: bool, 2025-05-07T20:32:42.1416379Z compiled: bool, 2025-05-07T20:32:42.1416603Z ) -> None: 2025-05-07T20:32:42.1416825Z torch.manual_seed(2025) 2025-05-07T20:32:42.1417061Z 2025-05-07T20:32:42.1417334Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1417680Z 2025-05-07T20:32:42.1417865Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1418158Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1418469Z x = x_sign * x_clamp 2025-05-07T20:32:42.1418703Z x0 = x[:, :D] 2025-05-07T20:32:42.1418914Z x1 = x[:, D:] 2025-05-07T20:32:42.1419117Z 2025-05-07T20:32:42.1419294Z if contiguous: 2025-05-07T20:32:42.1419523Z x0 = x0.contiguous() 2025-05-07T20:32:42.1419782Z x1 = x1.contiguous() 2025-05-07T20:32:42.1420014Z 2025-05-07T20:32:42.1420207Z if scale_ub is not None: 2025-05-07T20:32:42.1420480Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1420809Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1421120Z ) 2025-05-07T20:32:42.1421371Z else: 2025-05-07T20:32:42.1421585Z scale_ub_tensor = None 2025-05-07T20:32:42.1421829Z 2025-05-07T20:32:42.1422059Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1422372Z op = silu_mul_quant 2025-05-07T20:32:42.1422614Z if compiled: 2025-05-07T20:32:42.1422859Z op = torch.compile(op) 2025-05-07T20:32:42.1423156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1423420Z 2025-05-07T20:32:42.1423614Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1423619Z 2025-05-07T20:32:42.1423714Z moe/activation_test.py:117: 2025-05-07T20:32:42.1423850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1423948Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1424055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1424562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1424665Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1425035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1425258Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1425599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1425698Z kernel = self.compile( 2025-05-07T20:32:42.1426084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1426265Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1426393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1426400Z 2025-05-07T20:32:42.1426653Z self = 2025-05-07T20:32:42.1427481Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1427992Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a9e78180>} 2025-05-07T20:32:42.1428753Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1428943Z context = 2025-05-07T20:32:42.1428949Z 2025-05-07T20:32:42.1429162Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1429429Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1429538Z module_map=module_map) 2025-05-07T20:32:42.1429708Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1429805Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1429882Z E ^ 2025-05-07T20:32:42.1430247Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1430252Z 2025-05-07T20:32:42.1430667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1430672Z 2025-05-07T20:32:42.1430778Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1431002Z self=, 2025-05-07T20:32:42.1431078Z T=4096, 2025-05-07T20:32:42.1431165Z D=7168, 2025-05-07T20:32:42.1431248Z scale_ub=None, 2025-05-07T20:32:42.1431375Z contiguous=False, 2025-05-07T20:32:42.1431471Z compiled=False, 2025-05-07T20:32:42.1431545Z ) 2025-05-07T20:32:42.1431764Z self = 2025-05-07T20:32:42.1431947Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.1431952Z 2025-05-07T20:32:42.1432031Z @given( 2025-05-07T20:32:42.1432155Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1432257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1432370Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1432489Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1432603Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1432677Z ) 2025-05-07T20:32:42.1432928Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1433022Z def test_silu_mul_quant( 2025-05-07T20:32:42.1433105Z self, 2025-05-07T20:32:42.1433191Z T: int, 2025-05-07T20:32:42.1433264Z D: int, 2025-05-07T20:32:42.1433368Z scale_ub: Optional[float], 2025-05-07T20:32:42.1433454Z contiguous: bool, 2025-05-07T20:32:42.1433539Z compiled: bool, 2025-05-07T20:32:42.1433622Z ) -> None: 2025-05-07T20:32:42.1433716Z torch.manual_seed(2025) 2025-05-07T20:32:42.1433786Z 2025-05-07T20:32:42.1433962Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1434033Z 2025-05-07T20:32:42.1434121Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1434250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1434335Z x = x_sign * x_clamp 2025-05-07T20:32:42.1434413Z x0 = x[:, :D] 2025-05-07T20:32:42.1434498Z x1 = x[:, D:] 2025-05-07T20:32:42.1434570Z 2025-05-07T20:32:42.1434656Z if contiguous: 2025-05-07T20:32:42.1434750Z x0 = x0.contiguous() 2025-05-07T20:32:42.1434885Z x1 = x1.contiguous() 2025-05-07T20:32:42.1435027Z 2025-05-07T20:32:42.1435116Z if scale_ub is not None: 2025-05-07T20:32:42.1435220Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1435358Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1435430Z ) 2025-05-07T20:32:42.1435505Z else: 2025-05-07T20:32:42.1435608Z scale_ub_tensor = None 2025-05-07T20:32:42.1435679Z 2025-05-07T20:32:42.1435807Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1435903Z op = silu_mul_quant 2025-05-07T20:32:42.1435985Z if compiled: 2025-05-07T20:32:42.1436081Z op = torch.compile(op) 2025-05-07T20:32:42.1436196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1436266Z 2025-05-07T20:32:42.1436362Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1436407Z 2025-05-07T20:32:42.1436508Z moe/activation_test.py:117: 2025-05-07T20:32:42.1436639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1436748Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1436847Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1437352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1437451Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1437816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1438043Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1438631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1438777Z kernel = self.compile( 2025-05-07T20:32:42.1439217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1439526Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1439651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1439665Z 2025-05-07T20:32:42.1439872Z self = 2025-05-07T20:32:42.1440643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1441150Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a9e7b880>} 2025-05-07T20:32:42.1441904Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1442107Z context = 2025-05-07T20:32:42.1442113Z 2025-05-07T20:32:42.1442279Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1442545Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1442662Z module_map=module_map) 2025-05-07T20:32:42.1442825Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1442928Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1443006Z E ^ 2025-05-07T20:32:42.1443362Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1443367Z 2025-05-07T20:32:42.1443960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1443970Z 2025-05-07T20:32:42.1444133Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1444357Z self=, 2025-05-07T20:32:42.1444453Z T=128, 2025-05-07T20:32:42.1452822Z D=7168, 2025-05-07T20:32:42.1452934Z scale_ub=None, 2025-05-07T20:32:42.1453027Z contiguous=False, 2025-05-07T20:32:42.1453118Z compiled=True, 2025-05-07T20:32:42.1453197Z ) 2025-05-07T20:32:42.1453424Z self = 2025-05-07T20:32:42.1453606Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.1453611Z 2025-05-07T20:32:42.1453691Z @given( 2025-05-07T20:32:42.1453815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1453914Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1454142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1454271Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1454388Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1454464Z ) 2025-05-07T20:32:42.1454718Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1454814Z def test_silu_mul_quant( 2025-05-07T20:32:42.1454892Z self, 2025-05-07T20:32:42.1454978Z T: int, 2025-05-07T20:32:42.1455055Z D: int, 2025-05-07T20:32:42.1455159Z scale_ub: Optional[float], 2025-05-07T20:32:42.1455248Z contiguous: bool, 2025-05-07T20:32:42.1455333Z compiled: bool, 2025-05-07T20:32:42.1455417Z ) -> None: 2025-05-07T20:32:42.1455512Z torch.manual_seed(2025) 2025-05-07T20:32:42.1455586Z 2025-05-07T20:32:42.1455765Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1455839Z 2025-05-07T20:32:42.1455932Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1456066Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1456208Z x = x_sign * x_clamp 2025-05-07T20:32:42.1456291Z x0 = x[:, :D] 2025-05-07T20:32:42.1456378Z x1 = x[:, D:] 2025-05-07T20:32:42.1456451Z 2025-05-07T20:32:42.1456536Z if contiguous: 2025-05-07T20:32:42.1456638Z x0 = x0.contiguous() 2025-05-07T20:32:42.1456730Z x1 = x1.contiguous() 2025-05-07T20:32:42.1456812Z 2025-05-07T20:32:42.1456901Z if scale_ub is not None: 2025-05-07T20:32:42.1457007Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1457148Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1457226Z ) 2025-05-07T20:32:42.1457304Z else: 2025-05-07T20:32:42.1457431Z scale_ub_tensor = None 2025-05-07T20:32:42.1457509Z 2025-05-07T20:32:42.1457662Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1457761Z op = silu_mul_quant 2025-05-07T20:32:42.1457848Z if compiled: 2025-05-07T20:32:42.1457951Z op = torch.compile(op) 2025-05-07T20:32:42.1458068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1458143Z 2025-05-07T20:32:42.1458243Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.1458366Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.1458440Z 2025-05-07T20:32:42.1458585Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1458687Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.1458789Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.1458920Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.1459061Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1459135Z 2025-05-07T20:32:42.1459244Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.1459248Z 2025-05-07T20:32:42.1459349Z moe/activation_test.py:126: 2025-05-07T20:32:42.1459537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1459683Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.1459816Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1460391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.1460495Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.1460860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1461094Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1461466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.1461771Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1462180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.1462438Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1462825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.1462998Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.1463349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.1463428Z fn() 2025-05-07T20:32:42.1463832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.1463924Z self.fn.run( 2025-05-07T20:32:42.1464269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1464370Z kernel = self.compile( 2025-05-07T20:32:42.1464804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1464985Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1465123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1465128Z 2025-05-07T20:32:42.1465336Z self = 2025-05-07T20:32:42.1466112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1466625Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a9a6ed40>} 2025-05-07T20:32:42.1467406Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1467638Z context = 2025-05-07T20:32:42.1467643Z 2025-05-07T20:32:42.1467808Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1468083Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1468192Z module_map=module_map) 2025-05-07T20:32:42.1468354Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1468461Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.1468535Z E ^ 2025-05-07T20:32:42.1468896Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1468901Z 2025-05-07T20:32:42.1469364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1469412Z 2025-05-07T20:32:42.1469517Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1469756Z self=, 2025-05-07T20:32:42.1469837Z T=128, 2025-05-07T20:32:42.1469915Z D=7168, 2025-05-07T20:32:42.1470006Z scale_ub=None, 2025-05-07T20:32:42.1470094Z contiguous=False, 2025-05-07T20:32:42.1470178Z compiled=False, 2025-05-07T20:32:42.1470261Z ) 2025-05-07T20:32:42.1470481Z self = 2025-05-07T20:32:42.1470656Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.1470660Z 2025-05-07T20:32:42.1470746Z @given( 2025-05-07T20:32:42.1470905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1471013Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1471132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1471250Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1471371Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1471443Z ) 2025-05-07T20:32:42.1471689Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1471790Z def test_silu_mul_quant( 2025-05-07T20:32:42.1471866Z self, 2025-05-07T20:32:42.1471943Z T: int, 2025-05-07T20:32:42.1472024Z D: int, 2025-05-07T20:32:42.1472121Z scale_ub: Optional[float], 2025-05-07T20:32:42.1472218Z contiguous: bool, 2025-05-07T20:32:42.1472302Z compiled: bool, 2025-05-07T20:32:42.1472379Z ) -> None: 2025-05-07T20:32:42.1472483Z torch.manual_seed(2025) 2025-05-07T20:32:42.1472557Z 2025-05-07T20:32:42.1472729Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1472815Z 2025-05-07T20:32:42.1472951Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1473077Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1473174Z x = x_sign * x_clamp 2025-05-07T20:32:42.1473255Z x0 = x[:, :D] 2025-05-07T20:32:42.1473331Z x1 = x[:, D:] 2025-05-07T20:32:42.1473410Z 2025-05-07T20:32:42.1473492Z if contiguous: 2025-05-07T20:32:42.1473580Z x0 = x0.contiguous() 2025-05-07T20:32:42.1473676Z x1 = x1.contiguous() 2025-05-07T20:32:42.1473747Z 2025-05-07T20:32:42.1473843Z if scale_ub is not None: 2025-05-07T20:32:42.1473949Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1474083Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1474167Z ) 2025-05-07T20:32:42.1474245Z else: 2025-05-07T20:32:42.1474338Z scale_ub_tensor = None 2025-05-07T20:32:42.1474424Z 2025-05-07T20:32:42.1474559Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1474654Z op = silu_mul_quant 2025-05-07T20:32:42.1474745Z if compiled: 2025-05-07T20:32:42.1474845Z op = torch.compile(op) 2025-05-07T20:32:42.1474949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1475031Z 2025-05-07T20:32:42.1475121Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1475125Z 2025-05-07T20:32:42.1475224Z moe/activation_test.py:117: 2025-05-07T20:32:42.1475355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1475454Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1475563Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1476068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1476162Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1476603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1476867Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1477216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1477326Z kernel = self.compile( 2025-05-07T20:32:42.1477742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1477922Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1478048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1478053Z 2025-05-07T20:32:42.1478258Z self = 2025-05-07T20:32:42.1479084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1479593Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a9348e00>} 2025-05-07T20:32:42.1480350Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1480540Z context = 2025-05-07T20:32:42.1480545Z 2025-05-07T20:32:42.1480716Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1480981Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1481089Z module_map=module_map) 2025-05-07T20:32:42.1481258Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1481402Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1481480Z E ^ 2025-05-07T20:32:42.1481842Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1481848Z 2025-05-07T20:32:42.1482264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1482268Z 2025-05-07T20:32:42.1482377Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1482600Z self=, 2025-05-07T20:32:42.1482673Z T=4096, 2025-05-07T20:32:42.1482754Z D=5120, 2025-05-07T20:32:42.1482837Z scale_ub=1200.0, 2025-05-07T20:32:42.1482921Z contiguous=True, 2025-05-07T20:32:42.1483009Z compiled=False, 2025-05-07T20:32:42.1483083Z ) 2025-05-07T20:32:42.1483315Z self = 2025-05-07T20:32:42.1483494Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.1483499Z 2025-05-07T20:32:42.1483687Z @given( 2025-05-07T20:32:42.1483813Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1483910Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1484022Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1484140Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1484251Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1484329Z ) 2025-05-07T20:32:42.1484574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1484666Z def test_silu_mul_quant( 2025-05-07T20:32:42.1484745Z self, 2025-05-07T20:32:42.1484820Z T: int, 2025-05-07T20:32:42.1484893Z D: int, 2025-05-07T20:32:42.1484996Z scale_ub: Optional[float], 2025-05-07T20:32:42.1485134Z contiguous: bool, 2025-05-07T20:32:42.1485278Z compiled: bool, 2025-05-07T20:32:42.1485358Z ) -> None: 2025-05-07T20:32:42.1485451Z torch.manual_seed(2025) 2025-05-07T20:32:42.1485524Z 2025-05-07T20:32:42.1485696Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1485771Z 2025-05-07T20:32:42.1485863Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1485994Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1486081Z x = x_sign * x_clamp 2025-05-07T20:32:42.1486168Z x0 = x[:, :D] 2025-05-07T20:32:42.1486245Z x1 = x[:, D:] 2025-05-07T20:32:42.1486320Z 2025-05-07T20:32:42.1486409Z if contiguous: 2025-05-07T20:32:42.1486498Z x0 = x0.contiguous() 2025-05-07T20:32:42.1486585Z x1 = x1.contiguous() 2025-05-07T20:32:42.1486661Z 2025-05-07T20:32:42.1486792Z if scale_ub is not None: 2025-05-07T20:32:42.1486902Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1487047Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1487122Z ) 2025-05-07T20:32:42.1487197Z else: 2025-05-07T20:32:42.1487302Z scale_ub_tensor = None 2025-05-07T20:32:42.1487382Z 2025-05-07T20:32:42.1487533Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1487642Z op = silu_mul_quant 2025-05-07T20:32:42.1487725Z if compiled: 2025-05-07T20:32:42.1487831Z op = torch.compile(op) 2025-05-07T20:32:42.1487937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1488009Z 2025-05-07T20:32:42.1488104Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1488109Z 2025-05-07T20:32:42.1488203Z moe/activation_test.py:117: 2025-05-07T20:32:42.1488331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1488441Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1488542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1489098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1489193Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1489556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1489787Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1490129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1490219Z kernel = self.compile( 2025-05-07T20:32:42.1490613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1490789Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1490924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1490933Z 2025-05-07T20:32:42.1491138Z self = 2025-05-07T20:32:42.1491910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1492422Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a9349f80>} 2025-05-07T20:32:42.1493176Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1493377Z context = 2025-05-07T20:32:42.1493382Z 2025-05-07T20:32:42.1493592Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1493903Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1494008Z module_map=module_map) 2025-05-07T20:32:42.1494167Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1494270Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1494347Z E ^ 2025-05-07T20:32:42.1494702Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1494706Z 2025-05-07T20:32:42.1495129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1495134Z 2025-05-07T20:32:42.1495234Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1495505Z self=, 2025-05-07T20:32:42.1495586Z T=1, 2025-05-07T20:32:42.1495663Z D=5120, 2025-05-07T20:32:42.1495746Z scale_ub=None, 2025-05-07T20:32:42.1495830Z contiguous=True, 2025-05-07T20:32:42.1495910Z compiled=True, 2025-05-07T20:32:42.1495987Z ) 2025-05-07T20:32:42.1496205Z self = 2025-05-07T20:32:42.1496364Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.1496375Z 2025-05-07T20:32:42.1496448Z @given( 2025-05-07T20:32:42.1496568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1496669Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1496781Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1496893Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1497011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1497086Z ) 2025-05-07T20:32:42.1497333Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1497477Z def test_silu_mul_quant( 2025-05-07T20:32:42.1497553Z self, 2025-05-07T20:32:42.1497629Z T: int, 2025-05-07T20:32:42.1497712Z D: int, 2025-05-07T20:32:42.1497807Z scale_ub: Optional[float], 2025-05-07T20:32:42.1497899Z contiguous: bool, 2025-05-07T20:32:42.1497982Z compiled: bool, 2025-05-07T20:32:42.1498056Z ) -> None: 2025-05-07T20:32:42.1498155Z torch.manual_seed(2025) 2025-05-07T20:32:42.1498228Z 2025-05-07T20:32:42.1498397Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1498479Z 2025-05-07T20:32:42.1498568Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1498692Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1498786Z x = x_sign * x_clamp 2025-05-07T20:32:42.1498866Z x0 = x[:, :D] 2025-05-07T20:32:42.1498945Z x1 = x[:, D:] 2025-05-07T20:32:42.1499024Z 2025-05-07T20:32:42.1499111Z if contiguous: 2025-05-07T20:32:42.1499202Z x0 = x0.contiguous() 2025-05-07T20:32:42.1499295Z x1 = x1.contiguous() 2025-05-07T20:32:42.1499368Z 2025-05-07T20:32:42.1499465Z if scale_ub is not None: 2025-05-07T20:32:42.1499570Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1499704Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1499787Z ) 2025-05-07T20:32:42.1499861Z else: 2025-05-07T20:32:42.1499954Z scale_ub_tensor = None 2025-05-07T20:32:42.1500034Z 2025-05-07T20:32:42.1500163Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1500253Z op = silu_mul_quant 2025-05-07T20:32:42.1500344Z if compiled: 2025-05-07T20:32:42.1500443Z op = torch.compile(op) 2025-05-07T20:32:42.1500548Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1500626Z 2025-05-07T20:32:42.1500761Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.1500930Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.1501002Z 2025-05-07T20:32:42.1501140Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1501251Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.1501353Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.1501476Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.1501625Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1501696Z 2025-05-07T20:32:42.1501795Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.1501807Z 2025-05-07T20:32:42.1501904Z moe/activation_test.py:126: 2025-05-07T20:32:42.1502033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1502146Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.1502321Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1502892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.1502998Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.1503364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1503593Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1503964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.1504224Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1504633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.1504892Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1505355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.1505530Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.1505875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.1505959Z fn() 2025-05-07T20:32:42.1506363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.1506444Z self.fn.run( 2025-05-07T20:32:42.1506791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1506884Z kernel = self.compile( 2025-05-07T20:32:42.1507270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1507455Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1507585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1507590Z 2025-05-07T20:32:42.1507803Z self = 2025-05-07T20:32:42.1508576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1509087Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a934afc0>} 2025-05-07T20:32:42.1509843Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1510076Z context = 2025-05-07T20:32:42.1510122Z 2025-05-07T20:32:42.1510299Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1510565Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1510674Z module_map=module_map) 2025-05-07T20:32:42.1510835Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1510935Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.1511018Z E ^ 2025-05-07T20:32:42.1511375Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1511380Z 2025-05-07T20:32:42.1511798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1511851Z 2025-05-07T20:32:42.1511958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1512182Z self=, 2025-05-07T20:32:42.1512267Z T=2048, 2025-05-07T20:32:42.1512340Z D=5120, 2025-05-07T20:32:42.1512418Z scale_ub=None, 2025-05-07T20:32:42.1512508Z contiguous=True, 2025-05-07T20:32:42.1512590Z compiled=True, 2025-05-07T20:32:42.1512662Z ) 2025-05-07T20:32:42.1512888Z self = 2025-05-07T20:32:42.1513058Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.1513062Z 2025-05-07T20:32:42.1513139Z @given( 2025-05-07T20:32:42.1513261Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1513357Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1513475Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1513591Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1513714Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1513839Z ) 2025-05-07T20:32:42.1514089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1514180Z def test_silu_mul_quant( 2025-05-07T20:32:42.1514263Z self, 2025-05-07T20:32:42.1514338Z T: int, 2025-05-07T20:32:42.1514411Z D: int, 2025-05-07T20:32:42.1514517Z scale_ub: Optional[float], 2025-05-07T20:32:42.1514606Z contiguous: bool, 2025-05-07T20:32:42.1514690Z compiled: bool, 2025-05-07T20:32:42.1514774Z ) -> None: 2025-05-07T20:32:42.1514866Z torch.manual_seed(2025) 2025-05-07T20:32:42.1514939Z 2025-05-07T20:32:42.1515114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1515186Z 2025-05-07T20:32:42.1515277Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1515411Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1515498Z x = x_sign * x_clamp 2025-05-07T20:32:42.1515587Z x0 = x[:, :D] 2025-05-07T20:32:42.1515668Z x1 = x[:, D:] 2025-05-07T20:32:42.1515740Z 2025-05-07T20:32:42.1515831Z if contiguous: 2025-05-07T20:32:42.1515919Z x0 = x0.contiguous() 2025-05-07T20:32:42.1516008Z x1 = x1.contiguous() 2025-05-07T20:32:42.1516087Z 2025-05-07T20:32:42.1516176Z if scale_ub is not None: 2025-05-07T20:32:42.1516278Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1516418Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1516493Z ) 2025-05-07T20:32:42.1516570Z else: 2025-05-07T20:32:42.1516669Z scale_ub_tensor = None 2025-05-07T20:32:42.1516739Z 2025-05-07T20:32:42.1516875Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1516962Z op = silu_mul_quant 2025-05-07T20:32:42.1517044Z if compiled: 2025-05-07T20:32:42.1517150Z op = torch.compile(op) 2025-05-07T20:32:42.1517306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1517436Z 2025-05-07T20:32:42.1517542Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.1517685Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.1517756Z 2025-05-07T20:32:42.1517896Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1517994Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.1518093Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.1518221Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.1518360Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1518439Z 2025-05-07T20:32:42.1518538Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.1518542Z 2025-05-07T20:32:42.1518638Z moe/activation_test.py:126: 2025-05-07T20:32:42.1518812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1518921Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.1519054Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1519624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.1519723Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.1520092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1520313Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1520683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.1522391Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1522799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.1523107Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1523488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.1523758Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.1524109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.1524185Z fn() 2025-05-07T20:32:42.1524592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.1524679Z self.fn.run( 2025-05-07T20:32:42.1525019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1525118Z kernel = self.compile( 2025-05-07T20:32:42.1525508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1525690Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1525822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1525827Z 2025-05-07T20:32:42.1526032Z self = 2025-05-07T20:32:42.1526810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1527313Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a90d07c0>} 2025-05-07T20:32:42.1528164Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1528403Z context = 2025-05-07T20:32:42.1528408Z 2025-05-07T20:32:42.1528572Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1528843Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1528948Z module_map=module_map) 2025-05-07T20:32:42.1529108Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1529212Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.1529288Z E ^ 2025-05-07T20:32:42.1529643Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1529655Z 2025-05-07T20:32:42.1530123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1530132Z 2025-05-07T20:32:42.1530234Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1530462Z self=, 2025-05-07T20:32:42.1530540Z T=128, 2025-05-07T20:32:42.1530617Z D=5120, 2025-05-07T20:32:42.1530705Z scale_ub=None, 2025-05-07T20:32:42.1530788Z contiguous=True, 2025-05-07T20:32:42.1530868Z compiled=True, 2025-05-07T20:32:42.1530945Z ) 2025-05-07T20:32:42.1531163Z self = 2025-05-07T20:32:42.1531336Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.1531341Z 2025-05-07T20:32:42.1531417Z @given( 2025-05-07T20:32:42.1531535Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1531639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1531755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1531871Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1532036Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1532109Z ) 2025-05-07T20:32:42.1532360Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1532450Z def test_silu_mul_quant( 2025-05-07T20:32:42.1532526Z self, 2025-05-07T20:32:42.1532609Z T: int, 2025-05-07T20:32:42.1532684Z D: int, 2025-05-07T20:32:42.1532778Z scale_ub: Optional[float], 2025-05-07T20:32:42.1532872Z contiguous: bool, 2025-05-07T20:32:42.1532957Z compiled: bool, 2025-05-07T20:32:42.1533031Z ) -> None: 2025-05-07T20:32:42.1533134Z torch.manual_seed(2025) 2025-05-07T20:32:42.1533208Z 2025-05-07T20:32:42.1533376Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1533461Z 2025-05-07T20:32:42.1533553Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1533680Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1533780Z x = x_sign * x_clamp 2025-05-07T20:32:42.1533858Z x0 = x[:, :D] 2025-05-07T20:32:42.1533941Z x1 = x[:, D:] 2025-05-07T20:32:42.1534013Z 2025-05-07T20:32:42.1534094Z if contiguous: 2025-05-07T20:32:42.1534189Z x0 = x0.contiguous() 2025-05-07T20:32:42.1534275Z x1 = x1.contiguous() 2025-05-07T20:32:42.1534350Z 2025-05-07T20:32:42.1534444Z if scale_ub is not None: 2025-05-07T20:32:42.1534548Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1534681Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1534763Z ) 2025-05-07T20:32:42.1534839Z else: 2025-05-07T20:32:42.1534930Z scale_ub_tensor = None 2025-05-07T20:32:42.1535009Z 2025-05-07T20:32:42.1535136Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1535231Z op = silu_mul_quant 2025-05-07T20:32:42.1535358Z if compiled: 2025-05-07T20:32:42.1535495Z op = torch.compile(op) 2025-05-07T20:32:42.1535603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1535673Z 2025-05-07T20:32:42.1535761Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.1535885Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.1535954Z 2025-05-07T20:32:42.1536089Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1536192Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.1536288Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.1536407Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.1536550Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1536623Z 2025-05-07T20:32:42.1536727Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.1536731Z 2025-05-07T20:32:42.1536889Z moe/activation_test.py:126: 2025-05-07T20:32:42.1537018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1537134Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.1537265Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1537829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.1537932Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.1538295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1539046Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1539427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.1539689Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1540099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.1540480Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1540866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.1541031Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.1542807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.1542890Z fn() 2025-05-07T20:32:42.1543295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.1543377Z self.fn.run( 2025-05-07T20:32:42.1543727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1543818Z kernel = self.compile( 2025-05-07T20:32:42.1544214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1544388Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1544515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1544520Z 2025-05-07T20:32:42.1544730Z self = 2025-05-07T20:32:42.1545505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1546015Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8ddde40>} 2025-05-07T20:32:42.1546835Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1547093Z context = 2025-05-07T20:32:42.1547105Z 2025-05-07T20:32:42.1547281Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1547584Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1547695Z module_map=module_map) 2025-05-07T20:32:42.1547855Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1547954Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.1548035Z E ^ 2025-05-07T20:32:42.1548459Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1548464Z 2025-05-07T20:32:42.1548894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1548901Z 2025-05-07T20:32:42.1549004Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1549223Z self=, 2025-05-07T20:32:42.1549305Z T=4096, 2025-05-07T20:32:42.1549383Z D=5120, 2025-05-07T20:32:42.1549461Z scale_ub=None, 2025-05-07T20:32:42.1549552Z contiguous=True, 2025-05-07T20:32:42.1549633Z compiled=True, 2025-05-07T20:32:42.1549705Z ) 2025-05-07T20:32:42.1549931Z self = 2025-05-07T20:32:42.1550100Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.1550104Z 2025-05-07T20:32:42.1550186Z @given( 2025-05-07T20:32:42.1550307Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1550407Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1550570Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1550689Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1550800Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1550881Z ) 2025-05-07T20:32:42.1551124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1551223Z def test_silu_mul_quant( 2025-05-07T20:32:42.1551300Z self, 2025-05-07T20:32:42.1551376Z T: int, 2025-05-07T20:32:42.1551457Z D: int, 2025-05-07T20:32:42.1551553Z scale_ub: Optional[float], 2025-05-07T20:32:42.1551640Z contiguous: bool, 2025-05-07T20:32:42.1551730Z compiled: bool, 2025-05-07T20:32:42.1551810Z ) -> None: 2025-05-07T20:32:42.1551906Z torch.manual_seed(2025) 2025-05-07T20:32:42.1551985Z 2025-05-07T20:32:42.1552157Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1552232Z 2025-05-07T20:32:42.1552333Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1552459Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1552548Z x = x_sign * x_clamp 2025-05-07T20:32:42.1552635Z x0 = x[:, :D] 2025-05-07T20:32:42.1552716Z x1 = x[:, D:] 2025-05-07T20:32:42.1552792Z 2025-05-07T20:32:42.1552875Z if contiguous: 2025-05-07T20:32:42.1552962Z x0 = x0.contiguous() 2025-05-07T20:32:42.1553059Z x1 = x1.contiguous() 2025-05-07T20:32:42.1553127Z 2025-05-07T20:32:42.1553215Z if scale_ub is not None: 2025-05-07T20:32:42.1553326Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1553457Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1553532Z ) 2025-05-07T20:32:42.1553613Z else: 2025-05-07T20:32:42.1553704Z scale_ub_tensor = None 2025-05-07T20:32:42.1553779Z 2025-05-07T20:32:42.1553964Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1554098Z op = silu_mul_quant 2025-05-07T20:32:42.1554188Z if compiled: 2025-05-07T20:32:42.1554284Z op = torch.compile(op) 2025-05-07T20:32:42.1554386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1554463Z 2025-05-07T20:32:42.1554551Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.1554670Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.1554747Z 2025-05-07T20:32:42.1554880Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1554980Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.1555083Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.1555204Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.1555345Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1555465Z 2025-05-07T20:32:42.1555566Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.1555576Z 2025-05-07T20:32:42.1555680Z moe/activation_test.py:126: 2025-05-07T20:32:42.1555809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1555913Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.1556054Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1556615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.1556716Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.1557087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1557312Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1557692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.1557950Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1558399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.1558665Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1559045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.1559217Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.1559559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.1559636Z fn() 2025-05-07T20:32:42.1560048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.1560131Z self.fn.run( 2025-05-07T20:32:42.1560476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1560585Z kernel = self.compile( 2025-05-07T20:32:42.1560973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1561156Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1561282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1561286Z 2025-05-07T20:32:42.1561493Z self = 2025-05-07T20:32:42.1562270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1562817Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a887e700>} 2025-05-07T20:32:42.1563729Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1563923Z context = 2025-05-07T20:32:42.1563928Z 2025-05-07T20:32:42.1564097Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1564360Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1564470Z module_map=module_map) 2025-05-07T20:32:42.1564638Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1564738Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.1564857Z E ^ 2025-05-07T20:32:42.1565225Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1565234Z 2025-05-07T20:32:42.1565653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1565657Z 2025-05-07T20:32:42.1565768Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1565989Z self=, 2025-05-07T20:32:42.1566065Z T=16384, 2025-05-07T20:32:42.1566148Z D=5120, 2025-05-07T20:32:42.1566229Z scale_ub=None, 2025-05-07T20:32:42.1566312Z contiguous=True, 2025-05-07T20:32:42.1566400Z compiled=True, 2025-05-07T20:32:42.1566471Z ) 2025-05-07T20:32:42.1566689Z self = 2025-05-07T20:32:42.1566870Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.1566877Z 2025-05-07T20:32:42.1566953Z @given( 2025-05-07T20:32:42.1567085Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1567226Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1567340Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1567463Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1567576Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1567650Z ) 2025-05-07T20:32:42.1567904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1568000Z def test_silu_mul_quant( 2025-05-07T20:32:42.1568085Z self, 2025-05-07T20:32:42.1568163Z T: int, 2025-05-07T20:32:42.1568242Z D: int, 2025-05-07T20:32:42.1568341Z scale_ub: Optional[float], 2025-05-07T20:32:42.1568432Z contiguous: bool, 2025-05-07T20:32:42.1568516Z compiled: bool, 2025-05-07T20:32:42.1568600Z ) -> None: 2025-05-07T20:32:42.1568706Z torch.manual_seed(2025) 2025-05-07T20:32:42.1568779Z 2025-05-07T20:32:42.1568960Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1569038Z 2025-05-07T20:32:42.1569126Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1569260Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1569347Z x = x_sign * x_clamp 2025-05-07T20:32:42.1569425Z x0 = x[:, :D] 2025-05-07T20:32:42.1569508Z x1 = x[:, D:] 2025-05-07T20:32:42.1569583Z 2025-05-07T20:32:42.1569671Z if contiguous: 2025-05-07T20:32:42.1569758Z x0 = x0.contiguous() 2025-05-07T20:32:42.1569847Z x1 = x1.contiguous() 2025-05-07T20:32:42.1569930Z 2025-05-07T20:32:42.1570019Z if scale_ub is not None: 2025-05-07T20:32:42.1570123Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1570259Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1570341Z ) 2025-05-07T20:32:42.1570417Z else: 2025-05-07T20:32:42.1570562Z scale_ub_tensor = None 2025-05-07T20:32:42.1570710Z 2025-05-07T20:32:42.1570838Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1570931Z op = silu_mul_quant 2025-05-07T20:32:42.1571014Z if compiled: 2025-05-07T20:32:42.1571117Z op = torch.compile(op) 2025-05-07T20:32:42.1571220Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1571291Z 2025-05-07T20:32:42.1571387Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.1571506Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.1571574Z 2025-05-07T20:32:42.1571713Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1571815Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.1571912Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.1572040Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.1572222Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1572303Z 2025-05-07T20:32:42.1572411Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.1572416Z 2025-05-07T20:32:42.1572513Z moe/activation_test.py:126: 2025-05-07T20:32:42.1572647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1572754Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.1572890Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1573461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.1573560Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.1573922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1574151Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1574523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.1574829Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1575228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.1575482Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1575863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.1576028Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.1576379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.1576456Z fn() 2025-05-07T20:32:42.1576865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.1576960Z self.fn.run( 2025-05-07T20:32:42.1577305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1577400Z kernel = self.compile( 2025-05-07T20:32:42.1577796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1577972Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1578108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1578112Z 2025-05-07T20:32:42.1578317Z self = 2025-05-07T20:32:42.1579092Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1579643Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8415d00>} 2025-05-07T20:32:42.1580436Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1580640Z context = 2025-05-07T20:32:42.1580645Z 2025-05-07T20:32:42.1580810Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1592988Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1593125Z module_map=module_map) 2025-05-07T20:32:42.1593377Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1593496Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.1593574Z E ^ 2025-05-07T20:32:42.1593950Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1593956Z 2025-05-07T20:32:42.1594380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1594385Z 2025-05-07T20:32:42.1594488Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1594722Z self=, 2025-05-07T20:32:42.1594801Z T=1, 2025-05-07T20:32:42.1594886Z D=5120, 2025-05-07T20:32:42.1594970Z scale_ub=1200.0, 2025-05-07T20:32:42.1595055Z contiguous=True, 2025-05-07T20:32:42.1595147Z compiled=True, 2025-05-07T20:32:42.1595220Z ) 2025-05-07T20:32:42.1595440Z self = 2025-05-07T20:32:42.1595624Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.1595677Z 2025-05-07T20:32:42.1595760Z @given( 2025-05-07T20:32:42.1595882Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1595991Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1596107Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1596233Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1596349Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1596424Z ) 2025-05-07T20:32:42.1596680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1596777Z def test_silu_mul_quant( 2025-05-07T20:32:42.1596854Z self, 2025-05-07T20:32:42.1596944Z T: int, 2025-05-07T20:32:42.1597022Z D: int, 2025-05-07T20:32:42.1597123Z scale_ub: Optional[float], 2025-05-07T20:32:42.1597223Z contiguous: bool, 2025-05-07T20:32:42.1597323Z compiled: bool, 2025-05-07T20:32:42.1597421Z ) -> None: 2025-05-07T20:32:42.1597549Z torch.manual_seed(2025) 2025-05-07T20:32:42.1597630Z 2025-05-07T20:32:42.1597813Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1597891Z 2025-05-07T20:32:42.1597985Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1598119Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1598210Z x = x_sign * x_clamp 2025-05-07T20:32:42.1598293Z x0 = x[:, :D] 2025-05-07T20:32:42.1598380Z x1 = x[:, D:] 2025-05-07T20:32:42.1598453Z 2025-05-07T20:32:42.1598543Z if contiguous: 2025-05-07T20:32:42.1598645Z x0 = x0.contiguous() 2025-05-07T20:32:42.1598735Z x1 = x1.contiguous() 2025-05-07T20:32:42.1598810Z 2025-05-07T20:32:42.1598916Z if scale_ub is not None: 2025-05-07T20:32:42.1599024Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1599162Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1599299Z ) 2025-05-07T20:32:42.1599421Z else: 2025-05-07T20:32:42.1599527Z scale_ub_tensor = None 2025-05-07T20:32:42.1599599Z 2025-05-07T20:32:42.1599730Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1599828Z op = silu_mul_quant 2025-05-07T20:32:42.1599911Z if compiled: 2025-05-07T20:32:42.1600012Z op = torch.compile(op) 2025-05-07T20:32:42.1600127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1600201Z 2025-05-07T20:32:42.1600294Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1600298Z 2025-05-07T20:32:42.1600408Z moe/activation_test.py:117: 2025-05-07T20:32:42.1600539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1600649Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1600748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1601170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1601283Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1601786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1601885Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1602256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1602482Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1602835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1602930Z kernel = self.compile( 2025-05-07T20:32:42.1603316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1603510Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1603808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1603815Z 2025-05-07T20:32:42.1604032Z self = 2025-05-07T20:32:42.1604804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1605307Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8670ae0>} 2025-05-07T20:32:42.1606067Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1606258Z context = 2025-05-07T20:32:42.1606266Z 2025-05-07T20:32:42.1606440Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1606704Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1606812Z module_map=module_map) 2025-05-07T20:32:42.1606981Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1607079Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1607154Z E ^ 2025-05-07T20:32:42.1607518Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1607523Z 2025-05-07T20:32:42.1607939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1607944Z 2025-05-07T20:32:42.1608054Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1608329Z self=, 2025-05-07T20:32:42.1608443Z T=1, 2025-05-07T20:32:42.1608527Z D=5120, 2025-05-07T20:32:42.1608609Z scale_ub=None, 2025-05-07T20:32:42.1608703Z contiguous=False, 2025-05-07T20:32:42.1608786Z compiled=True, 2025-05-07T20:32:42.1608860Z ) 2025-05-07T20:32:42.1609084Z self = 2025-05-07T20:32:42.1609249Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.1609254Z 2025-05-07T20:32:42.1609331Z @given( 2025-05-07T20:32:42.1609461Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1609563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1609676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1609803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1609956Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1610048Z ) 2025-05-07T20:32:42.1610295Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1610390Z def test_silu_mul_quant( 2025-05-07T20:32:42.1610473Z self, 2025-05-07T20:32:42.1610549Z T: int, 2025-05-07T20:32:42.1610625Z D: int, 2025-05-07T20:32:42.1610733Z scale_ub: Optional[float], 2025-05-07T20:32:42.1610822Z contiguous: bool, 2025-05-07T20:32:42.1610906Z compiled: bool, 2025-05-07T20:32:42.1610996Z ) -> None: 2025-05-07T20:32:42.1611091Z torch.manual_seed(2025) 2025-05-07T20:32:42.1611165Z 2025-05-07T20:32:42.1611346Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1611421Z 2025-05-07T20:32:42.1611522Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1611646Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1611738Z x = x_sign * x_clamp 2025-05-07T20:32:42.1611828Z x0 = x[:, :D] 2025-05-07T20:32:42.1611953Z x1 = x[:, D:] 2025-05-07T20:32:42.1612030Z 2025-05-07T20:32:42.1612124Z if contiguous: 2025-05-07T20:32:42.1612215Z x0 = x0.contiguous() 2025-05-07T20:32:42.1612304Z x1 = x1.contiguous() 2025-05-07T20:32:42.1612386Z 2025-05-07T20:32:42.1612477Z if scale_ub is not None: 2025-05-07T20:32:42.1612585Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1612727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1612805Z ) 2025-05-07T20:32:42.1612881Z else: 2025-05-07T20:32:42.1612987Z scale_ub_tensor = None 2025-05-07T20:32:42.1613061Z 2025-05-07T20:32:42.1613199Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1613289Z op = silu_mul_quant 2025-05-07T20:32:42.1613375Z if compiled: 2025-05-07T20:32:42.1613488Z op = torch.compile(op) 2025-05-07T20:32:42.1613596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1613675Z 2025-05-07T20:32:42.1613775Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.1613900Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.1613974Z 2025-05-07T20:32:42.1614122Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1614225Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.1614326Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.1614457Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.1614597Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1614678Z 2025-05-07T20:32:42.1614778Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.1614782Z 2025-05-07T20:32:42.1614883Z moe/activation_test.py:126: 2025-05-07T20:32:42.1615011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1615196Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.1615333Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1615941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.1616044Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.1616407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1616639Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1617007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.1617262Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1617737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.1618022Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1618410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.1618577Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.1618920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.1619001Z fn() 2025-05-07T20:32:42.1619400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.1619490Z self.fn.run( 2025-05-07T20:32:42.1619830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1619922Z kernel = self.compile( 2025-05-07T20:32:42.1620318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1620538Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1620667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1620677Z 2025-05-07T20:32:42.1620882Z self = 2025-05-07T20:32:42.1621653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1622161Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8699e40>} 2025-05-07T20:32:42.1622911Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1623109Z context = 2025-05-07T20:32:42.1623114Z 2025-05-07T20:32:42.1623274Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1623537Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1623647Z module_map=module_map) 2025-05-07T20:32:42.1623810Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1623909Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.1623993Z E ^ 2025-05-07T20:32:42.1624350Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1624354Z 2025-05-07T20:32:42.1624822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1624830Z 2025-05-07T20:32:42.1624972Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1625191Z self=, 2025-05-07T20:32:42.1625274Z T=1, 2025-05-07T20:32:42.1625347Z D=5120, 2025-05-07T20:32:42.1625433Z scale_ub=None, 2025-05-07T20:32:42.1625516Z contiguous=True, 2025-05-07T20:32:42.1625597Z compiled=False, 2025-05-07T20:32:42.1625675Z ) 2025-05-07T20:32:42.1625889Z self = 2025-05-07T20:32:42.1626050Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.1626054Z 2025-05-07T20:32:42.1626136Z @given( 2025-05-07T20:32:42.1626253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1626348Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1626513Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1626631Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1626752Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1626825Z ) 2025-05-07T20:32:42.1627068Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1627163Z def test_silu_mul_quant( 2025-05-07T20:32:42.1627239Z self, 2025-05-07T20:32:42.1627312Z T: int, 2025-05-07T20:32:42.1627391Z D: int, 2025-05-07T20:32:42.1627486Z scale_ub: Optional[float], 2025-05-07T20:32:42.1627571Z contiguous: bool, 2025-05-07T20:32:42.1627659Z compiled: bool, 2025-05-07T20:32:42.1627737Z ) -> None: 2025-05-07T20:32:42.1627829Z torch.manual_seed(2025) 2025-05-07T20:32:42.1627908Z 2025-05-07T20:32:42.1628078Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1628153Z 2025-05-07T20:32:42.1628246Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1628372Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1628513Z x = x_sign * x_clamp 2025-05-07T20:32:42.1628596Z x0 = x[:, :D] 2025-05-07T20:32:42.1628678Z x1 = x[:, D:] 2025-05-07T20:32:42.1628756Z 2025-05-07T20:32:42.1628838Z if contiguous: 2025-05-07T20:32:42.1628926Z x0 = x0.contiguous() 2025-05-07T20:32:42.1629019Z x1 = x1.contiguous() 2025-05-07T20:32:42.1629091Z 2025-05-07T20:32:42.1629179Z if scale_ub is not None: 2025-05-07T20:32:42.1629292Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1629426Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1629501Z ) 2025-05-07T20:32:42.1629585Z else: 2025-05-07T20:32:42.1629676Z scale_ub_tensor = None 2025-05-07T20:32:42.1629751Z 2025-05-07T20:32:42.1629882Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1629974Z op = silu_mul_quant 2025-05-07T20:32:42.1630071Z if compiled: 2025-05-07T20:32:42.1630173Z op = torch.compile(op) 2025-05-07T20:32:42.1630280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1630358Z 2025-05-07T20:32:42.1630447Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1630451Z 2025-05-07T20:32:42.1630550Z moe/activation_test.py:117: 2025-05-07T20:32:42.1630683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1630783Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1630891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1631394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1631492Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1631860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1632129Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1632575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1632673Z kernel = self.compile( 2025-05-07T20:32:42.1633058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1633239Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1633365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1633369Z 2025-05-07T20:32:42.1633574Z self = 2025-05-07T20:32:42.1634389Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1634894Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a869b740>} 2025-05-07T20:32:42.1635659Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1635847Z context = 2025-05-07T20:32:42.1635851Z 2025-05-07T20:32:42.1636022Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1636284Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1636388Z module_map=module_map) 2025-05-07T20:32:42.1636555Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1636654Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1636732Z E ^ 2025-05-07T20:32:42.1637138Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1637147Z 2025-05-07T20:32:42.1637562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1637566Z 2025-05-07T20:32:42.1637673Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1637892Z self=, 2025-05-07T20:32:42.1637966Z T=128, 2025-05-07T20:32:42.1638046Z D=5120, 2025-05-07T20:32:42.1638126Z scale_ub=None, 2025-05-07T20:32:42.1638209Z contiguous=False, 2025-05-07T20:32:42.1638297Z compiled=True, 2025-05-07T20:32:42.1638372Z ) 2025-05-07T20:32:42.1639110Z self = 2025-05-07T20:32:42.1639295Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.1639305Z 2025-05-07T20:32:42.1639382Z @given( 2025-05-07T20:32:42.1639510Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1639606Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1639719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1639841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1639953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1640027Z ) 2025-05-07T20:32:42.1640280Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1640373Z def test_silu_mul_quant( 2025-05-07T20:32:42.1640450Z self, 2025-05-07T20:32:42.1640532Z T: int, 2025-05-07T20:32:42.1640605Z D: int, 2025-05-07T20:32:42.1640711Z scale_ub: Optional[float], 2025-05-07T20:32:42.1640797Z contiguous: bool, 2025-05-07T20:32:42.1640883Z compiled: bool, 2025-05-07T20:32:42.1640965Z ) -> None: 2025-05-07T20:32:42.1641232Z torch.manual_seed(2025) 2025-05-07T20:32:42.1641369Z 2025-05-07T20:32:42.1641546Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1641616Z 2025-05-07T20:32:42.1641706Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1641837Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1641923Z x = x_sign * x_clamp 2025-05-07T20:32:42.1642001Z x0 = x[:, :D] 2025-05-07T20:32:42.1642084Z x1 = x[:, D:] 2025-05-07T20:32:42.1642157Z 2025-05-07T20:32:42.1642244Z if contiguous: 2025-05-07T20:32:42.1642334Z x0 = x0.contiguous() 2025-05-07T20:32:42.1642420Z x1 = x1.contiguous() 2025-05-07T20:32:42.1642502Z 2025-05-07T20:32:42.1642592Z if scale_ub is not None: 2025-05-07T20:32:42.1642695Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1642907Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1642989Z ) 2025-05-07T20:32:42.1643065Z else: 2025-05-07T20:32:42.1643169Z scale_ub_tensor = None 2025-05-07T20:32:42.1643241Z 2025-05-07T20:32:42.1643371Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1643469Z op = silu_mul_quant 2025-05-07T20:32:42.1643658Z if compiled: 2025-05-07T20:32:42.1643764Z op = torch.compile(op) 2025-05-07T20:32:42.1643868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1643938Z 2025-05-07T20:32:42.1644033Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1644037Z 2025-05-07T20:32:42.1644132Z moe/activation_test.py:117: 2025-05-07T20:32:42.1644263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1644369Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1644468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1644845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1645044Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1645540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1645642Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1646002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1646223Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1646573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1646664Z kernel = self.compile( 2025-05-07T20:32:42.1647056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1647232Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1647362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1647368Z 2025-05-07T20:32:42.1647579Z self = 2025-05-07T20:32:42.1648346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1648850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287d09120>} 2025-05-07T20:32:42.1649603Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1649837Z context = 2025-05-07T20:32:42.1649881Z 2025-05-07T20:32:42.1650055Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1650320Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1650431Z module_map=module_map) 2025-05-07T20:32:42.1650590Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1650686Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1650768Z E ^ 2025-05-07T20:32:42.1651123Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1651127Z 2025-05-07T20:32:42.1651542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1651553Z 2025-05-07T20:32:42.1651696Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1651922Z self=, 2025-05-07T20:32:42.1652012Z T=128, 2025-05-07T20:32:42.1652088Z D=7168, 2025-05-07T20:32:42.1652170Z scale_ub=1200.0, 2025-05-07T20:32:42.1652262Z contiguous=False, 2025-05-07T20:32:42.1652344Z compiled=False, 2025-05-07T20:32:42.1652416Z ) 2025-05-07T20:32:42.1652638Z self = 2025-05-07T20:32:42.1652809Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.1652814Z 2025-05-07T20:32:42.1652893Z @given( 2025-05-07T20:32:42.1653008Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1653104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1653223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1653336Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1653448Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1653534Z ) 2025-05-07T20:32:42.1653824Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1653918Z def test_silu_mul_quant( 2025-05-07T20:32:42.1653999Z self, 2025-05-07T20:32:42.1654072Z T: int, 2025-05-07T20:32:42.1654149Z D: int, 2025-05-07T20:32:42.1654263Z scale_ub: Optional[float], 2025-05-07T20:32:42.1654350Z contiguous: bool, 2025-05-07T20:32:42.1654435Z compiled: bool, 2025-05-07T20:32:42.1654518Z ) -> None: 2025-05-07T20:32:42.1654613Z torch.manual_seed(2025) 2025-05-07T20:32:42.1654684Z 2025-05-07T20:32:42.1654867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1654942Z 2025-05-07T20:32:42.1655042Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1655165Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1655254Z x = x_sign * x_clamp 2025-05-07T20:32:42.1655340Z x0 = x[:, :D] 2025-05-07T20:32:42.1655426Z x1 = x[:, D:] 2025-05-07T20:32:42.1655506Z 2025-05-07T20:32:42.1655593Z if contiguous: 2025-05-07T20:32:42.1655682Z x0 = x0.contiguous() 2025-05-07T20:32:42.1655767Z x1 = x1.contiguous() 2025-05-07T20:32:42.1655847Z 2025-05-07T20:32:42.1655936Z if scale_ub is not None: 2025-05-07T20:32:42.1656036Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1656177Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1656250Z ) 2025-05-07T20:32:42.1656326Z else: 2025-05-07T20:32:42.1656426Z scale_ub_tensor = None 2025-05-07T20:32:42.1656499Z 2025-05-07T20:32:42.1656635Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1656724Z op = silu_mul_quant 2025-05-07T20:32:42.1656807Z if compiled: 2025-05-07T20:32:42.1656915Z op = torch.compile(op) 2025-05-07T20:32:42.1657074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1657152Z 2025-05-07T20:32:42.1657294Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1657298Z 2025-05-07T20:32:42.1657394Z moe/activation_test.py:117: 2025-05-07T20:32:42.1657523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1657628Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1657726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1658232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1658329Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1658691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1658922Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1659307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1659406Z kernel = self.compile( 2025-05-07T20:32:42.1659800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1659974Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1660110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1660115Z 2025-05-07T20:32:42.1660321Z self = 2025-05-07T20:32:42.1661090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1661602Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287d08360>} 2025-05-07T20:32:42.1662399Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1662597Z context = 2025-05-07T20:32:42.1662601Z 2025-05-07T20:32:42.1662766Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1663039Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1663148Z module_map=module_map) 2025-05-07T20:32:42.1663309Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1663414Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1663491Z E ^ 2025-05-07T20:32:42.1663850Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1663860Z 2025-05-07T20:32:42.1664282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1664287Z 2025-05-07T20:32:42.1664388Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1664617Z self=, 2025-05-07T20:32:42.1664695Z T=128, 2025-05-07T20:32:42.1664771Z D=5120, 2025-05-07T20:32:42.1664860Z scale_ub=None, 2025-05-07T20:32:42.1664946Z contiguous=False, 2025-05-07T20:32:42.1665030Z compiled=False, 2025-05-07T20:32:42.1665108Z ) 2025-05-07T20:32:42.1665327Z self = 2025-05-07T20:32:42.1665495Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.1665510Z 2025-05-07T20:32:42.1665588Z @given( 2025-05-07T20:32:42.1665752Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1665862Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1666018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1666132Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1666252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1666324Z ) 2025-05-07T20:32:42.1666568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1666665Z def test_silu_mul_quant( 2025-05-07T20:32:42.1666740Z self, 2025-05-07T20:32:42.1666815Z T: int, 2025-05-07T20:32:42.1666902Z D: int, 2025-05-07T20:32:42.1666996Z scale_ub: Optional[float], 2025-05-07T20:32:42.1667093Z contiguous: bool, 2025-05-07T20:32:42.1667179Z compiled: bool, 2025-05-07T20:32:42.1667255Z ) -> None: 2025-05-07T20:32:42.1667355Z torch.manual_seed(2025) 2025-05-07T20:32:42.1667472Z 2025-05-07T20:32:42.1667645Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1667736Z 2025-05-07T20:32:42.1667827Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1667952Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1668047Z x = x_sign * x_clamp 2025-05-07T20:32:42.1668125Z x0 = x[:, :D] 2025-05-07T20:32:42.1668203Z x1 = x[:, D:] 2025-05-07T20:32:42.1668285Z 2025-05-07T20:32:42.1668369Z if contiguous: 2025-05-07T20:32:42.1668469Z x0 = x0.contiguous() 2025-05-07T20:32:42.1668564Z x1 = x1.contiguous() 2025-05-07T20:32:42.1668638Z 2025-05-07T20:32:42.1668735Z if scale_ub is not None: 2025-05-07T20:32:42.1668841Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1668975Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1669057Z ) 2025-05-07T20:32:42.1669133Z else: 2025-05-07T20:32:42.1669230Z scale_ub_tensor = None 2025-05-07T20:32:42.1669314Z 2025-05-07T20:32:42.1669488Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1669577Z op = silu_mul_quant 2025-05-07T20:32:42.1669666Z if compiled: 2025-05-07T20:32:42.1669764Z op = torch.compile(op) 2025-05-07T20:32:42.1669876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1669947Z 2025-05-07T20:32:42.1670036Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1670041Z 2025-05-07T20:32:42.1670144Z moe/activation_test.py:117: 2025-05-07T20:32:42.1670271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1670371Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1670475Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1670978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1671071Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1671444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1671669Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1672020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1672113Z kernel = self.compile( 2025-05-07T20:32:42.1672502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1672683Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1672808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1672813Z 2025-05-07T20:32:42.1673024Z self = 2025-05-07T20:32:42.1673840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1674376Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb28785c720>} 2025-05-07T20:32:42.1675137Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1675325Z context = 2025-05-07T20:32:42.1675330Z 2025-05-07T20:32:42.1675501Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1675822Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1675933Z module_map=module_map) 2025-05-07T20:32:42.1676104Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1676200Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1676281Z E ^ 2025-05-07T20:32:42.1676635Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1676639Z 2025-05-07T20:32:42.1677053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1677057Z 2025-05-07T20:32:42.1677167Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1677386Z self=, 2025-05-07T20:32:42.1677466Z T=128, 2025-05-07T20:32:42.1677542Z D=5120, 2025-05-07T20:32:42.1677621Z scale_ub=1200.0, 2025-05-07T20:32:42.1677710Z contiguous=True, 2025-05-07T20:32:42.1677795Z compiled=False, 2025-05-07T20:32:42.1677866Z ) 2025-05-07T20:32:42.1678136Z self = 2025-05-07T20:32:42.1678306Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.1678311Z 2025-05-07T20:32:42.1678384Z @given( 2025-05-07T20:32:42.1678507Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1678604Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1678724Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1678839Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1678949Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1679029Z ) 2025-05-07T20:32:42.1679273Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1679365Z def test_silu_mul_quant( 2025-05-07T20:32:42.1679446Z self, 2025-05-07T20:32:42.1679524Z T: int, 2025-05-07T20:32:42.1679598Z D: int, 2025-05-07T20:32:42.1679705Z scale_ub: Optional[float], 2025-05-07T20:32:42.1679796Z contiguous: bool, 2025-05-07T20:32:42.1679881Z compiled: bool, 2025-05-07T20:32:42.1679962Z ) -> None: 2025-05-07T20:32:42.1680053Z torch.manual_seed(2025) 2025-05-07T20:32:42.1680121Z 2025-05-07T20:32:42.1680295Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1680366Z 2025-05-07T20:32:42.1680460Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1680580Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1680667Z x = x_sign * x_clamp 2025-05-07T20:32:42.1680751Z x0 = x[:, :D] 2025-05-07T20:32:42.1680830Z x1 = x[:, D:] 2025-05-07T20:32:42.1680899Z 2025-05-07T20:32:42.1680985Z if contiguous: 2025-05-07T20:32:42.1681073Z x0 = x0.contiguous() 2025-05-07T20:32:42.1681158Z x1 = x1.contiguous() 2025-05-07T20:32:42.1681241Z 2025-05-07T20:32:42.1681379Z if scale_ub is not None: 2025-05-07T20:32:42.1681490Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1681667Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1681742Z ) 2025-05-07T20:32:42.1681823Z else: 2025-05-07T20:32:42.1681914Z scale_ub_tensor = None 2025-05-07T20:32:42.1681986Z 2025-05-07T20:32:42.1682121Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1682209Z op = silu_mul_quant 2025-05-07T20:32:42.1682298Z if compiled: 2025-05-07T20:32:42.1682401Z op = torch.compile(op) 2025-05-07T20:32:42.1682503Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1682575Z 2025-05-07T20:32:42.1682670Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1682675Z 2025-05-07T20:32:42.1682772Z moe/activation_test.py:117: 2025-05-07T20:32:42.1682944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1683048Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1683151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1683749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1683847Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1684208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1684433Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1684775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1684878Z kernel = self.compile( 2025-05-07T20:32:42.1685262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1685441Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1685624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1685632Z 2025-05-07T20:32:42.1685836Z self = 2025-05-07T20:32:42.1686613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1687110Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb28785d8a0>} 2025-05-07T20:32:42.1687865Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1688068Z context = 2025-05-07T20:32:42.1688079Z 2025-05-07T20:32:42.1688245Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1688514Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1688619Z module_map=module_map) 2025-05-07T20:32:42.1688779Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1688885Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1688960Z E ^ 2025-05-07T20:32:42.1689315Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1689327Z 2025-05-07T20:32:42.1689741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1689746Z 2025-05-07T20:32:42.1689850Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1690121Z self=, 2025-05-07T20:32:42.1690236Z T=1, 2025-05-07T20:32:42.1690313Z D=7168, 2025-05-07T20:32:42.1690405Z scale_ub=1200.0, 2025-05-07T20:32:42.1690487Z contiguous=True, 2025-05-07T20:32:42.1690570Z compiled=True, 2025-05-07T20:32:42.1690653Z ) 2025-05-07T20:32:42.1690868Z self = 2025-05-07T20:32:42.1691043Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.1691047Z 2025-05-07T20:32:42.1691124Z @given( 2025-05-07T20:32:42.1691241Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1691342Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1691453Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1691566Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1691724Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1691799Z ) 2025-05-07T20:32:42.1692048Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1692151Z def test_silu_mul_quant( 2025-05-07T20:32:42.1692228Z self, 2025-05-07T20:32:42.1692308Z T: int, 2025-05-07T20:32:42.1692383Z D: int, 2025-05-07T20:32:42.1692477Z scale_ub: Optional[float], 2025-05-07T20:32:42.1692571Z contiguous: bool, 2025-05-07T20:32:42.1692654Z compiled: bool, 2025-05-07T20:32:42.1692729Z ) -> None: 2025-05-07T20:32:42.1692830Z torch.manual_seed(2025) 2025-05-07T20:32:42.1692901Z 2025-05-07T20:32:42.1693073Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1693154Z 2025-05-07T20:32:42.1693247Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1693369Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1693463Z x = x_sign * x_clamp 2025-05-07T20:32:42.1693538Z x0 = x[:, :D] 2025-05-07T20:32:42.1693667Z x1 = x[:, D:] 2025-05-07T20:32:42.1693740Z 2025-05-07T20:32:42.1693818Z if contiguous: 2025-05-07T20:32:42.1693916Z x0 = x0.contiguous() 2025-05-07T20:32:42.1694004Z x1 = x1.contiguous() 2025-05-07T20:32:42.1694078Z 2025-05-07T20:32:42.1694173Z if scale_ub is not None: 2025-05-07T20:32:42.1694276Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1694408Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1694486Z ) 2025-05-07T20:32:42.1694560Z else: 2025-05-07T20:32:42.1694653Z scale_ub_tensor = None 2025-05-07T20:32:42.1694730Z 2025-05-07T20:32:42.1694864Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1694951Z op = silu_mul_quant 2025-05-07T20:32:42.1695044Z if compiled: 2025-05-07T20:32:42.1695147Z op = torch.compile(op) 2025-05-07T20:32:42.1695263Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1695342Z 2025-05-07T20:32:42.1695435Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1695439Z 2025-05-07T20:32:42.1695543Z moe/activation_test.py:117: 2025-05-07T20:32:42.1695673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1695775Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1695879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1696252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1696356Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1696860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1696954Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1697324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1697594Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1697973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1698079Z kernel = self.compile( 2025-05-07T20:32:42.1698463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1698646Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1698772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1698777Z 2025-05-07T20:32:42.1698981Z self = 2025-05-07T20:32:42.1699799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1700305Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb28785ee80>} 2025-05-07T20:32:42.1701064Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1701254Z context = 2025-05-07T20:32:42.1701258Z 2025-05-07T20:32:42.1701429Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1701691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1701799Z module_map=module_map) 2025-05-07T20:32:42.1701971Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1702070Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1702192Z E ^ 2025-05-07T20:32:42.1702555Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1702559Z 2025-05-07T20:32:42.1702974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1702979Z 2025-05-07T20:32:42.1703087Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1703310Z self=, 2025-05-07T20:32:42.1703387Z T=1, 2025-05-07T20:32:42.1703471Z D=7168, 2025-05-07T20:32:42.1703553Z scale_ub=1200.0, 2025-05-07T20:32:42.1703639Z contiguous=False, 2025-05-07T20:32:42.1703729Z compiled=True, 2025-05-07T20:32:42.1703802Z ) 2025-05-07T20:32:42.1704026Z self = 2025-05-07T20:32:42.1704203Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.1704213Z 2025-05-07T20:32:42.1704289Z @given( 2025-05-07T20:32:42.1704414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1704509Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1704622Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1704749Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1704860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1704933Z ) 2025-05-07T20:32:42.1705183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1705276Z def test_silu_mul_quant( 2025-05-07T20:32:42.1705355Z self, 2025-05-07T20:32:42.1705437Z T: int, 2025-05-07T20:32:42.1705514Z D: int, 2025-05-07T20:32:42.1705614Z scale_ub: Optional[float], 2025-05-07T20:32:42.1705706Z contiguous: bool, 2025-05-07T20:32:42.1705832Z compiled: bool, 2025-05-07T20:32:42.1705918Z ) -> None: 2025-05-07T20:32:42.1706074Z torch.manual_seed(2025) 2025-05-07T20:32:42.1706147Z 2025-05-07T20:32:42.1706320Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1706392Z 2025-05-07T20:32:42.1706484Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1706612Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1706700Z x = x_sign * x_clamp 2025-05-07T20:32:42.1706777Z x0 = x[:, :D] 2025-05-07T20:32:42.1706864Z x1 = x[:, D:] 2025-05-07T20:32:42.1706938Z 2025-05-07T20:32:42.1707018Z if contiguous: 2025-05-07T20:32:42.1707115Z x0 = x0.contiguous() 2025-05-07T20:32:42.1707201Z x1 = x1.contiguous() 2025-05-07T20:32:42.1707280Z 2025-05-07T20:32:42.1707370Z if scale_ub is not None: 2025-05-07T20:32:42.1707518Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1707660Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1707738Z ) 2025-05-07T20:32:42.1707818Z else: 2025-05-07T20:32:42.1707918Z scale_ub_tensor = None 2025-05-07T20:32:42.1707989Z 2025-05-07T20:32:42.1708119Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1708218Z op = silu_mul_quant 2025-05-07T20:32:42.1708302Z if compiled: 2025-05-07T20:32:42.1708400Z op = torch.compile(op) 2025-05-07T20:32:42.1708510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1708581Z 2025-05-07T20:32:42.1708677Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1708681Z 2025-05-07T20:32:42.1708775Z moe/activation_test.py:117: 2025-05-07T20:32:42.1708901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1709006Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1709105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1709481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1709626Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1710124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1710227Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1710590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1710811Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1711163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1711256Z kernel = self.compile( 2025-05-07T20:32:42.1711646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1711830Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1711960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1711964Z 2025-05-07T20:32:42.1712180Z self = 2025-05-07T20:32:42.1712952Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1713458Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287af4680>} 2025-05-07T20:32:42.1714269Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1719501Z context = 2025-05-07T20:32:42.1719591Z 2025-05-07T20:32:42.1719782Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1720060Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1720168Z module_map=module_map) 2025-05-07T20:32:42.1720332Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1720437Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1720516Z E ^ 2025-05-07T20:32:42.1720877Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1720889Z 2025-05-07T20:32:42.1721352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1721358Z 2025-05-07T20:32:42.1721465Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1721704Z self=, 2025-05-07T20:32:42.1721785Z T=1, 2025-05-07T20:32:42.1721862Z D=7168, 2025-05-07T20:32:42.1721951Z scale_ub=None, 2025-05-07T20:32:42.1722039Z contiguous=False, 2025-05-07T20:32:42.1722122Z compiled=True, 2025-05-07T20:32:42.1722203Z ) 2025-05-07T20:32:42.1722420Z self = 2025-05-07T20:32:42.1722594Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.1722598Z 2025-05-07T20:32:42.1722677Z @given( 2025-05-07T20:32:42.1722798Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1722904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1723017Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1723139Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1723262Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1723392Z ) 2025-05-07T20:32:42.1723746Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1723856Z def test_silu_mul_quant( 2025-05-07T20:32:42.1723934Z self, 2025-05-07T20:32:42.1724020Z T: int, 2025-05-07T20:32:42.1724098Z D: int, 2025-05-07T20:32:42.1724196Z scale_ub: Optional[float], 2025-05-07T20:32:42.1724298Z contiguous: bool, 2025-05-07T20:32:42.1724384Z compiled: bool, 2025-05-07T20:32:42.1724464Z ) -> None: 2025-05-07T20:32:42.1724569Z torch.manual_seed(2025) 2025-05-07T20:32:42.1724643Z 2025-05-07T20:32:42.1724818Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1724903Z 2025-05-07T20:32:42.1724997Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1725124Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1725224Z x = x_sign * x_clamp 2025-05-07T20:32:42.1725307Z x0 = x[:, :D] 2025-05-07T20:32:42.1725404Z x1 = x[:, D:] 2025-05-07T20:32:42.1725477Z 2025-05-07T20:32:42.1725562Z if contiguous: 2025-05-07T20:32:42.1725664Z x0 = x0.contiguous() 2025-05-07T20:32:42.1725760Z x1 = x1.contiguous() 2025-05-07T20:32:42.1725832Z 2025-05-07T20:32:42.1725933Z if scale_ub is not None: 2025-05-07T20:32:42.1726039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1726175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1726263Z ) 2025-05-07T20:32:42.1726341Z else: 2025-05-07T20:32:42.1726436Z scale_ub_tensor = None 2025-05-07T20:32:42.1726516Z 2025-05-07T20:32:42.1726645Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1726742Z op = silu_mul_quant 2025-05-07T20:32:42.1726831Z if compiled: 2025-05-07T20:32:42.1726980Z op = torch.compile(op) 2025-05-07T20:32:42.1727098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1727213Z 2025-05-07T20:32:42.1727304Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.1727434Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.1727506Z 2025-05-07T20:32:42.1727642Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1727751Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.1727851Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.1727971Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.1728116Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1728191Z 2025-05-07T20:32:42.1728297Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.1728301Z 2025-05-07T20:32:42.1728398Z moe/activation_test.py:126: 2025-05-07T20:32:42.1728569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1728684Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.1728819Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1729379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.1729485Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.1729845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1730082Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1730450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.1730708Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1731118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.1731415Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1731800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.1731964Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.1732307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.1732389Z fn() 2025-05-07T20:32:42.1732791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.1732874Z self.fn.run( 2025-05-07T20:32:42.1733223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1733317Z kernel = self.compile( 2025-05-07T20:32:42.1733709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1733891Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1734019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1734024Z 2025-05-07T20:32:42.1734236Z self = 2025-05-07T20:32:42.1735008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1735514Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287af5580>} 2025-05-07T20:32:42.1736309Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1736546Z context = 2025-05-07T20:32:42.1736560Z 2025-05-07T20:32:42.1736726Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1736993Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1737108Z module_map=module_map) 2025-05-07T20:32:42.1737267Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1737367Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.1737452Z E ^ 2025-05-07T20:32:42.1737806Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1737811Z 2025-05-07T20:32:42.1738277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1738286Z 2025-05-07T20:32:42.1738682Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1738998Z self=, 2025-05-07T20:32:42.1739125Z T=1, 2025-05-07T20:32:42.1739233Z D=5120, 2025-05-07T20:32:42.1739349Z scale_ub=1200.0, 2025-05-07T20:32:42.1739458Z contiguous=False, 2025-05-07T20:32:42.1739541Z compiled=True, 2025-05-07T20:32:42.1739615Z ) 2025-05-07T20:32:42.1739842Z self = 2025-05-07T20:32:42.1740010Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.1740015Z 2025-05-07T20:32:42.1740103Z @given( 2025-05-07T20:32:42.1740222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1740319Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1740446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1740713Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1740827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1740910Z ) 2025-05-07T20:32:42.1741156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1741247Z def test_silu_mul_quant( 2025-05-07T20:32:42.1741334Z self, 2025-05-07T20:32:42.1741415Z T: int, 2025-05-07T20:32:42.1741500Z D: int, 2025-05-07T20:32:42.1741598Z scale_ub: Optional[float], 2025-05-07T20:32:42.1741685Z contiguous: bool, 2025-05-07T20:32:42.1741777Z compiled: bool, 2025-05-07T20:32:42.1741855Z ) -> None: 2025-05-07T20:32:42.1741947Z torch.manual_seed(2025) 2025-05-07T20:32:42.1742022Z 2025-05-07T20:32:42.1742191Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1742264Z 2025-05-07T20:32:42.1742364Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1742489Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1742581Z x = x_sign * x_clamp 2025-05-07T20:32:42.1742669Z x0 = x[:, :D] 2025-05-07T20:32:42.1742746Z x1 = x[:, D:] 2025-05-07T20:32:42.1742828Z 2025-05-07T20:32:42.1742909Z if contiguous: 2025-05-07T20:32:42.1743000Z x0 = x0.contiguous() 2025-05-07T20:32:42.1743101Z x1 = x1.contiguous() 2025-05-07T20:32:42.1743176Z 2025-05-07T20:32:42.1743265Z if scale_ub is not None: 2025-05-07T20:32:42.1743376Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1743509Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1743584Z ) 2025-05-07T20:32:42.1743663Z else: 2025-05-07T20:32:42.1743759Z scale_ub_tensor = None 2025-05-07T20:32:42.1743830Z 2025-05-07T20:32:42.1743967Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1744056Z op = silu_mul_quant 2025-05-07T20:32:42.1744246Z if compiled: 2025-05-07T20:32:42.1744412Z op = torch.compile(op) 2025-05-07T20:32:42.1744517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1744589Z 2025-05-07T20:32:42.1744686Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1744691Z 2025-05-07T20:32:42.1744786Z moe/activation_test.py:117: 2025-05-07T20:32:42.1744912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1745018Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1745115Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1745492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1745581Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1746143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1746252Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1746613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1746835Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1747183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1747274Z kernel = self.compile( 2025-05-07T20:32:42.1747664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1747838Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1747963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1747967Z 2025-05-07T20:32:42.1748181Z self = 2025-05-07T20:32:42.1748958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1749511Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287af6b60>} 2025-05-07T20:32:42.1750262Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1750449Z context = 2025-05-07T20:32:42.1750460Z 2025-05-07T20:32:42.1750622Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1750889Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1751003Z module_map=module_map) 2025-05-07T20:32:42.1751166Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1751262Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1751344Z E ^ 2025-05-07T20:32:42.1751698Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1751703Z 2025-05-07T20:32:42.1752124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1752129Z 2025-05-07T20:32:42.1752228Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1752447Z self=, 2025-05-07T20:32:42.1752529Z T=1, 2025-05-07T20:32:42.1752603Z D=5120, 2025-05-07T20:32:42.1752683Z scale_ub=1200.0, 2025-05-07T20:32:42.1752778Z contiguous=False, 2025-05-07T20:32:42.1752860Z compiled=False, 2025-05-07T20:32:42.1752980Z ) 2025-05-07T20:32:42.1753253Z self = 2025-05-07T20:32:42.1753419Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.1753424Z 2025-05-07T20:32:42.1753507Z @given( 2025-05-07T20:32:42.1753623Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1753719Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1753838Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1753951Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1754061Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1754142Z ) 2025-05-07T20:32:42.1754385Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1754475Z def test_silu_mul_quant( 2025-05-07T20:32:42.1754598Z self, 2025-05-07T20:32:42.1754672Z T: int, 2025-05-07T20:32:42.1754757Z D: int, 2025-05-07T20:32:42.1754855Z scale_ub: Optional[float], 2025-05-07T20:32:42.1754947Z contiguous: bool, 2025-05-07T20:32:42.1755040Z compiled: bool, 2025-05-07T20:32:42.1755117Z ) -> None: 2025-05-07T20:32:42.1755210Z torch.manual_seed(2025) 2025-05-07T20:32:42.1755288Z 2025-05-07T20:32:42.1755458Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1755531Z 2025-05-07T20:32:42.1755630Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1755752Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1755839Z x = x_sign * x_clamp 2025-05-07T20:32:42.1755925Z x0 = x[:, :D] 2025-05-07T20:32:42.1756005Z x1 = x[:, D:] 2025-05-07T20:32:42.1756085Z 2025-05-07T20:32:42.1756169Z if contiguous: 2025-05-07T20:32:42.1756258Z x0 = x0.contiguous() 2025-05-07T20:32:42.1756355Z x1 = x1.contiguous() 2025-05-07T20:32:42.1756427Z 2025-05-07T20:32:42.1756564Z if scale_ub is not None: 2025-05-07T20:32:42.1756677Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1756811Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1756891Z ) 2025-05-07T20:32:42.1756973Z else: 2025-05-07T20:32:42.1757068Z scale_ub_tensor = None 2025-05-07T20:32:42.1757141Z 2025-05-07T20:32:42.1757276Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1757367Z op = silu_mul_quant 2025-05-07T20:32:42.1757449Z if compiled: 2025-05-07T20:32:42.1757556Z op = torch.compile(op) 2025-05-07T20:32:42.1757660Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1757743Z 2025-05-07T20:32:42.1757832Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1757837Z 2025-05-07T20:32:42.1757933Z moe/activation_test.py:117: 2025-05-07T20:32:42.1758073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1758175Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1758276Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1758781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1758876Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1759246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1759465Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1759806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1759907Z kernel = self.compile( 2025-05-07T20:32:42.1760295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1760511Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1760684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1760688Z 2025-05-07T20:32:42.1760892Z self = 2025-05-07T20:32:42.1761664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1762160Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287af72e0>} 2025-05-07T20:32:42.1762952Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1763145Z context = 2025-05-07T20:32:42.1763155Z 2025-05-07T20:32:42.1763317Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1763664Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1763770Z module_map=module_map) 2025-05-07T20:32:42.1763939Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1764033Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1764112Z E ^ 2025-05-07T20:32:42.1764477Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1764481Z 2025-05-07T20:32:42.1764897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1764905Z 2025-05-07T20:32:42.1765007Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1765286Z self=, 2025-05-07T20:32:42.1765366Z T=16384, 2025-05-07T20:32:42.1765451Z D=5120, 2025-05-07T20:32:42.1765532Z scale_ub=1200.0, 2025-05-07T20:32:42.1765617Z contiguous=False, 2025-05-07T20:32:42.1765704Z compiled=True, 2025-05-07T20:32:42.1765776Z ) 2025-05-07T20:32:42.1765994Z self = 2025-05-07T20:32:42.1766181Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.1766185Z 2025-05-07T20:32:42.1766262Z @given( 2025-05-07T20:32:42.1766381Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1766487Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1766600Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1766728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1766841Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1766916Z ) 2025-05-07T20:32:42.1767168Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1767258Z def test_silu_mul_quant( 2025-05-07T20:32:42.1767335Z self, 2025-05-07T20:32:42.1767417Z T: int, 2025-05-07T20:32:42.1767493Z D: int, 2025-05-07T20:32:42.1767591Z scale_ub: Optional[float], 2025-05-07T20:32:42.1767685Z contiguous: bool, 2025-05-07T20:32:42.1767769Z compiled: bool, 2025-05-07T20:32:42.1767846Z ) -> None: 2025-05-07T20:32:42.1767946Z torch.manual_seed(2025) 2025-05-07T20:32:42.1768016Z 2025-05-07T20:32:42.1768199Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1768273Z 2025-05-07T20:32:42.1768365Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1768495Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1768583Z x = x_sign * x_clamp 2025-05-07T20:32:42.1768707Z x0 = x[:, :D] 2025-05-07T20:32:42.1768860Z x1 = x[:, D:] 2025-05-07T20:32:42.1768940Z 2025-05-07T20:32:42.1769021Z if contiguous: 2025-05-07T20:32:42.1769122Z x0 = x0.contiguous() 2025-05-07T20:32:42.1769210Z x1 = x1.contiguous() 2025-05-07T20:32:42.1769279Z 2025-05-07T20:32:42.1769376Z if scale_ub is not None: 2025-05-07T20:32:42.1769479Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1769619Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1769691Z ) 2025-05-07T20:32:42.1769766Z else: 2025-05-07T20:32:42.1769868Z scale_ub_tensor = None 2025-05-07T20:32:42.1769939Z 2025-05-07T20:32:42.1770068Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1770165Z op = silu_mul_quant 2025-05-07T20:32:42.1770295Z if compiled: 2025-05-07T20:32:42.1770393Z op = torch.compile(op) 2025-05-07T20:32:42.1770510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1770586Z 2025-05-07T20:32:42.1770678Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1770690Z 2025-05-07T20:32:42.1770787Z moe/activation_test.py:117: 2025-05-07T20:32:42.1770915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1771024Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1771121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1771493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1771591Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1772088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1772183Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1772555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1772847Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1773197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1773290Z kernel = self.compile( 2025-05-07T20:32:42.1773679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1773860Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1773989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1773993Z 2025-05-07T20:32:42.1774209Z self = 2025-05-07T20:32:42.1774984Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1775486Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2874e0fe0>} 2025-05-07T20:32:42.1776244Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1776435Z context = 2025-05-07T20:32:42.1776439Z 2025-05-07T20:32:42.1776613Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1776875Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1776981Z module_map=module_map) 2025-05-07T20:32:42.1777195Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1777334Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1777424Z E ^ 2025-05-07T20:32:42.1777778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1777782Z 2025-05-07T20:32:42.1778198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1778203Z 2025-05-07T20:32:42.1778314Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1778534Z self=, 2025-05-07T20:32:42.1778619Z T=2048, 2025-05-07T20:32:42.1778696Z D=7168, 2025-05-07T20:32:42.1778777Z scale_ub=1200.0, 2025-05-07T20:32:42.1778871Z contiguous=False, 2025-05-07T20:32:42.1778953Z compiled=True, 2025-05-07T20:32:42.1779025Z ) 2025-05-07T20:32:42.1779293Z self = 2025-05-07T20:32:42.1779470Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.1779478Z 2025-05-07T20:32:42.1779554Z @given( 2025-05-07T20:32:42.1779681Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1779779Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1779890Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1780013Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1780126Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1780206Z ) 2025-05-07T20:32:42.1780450Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1780541Z def test_silu_mul_quant( 2025-05-07T20:32:42.1780626Z self, 2025-05-07T20:32:42.1780715Z T: int, 2025-05-07T20:32:42.1780791Z D: int, 2025-05-07T20:32:42.1780898Z scale_ub: Optional[float], 2025-05-07T20:32:42.1780989Z contiguous: bool, 2025-05-07T20:32:42.1781114Z compiled: bool, 2025-05-07T20:32:42.1781205Z ) -> None: 2025-05-07T20:32:42.1781300Z torch.manual_seed(2025) 2025-05-07T20:32:42.1781372Z 2025-05-07T20:32:42.1781553Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1781629Z 2025-05-07T20:32:42.1781718Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1781850Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1781938Z x = x_sign * x_clamp 2025-05-07T20:32:42.1782016Z x0 = x[:, :D] 2025-05-07T20:32:42.1782104Z x1 = x[:, D:] 2025-05-07T20:32:42.1782173Z 2025-05-07T20:32:42.1782264Z if contiguous: 2025-05-07T20:32:42.1782355Z x0 = x0.contiguous() 2025-05-07T20:32:42.1782443Z x1 = x1.contiguous() 2025-05-07T20:32:42.1782523Z 2025-05-07T20:32:42.1782616Z if scale_ub is not None: 2025-05-07T20:32:42.1782722Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1782869Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1782947Z ) 2025-05-07T20:32:42.1783024Z else: 2025-05-07T20:32:42.1783124Z scale_ub_tensor = None 2025-05-07T20:32:42.1783197Z 2025-05-07T20:32:42.1783325Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1783420Z op = silu_mul_quant 2025-05-07T20:32:42.1783503Z if compiled: 2025-05-07T20:32:42.1783607Z op = torch.compile(op) 2025-05-07T20:32:42.1783710Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1783781Z 2025-05-07T20:32:42.1783877Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1783881Z 2025-05-07T20:32:42.1783977Z moe/activation_test.py:117: 2025-05-07T20:32:42.1784104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1784215Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1784364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1784740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1784875Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1785370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1785471Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1785829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1786050Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1786399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1786490Z kernel = self.compile( 2025-05-07T20:32:42.1786925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1787105Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1787232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1787237Z 2025-05-07T20:32:42.1787449Z self = 2025-05-07T20:32:42.1788216Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1788719Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2874e1b20>} 2025-05-07T20:32:42.1789475Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1789709Z context = 2025-05-07T20:32:42.1789714Z 2025-05-07T20:32:42.1789885Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1790148Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1790261Z module_map=module_map) 2025-05-07T20:32:42.1790421Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1790520Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1790600Z E ^ 2025-05-07T20:32:42.1790955Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1790959Z 2025-05-07T20:32:42.1791376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1791393Z 2025-05-07T20:32:42.1791492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1791717Z self=, 2025-05-07T20:32:42.1791804Z T=1, 2025-05-07T20:32:42.1791875Z D=5120, 2025-05-07T20:32:42.1791957Z scale_ub=None, 2025-05-07T20:32:42.1792049Z contiguous=False, 2025-05-07T20:32:42.1792132Z compiled=False, 2025-05-07T20:32:42.1792204Z ) 2025-05-07T20:32:42.1792428Z self = 2025-05-07T20:32:42.1792592Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.1792597Z 2025-05-07T20:32:42.1792681Z @given( 2025-05-07T20:32:42.1792795Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1792892Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1793014Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1793174Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1793292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1793417Z ) 2025-05-07T20:32:42.1793658Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1793750Z def test_silu_mul_quant( 2025-05-07T20:32:42.1793838Z self, 2025-05-07T20:32:42.1793914Z T: int, 2025-05-07T20:32:42.1793988Z D: int, 2025-05-07T20:32:42.1794092Z scale_ub: Optional[float], 2025-05-07T20:32:42.1794182Z contiguous: bool, 2025-05-07T20:32:42.1794274Z compiled: bool, 2025-05-07T20:32:42.1794353Z ) -> None: 2025-05-07T20:32:42.1794450Z torch.manual_seed(2025) 2025-05-07T20:32:42.1794530Z 2025-05-07T20:32:42.1794699Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1794772Z 2025-05-07T20:32:42.1794869Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1795038Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1795134Z x = x_sign * x_clamp 2025-05-07T20:32:42.1795224Z x0 = x[:, :D] 2025-05-07T20:32:42.1795303Z x1 = x[:, D:] 2025-05-07T20:32:42.1795376Z 2025-05-07T20:32:42.1795469Z if contiguous: 2025-05-07T20:32:42.1795561Z x0 = x0.contiguous() 2025-05-07T20:32:42.1795653Z x1 = x1.contiguous() 2025-05-07T20:32:42.1795724Z 2025-05-07T20:32:42.1795816Z if scale_ub is not None: 2025-05-07T20:32:42.1795929Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1796063Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1796139Z ) 2025-05-07T20:32:42.1796223Z else: 2025-05-07T20:32:42.1796317Z scale_ub_tensor = None 2025-05-07T20:32:42.1796394Z 2025-05-07T20:32:42.1796531Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1796626Z op = silu_mul_quant 2025-05-07T20:32:42.1796712Z if compiled: 2025-05-07T20:32:42.1796863Z op = torch.compile(op) 2025-05-07T20:32:42.1796969Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1797048Z 2025-05-07T20:32:42.1797139Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1797143Z 2025-05-07T20:32:42.1797237Z moe/activation_test.py:117: 2025-05-07T20:32:42.1797375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1797473Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1797571Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1798087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1798183Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1798542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1798781Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1799127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1799233Z kernel = self.compile( 2025-05-07T20:32:42.1799618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1799790Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1799927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1799932Z 2025-05-07T20:32:42.1800136Z self = 2025-05-07T20:32:42.1800915Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1801453Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2874e2e80>} 2025-05-07T20:32:42.1802257Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1802446Z context = 2025-05-07T20:32:42.1802451Z 2025-05-07T20:32:42.1802615Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1802885Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1802989Z module_map=module_map) 2025-05-07T20:32:42.1803149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1803324Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1803401Z E ^ 2025-05-07T20:32:42.1803910Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1803919Z 2025-05-07T20:32:42.1804334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1804339Z 2025-05-07T20:32:42.1804438Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1804666Z self=, 2025-05-07T20:32:42.1804741Z T=4096, 2025-05-07T20:32:42.1804827Z D=7168, 2025-05-07T20:32:42.1804908Z scale_ub=1200.0, 2025-05-07T20:32:42.1804994Z contiguous=False, 2025-05-07T20:32:42.1805078Z compiled=False, 2025-05-07T20:32:42.1805153Z ) 2025-05-07T20:32:42.1805368Z self = 2025-05-07T20:32:42.1805555Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.1805562Z 2025-05-07T20:32:42.1805689Z @given( 2025-05-07T20:32:42.1805809Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1805915Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1806029Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1806152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1806264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1806335Z ) 2025-05-07T20:32:42.1806585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1806680Z def test_silu_mul_quant( 2025-05-07T20:32:42.1806754Z self, 2025-05-07T20:32:42.1806833Z T: int, 2025-05-07T20:32:42.1806912Z D: int, 2025-05-07T20:32:42.1807006Z scale_ub: Optional[float], 2025-05-07T20:32:42.1807099Z contiguous: bool, 2025-05-07T20:32:42.1807186Z compiled: bool, 2025-05-07T20:32:42.1807263Z ) -> None: 2025-05-07T20:32:42.1807370Z torch.manual_seed(2025) 2025-05-07T20:32:42.1807464Z 2025-05-07T20:32:42.1807660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1807740Z 2025-05-07T20:32:42.1807827Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1807954Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1808041Z x = x_sign * x_clamp 2025-05-07T20:32:42.1808116Z x0 = x[:, :D] 2025-05-07T20:32:42.1808201Z x1 = x[:, D:] 2025-05-07T20:32:42.1808272Z 2025-05-07T20:32:42.1808352Z if contiguous: 2025-05-07T20:32:42.1808446Z x0 = x0.contiguous() 2025-05-07T20:32:42.1808533Z x1 = x1.contiguous() 2025-05-07T20:32:42.1808605Z 2025-05-07T20:32:42.1808699Z if scale_ub is not None: 2025-05-07T20:32:42.1808800Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1808935Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1809016Z ) 2025-05-07T20:32:42.1809136Z else: 2025-05-07T20:32:42.1809274Z scale_ub_tensor = None 2025-05-07T20:32:42.1809346Z 2025-05-07T20:32:42.1809472Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1809568Z op = silu_mul_quant 2025-05-07T20:32:42.1809648Z if compiled: 2025-05-07T20:32:42.1809747Z op = torch.compile(op) 2025-05-07T20:32:42.1809856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1809928Z 2025-05-07T20:32:42.1810017Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1810021Z 2025-05-07T20:32:42.1810123Z moe/activation_test.py:117: 2025-05-07T20:32:42.1810250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1810360Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1810458Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1811018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1811130Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1811492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1811714Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1812064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1812156Z kernel = self.compile( 2025-05-07T20:32:42.1812550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1812726Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1812853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1812857Z 2025-05-07T20:32:42.1813078Z self = 2025-05-07T20:32:42.1813890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1814398Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8ba8040>} 2025-05-07T20:32:42.1815152Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1815341Z context = 2025-05-07T20:32:42.1815354Z 2025-05-07T20:32:42.1815519Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1815786Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1815905Z module_map=module_map) 2025-05-07T20:32:42.1816066Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1816159Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1816243Z E ^ 2025-05-07T20:32:42.1816596Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1816601Z 2025-05-07T20:32:42.1817025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1817030Z 2025-05-07T20:32:42.1817131Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1817351Z self=, 2025-05-07T20:32:42.1817441Z T=16384, 2025-05-07T20:32:42.1817520Z D=7168, 2025-05-07T20:32:42.1817597Z scale_ub=None, 2025-05-07T20:32:42.1817733Z contiguous=True, 2025-05-07T20:32:42.1817856Z compiled=True, 2025-05-07T20:32:42.1817930Z ) 2025-05-07T20:32:42.1818152Z self = 2025-05-07T20:32:42.1818324Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.1818329Z 2025-05-07T20:32:42.1818412Z @given( 2025-05-07T20:32:42.1818528Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1818626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1818744Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1818859Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1818969Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1819050Z ) 2025-05-07T20:32:42.1819294Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1819425Z def test_silu_mul_quant( 2025-05-07T20:32:42.1819511Z self, 2025-05-07T20:32:42.1819589Z T: int, 2025-05-07T20:32:42.1819679Z D: int, 2025-05-07T20:32:42.1819777Z scale_ub: Optional[float], 2025-05-07T20:32:42.1819865Z contiguous: bool, 2025-05-07T20:32:42.1819953Z compiled: bool, 2025-05-07T20:32:42.1820030Z ) -> None: 2025-05-07T20:32:42.1820123Z torch.manual_seed(2025) 2025-05-07T20:32:42.1820202Z 2025-05-07T20:32:42.1820370Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1820442Z 2025-05-07T20:32:42.1820539Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1820661Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1820746Z x = x_sign * x_clamp 2025-05-07T20:32:42.1820829Z x0 = x[:, :D] 2025-05-07T20:32:42.1820906Z x1 = x[:, D:] 2025-05-07T20:32:42.1820983Z 2025-05-07T20:32:42.1821066Z if contiguous: 2025-05-07T20:32:42.1821156Z x0 = x0.contiguous() 2025-05-07T20:32:42.1821256Z x1 = x1.contiguous() 2025-05-07T20:32:42.1821371Z 2025-05-07T20:32:42.1821460Z if scale_ub is not None: 2025-05-07T20:32:42.1821572Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1821702Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1821773Z ) 2025-05-07T20:32:42.1821855Z else: 2025-05-07T20:32:42.1821947Z scale_ub_tensor = None 2025-05-07T20:32:42.1822017Z 2025-05-07T20:32:42.1822150Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1822241Z op = silu_mul_quant 2025-05-07T20:32:42.1822323Z if compiled: 2025-05-07T20:32:42.1822430Z op = torch.compile(op) 2025-05-07T20:32:42.1822532Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1822613Z 2025-05-07T20:32:42.1822702Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1822710Z 2025-05-07T20:32:42.1822804Z moe/activation_test.py:117: 2025-05-07T20:32:42.1822946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1823048Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1823144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1823522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1823612Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1824117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1824210Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1824569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1824797Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1825190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1825322Z kernel = self.compile( 2025-05-07T20:32:42.1825715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1825887Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1826021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1826025Z 2025-05-07T20:32:42.1826232Z self = 2025-05-07T20:32:42.1826999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1827548Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8ba9260>} 2025-05-07T20:32:42.1828358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1828557Z context = 2025-05-07T20:32:42.1828561Z 2025-05-07T20:32:42.1828726Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1829001Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1829108Z module_map=module_map) 2025-05-07T20:32:42.1829269Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1829377Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1829454Z E ^ 2025-05-07T20:32:42.1829811Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1829861Z 2025-05-07T20:32:42.1830285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1830290Z 2025-05-07T20:32:42.1830390Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1830619Z self=, 2025-05-07T20:32:42.1830697Z T=4096, 2025-05-07T20:32:42.1830772Z D=5120, 2025-05-07T20:32:42.1830862Z scale_ub=None, 2025-05-07T20:32:42.1830946Z contiguous=False, 2025-05-07T20:32:42.1831027Z compiled=True, 2025-05-07T20:32:42.1831114Z ) 2025-05-07T20:32:42.1831330Z self = 2025-05-07T20:32:42.1831500Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.1831512Z 2025-05-07T20:32:42.1831590Z @given( 2025-05-07T20:32:42.1831709Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1831815Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1831929Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1832043Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1832163Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1832236Z ) 2025-05-07T20:32:42.1832479Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1832576Z def test_silu_mul_quant( 2025-05-07T20:32:42.1832649Z self, 2025-05-07T20:32:42.1832723Z T: int, 2025-05-07T20:32:42.1832803Z D: int, 2025-05-07T20:32:42.1832897Z scale_ub: Optional[float], 2025-05-07T20:32:42.1832986Z contiguous: bool, 2025-05-07T20:32:42.1833067Z compiled: bool, 2025-05-07T20:32:42.1833142Z ) -> None: 2025-05-07T20:32:42.1833241Z torch.manual_seed(2025) 2025-05-07T20:32:42.1833314Z 2025-05-07T20:32:42.1833526Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1833674Z 2025-05-07T20:32:42.1833762Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1833885Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1833978Z x = x_sign * x_clamp 2025-05-07T20:32:42.1834055Z x0 = x[:, :D] 2025-05-07T20:32:42.1834131Z x1 = x[:, D:] 2025-05-07T20:32:42.1834210Z 2025-05-07T20:32:42.1834291Z if contiguous: 2025-05-07T20:32:42.1834379Z x0 = x0.contiguous() 2025-05-07T20:32:42.1834474Z x1 = x1.contiguous() 2025-05-07T20:32:42.1834544Z 2025-05-07T20:32:42.1834639Z if scale_ub is not None: 2025-05-07T20:32:42.1834740Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1834872Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1834954Z ) 2025-05-07T20:32:42.1835029Z else: 2025-05-07T20:32:42.1835162Z scale_ub_tensor = None 2025-05-07T20:32:42.1835246Z 2025-05-07T20:32:42.1835376Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1835465Z op = silu_mul_quant 2025-05-07T20:32:42.1835552Z if compiled: 2025-05-07T20:32:42.1835648Z op = torch.compile(op) 2025-05-07T20:32:42.1835750Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1835827Z 2025-05-07T20:32:42.1835917Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1835923Z 2025-05-07T20:32:42.1836026Z moe/activation_test.py:117: 2025-05-07T20:32:42.1836151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1836252Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1836356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1836725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1836818Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1837324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1837472Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1837838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1838063Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1838806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1838913Z kernel = self.compile( 2025-05-07T20:32:42.1839300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1839479Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1839605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1839609Z 2025-05-07T20:32:42.1845116Z self = 2025-05-07T20:32:42.1845925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1846438Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8ba9da0>} 2025-05-07T20:32:42.1847188Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1847381Z context = 2025-05-07T20:32:42.1847394Z 2025-05-07T20:32:42.1847779Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1848052Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1848239Z module_map=module_map) 2025-05-07T20:32:42.1848403Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1848507Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1848594Z E ^ 2025-05-07T20:32:42.1848950Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1848955Z 2025-05-07T20:32:42.1849384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1849389Z 2025-05-07T20:32:42.1849497Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1849722Z self=, 2025-05-07T20:32:42.1849881Z T=4096, 2025-05-07T20:32:42.1849964Z D=5120, 2025-05-07T20:32:42.1850048Z scale_ub=1200.0, 2025-05-07T20:32:42.1850147Z contiguous=False, 2025-05-07T20:32:42.1850235Z compiled=False, 2025-05-07T20:32:42.1850310Z ) 2025-05-07T20:32:42.1850537Z self = 2025-05-07T20:32:42.1850711Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.1850716Z 2025-05-07T20:32:42.1850805Z @given( 2025-05-07T20:32:42.1850923Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1851021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1851144Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1851260Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1851372Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1851458Z ) 2025-05-07T20:32:42.1851706Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1851811Z def test_silu_mul_quant( 2025-05-07T20:32:42.1851966Z self, 2025-05-07T20:32:42.1852045Z T: int, 2025-05-07T20:32:42.1852130Z D: int, 2025-05-07T20:32:42.1852228Z scale_ub: Optional[float], 2025-05-07T20:32:42.1852315Z contiguous: bool, 2025-05-07T20:32:42.1852411Z compiled: bool, 2025-05-07T20:32:42.1852495Z ) -> None: 2025-05-07T20:32:42.1852591Z torch.manual_seed(2025) 2025-05-07T20:32:42.1852671Z 2025-05-07T20:32:42.1852842Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1852913Z 2025-05-07T20:32:42.1853013Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1853137Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1853224Z x = x_sign * x_clamp 2025-05-07T20:32:42.1853312Z x0 = x[:, :D] 2025-05-07T20:32:42.1853395Z x1 = x[:, D:] 2025-05-07T20:32:42.1853479Z 2025-05-07T20:32:42.1853563Z if contiguous: 2025-05-07T20:32:42.1853665Z x0 = x0.contiguous() 2025-05-07T20:32:42.1853769Z x1 = x1.contiguous() 2025-05-07T20:32:42.1853843Z 2025-05-07T20:32:42.1853934Z if scale_ub is not None: 2025-05-07T20:32:42.1854047Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1854180Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1854255Z ) 2025-05-07T20:32:42.1854339Z else: 2025-05-07T20:32:42.1854435Z scale_ub_tensor = None 2025-05-07T20:32:42.1854504Z 2025-05-07T20:32:42.1854642Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1854735Z op = silu_mul_quant 2025-05-07T20:32:42.1854827Z if compiled: 2025-05-07T20:32:42.1854927Z op = torch.compile(op) 2025-05-07T20:32:42.1855030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1855114Z 2025-05-07T20:32:42.1855207Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1855212Z 2025-05-07T20:32:42.1855371Z moe/activation_test.py:117: 2025-05-07T20:32:42.1855554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1855655Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1855752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1856260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1856356Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1856729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1856952Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1857294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1857438Z kernel = self.compile( 2025-05-07T20:32:42.1857832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1858013Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1858149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1858154Z 2025-05-07T20:32:42.1858358Z self = 2025-05-07T20:32:42.1859134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1859632Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8bab420>} 2025-05-07T20:32:42.1860401Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1860642Z context = 2025-05-07T20:32:42.1860647Z 2025-05-07T20:32:42.1860811Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1861081Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1861185Z module_map=module_map) 2025-05-07T20:32:42.1861352Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1861452Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1861526Z E ^ 2025-05-07T20:32:42.1861891Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1861896Z 2025-05-07T20:32:42.1862314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1862323Z 2025-05-07T20:32:42.1862434Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1862658Z self=, 2025-05-07T20:32:42.1862736Z T=4096, 2025-05-07T20:32:42.1862831Z D=5120, 2025-05-07T20:32:42.1862916Z scale_ub=1200.0, 2025-05-07T20:32:42.1863001Z contiguous=False, 2025-05-07T20:32:42.1863092Z compiled=True, 2025-05-07T20:32:42.1863167Z ) 2025-05-07T20:32:42.1863383Z self = 2025-05-07T20:32:42.1863573Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.1863577Z 2025-05-07T20:32:42.1863658Z @given( 2025-05-07T20:32:42.1863781Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1863890Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1864051Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1864231Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1864347Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1864426Z ) 2025-05-07T20:32:42.1864688Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1864787Z def test_silu_mul_quant( 2025-05-07T20:32:42.1864866Z self, 2025-05-07T20:32:42.1864958Z T: int, 2025-05-07T20:32:42.1865038Z D: int, 2025-05-07T20:32:42.1865138Z scale_ub: Optional[float], 2025-05-07T20:32:42.1865240Z contiguous: bool, 2025-05-07T20:32:42.1865328Z compiled: bool, 2025-05-07T20:32:42.1865414Z ) -> None: 2025-05-07T20:32:42.1865508Z torch.manual_seed(2025) 2025-05-07T20:32:42.1865582Z 2025-05-07T20:32:42.1865759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1865877Z 2025-05-07T20:32:42.1865974Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1866108Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1866200Z x = x_sign * x_clamp 2025-05-07T20:32:42.1866280Z x0 = x[:, :D] 2025-05-07T20:32:42.1866370Z x1 = x[:, D:] 2025-05-07T20:32:42.1866442Z 2025-05-07T20:32:42.1866524Z if contiguous: 2025-05-07T20:32:42.1866623Z x0 = x0.contiguous() 2025-05-07T20:32:42.1866712Z x1 = x1.contiguous() 2025-05-07T20:32:42.1866787Z 2025-05-07T20:32:42.1866885Z if scale_ub is not None: 2025-05-07T20:32:42.1866990Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1867131Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1867209Z ) 2025-05-07T20:32:42.1867681Z else: 2025-05-07T20:32:42.1867783Z scale_ub_tensor = None 2025-05-07T20:32:42.1867859Z 2025-05-07T20:32:42.1867994Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1868094Z op = silu_mul_quant 2025-05-07T20:32:42.1868229Z if compiled: 2025-05-07T20:32:42.1868328Z op = torch.compile(op) 2025-05-07T20:32:42.1868449Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1868522Z 2025-05-07T20:32:42.1868616Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1868621Z 2025-05-07T20:32:42.1868724Z moe/activation_test.py:117: 2025-05-07T20:32:42.1868851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1868951Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1869052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1869423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1869518Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1870025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1870125Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1870496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1870716Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1871057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1871156Z kernel = self.compile( 2025-05-07T20:32:42.1871540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1871724Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1871850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1871854Z 2025-05-07T20:32:42.1872060Z self = 2025-05-07T20:32:42.1872911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1873453Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2875fc860>} 2025-05-07T20:32:42.1874209Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1874396Z context = 2025-05-07T20:32:42.1874401Z 2025-05-07T20:32:42.1874562Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1874880Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1874988Z module_map=module_map) 2025-05-07T20:32:42.1875159Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1875258Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1875331Z E ^ 2025-05-07T20:32:42.1875693Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1875698Z 2025-05-07T20:32:42.1876110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1876114Z 2025-05-07T20:32:42.1876228Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1876449Z self=, 2025-05-07T20:32:42.1876525Z T=2048, 2025-05-07T20:32:42.1876612Z D=7168, 2025-05-07T20:32:42.1876697Z scale_ub=1200.0, 2025-05-07T20:32:42.1876782Z contiguous=False, 2025-05-07T20:32:42.1876877Z compiled=False, 2025-05-07T20:32:42.1877002Z ) 2025-05-07T20:32:42.1877217Z self = 2025-05-07T20:32:42.1877403Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.1877407Z 2025-05-07T20:32:42.1877485Z @given( 2025-05-07T20:32:42.1877620Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1877715Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1877829Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1877954Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1878063Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1878136Z ) 2025-05-07T20:32:42.1878390Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1878484Z def test_silu_mul_quant( 2025-05-07T20:32:42.1878567Z self, 2025-05-07T20:32:42.1878656Z T: int, 2025-05-07T20:32:42.1878735Z D: int, 2025-05-07T20:32:42.1878834Z scale_ub: Optional[float], 2025-05-07T20:32:42.1878932Z contiguous: bool, 2025-05-07T20:32:42.1879018Z compiled: bool, 2025-05-07T20:32:42.1879102Z ) -> None: 2025-05-07T20:32:42.1879201Z torch.manual_seed(2025) 2025-05-07T20:32:42.1879274Z 2025-05-07T20:32:42.1879452Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1879529Z 2025-05-07T20:32:42.1879626Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1879772Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1879865Z x = x_sign * x_clamp 2025-05-07T20:32:42.1879945Z x0 = x[:, :D] 2025-05-07T20:32:42.1880037Z x1 = x[:, D:] 2025-05-07T20:32:42.1880109Z 2025-05-07T20:32:42.1880196Z if contiguous: 2025-05-07T20:32:42.1880305Z x0 = x0.contiguous() 2025-05-07T20:32:42.1880395Z x1 = x1.contiguous() 2025-05-07T20:32:42.1880539Z 2025-05-07T20:32:42.1880638Z if scale_ub is not None: 2025-05-07T20:32:42.1880789Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1880939Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1881019Z ) 2025-05-07T20:32:42.1881095Z else: 2025-05-07T20:32:42.1881201Z scale_ub_tensor = None 2025-05-07T20:32:42.1881277Z 2025-05-07T20:32:42.1881410Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1881513Z op = silu_mul_quant 2025-05-07T20:32:42.1881598Z if compiled: 2025-05-07T20:32:42.1881698Z op = torch.compile(op) 2025-05-07T20:32:42.1881815Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1881889Z 2025-05-07T20:32:42.1881993Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1881998Z 2025-05-07T20:32:42.1882133Z moe/activation_test.py:117: 2025-05-07T20:32:42.1882268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1882380Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1882481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1882985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1883090Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1883449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1883828Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1884170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1884264Z kernel = self.compile( 2025-05-07T20:32:42.1884662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1884839Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1885011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1885023Z 2025-05-07T20:32:42.1885227Z self = 2025-05-07T20:32:42.1885990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1886495Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2875fd6c0>} 2025-05-07T20:32:42.1887248Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1887447Z context = 2025-05-07T20:32:42.1887454Z 2025-05-07T20:32:42.1887616Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1887876Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1887991Z module_map=module_map) 2025-05-07T20:32:42.1888151Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1888244Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1888330Z E ^ 2025-05-07T20:32:42.1888682Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1888687Z 2025-05-07T20:32:42.1889107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1889111Z 2025-05-07T20:32:42.1889262Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1889523Z self=, 2025-05-07T20:32:42.1889605Z T=1, 2025-05-07T20:32:42.1889679Z D=7168, 2025-05-07T20:32:42.1889765Z scale_ub=None, 2025-05-07T20:32:42.1889847Z contiguous=True, 2025-05-07T20:32:42.1889927Z compiled=False, 2025-05-07T20:32:42.1890005Z ) 2025-05-07T20:32:42.1890222Z self = 2025-05-07T20:32:42.1890384Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.1890389Z 2025-05-07T20:32:42.1890474Z @given( 2025-05-07T20:32:42.1890591Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1890687Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1890807Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1890965Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1891088Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1891170Z ) 2025-05-07T20:32:42.1891412Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1891510Z def test_silu_mul_quant( 2025-05-07T20:32:42.1891585Z self, 2025-05-07T20:32:42.1891659Z T: int, 2025-05-07T20:32:42.1891740Z D: int, 2025-05-07T20:32:42.1891836Z scale_ub: Optional[float], 2025-05-07T20:32:42.1891923Z contiguous: bool, 2025-05-07T20:32:42.1892014Z compiled: bool, 2025-05-07T20:32:42.1892090Z ) -> None: 2025-05-07T20:32:42.1892184Z torch.manual_seed(2025) 2025-05-07T20:32:42.1892261Z 2025-05-07T20:32:42.1892429Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1892497Z 2025-05-07T20:32:42.1892593Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1892717Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1892815Z x = x_sign * x_clamp 2025-05-07T20:32:42.1892938Z x0 = x[:, :D] 2025-05-07T20:32:42.1893015Z x1 = x[:, D:] 2025-05-07T20:32:42.1893091Z 2025-05-07T20:32:42.1893171Z if contiguous: 2025-05-07T20:32:42.1893265Z x0 = x0.contiguous() 2025-05-07T20:32:42.1893361Z x1 = x1.contiguous() 2025-05-07T20:32:42.1893434Z 2025-05-07T20:32:42.1893524Z if scale_ub is not None: 2025-05-07T20:32:42.1893636Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1893770Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1893844Z ) 2025-05-07T20:32:42.1893925Z else: 2025-05-07T20:32:42.1894015Z scale_ub_tensor = None 2025-05-07T20:32:42.1894093Z 2025-05-07T20:32:42.1894221Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1894310Z op = silu_mul_quant 2025-05-07T20:32:42.1894400Z if compiled: 2025-05-07T20:32:42.1894501Z op = torch.compile(op) 2025-05-07T20:32:42.1894608Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1894688Z 2025-05-07T20:32:42.1894778Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1894783Z 2025-05-07T20:32:42.1894878Z moe/activation_test.py:117: 2025-05-07T20:32:42.1895012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1895112Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1895218Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1895721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1895816Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1896182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1896406Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1896796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1896943Z kernel = self.compile( 2025-05-07T20:32:42.1897326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1897508Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1897636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1897641Z 2025-05-07T20:32:42.1897845Z self = 2025-05-07T20:32:42.1898622Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1899187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2875fcfe0>} 2025-05-07T20:32:42.1899948Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1900135Z context = 2025-05-07T20:32:42.1900140Z 2025-05-07T20:32:42.1900310Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1900573Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1900679Z module_map=module_map) 2025-05-07T20:32:42.1900845Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1900940Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1901021Z E ^ 2025-05-07T20:32:42.1901383Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1901459Z 2025-05-07T20:32:42.1901875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1901879Z 2025-05-07T20:32:42.1901987Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1902207Z self=, 2025-05-07T20:32:42.1902287Z T=16384, 2025-05-07T20:32:42.1902368Z D=7168, 2025-05-07T20:32:42.1902450Z scale_ub=1200.0, 2025-05-07T20:32:42.1902540Z contiguous=False, 2025-05-07T20:32:42.1902627Z compiled=True, 2025-05-07T20:32:42.1902698Z ) 2025-05-07T20:32:42.1902915Z self = 2025-05-07T20:32:42.1903101Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.1903105Z 2025-05-07T20:32:42.1903185Z @given( 2025-05-07T20:32:42.1903312Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1903411Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1903524Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1903645Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1903757Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1903829Z ) 2025-05-07T20:32:42.1904079Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1904170Z def test_silu_mul_quant( 2025-05-07T20:32:42.1904245Z self, 2025-05-07T20:32:42.1904330Z T: int, 2025-05-07T20:32:42.1904405Z D: int, 2025-05-07T20:32:42.1904504Z scale_ub: Optional[float], 2025-05-07T20:32:42.1904592Z contiguous: bool, 2025-05-07T20:32:42.1904675Z compiled: bool, 2025-05-07T20:32:42.1904761Z ) -> None: 2025-05-07T20:32:42.1904852Z torch.manual_seed(2025) 2025-05-07T20:32:42.1904976Z 2025-05-07T20:32:42.1905216Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1905292Z 2025-05-07T20:32:42.1905385Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1905527Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1905616Z x = x_sign * x_clamp 2025-05-07T20:32:42.1905696Z x0 = x[:, :D] 2025-05-07T20:32:42.1905783Z x1 = x[:, D:] 2025-05-07T20:32:42.1905853Z 2025-05-07T20:32:42.1905933Z if contiguous: 2025-05-07T20:32:42.1906035Z x0 = x0.contiguous() 2025-05-07T20:32:42.1906121Z x1 = x1.contiguous() 2025-05-07T20:32:42.1906192Z 2025-05-07T20:32:42.1906289Z if scale_ub is not None: 2025-05-07T20:32:42.1906391Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1906528Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1906657Z ) 2025-05-07T20:32:42.1906733Z else: 2025-05-07T20:32:42.1906841Z scale_ub_tensor = None 2025-05-07T20:32:42.1906915Z 2025-05-07T20:32:42.1907044Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1907136Z op = silu_mul_quant 2025-05-07T20:32:42.1907220Z if compiled: 2025-05-07T20:32:42.1907318Z op = torch.compile(op) 2025-05-07T20:32:42.1907431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1907501Z 2025-05-07T20:32:42.1907588Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1907592Z 2025-05-07T20:32:42.1907692Z moe/activation_test.py:117: 2025-05-07T20:32:42.1907821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1907925Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1908025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1908400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1908502Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1909051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1909147Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1909517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1909739Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1910088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1910181Z kernel = self.compile( 2025-05-07T20:32:42.1910565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1910749Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1910878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1910888Z 2025-05-07T20:32:42.1911099Z self = 2025-05-07T20:32:42.1911865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1912360Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2875ffb00>} 2025-05-07T20:32:42.1913119Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1913310Z context = 2025-05-07T20:32:42.1913355Z 2025-05-07T20:32:42.1913527Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1913840Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1913943Z module_map=module_map) 2025-05-07T20:32:42.1914111Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1914206Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1914283Z E ^ 2025-05-07T20:32:42.1914644Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1914649Z 2025-05-07T20:32:42.1915060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1915065Z 2025-05-07T20:32:42.1915169Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1915431Z self=, 2025-05-07T20:32:42.1915511Z T=1, 2025-05-07T20:32:42.1915596Z D=7168, 2025-05-07T20:32:42.1915677Z scale_ub=None, 2025-05-07T20:32:42.1915758Z contiguous=False, 2025-05-07T20:32:42.1915846Z compiled=False, 2025-05-07T20:32:42.1915918Z ) 2025-05-07T20:32:42.1916139Z self = 2025-05-07T20:32:42.1916304Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.1916308Z 2025-05-07T20:32:42.1916386Z @given( 2025-05-07T20:32:42.1916511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1916608Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1916723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1916844Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1916955Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1917036Z ) 2025-05-07T20:32:42.1917280Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1917420Z def test_silu_mul_quant( 2025-05-07T20:32:42.1917502Z self, 2025-05-07T20:32:42.1917579Z T: int, 2025-05-07T20:32:42.1917652Z D: int, 2025-05-07T20:32:42.1917754Z scale_ub: Optional[float], 2025-05-07T20:32:42.1917842Z contiguous: bool, 2025-05-07T20:32:42.1917924Z compiled: bool, 2025-05-07T20:32:42.1918008Z ) -> None: 2025-05-07T20:32:42.1918103Z torch.manual_seed(2025) 2025-05-07T20:32:42.1918175Z 2025-05-07T20:32:42.1918352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1918424Z 2025-05-07T20:32:42.1918512Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1918642Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1918730Z x = x_sign * x_clamp 2025-05-07T20:32:42.1918820Z x0 = x[:, :D] 2025-05-07T20:32:42.1918897Z x1 = x[:, D:] 2025-05-07T20:32:42.1918970Z 2025-05-07T20:32:42.1919061Z if contiguous: 2025-05-07T20:32:42.1919151Z x0 = x0.contiguous() 2025-05-07T20:32:42.1919237Z x1 = x1.contiguous() 2025-05-07T20:32:42.1919319Z 2025-05-07T20:32:42.1919408Z if scale_ub is not None: 2025-05-07T20:32:42.1919510Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1919650Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1919725Z ) 2025-05-07T20:32:42.1919800Z else: 2025-05-07T20:32:42.1919901Z scale_ub_tensor = None 2025-05-07T20:32:42.1919969Z 2025-05-07T20:32:42.1920107Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1920196Z op = silu_mul_quant 2025-05-07T20:32:42.1920276Z if compiled: 2025-05-07T20:32:42.1920381Z op = torch.compile(op) 2025-05-07T20:32:42.1920488Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1920557Z 2025-05-07T20:32:42.1920701Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1920745Z 2025-05-07T20:32:42.1920842Z moe/activation_test.py:117: 2025-05-07T20:32:42.1920968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1921078Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1921178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1921687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1921785Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1922147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1922376Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1922759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1922860Z kernel = self.compile( 2025-05-07T20:32:42.1923260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1923434Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1923694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1923700Z 2025-05-07T20:32:42.1923905Z self = 2025-05-07T20:32:42.1924671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1925180Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a83749a0>} 2025-05-07T20:32:42.1925934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1926177Z context = 2025-05-07T20:32:42.1926181Z 2025-05-07T20:32:42.1926345Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1926612Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1926717Z module_map=module_map) 2025-05-07T20:32:42.1926881Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1926982Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1927056Z E ^ 2025-05-07T20:32:42.1927412Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1927417Z 2025-05-07T20:32:42.1927839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1927849Z 2025-05-07T20:32:42.1927951Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1928177Z self=, 2025-05-07T20:32:42.1928254Z T=2048, 2025-05-07T20:32:42.1928329Z D=7168, 2025-05-07T20:32:42.1928413Z scale_ub=None, 2025-05-07T20:32:42.1928498Z contiguous=False, 2025-05-07T20:32:42.1928576Z compiled=True, 2025-05-07T20:32:42.1928653Z ) 2025-05-07T20:32:42.1928867Z self = 2025-05-07T20:32:42.1929037Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.1929048Z 2025-05-07T20:32:42.1929122Z @given( 2025-05-07T20:32:42.1929241Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1929389Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1929509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1929671Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1929790Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1929863Z ) 2025-05-07T20:32:42.1930105Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1930201Z def test_silu_mul_quant( 2025-05-07T20:32:42.1930275Z self, 2025-05-07T20:32:42.1930349Z T: int, 2025-05-07T20:32:42.1930428Z D: int, 2025-05-07T20:32:42.1930523Z scale_ub: Optional[float], 2025-05-07T20:32:42.1930615Z contiguous: bool, 2025-05-07T20:32:42.1930698Z compiled: bool, 2025-05-07T20:32:42.1930770Z ) -> None: 2025-05-07T20:32:42.1930871Z torch.manual_seed(2025) 2025-05-07T20:32:42.1930943Z 2025-05-07T20:32:42.1931180Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1931266Z 2025-05-07T20:32:42.1931355Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1931481Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1931575Z x = x_sign * x_clamp 2025-05-07T20:32:42.1931656Z x0 = x[:, :D] 2025-05-07T20:32:42.1931733Z x1 = x[:, D:] 2025-05-07T20:32:42.1931807Z 2025-05-07T20:32:42.1931890Z if contiguous: 2025-05-07T20:32:42.1931988Z x0 = x0.contiguous() 2025-05-07T20:32:42.1932073Z x1 = x1.contiguous() 2025-05-07T20:32:42.1932142Z 2025-05-07T20:32:42.1932234Z if scale_ub is not None: 2025-05-07T20:32:42.1932337Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1932469Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1932548Z ) 2025-05-07T20:32:42.1932623Z else: 2025-05-07T20:32:42.1932712Z scale_ub_tensor = None 2025-05-07T20:32:42.1932793Z 2025-05-07T20:32:42.1932922Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1933057Z op = silu_mul_quant 2025-05-07T20:32:42.1933148Z if compiled: 2025-05-07T20:32:42.1933245Z op = torch.compile(op) 2025-05-07T20:32:42.1933346Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1933425Z 2025-05-07T20:32:42.1933512Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1933518Z 2025-05-07T20:32:42.1933616Z moe/activation_test.py:117: 2025-05-07T20:32:42.1933743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1933841Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1933946Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1934314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1934404Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1934906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1935009Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1935379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1935606Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1935949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1936050Z kernel = self.compile( 2025-05-07T20:32:42.1936434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1936613Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1936739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1936746Z 2025-05-07T20:32:42.1936994Z self = 2025-05-07T20:32:42.1937814Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1938312Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8375d00>} 2025-05-07T20:32:42.1939441Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1939633Z context = 2025-05-07T20:32:42.1939638Z 2025-05-07T20:32:42.1939944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1940219Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1940331Z module_map=module_map) 2025-05-07T20:32:42.1940498Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1940595Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1940673Z E ^ 2025-05-07T20:32:42.1941032Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1941037Z 2025-05-07T20:32:42.1941451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1941455Z 2025-05-07T20:32:42.1941562Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1941784Z self=, 2025-05-07T20:32:42.1941856Z T=4096, 2025-05-07T20:32:42.1941941Z D=7168, 2025-05-07T20:32:42.1942025Z scale_ub=None, 2025-05-07T20:32:42.1942186Z contiguous=False, 2025-05-07T20:32:42.1942279Z compiled=True, 2025-05-07T20:32:42.1942351Z ) 2025-05-07T20:32:42.1942568Z self = 2025-05-07T20:32:42.1942748Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.1942753Z 2025-05-07T20:32:42.1942832Z @given( 2025-05-07T20:32:42.1942960Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1943055Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1943167Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1943288Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1943399Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1943471Z ) 2025-05-07T20:32:42.1943727Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1943819Z def test_silu_mul_quant( 2025-05-07T20:32:42.1943900Z self, 2025-05-07T20:32:42.1943984Z T: int, 2025-05-07T20:32:42.1944059Z D: int, 2025-05-07T20:32:42.1944154Z scale_ub: Optional[float], 2025-05-07T20:32:42.1944245Z contiguous: bool, 2025-05-07T20:32:42.1944328Z compiled: bool, 2025-05-07T20:32:42.1944410Z ) -> None: 2025-05-07T20:32:42.1944503Z torch.manual_seed(2025) 2025-05-07T20:32:42.1944577Z 2025-05-07T20:32:42.1944753Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1944827Z 2025-05-07T20:32:42.1944914Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1945043Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1945132Z x = x_sign * x_clamp 2025-05-07T20:32:42.1945211Z x0 = x[:, :D] 2025-05-07T20:32:42.1945295Z x1 = x[:, D:] 2025-05-07T20:32:42.1945369Z 2025-05-07T20:32:42.1945455Z if contiguous: 2025-05-07T20:32:42.1945552Z x0 = x0.contiguous() 2025-05-07T20:32:42.1945722Z x1 = x1.contiguous() 2025-05-07T20:32:42.1945858Z 2025-05-07T20:32:42.1945953Z if scale_ub is not None: 2025-05-07T20:32:42.1946054Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1946195Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1946271Z ) 2025-05-07T20:32:42.1946345Z else: 2025-05-07T20:32:42.1946442Z scale_ub_tensor = None 2025-05-07T20:32:42.1946512Z 2025-05-07T20:32:42.1946639Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1946733Z op = silu_mul_quant 2025-05-07T20:32:42.1946815Z if compiled: 2025-05-07T20:32:42.1946910Z op = torch.compile(op) 2025-05-07T20:32:42.1947020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1947092Z 2025-05-07T20:32:42.1947181Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1947237Z 2025-05-07T20:32:42.1947337Z moe/activation_test.py:117: 2025-05-07T20:32:42.1947466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1947574Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1947671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1948042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1948138Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1948633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1948735Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1949096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1949320Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1949670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1949811Z kernel = self.compile( 2025-05-07T20:32:42.1950194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1950375Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1950501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1950505Z 2025-05-07T20:32:42.1950721Z self = 2025-05-07T20:32:42.1951490Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1951992Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb3a8376840>} 2025-05-07T20:32:42.1952757Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1952945Z context = 2025-05-07T20:32:42.1952949Z 2025-05-07T20:32:42.1953121Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1953385Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1953488Z module_map=module_map) 2025-05-07T20:32:42.1953656Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1953753Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1953835Z E ^ 2025-05-07T20:32:42.1954236Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1954282Z 2025-05-07T20:32:42.1954697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1954701Z 2025-05-07T20:32:42.1954812Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1955031Z self=, 2025-05-07T20:32:42.1955116Z T=16384, 2025-05-07T20:32:42.1955190Z D=5120, 2025-05-07T20:32:42.1955268Z scale_ub=1200.0, 2025-05-07T20:32:42.1955360Z contiguous=False, 2025-05-07T20:32:42.1955446Z compiled=False, 2025-05-07T20:32:42.1955516Z ) 2025-05-07T20:32:42.1955737Z self = 2025-05-07T20:32:42.1955916Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.1955920Z 2025-05-07T20:32:42.1956037Z @given( 2025-05-07T20:32:42.1956167Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1956270Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1956391Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1956505Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1956617Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1956699Z ) 2025-05-07T20:32:42.1956946Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1957036Z def test_silu_mul_quant( 2025-05-07T20:32:42.1957124Z self, 2025-05-07T20:32:42.1957200Z T: int, 2025-05-07T20:32:42.1957273Z D: int, 2025-05-07T20:32:42.1957379Z scale_ub: Optional[float], 2025-05-07T20:32:42.1957467Z contiguous: bool, 2025-05-07T20:32:42.1957550Z compiled: bool, 2025-05-07T20:32:42.1957637Z ) -> None: 2025-05-07T20:32:42.1957733Z torch.manual_seed(2025) 2025-05-07T20:32:42.1957814Z 2025-05-07T20:32:42.1957986Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1958106Z 2025-05-07T20:32:42.1958201Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1958326Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1958412Z x = x_sign * x_clamp 2025-05-07T20:32:42.1958494Z x0 = x[:, :D] 2025-05-07T20:32:42.1958574Z x1 = x[:, D:] 2025-05-07T20:32:42.1958646Z 2025-05-07T20:32:42.1958740Z if contiguous: 2025-05-07T20:32:42.1958828Z x0 = x0.contiguous() 2025-05-07T20:32:42.1958915Z x1 = x1.contiguous() 2025-05-07T20:32:42.1958991Z 2025-05-07T20:32:42.1959078Z if scale_ub is not None: 2025-05-07T20:32:42.1959184Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1959325Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1959402Z ) 2025-05-07T20:32:42.1959493Z else: 2025-05-07T20:32:42.1959588Z scale_ub_tensor = None 2025-05-07T20:32:42.1959665Z 2025-05-07T20:32:42.1959803Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1959894Z op = silu_mul_quant 2025-05-07T20:32:42.1959976Z if compiled: 2025-05-07T20:32:42.1960083Z op = torch.compile(op) 2025-05-07T20:32:42.1960192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1960269Z 2025-05-07T20:32:42.1960379Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1960383Z 2025-05-07T20:32:42.1960484Z moe/activation_test.py:117: 2025-05-07T20:32:42.1960624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1960725Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1960829Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1961351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1961497Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1961867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1962181Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1962522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1962624Z kernel = self.compile( 2025-05-07T20:32:42.1963008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1963181Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1963313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1963318Z 2025-05-07T20:32:42.1963522Z self = 2025-05-07T20:32:42.1964465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1964970Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287f14040>} 2025-05-07T20:32:42.1965717Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1965917Z context = 2025-05-07T20:32:42.1965922Z 2025-05-07T20:32:42.1966087Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1966356Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1966463Z module_map=module_map) 2025-05-07T20:32:42.1966669Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1966769Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1966844Z E ^ 2025-05-07T20:32:42.1972385Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1972395Z 2025-05-07T20:32:42.1972838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1972843Z 2025-05-07T20:32:42.1972957Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1973182Z self=, 2025-05-07T20:32:42.1973270Z T=16384, 2025-05-07T20:32:42.1973350Z D=5120, 2025-05-07T20:32:42.1973434Z scale_ub=1200.0, 2025-05-07T20:32:42.1973535Z contiguous=True, 2025-05-07T20:32:42.1973620Z compiled=True, 2025-05-07T20:32:42.1973694Z ) 2025-05-07T20:32:42.1973928Z self = 2025-05-07T20:32:42.1974111Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.1974116Z 2025-05-07T20:32:42.1974193Z @given( 2025-05-07T20:32:42.1974323Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1974424Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1974550Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1974667Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1974781Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1974865Z ) 2025-05-07T20:32:42.1975115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1975210Z def test_silu_mul_quant( 2025-05-07T20:32:42.1975297Z self, 2025-05-07T20:32:42.1975377Z T: int, 2025-05-07T20:32:42.1975455Z D: int, 2025-05-07T20:32:42.1975651Z scale_ub: Optional[float], 2025-05-07T20:32:42.1975783Z contiguous: bool, 2025-05-07T20:32:42.1975868Z compiled: bool, 2025-05-07T20:32:42.1975956Z ) -> None: 2025-05-07T20:32:42.1976052Z torch.manual_seed(2025) 2025-05-07T20:32:42.1976125Z 2025-05-07T20:32:42.1976305Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1976381Z 2025-05-07T20:32:42.1976484Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1976612Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1976706Z x = x_sign * x_clamp 2025-05-07T20:32:42.1976799Z x0 = x[:, :D] 2025-05-07T20:32:42.1976881Z x1 = x[:, D:] 2025-05-07T20:32:42.1976953Z 2025-05-07T20:32:42.1977046Z if contiguous: 2025-05-07T20:32:42.1977139Z x0 = x0.contiguous() 2025-05-07T20:32:42.1977276Z x1 = x1.contiguous() 2025-05-07T20:32:42.1977358Z 2025-05-07T20:32:42.1977452Z if scale_ub is not None: 2025-05-07T20:32:42.1977562Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1977703Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1977778Z ) 2025-05-07T20:32:42.1977862Z else: 2025-05-07T20:32:42.1977957Z scale_ub_tensor = None 2025-05-07T20:32:42.1978031Z 2025-05-07T20:32:42.1978167Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1978259Z op = silu_mul_quant 2025-05-07T20:32:42.1978347Z if compiled: 2025-05-07T20:32:42.1978456Z op = torch.compile(op) 2025-05-07T20:32:42.1978562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1978636Z 2025-05-07T20:32:42.1978735Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1978739Z 2025-05-07T20:32:42.1978839Z moe/activation_test.py:117: 2025-05-07T20:32:42.1978979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1979137Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1979241Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1979624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1979719Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1980216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1980322Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1980681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1980914Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1981259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1981359Z kernel = self.compile( 2025-05-07T20:32:42.1981755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1981936Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1982065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1982076Z 2025-05-07T20:32:42.1982284Z self = 2025-05-07T20:32:42.1983058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1983564Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287f15300>} 2025-05-07T20:32:42.1984362Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1984605Z context = 2025-05-07T20:32:42.1984610Z 2025-05-07T20:32:42.1984774Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1985038Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1985153Z module_map=module_map) 2025-05-07T20:32:42.1985314Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1985408Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1985493Z E ^ 2025-05-07T20:32:42.1985848Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1985895Z 2025-05-07T20:32:42.1986319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1986329Z 2025-05-07T20:32:42.1986430Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1986652Z self=, 2025-05-07T20:32:42.1986737Z T=16384, 2025-05-07T20:32:42.1986813Z D=5120, 2025-05-07T20:32:42.1986899Z scale_ub=None, 2025-05-07T20:32:42.1986985Z contiguous=False, 2025-05-07T20:32:42.1987065Z compiled=True, 2025-05-07T20:32:42.1987146Z ) 2025-05-07T20:32:42.1987362Z self = 2025-05-07T20:32:42.1987537Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.1987542Z 2025-05-07T20:32:42.1987626Z @given( 2025-05-07T20:32:42.1987744Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1987847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1987970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1988134Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1988254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1988329Z ) 2025-05-07T20:32:42.1988575Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1988675Z def test_silu_mul_quant( 2025-05-07T20:32:42.1988750Z self, 2025-05-07T20:32:42.1988832Z T: int, 2025-05-07T20:32:42.1988919Z D: int, 2025-05-07T20:32:42.1989021Z scale_ub: Optional[float], 2025-05-07T20:32:42.1989111Z contiguous: bool, 2025-05-07T20:32:42.1989205Z compiled: bool, 2025-05-07T20:32:42.1989285Z ) -> None: 2025-05-07T20:32:42.1989381Z torch.manual_seed(2025) 2025-05-07T20:32:42.1989468Z 2025-05-07T20:32:42.1989641Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1989715Z 2025-05-07T20:32:42.1989818Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1989948Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1990050Z x = x_sign * x_clamp 2025-05-07T20:32:42.1990130Z x0 = x[:, :D] 2025-05-07T20:32:42.1990209Z x1 = x[:, D:] 2025-05-07T20:32:42.1990297Z 2025-05-07T20:32:42.1990381Z if contiguous: 2025-05-07T20:32:42.1990473Z x0 = x0.contiguous() 2025-05-07T20:32:42.1990574Z x1 = x1.contiguous() 2025-05-07T20:32:42.1990645Z 2025-05-07T20:32:42.1990737Z if scale_ub is not None: 2025-05-07T20:32:42.1990851Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1990986Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1991062Z ) 2025-05-07T20:32:42.1991146Z else: 2025-05-07T20:32:42.1991240Z scale_ub_tensor = None 2025-05-07T20:32:42.1991322Z 2025-05-07T20:32:42.1991454Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1991603Z op = silu_mul_quant 2025-05-07T20:32:42.1992337Z if compiled: 2025-05-07T20:32:42.1992437Z op = torch.compile(op) 2025-05-07T20:32:42.1992540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1992617Z 2025-05-07T20:32:42.1992711Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1992716Z 2025-05-07T20:32:42.1992814Z moe/activation_test.py:117: 2025-05-07T20:32:42.1992948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1993045Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1993144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1993512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1993607Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1994158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1994259Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1994622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1994854Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1995195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1995299Z kernel = self.compile( 2025-05-07T20:32:42.1995684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1995860Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1995994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1995998Z 2025-05-07T20:32:42.1996209Z self = 2025-05-07T20:32:42.1996980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1997564Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287f15e40>} 2025-05-07T20:32:42.1998333Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1998532Z context = 2025-05-07T20:32:42.1998537Z 2025-05-07T20:32:42.1998699Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1998967Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1999090Z module_map=module_map) 2025-05-07T20:32:42.1999250Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1999344Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1999431Z E ^ 2025-05-07T20:32:42.1999782Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1999786Z 2025-05-07T20:32:42.2000207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2000212Z 2025-05-07T20:32:42.2000312Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2000532Z self=, 2025-05-07T20:32:42.2000614Z T=2048, 2025-05-07T20:32:42.2000688Z D=5120, 2025-05-07T20:32:42.2000778Z scale_ub=None, 2025-05-07T20:32:42.2000862Z contiguous=False, 2025-05-07T20:32:42.2001020Z compiled=True, 2025-05-07T20:32:42.2001138Z ) 2025-05-07T20:32:42.2001358Z self = 2025-05-07T20:32:42.2001529Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.2001533Z 2025-05-07T20:32:42.2001618Z @given( 2025-05-07T20:32:42.2001737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2001834Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2001953Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2002066Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2002184Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2002254Z ) 2025-05-07T20:32:42.2002498Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2002640Z def test_silu_mul_quant( 2025-05-07T20:32:42.2002717Z self, 2025-05-07T20:32:42.2002794Z T: int, 2025-05-07T20:32:42.2002877Z D: int, 2025-05-07T20:32:42.2002974Z scale_ub: Optional[float], 2025-05-07T20:32:42.2003059Z contiguous: bool, 2025-05-07T20:32:42.2003150Z compiled: bool, 2025-05-07T20:32:42.2003227Z ) -> None: 2025-05-07T20:32:42.2003321Z torch.manual_seed(2025) 2025-05-07T20:32:42.2003398Z 2025-05-07T20:32:42.2003741Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2003823Z 2025-05-07T20:32:42.2003913Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2004038Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2004132Z x = x_sign * x_clamp 2025-05-07T20:32:42.2004209Z x0 = x[:, :D] 2025-05-07T20:32:42.2004284Z x1 = x[:, D:] 2025-05-07T20:32:42.2004364Z 2025-05-07T20:32:42.2004444Z if contiguous: 2025-05-07T20:32:42.2004538Z x0 = x0.contiguous() 2025-05-07T20:32:42.2004636Z x1 = x1.contiguous() 2025-05-07T20:32:42.2004758Z 2025-05-07T20:32:42.2004850Z if scale_ub is not None: 2025-05-07T20:32:42.2004962Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2005095Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2005167Z ) 2025-05-07T20:32:42.2005246Z else: 2025-05-07T20:32:42.2005337Z scale_ub_tensor = None 2025-05-07T20:32:42.2005413Z 2025-05-07T20:32:42.2005543Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2005633Z op = silu_mul_quant 2025-05-07T20:32:42.2005722Z if compiled: 2025-05-07T20:32:42.2005824Z op = torch.compile(op) 2025-05-07T20:32:42.2005926Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2006006Z 2025-05-07T20:32:42.2006096Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2006100Z 2025-05-07T20:32:42.2006198Z moe/activation_test.py:117: 2025-05-07T20:32:42.2006340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2006444Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2006551Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2006921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.2007012Z return fn(*args, **kwargs) 2025-05-07T20:32:42.2007515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2007613Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2007970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2008204Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2008549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2008701Z kernel = self.compile( 2025-05-07T20:32:42.2009128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2009304Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2009436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2009440Z 2025-05-07T20:32:42.2009643Z self = 2025-05-07T20:32:42.2010425Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2010963Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287f17240>} 2025-05-07T20:32:42.2011718Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2011921Z context = 2025-05-07T20:32:42.2011926Z 2025-05-07T20:32:42.2012091Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2012360Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2012465Z module_map=module_map) 2025-05-07T20:32:42.2012625Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2012730Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2012809Z E ^ 2025-05-07T20:32:42.2013165Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2013181Z 2025-05-07T20:32:42.2013637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2013644Z 2025-05-07T20:32:42.2013745Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2013976Z self=, 2025-05-07T20:32:42.2014057Z T=2048, 2025-05-07T20:32:42.2014133Z D=5120, 2025-05-07T20:32:42.2014223Z scale_ub=1200.0, 2025-05-07T20:32:42.2014309Z contiguous=False, 2025-05-07T20:32:42.2014391Z compiled=True, 2025-05-07T20:32:42.2014475Z ) 2025-05-07T20:32:42.2014693Z self = 2025-05-07T20:32:42.2014874Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.2014879Z 2025-05-07T20:32:42.2014955Z @given( 2025-05-07T20:32:42.2015076Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2015179Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2015295Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2015410Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2015528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2015601Z ) 2025-05-07T20:32:42.2015853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2015945Z def test_silu_mul_quant( 2025-05-07T20:32:42.2016020Z self, 2025-05-07T20:32:42.2016102Z T: int, 2025-05-07T20:32:42.2016178Z D: int, 2025-05-07T20:32:42.2016273Z scale_ub: Optional[float], 2025-05-07T20:32:42.2016369Z contiguous: bool, 2025-05-07T20:32:42.2016453Z compiled: bool, 2025-05-07T20:32:42.2016528Z ) -> None: 2025-05-07T20:32:42.2016628Z torch.manual_seed(2025) 2025-05-07T20:32:42.2016700Z 2025-05-07T20:32:42.2016925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2017010Z 2025-05-07T20:32:42.2017141Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2017263Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2017358Z x = x_sign * x_clamp 2025-05-07T20:32:42.2017436Z x0 = x[:, :D] 2025-05-07T20:32:42.2017521Z x1 = x[:, D:] 2025-05-07T20:32:42.2017594Z 2025-05-07T20:32:42.2017673Z if contiguous: 2025-05-07T20:32:42.2017771Z x0 = x0.contiguous() 2025-05-07T20:32:42.2017857Z x1 = x1.contiguous() 2025-05-07T20:32:42.2017928Z 2025-05-07T20:32:42.2018025Z if scale_ub is not None: 2025-05-07T20:32:42.2018129Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2018260Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2018341Z ) 2025-05-07T20:32:42.2018416Z else: 2025-05-07T20:32:42.2018548Z scale_ub_tensor = None 2025-05-07T20:32:42.2018628Z 2025-05-07T20:32:42.2018760Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2018864Z op = silu_mul_quant 2025-05-07T20:32:42.2018947Z if compiled: 2025-05-07T20:32:42.2019044Z op = torch.compile(op) 2025-05-07T20:32:42.2019157Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2019231Z 2025-05-07T20:32:42.2019323Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2019327Z 2025-05-07T20:32:42.2019431Z moe/activation_test.py:117: 2025-05-07T20:32:42.2019560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2019660Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2019766Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2020133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.2020231Z return fn(*args, **kwargs) 2025-05-07T20:32:42.2020736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2020882Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2021252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2021474Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2021816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2021914Z kernel = self.compile( 2025-05-07T20:32:42.2022300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2022482Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2022607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2022616Z 2025-05-07T20:32:42.2022823Z self = 2025-05-07T20:32:42.2023603Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2024100Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287618720>} 2025-05-07T20:32:42.2024857Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2025046Z context = 2025-05-07T20:32:42.2025051Z 2025-05-07T20:32:42.2025232Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2025536Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2025686Z module_map=module_map) 2025-05-07T20:32:42.2025860Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2025959Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2026034Z E ^ 2025-05-07T20:32:42.2026397Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2026401Z 2025-05-07T20:32:42.2026817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2026821Z 2025-05-07T20:32:42.2026931Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2027153Z self=, 2025-05-07T20:32:42.2027267Z T=4096, 2025-05-07T20:32:42.2027350Z D=5120, 2025-05-07T20:32:42.2027437Z scale_ub=1200.0, 2025-05-07T20:32:42.2027527Z contiguous=True, 2025-05-07T20:32:42.2027616Z compiled=True, 2025-05-07T20:32:42.2027690Z ) 2025-05-07T20:32:42.2027908Z self = 2025-05-07T20:32:42.2028085Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.2028090Z 2025-05-07T20:32:42.2028167Z @given( 2025-05-07T20:32:42.2028291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2028388Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2028502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2028625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2028735Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2028807Z ) 2025-05-07T20:32:42.2029062Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2029156Z def test_silu_mul_quant( 2025-05-07T20:32:42.2029301Z self, 2025-05-07T20:32:42.2029379Z T: int, 2025-05-07T20:32:42.2029455Z D: int, 2025-05-07T20:32:42.2029557Z scale_ub: Optional[float], 2025-05-07T20:32:42.2029642Z contiguous: bool, 2025-05-07T20:32:42.2029727Z compiled: bool, 2025-05-07T20:32:42.2029808Z ) -> None: 2025-05-07T20:32:42.2029901Z torch.manual_seed(2025) 2025-05-07T20:32:42.2029973Z 2025-05-07T20:32:42.2030148Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2030223Z 2025-05-07T20:32:42.2030314Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2030444Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2030530Z x = x_sign * x_clamp 2025-05-07T20:32:42.2030608Z x0 = x[:, :D] 2025-05-07T20:32:42.2030693Z x1 = x[:, D:] 2025-05-07T20:32:42.2030763Z 2025-05-07T20:32:42.2030855Z if contiguous: 2025-05-07T20:32:42.2030947Z x0 = x0.contiguous() 2025-05-07T20:32:42.2031040Z x1 = x1.contiguous() 2025-05-07T20:32:42.2031120Z 2025-05-07T20:32:42.2031208Z if scale_ub is not None: 2025-05-07T20:32:42.2031314Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2031480Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2031554Z ) 2025-05-07T20:32:42.2031629Z else: 2025-05-07T20:32:42.2031728Z scale_ub_tensor = None 2025-05-07T20:32:42.2031801Z 2025-05-07T20:32:42.2031929Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2032024Z op = silu_mul_quant 2025-05-07T20:32:42.2032107Z if compiled: 2025-05-07T20:32:42.2032203Z op = torch.compile(op) 2025-05-07T20:32:42.2032314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2032385Z 2025-05-07T20:32:42.2032478Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2032482Z 2025-05-07T20:32:42.2032632Z moe/activation_test.py:117: 2025-05-07T20:32:42.2032764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2032912Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2033010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2033378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.2033475Z return fn(*args, **kwargs) 2025-05-07T20:32:42.2033972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2034063Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2034428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2034687Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2035043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2035142Z kernel = self.compile( 2025-05-07T20:32:42.2035528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2035708Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2035835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2035839Z 2025-05-07T20:32:42.2036051Z self = 2025-05-07T20:32:42.2036820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2037324Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb287619260>} 2025-05-07T20:32:42.2038130Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2038322Z context = 2025-05-07T20:32:42.2038327Z 2025-05-07T20:32:42.2038855Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2039118Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2039222Z module_map=module_map) 2025-05-07T20:32:42.2039390Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2039486Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2039569Z E ^ 2025-05-07T20:32:42.2039928Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2039939Z 2025-05-07T20:32:42.2040352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2040356Z 2025-05-07T20:32:42.2040463Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2040685Z self=, 2025-05-07T20:32:42.2040755Z T=128, 2025-05-07T20:32:42.2040839Z D=5120, 2025-05-07T20:32:42.2040918Z scale_ub=1200.0, 2025-05-07T20:32:42.2041009Z contiguous=False, 2025-05-07T20:32:42.2041091Z compiled=True, 2025-05-07T20:32:42.2041167Z ) 2025-05-07T20:32:42.2041391Z self = 2025-05-07T20:32:42.2041562Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.2041567Z 2025-05-07T20:32:42.2041644Z @given( 2025-05-07T20:32:42.2041970Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2042157Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2042271Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2042393Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2042506Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2042584Z ) 2025-05-07T20:32:42.2042828Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2042924Z def test_silu_mul_quant( 2025-05-07T20:32:42.2043007Z self, 2025-05-07T20:32:42.2043082Z T: int, 2025-05-07T20:32:42.2043156Z D: int, 2025-05-07T20:32:42.2043257Z scale_ub: Optional[float], 2025-05-07T20:32:42.2043346Z contiguous: bool, 2025-05-07T20:32:42.2043430Z compiled: bool, 2025-05-07T20:32:42.2043515Z ) -> None: 2025-05-07T20:32:42.2043807Z torch.manual_seed(2025) 2025-05-07T20:32:42.2043882Z 2025-05-07T20:32:42.2044061Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2044139Z 2025-05-07T20:32:42.2044233Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2044358Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2044446Z x = x_sign * x_clamp 2025-05-07T20:32:42.2044529Z x0 = x[:, :D] 2025-05-07T20:32:42.2044607Z x1 = x[:, D:] 2025-05-07T20:32:42.2044678Z 2025-05-07T20:32:42.2044765Z if contiguous: 2025-05-07T20:32:42.2044852Z x0 = x0.contiguous() 2025-05-07T20:32:42.2044937Z x1 = x1.contiguous() 2025-05-07T20:32:42.2045019Z 2025-05-07T20:32:42.2045109Z if scale_ub is not None: 2025-05-07T20:32:42.2045210Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2045355Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2045430Z ) 2025-05-07T20:32:42.2045516Z else: 2025-05-07T20:32:42.2045609Z scale_ub_tensor = None 2025-05-07T20:32:42.2045758Z 2025-05-07T20:32:42.2045902Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2045988Z op = silu_mul_quant 2025-05-07T20:32:42.2046072Z if compiled: 2025-05-07T20:32:42.2046183Z op = torch.compile(op) 2025-05-07T20:32:42.2046285Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2046355Z 2025-05-07T20:32:42.2046454Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2046458Z 2025-05-07T20:32:42.2046554Z moe/activation_test.py:117: 2025-05-07T20:32:42.2046684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2046794Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2046893Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2047274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.2047364Z return fn(*args, **kwargs) 2025-05-07T20:32:42.2047864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2047970Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2048330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2048561Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2048905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2049000Z kernel = self.compile( 2025-05-07T20:32:42.2049396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2049569Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2049698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2049751Z 2025-05-07T20:32:42.2049966Z self = 2025-05-07T20:32:42.2050779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2051289Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb28761a480>} 2025-05-07T20:32:42.2052042Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2052242Z context = 2025-05-07T20:32:42.2052285Z 2025-05-07T20:32:42.2052452Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2052722Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2052841Z module_map=module_map) 2025-05-07T20:32:42.2053005Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2053103Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2053184Z E ^ 2025-05-07T20:32:42.2053537Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2053542Z 2025-05-07T20:32:42.2053965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2053970Z 2025-05-07T20:32:42.2054072Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2054295Z self=, 2025-05-07T20:32:42.2054380Z T=16384, 2025-05-07T20:32:42.2054505Z D=7168, 2025-05-07T20:32:42.2054587Z scale_ub=1200.0, 2025-05-07T20:32:42.2054678Z contiguous=True, 2025-05-07T20:32:42.2054758Z compiled=True, 2025-05-07T20:32:42.2054838Z ) 2025-05-07T20:32:42.2055058Z self = 2025-05-07T20:32:42.2055232Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.2055237Z 2025-05-07T20:32:42.2055325Z @given( 2025-05-07T20:32:42.2055444Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2055539Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2055659Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2055777Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2055889Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2055968Z ) 2025-05-07T20:32:42.2056216Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2056316Z def test_silu_mul_quant( 2025-05-07T20:32:42.2056393Z self, 2025-05-07T20:32:42.2056470Z T: int, 2025-05-07T20:32:42.2056552Z D: int, 2025-05-07T20:32:42.2056648Z scale_ub: Optional[float], 2025-05-07T20:32:42.2056733Z contiguous: bool, 2025-05-07T20:32:42.2056821Z compiled: bool, 2025-05-07T20:32:42.2056895Z ) -> None: 2025-05-07T20:32:42.2056986Z torch.manual_seed(2025) 2025-05-07T20:32:42.2057066Z 2025-05-07T20:32:42.2057237Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2057310Z 2025-05-07T20:32:42.2057405Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2057526Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2057623Z x = x_sign * x_clamp 2025-05-07T20:32:42.2057699Z x0 = x[:, :D] 2025-05-07T20:32:42.2057777Z x1 = x[:, D:] 2025-05-07T20:32:42.2057856Z 2025-05-07T20:32:42.2057984Z if contiguous: 2025-05-07T20:32:42.2058078Z x0 = x0.contiguous() 2025-05-07T20:32:42.2058219Z x1 = x1.contiguous() 2025-05-07T20:32:42.2058292Z 2025-05-07T20:32:42.2058377Z if scale_ub is not None: 2025-05-07T20:32:42.2058487Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2058618Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2058693Z ) 2025-05-07T20:32:42.2058775Z else: 2025-05-07T20:32:42.2058865Z scale_ub_tensor = None 2025-05-07T20:32:42.2058944Z 2025-05-07T20:32:42.2059075Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2059163Z op = silu_mul_quant 2025-05-07T20:32:42.2059252Z if compiled: 2025-05-07T20:32:42.2059347Z op = torch.compile(op) 2025-05-07T20:32:42.2059451Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2059595Z 2025-05-07T20:32:42.2059689Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2059695Z 2025-05-07T20:32:42.2059793Z moe/activation_test.py:117: 2025-05-07T20:32:42.2059928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2060028Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2060134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2060503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.2060594Z return fn(*args, **kwargs) 2025-05-07T20:32:42.2061103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2061198Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2061557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2061788Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2062223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2062327Z kernel = self.compile( 2025-05-07T20:32:42.2062713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2062889Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2063021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2063025Z 2025-05-07T20:32:42.2063231Z self = 2025-05-07T20:32:42.2064012Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2064512Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb28761bd80>} 2025-05-07T20:32:42.2065267Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2065466Z context = 2025-05-07T20:32:42.2065470Z 2025-05-07T20:32:42.2065635Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2065905Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2066009Z module_map=module_map) 2025-05-07T20:32:42.2066171Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2066277Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2066352Z E ^ 2025-05-07T20:32:42.2066753Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2066809Z 2025-05-07T20:32:42.2067225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2067230Z 2025-05-07T20:32:42.2067331Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2067562Z self=, 2025-05-07T20:32:42.2067641Z T=16384, 2025-05-07T20:32:42.2067717Z D=5120, 2025-05-07T20:32:42.2067803Z scale_ub=1200.0, 2025-05-07T20:32:42.2067886Z contiguous=True, 2025-05-07T20:32:42.2067970Z compiled=False, 2025-05-07T20:32:42.2068045Z ) 2025-05-07T20:32:42.2068265Z self = 2025-05-07T20:32:42.2068486Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.2068492Z 2025-05-07T20:32:42.2068572Z @given( 2025-05-07T20:32:42.2068694Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2068800Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2068913Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2069026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2069146Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2069219Z ) 2025-05-07T20:32:42.2069465Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2069564Z def test_silu_mul_quant( 2025-05-07T20:32:42.2069639Z self, 2025-05-07T20:32:42.2069723Z T: int, 2025-05-07T20:32:42.2069796Z D: int, 2025-05-07T20:32:42.2069888Z scale_ub: Optional[float], 2025-05-07T20:32:42.2069978Z contiguous: bool, 2025-05-07T20:32:42.2070059Z compiled: bool, 2025-05-07T20:32:42.2070136Z ) -> None: 2025-05-07T20:32:42.2070238Z torch.manual_seed(2025) 2025-05-07T20:32:42.2070361Z 2025-05-07T20:32:42.2070533Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2070614Z 2025-05-07T20:32:42.2070712Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2070835Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2070931Z x = x_sign * x_clamp 2025-05-07T20:32:42.2071007Z x0 = x[:, :D] 2025-05-07T20:32:42.2071090Z x1 = x[:, D:] 2025-05-07T20:32:42.2071163Z 2025-05-07T20:32:42.2071242Z if contiguous: 2025-05-07T20:32:42.2071335Z x0 = x0.contiguous() 2025-05-07T20:32:42.2071421Z x1 = x1.contiguous() 2025-05-07T20:32:42.2071490Z 2025-05-07T20:32:42.2071584Z if scale_ub is not None: 2025-05-07T20:32:42.2071687Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2071823Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2071904Z ) 2025-05-07T20:32:42.2071980Z else: 2025-05-07T20:32:42.2072073Z scale_ub_tensor = None 2025-05-07T20:32:42.2072154Z 2025-05-07T20:32:42.2072282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2072376Z op = silu_mul_quant 2025-05-07T20:32:42.2072459Z if compiled: 2025-05-07T20:32:42.2072556Z op = torch.compile(op) 2025-05-07T20:32:42.2072667Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2072742Z 2025-05-07T20:32:42.2072830Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2072834Z 2025-05-07T20:32:42.2072932Z moe/activation_test.py:117: 2025-05-07T20:32:42.2073057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2073156Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2073258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2073809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2073951Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2074310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2074531Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2074878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2074969Z kernel = self.compile( 2025-05-07T20:32:42.2075356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2075541Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2075665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2075670Z 2025-05-07T20:32:42.2075923Z self = 2025-05-07T20:32:42.2076702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2077209Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2872c8cc0>} 2025-05-07T20:32:42.2077959Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2078149Z context = 2025-05-07T20:32:42.2078153Z 2025-05-07T20:32:42.2078325Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2078596Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2078749Z module_map=module_map) 2025-05-07T20:32:42.2078911Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2079005Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2079086Z E ^ 2025-05-07T20:32:42.2079442Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2079446Z 2025-05-07T20:32:42.2079866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2079877Z 2025-05-07T20:32:42.2079979Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2080201Z self=, 2025-05-07T20:32:42.2080287Z T=1, 2025-05-07T20:32:42.2080361Z D=7168, 2025-05-07T20:32:42.2080445Z scale_ub=1200.0, 2025-05-07T20:32:42.2080538Z contiguous=False, 2025-05-07T20:32:42.2080622Z compiled=False, 2025-05-07T20:32:42.2080702Z ) 2025-05-07T20:32:42.2080925Z self = 2025-05-07T20:32:42.2081093Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.2081098Z 2025-05-07T20:32:42.2081176Z @given( 2025-05-07T20:32:42.2081303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2081405Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2081525Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2081640Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2081750Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2081831Z ) 2025-05-07T20:32:42.2082075Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2082171Z def test_silu_mul_quant( 2025-05-07T20:32:42.2082256Z self, 2025-05-07T20:32:42.2082380Z T: int, 2025-05-07T20:32:42.2082496Z D: int, 2025-05-07T20:32:42.2082603Z scale_ub: Optional[float], 2025-05-07T20:32:42.2082688Z contiguous: bool, 2025-05-07T20:32:42.2082788Z compiled: bool, 2025-05-07T20:32:42.2082865Z ) -> None: 2025-05-07T20:32:42.2082960Z torch.manual_seed(2025) 2025-05-07T20:32:42.2083038Z 2025-05-07T20:32:42.2083208Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2083281Z 2025-05-07T20:32:42.2083381Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2083504Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2083689Z x = x_sign * x_clamp 2025-05-07T20:32:42.2083772Z x0 = x[:, :D] 2025-05-07T20:32:42.2083850Z x1 = x[:, D:] 2025-05-07T20:32:42.2083922Z 2025-05-07T20:32:42.2084012Z if contiguous: 2025-05-07T20:32:42.2084146Z x0 = x0.contiguous() 2025-05-07T20:32:42.2084239Z x1 = x1.contiguous() 2025-05-07T20:32:42.2084322Z 2025-05-07T20:32:42.2084415Z if scale_ub is not None: 2025-05-07T20:32:42.2084525Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2084657Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2084728Z ) 2025-05-07T20:32:42.2084813Z else: 2025-05-07T20:32:42.2084904Z scale_ub_tensor = None 2025-05-07T20:32:42.2084976Z 2025-05-07T20:32:42.2085111Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2085203Z op = silu_mul_quant 2025-05-07T20:32:42.2085283Z if compiled: 2025-05-07T20:32:42.2085386Z op = torch.compile(op) 2025-05-07T20:32:42.2085490Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2085561Z 2025-05-07T20:32:42.2085656Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2085660Z 2025-05-07T20:32:42.2085760Z moe/activation_test.py:117: 2025-05-07T20:32:42.2085899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2086045Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2086144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2086658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2086755Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2087120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2087351Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2087693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2087794Z kernel = self.compile( 2025-05-07T20:32:42.2088183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2088359Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2088498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2088502Z 2025-05-07T20:32:42.2088706Z self = 2025-05-07T20:32:42.2089486Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2089981Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2872c9080>} 2025-05-07T20:32:42.2090779Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2091043Z context = 2025-05-07T20:32:42.2091047Z 2025-05-07T20:32:42.2091213Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2091481Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2091584Z module_map=module_map) 2025-05-07T20:32:42.2091744Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2091845Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2091919Z E ^ 2025-05-07T20:32:42.2092279Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2092283Z 2025-05-07T20:32:42.2092738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2092746Z 2025-05-07T20:32:42.2092848Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2093079Z self=, 2025-05-07T20:32:42.2093157Z T=4096, 2025-05-07T20:32:42.2093233Z D=7168, 2025-05-07T20:32:42.2093321Z scale_ub=1200.0, 2025-05-07T20:32:42.2093403Z contiguous=False, 2025-05-07T20:32:42.2093489Z compiled=True, 2025-05-07T20:32:42.2093563Z ) 2025-05-07T20:32:42.2093779Z self = 2025-05-07T20:32:42.2099253Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.2099266Z 2025-05-07T20:32:42.2099366Z @given( 2025-05-07T20:32:42.2099503Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2099605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2099740Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2099861Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2100058Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2100146Z ) 2025-05-07T20:32:42.2100398Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2100495Z def test_silu_mul_quant( 2025-05-07T20:32:42.2100582Z self, 2025-05-07T20:32:42.2100663Z T: int, 2025-05-07T20:32:42.2100740Z D: int, 2025-05-07T20:32:42.2100846Z scale_ub: Optional[float], 2025-05-07T20:32:42.2100937Z contiguous: bool, 2025-05-07T20:32:42.2101023Z compiled: bool, 2025-05-07T20:32:42.2101112Z ) -> None: 2025-05-07T20:32:42.2101210Z torch.manual_seed(2025) 2025-05-07T20:32:42.2101298Z 2025-05-07T20:32:42.2101473Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2101549Z 2025-05-07T20:32:42.2101658Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2101787Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2101883Z x = x_sign * x_clamp 2025-05-07T20:32:42.2101976Z x0 = x[:, :D] 2025-05-07T20:32:42.2102054Z x1 = x[:, D:] 2025-05-07T20:32:42.2102127Z 2025-05-07T20:32:42.2102219Z if contiguous: 2025-05-07T20:32:42.2102312Z x0 = x0.contiguous() 2025-05-07T20:32:42.2102403Z x1 = x1.contiguous() 2025-05-07T20:32:42.2102487Z 2025-05-07T20:32:42.2102580Z if scale_ub is not None: 2025-05-07T20:32:42.2102696Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2102836Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2102911Z ) 2025-05-07T20:32:42.2103004Z else: 2025-05-07T20:32:42.2103099Z scale_ub_tensor = None 2025-05-07T20:32:42.2103175Z 2025-05-07T20:32:42.2103315Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2103407Z op = silu_mul_quant 2025-05-07T20:32:42.2103491Z if compiled: 2025-05-07T20:32:42.2103652Z op = torch.compile(op) 2025-05-07T20:32:42.2103799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2103873Z 2025-05-07T20:32:42.2103974Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2103979Z 2025-05-07T20:32:42.2104077Z moe/activation_test.py:117: 2025-05-07T20:32:42.2104219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2104323Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2104422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2104806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.2104899Z return fn(*args, **kwargs) 2025-05-07T20:32:42.2105395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2105541Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2105909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2106147Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2106493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2106587Z kernel = self.compile( 2025-05-07T20:32:42.2106982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2107161Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2107290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2107304Z 2025-05-07T20:32:42.2107510Z self = 2025-05-07T20:32:42.2108289Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2108844Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb2872cb060>} 2025-05-07T20:32:42.2109594Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2109790Z context = 2025-05-07T20:32:42.2109795Z 2025-05-07T20:32:42.2109960Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2110224Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2110342Z module_map=module_map) 2025-05-07T20:32:42.2110509Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2110617Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2110694Z E ^ 2025-05-07T20:32:42.2111052Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2111057Z 2025-05-07T20:32:42.2111481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2111486Z 2025-05-07T20:32:42.2111587Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2111809Z self=, 2025-05-07T20:32:42.2111898Z T=128, 2025-05-07T20:32:42.2111971Z D=7168, 2025-05-07T20:32:42.2112065Z scale_ub=1200.0, 2025-05-07T20:32:42.2112149Z contiguous=False, 2025-05-07T20:32:42.2112231Z compiled=True, 2025-05-07T20:32:42.2112309Z ) 2025-05-07T20:32:42.2112567Z self = 2025-05-07T20:32:42.2112779Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.2112783Z 2025-05-07T20:32:42.2112867Z @given( 2025-05-07T20:32:42.2112987Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2113082Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2113205Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2113321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2113440Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2113516Z ) 2025-05-07T20:32:42.2113760Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2113861Z def test_silu_mul_quant( 2025-05-07T20:32:42.2113937Z self, 2025-05-07T20:32:42.2114013Z T: int, 2025-05-07T20:32:42.2114136Z D: int, 2025-05-07T20:32:42.2114241Z scale_ub: Optional[float], 2025-05-07T20:32:42.2114335Z contiguous: bool, 2025-05-07T20:32:42.2114431Z compiled: bool, 2025-05-07T20:32:42.2114513Z ) -> None: 2025-05-07T20:32:42.2114607Z torch.manual_seed(2025) 2025-05-07T20:32:42.2114687Z 2025-05-07T20:32:42.2114857Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2114943Z 2025-05-07T20:32:42.2115034Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2115157Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2115255Z x = x_sign * x_clamp 2025-05-07T20:32:42.2115333Z x0 = x[:, :D] 2025-05-07T20:32:42.2115413Z x1 = x[:, D:] 2025-05-07T20:32:42.2115501Z 2025-05-07T20:32:42.2115585Z if contiguous: 2025-05-07T20:32:42.2115676Z x0 = x0.contiguous() 2025-05-07T20:32:42.2115774Z x1 = x1.contiguous() 2025-05-07T20:32:42.2115850Z 2025-05-07T20:32:42.2115941Z if scale_ub is not None: 2025-05-07T20:32:42.2116063Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2116242Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2116329Z ) 2025-05-07T20:32:42.2116409Z else: 2025-05-07T20:32:42.2116502Z scale_ub_tensor = None 2025-05-07T20:32:42.2116583Z 2025-05-07T20:32:42.2116711Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2116800Z op = silu_mul_quant 2025-05-07T20:32:42.2116894Z if compiled: 2025-05-07T20:32:42.2116993Z op = torch.compile(op) 2025-05-07T20:32:42.2117097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2117178Z 2025-05-07T20:32:42.2117273Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2117277Z 2025-05-07T20:32:42.2117375Z moe/activation_test.py:117: 2025-05-07T20:32:42.2117515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2117617Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2117731Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2118101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.2118192Z return fn(*args, **kwargs) 2025-05-07T20:32:42.2118698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2118797Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2119154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2119386Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2119728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2119836Z kernel = self.compile( 2025-05-07T20:32:42.2120265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2120481Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2120615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2120619Z 2025-05-07T20:32:42.2120824Z self = 2025-05-07T20:32:42.2121605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2122103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286e48360>} 2025-05-07T20:32:42.2122897Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2123102Z context = 2025-05-07T20:32:42.2123107Z 2025-05-07T20:32:42.2123270Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2123674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2123782Z module_map=module_map) 2025-05-07T20:32:42.2123943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2124046Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2124120Z E ^ 2025-05-07T20:32:42.2124479Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2124483Z 2025-05-07T20:32:42.2124911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2124963Z 2025-05-07T20:32:42.2125065Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2125288Z self=, 2025-05-07T20:32:42.2125372Z T=2048, 2025-05-07T20:32:42.2125448Z D=7168, 2025-05-07T20:32:42.2125536Z scale_ub=None, 2025-05-07T20:32:42.2125619Z contiguous=True, 2025-05-07T20:32:42.2125698Z compiled=True, 2025-05-07T20:32:42.2125777Z ) 2025-05-07T20:32:42.2125993Z self = 2025-05-07T20:32:42.2126160Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.2126165Z 2025-05-07T20:32:42.2126246Z @given( 2025-05-07T20:32:42.2126362Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2126457Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2126582Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2126701Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2126819Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2126894Z ) 2025-05-07T20:32:42.2127137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2127237Z def test_silu_mul_quant( 2025-05-07T20:32:42.2127314Z self, 2025-05-07T20:32:42.2127388Z T: int, 2025-05-07T20:32:42.2127467Z D: int, 2025-05-07T20:32:42.2127565Z scale_ub: Optional[float], 2025-05-07T20:32:42.2127651Z contiguous: bool, 2025-05-07T20:32:42.2127742Z compiled: bool, 2025-05-07T20:32:42.2127819Z ) -> None: 2025-05-07T20:32:42.2127913Z torch.manual_seed(2025) 2025-05-07T20:32:42.2127989Z 2025-05-07T20:32:42.2128158Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2128236Z 2025-05-07T20:32:42.2128326Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2128537Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2128678Z x = x_sign * x_clamp 2025-05-07T20:32:42.2128756Z x0 = x[:, :D] 2025-05-07T20:32:42.2128834Z x1 = x[:, D:] 2025-05-07T20:32:42.2128910Z 2025-05-07T20:32:42.2128990Z if contiguous: 2025-05-07T20:32:42.2129081Z x0 = x0.contiguous() 2025-05-07T20:32:42.2129175Z x1 = x1.contiguous() 2025-05-07T20:32:42.2129250Z 2025-05-07T20:32:42.2129338Z if scale_ub is not None: 2025-05-07T20:32:42.2129450Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2129583Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2129664Z ) 2025-05-07T20:32:42.2129740Z else: 2025-05-07T20:32:42.2129834Z scale_ub_tensor = None 2025-05-07T20:32:42.2129915Z 2025-05-07T20:32:42.2130087Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2130178Z op = silu_mul_quant 2025-05-07T20:32:42.2130272Z if compiled: 2025-05-07T20:32:42.2130374Z op = torch.compile(op) 2025-05-07T20:32:42.2130477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2130554Z 2025-05-07T20:32:42.2130642Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2130647Z 2025-05-07T20:32:42.2130743Z moe/activation_test.py:117: 2025-05-07T20:32:42.2130878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2130978Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2131085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2131459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.2131550Z return fn(*args, **kwargs) 2025-05-07T20:32:42.2132058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2132157Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2132558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2132788Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2133130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2133235Z kernel = self.compile( 2025-05-07T20:32:42.2133619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2133794Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2133933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2133938Z 2025-05-07T20:32:42.2134143Z self = 2025-05-07T20:32:42.2134919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2135424Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286e48ea0>} 2025-05-07T20:32:42.2136173Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2136369Z context = 2025-05-07T20:32:42.2136373Z 2025-05-07T20:32:42.2136539Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2136809Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2136955Z module_map=module_map) 2025-05-07T20:32:42.2137154Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2137258Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2137335Z E ^ 2025-05-07T20:32:42.2137701Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2137706Z 2025-05-07T20:32:42.2138122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2138126Z 2025-05-07T20:32:42.2138228Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2138872Z self=, 2025-05-07T20:32:42.2138952Z T=16384, 2025-05-07T20:32:42.2139023Z D=5120, 2025-05-07T20:32:42.2139108Z scale_ub=None, 2025-05-07T20:32:42.2139337Z contiguous=False, 2025-05-07T20:32:42.2139433Z compiled=False, 2025-05-07T20:32:42.2139512Z ) 2025-05-07T20:32:42.2139733Z self = 2025-05-07T20:32:42.2139915Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.2139919Z 2025-05-07T20:32:42.2139995Z @given( 2025-05-07T20:32:42.2140115Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2140219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2140332Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2140447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2140567Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2140641Z ) 2025-05-07T20:32:42.2140893Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2140986Z def test_silu_mul_quant( 2025-05-07T20:32:42.2141064Z self, 2025-05-07T20:32:42.2141143Z T: int, 2025-05-07T20:32:42.2141222Z D: int, 2025-05-07T20:32:42.2141387Z scale_ub: Optional[float], 2025-05-07T20:32:42.2141483Z contiguous: bool, 2025-05-07T20:32:42.2141566Z compiled: bool, 2025-05-07T20:32:42.2141642Z ) -> None: 2025-05-07T20:32:42.2141740Z torch.manual_seed(2025) 2025-05-07T20:32:42.2141813Z 2025-05-07T20:32:42.2141980Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2142059Z 2025-05-07T20:32:42.2142147Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2142275Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2144092Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2144104Z 2025-05-07T20:32:42.2144227Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.2144232Z 2025-05-07T20:32:42.2144332Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2144553Z self=, 2025-05-07T20:32:42.2144640Z T=4096, 2025-05-07T20:32:42.2144715Z D=7168, 2025-05-07T20:32:42.2144798Z scale_ub=1200.0, 2025-05-07T20:32:42.2144887Z contiguous=True, 2025-05-07T20:32:42.2144968Z compiled=True, 2025-05-07T20:32:42.2145044Z ) 2025-05-07T20:32:42.2145268Z self = 2025-05-07T20:32:42.2145440Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.2145445Z 2025-05-07T20:32:42.2145603Z @given( 2025-05-07T20:32:42.2145727Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2145884Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2146003Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2146116Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2146228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2146310Z ) 2025-05-07T20:32:42.2146554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2146651Z def test_silu_mul_quant( 2025-05-07T20:32:42.2146726Z self, 2025-05-07T20:32:42.2146798Z T: int, 2025-05-07T20:32:42.2146884Z D: int, 2025-05-07T20:32:42.2146981Z scale_ub: Optional[float], 2025-05-07T20:32:42.2147068Z contiguous: bool, 2025-05-07T20:32:42.2147159Z compiled: bool, 2025-05-07T20:32:42.2147279Z ) -> None: 2025-05-07T20:32:42.2147374Z torch.manual_seed(2025) 2025-05-07T20:32:42.2147453Z 2025-05-07T20:32:42.2147622Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2147696Z 2025-05-07T20:32:42.2147794Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2147917Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2149707Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2149717Z 2025-05-07T20:32:42.2149835Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.2149880Z 2025-05-07T20:32:42.2149992Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2150211Z self=, 2025-05-07T20:32:42.2150290Z T=16384, 2025-05-07T20:32:42.2150374Z D=7168, 2025-05-07T20:32:42.2150454Z scale_ub=None, 2025-05-07T20:32:42.2150539Z contiguous=False, 2025-05-07T20:32:42.2150632Z compiled=False, 2025-05-07T20:32:42.2150708Z ) 2025-05-07T20:32:42.2150920Z self = 2025-05-07T20:32:42.2151105Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.2151110Z 2025-05-07T20:32:42.2151184Z @given( 2025-05-07T20:32:42.2151308Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2151411Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2151530Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2151664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2151789Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2151867Z ) 2025-05-07T20:32:42.2152127Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2152224Z def test_silu_mul_quant( 2025-05-07T20:32:42.2152305Z self, 2025-05-07T20:32:42.2152395Z T: int, 2025-05-07T20:32:42.2152472Z D: int, 2025-05-07T20:32:42.2152570Z scale_ub: Optional[float], 2025-05-07T20:32:42.2152677Z contiguous: bool, 2025-05-07T20:32:42.2152765Z compiled: bool, 2025-05-07T20:32:42.2152860Z ) -> None: 2025-05-07T20:32:42.2152956Z torch.manual_seed(2025) 2025-05-07T20:32:42.2153031Z 2025-05-07T20:32:42.2153206Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2155036Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2155084Z 2025-05-07T20:32:42.2155210Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2155215Z 2025-05-07T20:32:42.2155316Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2155536Z self=, 2025-05-07T20:32:42.2155621Z T=2048, 2025-05-07T20:32:42.2155697Z D=7168, 2025-05-07T20:32:42.2155781Z scale_ub=1200.0, 2025-05-07T20:32:42.2155872Z contiguous=True, 2025-05-07T20:32:42.2155999Z compiled=True, 2025-05-07T20:32:42.2156080Z ) 2025-05-07T20:32:42.2156302Z self = 2025-05-07T20:32:42.2156476Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.2156481Z 2025-05-07T20:32:42.2156568Z @given( 2025-05-07T20:32:42.2156681Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2156775Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2156898Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2157014Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2157124Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2157206Z ) 2025-05-07T20:32:42.2157447Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2157547Z def test_silu_mul_quant( 2025-05-07T20:32:42.2157623Z self, 2025-05-07T20:32:42.2157701Z T: int, 2025-05-07T20:32:42.2157792Z D: int, 2025-05-07T20:32:42.2157994Z scale_ub: Optional[float], 2025-05-07T20:32:42.2158100Z contiguous: bool, 2025-05-07T20:32:42.2158214Z compiled: bool, 2025-05-07T20:32:42.2158297Z ) -> None: 2025-05-07T20:32:42.2158393Z torch.manual_seed(2025) 2025-05-07T20:32:42.2158483Z 2025-05-07T20:32:42.2158653Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2158726Z 2025-05-07T20:32:42.2158834Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2158959Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2160734Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2160747Z 2025-05-07T20:32:42.2160862Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.2160867Z 2025-05-07T20:32:42.2160977Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2161195Z self=, 2025-05-07T20:32:42.2161285Z T=2048, 2025-05-07T20:32:42.2161368Z D=7168, 2025-05-07T20:32:42.2161449Z scale_ub=None, 2025-05-07T20:32:42.2161529Z contiguous=True, 2025-05-07T20:32:42.2161619Z compiled=False, 2025-05-07T20:32:42.2161693Z ) 2025-05-07T20:32:42.2161906Z self = 2025-05-07T20:32:42.2162080Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.2162087Z 2025-05-07T20:32:42.2162168Z @given( 2025-05-07T20:32:42.2162328Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2162475Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2162589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2162713Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2162825Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2162902Z ) 2025-05-07T20:32:42.2163150Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2163242Z def test_silu_mul_quant( 2025-05-07T20:32:42.2163318Z self, 2025-05-07T20:32:42.2163403Z T: int, 2025-05-07T20:32:42.2163477Z D: int, 2025-05-07T20:32:42.2163690Z scale_ub: Optional[float], 2025-05-07T20:32:42.2163787Z contiguous: bool, 2025-05-07T20:32:42.2163871Z compiled: bool, 2025-05-07T20:32:42.2163946Z ) -> None: 2025-05-07T20:32:42.2164091Z torch.manual_seed(2025) 2025-05-07T20:32:42.2164169Z 2025-05-07T20:32:42.2164350Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2164428Z 2025-05-07T20:32:42.2164518Z > x_sign = torch.sign(x) 2025-05-07T20:32:42.2166282Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2166287Z 2025-05-07T20:32:42.2166401Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:42.2166406Z 2025-05-07T20:32:42.2166518Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2166742Z self=, 2025-05-07T20:32:42.2166862Z T=1, 2025-05-07T20:32:42.2166949Z D=7168, 2025-05-07T20:32:42.2167033Z scale_ub=1200.0, 2025-05-07T20:32:42.2167116Z contiguous=True, 2025-05-07T20:32:42.2167209Z compiled=False, 2025-05-07T20:32:42.2167281Z ) 2025-05-07T20:32:42.2167493Z self = 2025-05-07T20:32:42.2167666Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.2167670Z 2025-05-07T20:32:42.2167748Z @given( 2025-05-07T20:32:42.2167894Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2168004Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2168125Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2168244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2168361Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2168439Z ) 2025-05-07T20:32:42.2168689Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2168780Z def test_silu_mul_quant( 2025-05-07T20:32:42.2168860Z self, 2025-05-07T20:32:42.2168936Z T: int, 2025-05-07T20:32:42.2169010Z D: int, 2025-05-07T20:32:42.2169110Z scale_ub: Optional[float], 2025-05-07T20:32:42.2169197Z contiguous: bool, 2025-05-07T20:32:42.2169279Z compiled: bool, 2025-05-07T20:32:42.2169361Z ) -> None: 2025-05-07T20:32:42.2169453Z torch.manual_seed(2025) 2025-05-07T20:32:42.2169528Z 2025-05-07T20:32:42.2169700Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2169773Z 2025-05-07T20:32:42.2169864Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2169996Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2170088Z x = x_sign * x_clamp 2025-05-07T20:32:42.2170211Z x0 = x[:, :D] 2025-05-07T20:32:42.2170300Z x1 = x[:, D:] 2025-05-07T20:32:42.2170410Z 2025-05-07T20:32:42.2170496Z if contiguous: 2025-05-07T20:32:42.2170587Z x0 = x0.contiguous() 2025-05-07T20:32:42.2170672Z x1 = x1.contiguous() 2025-05-07T20:32:42.2170747Z 2025-05-07T20:32:42.2170834Z if scale_ub is not None: 2025-05-07T20:32:42.2170937Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2171075Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2171146Z ) 2025-05-07T20:32:42.2171217Z else: 2025-05-07T20:32:42.2171318Z scale_ub_tensor = None 2025-05-07T20:32:42.2171391Z 2025-05-07T20:32:42.2171520Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2171617Z op = silu_mul_quant 2025-05-07T20:32:42.2171700Z if compiled: 2025-05-07T20:32:42.2171845Z op = torch.compile(op) 2025-05-07T20:32:42.2171953Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2172030Z 2025-05-07T20:32:42.2172128Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2172133Z 2025-05-07T20:32:42.2172229Z moe/activation_test.py:117: 2025-05-07T20:32:42.2172355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2172461Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2172559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2173060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2173164Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2173524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2173751Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2174102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2174238Z kernel = self.compile( 2025-05-07T20:32:42.2174629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2174802Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2174937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2174942Z 2025-05-07T20:32:42.2175146Z self = 2025-05-07T20:32:42.2175916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2176424Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286c68680>} 2025-05-07T20:32:42.2177177Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2177373Z context = 2025-05-07T20:32:42.2177378Z 2025-05-07T20:32:42.2177542Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2177805Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2177920Z module_map=module_map) 2025-05-07T20:32:42.2178080Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2178184Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2178261Z E ^ 2025-05-07T20:32:42.2178660Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2178759Z 2025-05-07T20:32:42.2179181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2179186Z 2025-05-07T20:32:42.2179288Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2179516Z self=, 2025-05-07T20:32:42.2179594Z T=128, 2025-05-07T20:32:42.2179672Z D=5120, 2025-05-07T20:32:42.2179759Z scale_ub=None, 2025-05-07T20:32:42.2179840Z contiguous=True, 2025-05-07T20:32:42.2179920Z compiled=False, 2025-05-07T20:32:42.2180002Z ) 2025-05-07T20:32:42.2180221Z self = 2025-05-07T20:32:42.2180388Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.2180392Z 2025-05-07T20:32:42.2180521Z @given( 2025-05-07T20:32:42.2180642Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2180748Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2180860Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2180975Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2181091Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2181165Z ) 2025-05-07T20:32:42.2181410Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2181506Z def test_silu_mul_quant( 2025-05-07T20:32:42.2181581Z self, 2025-05-07T20:32:42.2181656Z T: int, 2025-05-07T20:32:42.2181738Z D: int, 2025-05-07T20:32:42.2181834Z scale_ub: Optional[float], 2025-05-07T20:32:42.2181918Z contiguous: bool, 2025-05-07T20:32:42.2182008Z compiled: bool, 2025-05-07T20:32:42.2182080Z ) -> None: 2025-05-07T20:32:42.2182185Z torch.manual_seed(2025) 2025-05-07T20:32:42.2182254Z 2025-05-07T20:32:42.2182424Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2182551Z 2025-05-07T20:32:42.2182641Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2182763Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2182859Z x = x_sign * x_clamp 2025-05-07T20:32:42.2182938Z x0 = x[:, :D] 2025-05-07T20:32:42.2183015Z x1 = x[:, D:] 2025-05-07T20:32:42.2183091Z 2025-05-07T20:32:42.2183173Z if contiguous: 2025-05-07T20:32:42.2183265Z x0 = x0.contiguous() 2025-05-07T20:32:42.2183357Z x1 = x1.contiguous() 2025-05-07T20:32:42.2183429Z 2025-05-07T20:32:42.2183516Z if scale_ub is not None: 2025-05-07T20:32:42.2183626Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2183757Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2183837Z ) 2025-05-07T20:32:42.2183913Z else: 2025-05-07T20:32:42.2184002Z scale_ub_tensor = None 2025-05-07T20:32:42.2184087Z 2025-05-07T20:32:42.2184218Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2184306Z op = silu_mul_quant 2025-05-07T20:32:42.2184395Z if compiled: 2025-05-07T20:32:42.2184495Z op = torch.compile(op) 2025-05-07T20:32:42.2184597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2184678Z 2025-05-07T20:32:42.2184765Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2184769Z 2025-05-07T20:32:42.2184869Z moe/activation_test.py:117: 2025-05-07T20:32:42.2184996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2185093Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2185196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2185695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2185840Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2186210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2186469Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2186816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2186907Z kernel = self.compile( 2025-05-07T20:32:42.2187290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2187471Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2187594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2187599Z 2025-05-07T20:32:42.2187802Z self = 2025-05-07T20:32:42.2188646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2189148Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286c698a0>} 2025-05-07T20:32:42.2189901Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2190087Z context = 2025-05-07T20:32:42.2190091Z 2025-05-07T20:32:42.2190259Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2190523Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2190630Z module_map=module_map) 2025-05-07T20:32:42.2190846Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2190941Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2191013Z E ^ 2025-05-07T20:32:42.2191374Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2191378Z 2025-05-07T20:32:42.2191790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2191794Z 2025-05-07T20:32:42.2191903Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2192123Z self=, 2025-05-07T20:32:42.2192199Z T=128, 2025-05-07T20:32:42.2192282Z D=7168, 2025-05-07T20:32:42.2192363Z scale_ub=None, 2025-05-07T20:32:42.2192449Z contiguous=True, 2025-05-07T20:32:42.2192541Z compiled=False, 2025-05-07T20:32:42.2192617Z ) 2025-05-07T20:32:42.2192844Z self = 2025-05-07T20:32:42.2193019Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.2193023Z 2025-05-07T20:32:42.2193098Z @given( 2025-05-07T20:32:42.2193222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2193323Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2193435Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2193556Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2193671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2193754Z ) 2025-05-07T20:32:42.2193996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2194088Z def test_silu_mul_quant( 2025-05-07T20:32:42.2194172Z self, 2025-05-07T20:32:42.2194252Z T: int, 2025-05-07T20:32:42.2194326Z D: int, 2025-05-07T20:32:42.2194478Z scale_ub: Optional[float], 2025-05-07T20:32:42.2194608Z contiguous: bool, 2025-05-07T20:32:42.2194689Z compiled: bool, 2025-05-07T20:32:42.2194774Z ) -> None: 2025-05-07T20:32:42.2194868Z torch.manual_seed(2025) 2025-05-07T20:32:42.2194940Z 2025-05-07T20:32:42.2195119Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2195191Z 2025-05-07T20:32:42.2195282Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2195412Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2195499Z x = x_sign * x_clamp 2025-05-07T20:32:42.2195585Z x0 = x[:, :D] 2025-05-07T20:32:42.2195661Z x1 = x[:, D:] 2025-05-07T20:32:42.2195733Z 2025-05-07T20:32:42.2195820Z if contiguous: 2025-05-07T20:32:42.2195908Z x0 = x0.contiguous() 2025-05-07T20:32:42.2196034Z x1 = x1.contiguous() 2025-05-07T20:32:42.2196114Z 2025-05-07T20:32:42.2196203Z if scale_ub is not None: 2025-05-07T20:32:42.2196310Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2196450Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2196523Z ) 2025-05-07T20:32:42.2196598Z else: 2025-05-07T20:32:42.2196697Z scale_ub_tensor = None 2025-05-07T20:32:42.2196769Z 2025-05-07T20:32:42.2196907Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2196993Z op = silu_mul_quant 2025-05-07T20:32:42.2197073Z if compiled: 2025-05-07T20:32:42.2197176Z op = torch.compile(op) 2025-05-07T20:32:42.2197279Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2197349Z 2025-05-07T20:32:42.2197447Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2197451Z 2025-05-07T20:32:42.2197547Z moe/activation_test.py:117: 2025-05-07T20:32:42.2197675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2197827Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2197926Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2198427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2198524Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2198882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2199109Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2199771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2199864Z kernel = self.compile( 2025-05-07T20:32:42.2200252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2200431Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2200567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2200575Z 2025-05-07T20:32:42.2200782Z self = 2025-05-07T20:32:42.2201545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2202047Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286c6a7a0>} 2025-05-07T20:32:42.2202796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2203038Z context = 2025-05-07T20:32:42.2203083Z 2025-05-07T20:32:42.2203248Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2203515Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2203731Z module_map=module_map) 2025-05-07T20:32:42.2203890Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2203994Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2204069Z E ^ 2025-05-07T20:32:42.2204422Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2204427Z 2025-05-07T20:32:42.2204845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2204849Z 2025-05-07T20:32:42.2204994Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2205229Z self=, 2025-05-07T20:32:42.2205311Z T=2048, 2025-05-07T20:32:42.2205387Z D=7168, 2025-05-07T20:32:42.2205475Z scale_ub=1200.0, 2025-05-07T20:32:42.2205556Z contiguous=True, 2025-05-07T20:32:42.2205640Z compiled=False, 2025-05-07T20:32:42.2205721Z ) 2025-05-07T20:32:42.2205938Z self = 2025-05-07T20:32:42.2206111Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.2206123Z 2025-05-07T20:32:42.2206201Z @given( 2025-05-07T20:32:42.2206318Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2206424Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2206533Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2206649Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2206770Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2206892Z ) 2025-05-07T20:32:42.2207138Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2207240Z def test_silu_mul_quant( 2025-05-07T20:32:42.2207317Z self, 2025-05-07T20:32:42.2207392Z T: int, 2025-05-07T20:32:42.2207474Z D: int, 2025-05-07T20:32:42.2207570Z scale_ub: Optional[float], 2025-05-07T20:32:42.2207664Z contiguous: bool, 2025-05-07T20:32:42.2207747Z compiled: bool, 2025-05-07T20:32:42.2207822Z ) -> None: 2025-05-07T20:32:42.2207923Z torch.manual_seed(2025) 2025-05-07T20:32:42.2207995Z 2025-05-07T20:32:42.2208163Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2209962Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2209973Z 2025-05-07T20:32:42.2210092Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2210097Z 2025-05-07T20:32:42.2210206Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2210426Z self=, 2025-05-07T20:32:42.2210499Z T=1, 2025-05-07T20:32:42.2210589Z D=5120, 2025-05-07T20:32:42.2210670Z scale_ub=1200.0, 2025-05-07T20:32:42.2210761Z contiguous=True, 2025-05-07T20:32:42.2210842Z compiled=False, 2025-05-07T20:32:42.2210913Z ) 2025-05-07T20:32:42.2211140Z self = 2025-05-07T20:32:42.2211349Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.2211396Z 2025-05-07T20:32:42.2211470Z @given( 2025-05-07T20:32:42.2211593Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2211690Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2211802Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2211924Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2212033Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2212111Z ) 2025-05-07T20:32:42.2212353Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2212443Z def test_silu_mul_quant( 2025-05-07T20:32:42.2212524Z self, 2025-05-07T20:32:42.2212598Z T: int, 2025-05-07T20:32:42.2212671Z D: int, 2025-05-07T20:32:42.2212812Z scale_ub: Optional[float], 2025-05-07T20:32:42.2212901Z contiguous: bool, 2025-05-07T20:32:42.2212987Z compiled: bool, 2025-05-07T20:32:42.2213073Z ) -> None: 2025-05-07T20:32:42.2213164Z torch.manual_seed(2025) 2025-05-07T20:32:42.2213234Z 2025-05-07T20:32:42.2213408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2213482Z 2025-05-07T20:32:42.2213576Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2213697Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2213781Z x = x_sign * x_clamp 2025-05-07T20:32:42.2213863Z x0 = x[:, :D] 2025-05-07T20:32:42.2213939Z x1 = x[:, D:] 2025-05-07T20:32:42.2214010Z 2025-05-07T20:32:42.2214099Z if contiguous: 2025-05-07T20:32:42.2214186Z x0 = x0.contiguous() 2025-05-07T20:32:42.2214274Z x1 = x1.contiguous() 2025-05-07T20:32:42.2214352Z 2025-05-07T20:32:42.2214441Z if scale_ub is not None: 2025-05-07T20:32:42.2214545Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2214686Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2214805Z ) 2025-05-07T20:32:42.2214885Z else: 2025-05-07T20:32:42.2214984Z scale_ub_tensor = None 2025-05-07T20:32:42.2215058Z 2025-05-07T20:32:42.2215192Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2215279Z op = silu_mul_quant 2025-05-07T20:32:42.2215360Z if compiled: 2025-05-07T20:32:42.2215464Z op = torch.compile(op) 2025-05-07T20:32:42.2215567Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2215640Z 2025-05-07T20:32:42.2215733Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2215738Z 2025-05-07T20:32:42.2215833Z moe/activation_test.py:117: 2025-05-07T20:32:42.2215959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2216063Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2216162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2216671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2216767Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2217127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2217360Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2217700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2217792Z kernel = self.compile( 2025-05-07T20:32:42.2218182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2218355Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2218487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2218541Z 2025-05-07T20:32:42.2218804Z self = 2025-05-07T20:32:42.2219567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2220068Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286c6bb00>} 2025-05-07T20:32:42.2225845Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2226157Z context = 2025-05-07T20:32:42.2226163Z 2025-05-07T20:32:42.2226347Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2226619Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2226725Z module_map=module_map) 2025-05-07T20:32:42.2226898Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2226997Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2227076Z E ^ 2025-05-07T20:32:42.2227444Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2227448Z 2025-05-07T20:32:42.2227868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2227872Z 2025-05-07T20:32:42.2227984Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2228210Z self=, 2025-05-07T20:32:42.2228290Z T=2048, 2025-05-07T20:32:42.2228430Z D=5120, 2025-05-07T20:32:42.2228516Z scale_ub=None, 2025-05-07T20:32:42.2228601Z contiguous=True, 2025-05-07T20:32:42.2228698Z compiled=False, 2025-05-07T20:32:42.2228771Z ) 2025-05-07T20:32:42.2228997Z self = 2025-05-07T20:32:42.2229177Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.2229181Z 2025-05-07T20:32:42.2229264Z @given( 2025-05-07T20:32:42.2229403Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2229503Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2229619Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2229745Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2229859Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2229944Z ) 2025-05-07T20:32:42.2230198Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2230300Z def test_silu_mul_quant( 2025-05-07T20:32:42.2230395Z self, 2025-05-07T20:32:42.2230473Z T: int, 2025-05-07T20:32:42.2230553Z D: int, 2025-05-07T20:32:42.2230661Z scale_ub: Optional[float], 2025-05-07T20:32:42.2230753Z contiguous: bool, 2025-05-07T20:32:42.2230838Z compiled: bool, 2025-05-07T20:32:42.2230928Z ) -> None: 2025-05-07T20:32:42.2231024Z torch.manual_seed(2025) 2025-05-07T20:32:42.2231097Z 2025-05-07T20:32:42.2231278Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2231353Z 2025-05-07T20:32:42.2231448Z > x_sign = torch.sign(x) 2025-05-07T20:32:42.2233286Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2233331Z 2025-05-07T20:32:42.2233461Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:42.2233465Z 2025-05-07T20:32:42.2233571Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2233792Z self=, 2025-05-07T20:32:42.2233881Z T=16384, 2025-05-07T20:32:42.2233959Z D=5120, 2025-05-07T20:32:42.2234042Z scale_ub=None, 2025-05-07T20:32:42.2234134Z contiguous=True, 2025-05-07T20:32:42.2234221Z compiled=False, 2025-05-07T20:32:42.2234296Z ) 2025-05-07T20:32:42.2234562Z self = 2025-05-07T20:32:42.2234746Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.2234756Z 2025-05-07T20:32:42.2234848Z @given( 2025-05-07T20:32:42.2234967Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2235065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2235193Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2235310Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2235425Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2235508Z ) 2025-05-07T20:32:42.2235754Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2235852Z def test_silu_mul_quant( 2025-05-07T20:32:42.2235940Z self, 2025-05-07T20:32:42.2236020Z T: int, 2025-05-07T20:32:42.2236104Z D: int, 2025-05-07T20:32:42.2236209Z scale_ub: Optional[float], 2025-05-07T20:32:42.2236303Z contiguous: bool, 2025-05-07T20:32:42.2236399Z compiled: bool, 2025-05-07T20:32:42.2236525Z ) -> None: 2025-05-07T20:32:42.2236622Z torch.manual_seed(2025) 2025-05-07T20:32:42.2236708Z 2025-05-07T20:32:42.2236877Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2238998Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2239012Z 2025-05-07T20:32:42.2239134Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2239139Z 2025-05-07T20:32:42.2239245Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2239485Z self=, 2025-05-07T20:32:42.2239562Z T=4096, 2025-05-07T20:32:42.2239641Z D=5120, 2025-05-07T20:32:42.2239734Z scale_ub=None, 2025-05-07T20:32:42.2239818Z contiguous=True, 2025-05-07T20:32:42.2239913Z compiled=False, 2025-05-07T20:32:42.2239987Z ) 2025-05-07T20:32:42.2240201Z self = 2025-05-07T20:32:42.2240377Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.2240381Z 2025-05-07T20:32:42.2240460Z @given( 2025-05-07T20:32:42.2240577Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2240684Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2240798Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2240913Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2241203Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2241337Z ) 2025-05-07T20:32:42.2241589Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2241686Z def test_silu_mul_quant( 2025-05-07T20:32:42.2241765Z self, 2025-05-07T20:32:42.2241849Z T: int, 2025-05-07T20:32:42.2241925Z D: int, 2025-05-07T20:32:42.2242021Z scale_ub: Optional[float], 2025-05-07T20:32:42.2242123Z contiguous: bool, 2025-05-07T20:32:42.2242208Z compiled: bool, 2025-05-07T20:32:42.2242286Z ) -> None: 2025-05-07T20:32:42.2242391Z torch.manual_seed(2025) 2025-05-07T20:32:42.2242463Z 2025-05-07T20:32:42.2242634Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2244573Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2244585Z 2025-05-07T20:32:42.2244715Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2244720Z 2025-05-07T20:32:42.2244825Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2245045Z self=, 2025-05-07T20:32:42.2245137Z T=2048, 2025-05-07T20:32:42.2245217Z D=5120, 2025-05-07T20:32:42.2245302Z scale_ub=None, 2025-05-07T20:32:42.2245401Z contiguous=False, 2025-05-07T20:32:42.2245485Z compiled=False, 2025-05-07T20:32:42.2245560Z ) 2025-05-07T20:32:42.2245795Z self = 2025-05-07T20:32:42.2246040Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.2246047Z 2025-05-07T20:32:42.2246133Z @given( 2025-05-07T20:32:42.2246252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2246351Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2246476Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2246594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2246710Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2246799Z ) 2025-05-07T20:32:42.2247041Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2247136Z def test_silu_mul_quant( 2025-05-07T20:32:42.2247230Z self, 2025-05-07T20:32:42.2247310Z T: int, 2025-05-07T20:32:42.2247388Z D: int, 2025-05-07T20:32:42.2247500Z scale_ub: Optional[float], 2025-05-07T20:32:42.2247593Z contiguous: bool, 2025-05-07T20:32:42.2247698Z compiled: bool, 2025-05-07T20:32:42.2247782Z ) -> None: 2025-05-07T20:32:42.2247884Z torch.manual_seed(2025) 2025-05-07T20:32:42.2247977Z 2025-05-07T20:32:42.2248146Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2249903Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2249922Z 2025-05-07T20:32:42.2250086Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2250094Z 2025-05-07T20:32:42.2250235Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2250464Z self=, 2025-05-07T20:32:42.2250544Z T=4096, 2025-05-07T20:32:42.2250623Z D=7168, 2025-05-07T20:32:42.2250712Z scale_ub=None, 2025-05-07T20:32:42.2250800Z contiguous=True, 2025-05-07T20:32:42.2250895Z compiled=True, 2025-05-07T20:32:42.2250972Z ) 2025-05-07T20:32:42.2251187Z self = 2025-05-07T20:32:42.2251369Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.2251373Z 2025-05-07T20:32:42.2251450Z @given( 2025-05-07T20:32:42.2251567Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2251676Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2251830Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2251952Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2252078Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2252149Z ) 2025-05-07T20:32:42.2252396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2252499Z def test_silu_mul_quant( 2025-05-07T20:32:42.2252581Z self, 2025-05-07T20:32:42.2252663Z T: int, 2025-05-07T20:32:42.2252746Z D: int, 2025-05-07T20:32:42.2252842Z scale_ub: Optional[float], 2025-05-07T20:32:42.2252929Z contiguous: bool, 2025-05-07T20:32:42.2253023Z compiled: bool, 2025-05-07T20:32:42.2253100Z ) -> None: 2025-05-07T20:32:42.2253193Z torch.manual_seed(2025) 2025-05-07T20:32:42.2253272Z 2025-05-07T20:32:42.2253440Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2255213Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2255265Z 2025-05-07T20:32:42.2255381Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2255386Z 2025-05-07T20:32:42.2255493Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2255712Z self=, 2025-05-07T20:32:42.2255789Z T=2048, 2025-05-07T20:32:42.2255873Z D=5120, 2025-05-07T20:32:42.2255955Z scale_ub=1200.0, 2025-05-07T20:32:42.2256038Z contiguous=False, 2025-05-07T20:32:42.2256129Z compiled=False, 2025-05-07T20:32:42.2256204Z ) 2025-05-07T20:32:42.2256420Z self = 2025-05-07T20:32:42.2256601Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.2256606Z 2025-05-07T20:32:42.2256681Z @given( 2025-05-07T20:32:42.2256803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2256900Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2257010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2257131Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2257241Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2257317Z ) 2025-05-07T20:32:42.2257570Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2257665Z def test_silu_mul_quant( 2025-05-07T20:32:42.2257750Z self, 2025-05-07T20:32:42.2257828Z T: int, 2025-05-07T20:32:42.2257905Z D: int, 2025-05-07T20:32:42.2258082Z scale_ub: Optional[float], 2025-05-07T20:32:42.2258210Z contiguous: bool, 2025-05-07T20:32:42.2258294Z compiled: bool, 2025-05-07T20:32:42.2258383Z ) -> None: 2025-05-07T20:32:42.2258476Z torch.manual_seed(2025) 2025-05-07T20:32:42.2258549Z 2025-05-07T20:32:42.2258724Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2260526Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2260536Z 2025-05-07T20:32:42.2260665Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2260672Z 2025-05-07T20:32:42.2260774Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2260995Z self=, 2025-05-07T20:32:42.2261080Z T=4096, 2025-05-07T20:32:42.2261154Z D=7168, 2025-05-07T20:32:42.2261241Z scale_ub=1200.0, 2025-05-07T20:32:42.2261324Z contiguous=True, 2025-05-07T20:32:42.2261406Z compiled=False, 2025-05-07T20:32:42.2261490Z ) 2025-05-07T20:32:42.2261705Z self = 2025-05-07T20:32:42.2261875Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.2261879Z 2025-05-07T20:32:42.2261962Z @given( 2025-05-07T20:32:42.2262077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2262179Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2262301Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2262462Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2262580Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2262655Z ) 2025-05-07T20:32:42.2262900Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2262999Z def test_silu_mul_quant( 2025-05-07T20:32:42.2263073Z self, 2025-05-07T20:32:42.2263155Z T: int, 2025-05-07T20:32:42.2263237Z D: int, 2025-05-07T20:32:42.2263335Z scale_ub: Optional[float], 2025-05-07T20:32:42.2263421Z contiguous: bool, 2025-05-07T20:32:42.2263515Z compiled: bool, 2025-05-07T20:32:42.2263595Z ) -> None: 2025-05-07T20:32:42.2263689Z torch.manual_seed(2025) 2025-05-07T20:32:42.2263772Z 2025-05-07T20:32:42.2263943Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2265719Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2265730Z 2025-05-07T20:32:42.2265846Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2265850Z 2025-05-07T20:32:42.2265962Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2266183Z self=, 2025-05-07T20:32:42.2266262Z T=16384, 2025-05-07T20:32:42.2266349Z D=7168, 2025-05-07T20:32:42.2266436Z scale_ub=None, 2025-05-07T20:32:42.2266523Z contiguous=False, 2025-05-07T20:32:42.2266671Z compiled=True, 2025-05-07T20:32:42.2266788Z ) 2025-05-07T20:32:42.2267003Z self = 2025-05-07T20:32:42.2267188Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.2267193Z 2025-05-07T20:32:42.2267270Z @given( 2025-05-07T20:32:42.2267390Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2267487Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2267602Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2267729Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2267852Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2267940Z ) 2025-05-07T20:32:42.2268216Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2268348Z def test_silu_mul_quant( 2025-05-07T20:32:42.2268436Z self, 2025-05-07T20:32:42.2268517Z T: int, 2025-05-07T20:32:42.2268598Z D: int, 2025-05-07T20:32:42.2268704Z scale_ub: Optional[float], 2025-05-07T20:32:42.2268790Z contiguous: bool, 2025-05-07T20:32:42.2268874Z compiled: bool, 2025-05-07T20:32:42.2268960Z ) -> None: 2025-05-07T20:32:42.2269054Z torch.manual_seed(2025) 2025-05-07T20:32:42.2269130Z 2025-05-07T20:32:42.2269303Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2271071Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2271121Z 2025-05-07T20:32:42.2271244Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2271249Z 2025-05-07T20:32:42.2271351Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2271576Z self=, 2025-05-07T20:32:42.2271652Z T=4096, 2025-05-07T20:32:42.2271730Z D=7168, 2025-05-07T20:32:42.2271816Z scale_ub=None, 2025-05-07T20:32:42.2271900Z contiguous=True, 2025-05-07T20:32:42.2271983Z compiled=False, 2025-05-07T20:32:42.2272065Z ) 2025-05-07T20:32:42.2272280Z self = 2025-05-07T20:32:42.2272449Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.2272453Z 2025-05-07T20:32:42.2272538Z @given( 2025-05-07T20:32:42.2272656Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2272756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2272885Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2272999Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2273120Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2273194Z ) 2025-05-07T20:32:42.2273436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2273536Z def test_silu_mul_quant( 2025-05-07T20:32:42.2273612Z self, 2025-05-07T20:32:42.2273689Z T: int, 2025-05-07T20:32:42.2273772Z D: int, 2025-05-07T20:32:42.2273870Z scale_ub: Optional[float], 2025-05-07T20:32:42.2273956Z contiguous: bool, 2025-05-07T20:32:42.2274051Z compiled: bool, 2025-05-07T20:32:42.2274128Z ) -> None: 2025-05-07T20:32:42.2274222Z torch.manual_seed(2025) 2025-05-07T20:32:42.2274307Z 2025-05-07T20:32:42.2274524Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2276346Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2276390Z 2025-05-07T20:32:42.2276509Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2276513Z 2025-05-07T20:32:42.2276620Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2276838Z self=, 2025-05-07T20:32:42.2276955Z T=16384, 2025-05-07T20:32:42.2277042Z D=7168, 2025-05-07T20:32:42.2277125Z scale_ub=None, 2025-05-07T20:32:42.2277213Z contiguous=True, 2025-05-07T20:32:42.2277300Z compiled=False, 2025-05-07T20:32:42.2277376Z ) 2025-05-07T20:32:42.2277591Z self = 2025-05-07T20:32:42.2277772Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.2277777Z 2025-05-07T20:32:42.2277862Z @given( 2025-05-07T20:32:42.2278000Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2278111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2278248Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2278375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2278489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2278564Z ) 2025-05-07T20:32:42.2278813Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2278909Z def test_silu_mul_quant( 2025-05-07T20:32:42.2279033Z self, 2025-05-07T20:32:42.2279109Z T: int, 2025-05-07T20:32:42.2279186Z D: int, 2025-05-07T20:32:42.2279294Z scale_ub: Optional[float], 2025-05-07T20:32:42.2279384Z contiguous: bool, 2025-05-07T20:32:42.2279466Z compiled: bool, 2025-05-07T20:32:42.2279550Z ) -> None: 2025-05-07T20:32:42.2279642Z torch.manual_seed(2025) 2025-05-07T20:32:42.2279715Z 2025-05-07T20:32:42.2279890Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2281663Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2281672Z 2025-05-07T20:32:42.2281794Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2281798Z 2025-05-07T20:32:42.2281905Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2282135Z self=, 2025-05-07T20:32:42.2282213Z T=16384, 2025-05-07T20:32:42.2282292Z D=7168, 2025-05-07T20:32:42.2282383Z scale_ub=1200.0, 2025-05-07T20:32:42.2282466Z contiguous=True, 2025-05-07T20:32:42.2282552Z compiled=False, 2025-05-07T20:32:42.2282634Z ) 2025-05-07T20:32:42.2282848Z self = 2025-05-07T20:32:42.2283023Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.2283029Z 2025-05-07T20:32:42.2283112Z @given( 2025-05-07T20:32:42.2283274Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2283409Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2283752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2283870Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2283990Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2284066Z ) 2025-05-07T20:32:42.2284308Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2284408Z def test_silu_mul_quant( 2025-05-07T20:32:42.2284484Z self, 2025-05-07T20:32:42.2284564Z T: int, 2025-05-07T20:32:42.2284648Z D: int, 2025-05-07T20:32:42.2284743Z scale_ub: Optional[float], 2025-05-07T20:32:42.2284831Z contiguous: bool, 2025-05-07T20:32:42.2284923Z compiled: bool, 2025-05-07T20:32:42.2284998Z ) -> None: 2025-05-07T20:32:42.2285134Z torch.manual_seed(2025) 2025-05-07T20:32:42.2285213Z 2025-05-07T20:32:42.2285391Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2287164Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2287171Z 2025-05-07T20:32:42.2287285Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2287290Z 2025-05-07T20:32:42.2287399Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2287622Z self=, 2025-05-07T20:32:42.2287701Z T=128, 2025-05-07T20:32:42.2287851Z D=5120, 2025-05-07T20:32:42.2287937Z scale_ub=1200.0, 2025-05-07T20:32:42.2288025Z contiguous=False, 2025-05-07T20:32:42.2288115Z compiled=False, 2025-05-07T20:32:42.2288190Z ) 2025-05-07T20:32:42.2288405Z self = 2025-05-07T20:32:42.2288580Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.2288585Z 2025-05-07T20:32:42.2288663Z @given( 2025-05-07T20:32:42.2288786Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2288882Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2288996Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2289117Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2289229Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2289306Z ) 2025-05-07T20:32:42.2289560Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2289657Z def test_silu_mul_quant( 2025-05-07T20:32:42.2289748Z self, 2025-05-07T20:32:42.2289834Z T: int, 2025-05-07T20:32:42.2289911Z D: int, 2025-05-07T20:32:42.2290009Z scale_ub: Optional[float], 2025-05-07T20:32:42.2290103Z contiguous: bool, 2025-05-07T20:32:42.2290192Z compiled: bool, 2025-05-07T20:32:42.2290268Z ) -> None: 2025-05-07T20:32:42.2290370Z torch.manual_seed(2025) 2025-05-07T20:32:42.2290444Z 2025-05-07T20:32:42.2290619Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2290693Z 2025-05-07T20:32:42.2290786Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2290919Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2291008Z x = x_sign * x_clamp 2025-05-07T20:32:42.2291088Z x0 = x[:, :D] 2025-05-07T20:32:42.2291177Z x1 = x[:, D:] 2025-05-07T20:32:42.2291249Z 2025-05-07T20:32:42.2291385Z if contiguous: 2025-05-07T20:32:42.2291523Z x0 = x0.contiguous() 2025-05-07T20:32:42.2291613Z x1 = x1.contiguous() 2025-05-07T20:32:42.2291688Z 2025-05-07T20:32:42.2291786Z if scale_ub is not None: 2025-05-07T20:32:42.2291891Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2292034Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2292110Z ) 2025-05-07T20:32:42.2292184Z else: 2025-05-07T20:32:42.2292284Z scale_ub_tensor = None 2025-05-07T20:32:42.2292357Z 2025-05-07T20:32:42.2292487Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2292585Z op = silu_mul_quant 2025-05-07T20:32:42.2292667Z if compiled: 2025-05-07T20:32:42.2292766Z op = torch.compile(op) 2025-05-07T20:32:42.2292923Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2292999Z 2025-05-07T20:32:42.2293094Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2293102Z 2025-05-07T20:32:42.2293209Z moe/activation_test.py:117: 2025-05-07T20:32:42.2293339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2293443Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2293542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2294040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2294140Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2294503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2294724Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2295077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2295170Z kernel = self.compile( 2025-05-07T20:32:42.2295603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2295778Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2295904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2295908Z 2025-05-07T20:32:42.2296116Z self = 2025-05-07T20:32:42.2296887Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2297397Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286dd2700>} 2025-05-07T20:32:42.2298147Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2298341Z context = 2025-05-07T20:32:42.2298352Z 2025-05-07T20:32:42.2298513Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2298775Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2298888Z module_map=module_map) 2025-05-07T20:32:42.2299046Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2299140Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2299223Z E ^ 2025-05-07T20:32:42.2299578Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2299583Z 2025-05-07T20:32:42.2300042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2300084Z 2025-05-07T20:32:42.2300191Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2300412Z self=, 2025-05-07T20:32:42.2300497Z T=2048, 2025-05-07T20:32:42.2300571Z D=7168, 2025-05-07T20:32:42.2300650Z scale_ub=None, 2025-05-07T20:32:42.2300741Z contiguous=False, 2025-05-07T20:32:42.2300825Z compiled=False, 2025-05-07T20:32:42.2300898Z ) 2025-05-07T20:32:42.2301121Z self = 2025-05-07T20:32:42.2301292Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.2301297Z 2025-05-07T20:32:42.2301381Z @given( 2025-05-07T20:32:42.2301536Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2301631Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2301757Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2301875Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2301986Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2302069Z ) 2025-05-07T20:32:42.2302311Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2302411Z def test_silu_mul_quant( 2025-05-07T20:32:42.2302487Z self, 2025-05-07T20:32:42.2302562Z T: int, 2025-05-07T20:32:42.2302644Z D: int, 2025-05-07T20:32:42.2302740Z scale_ub: Optional[float], 2025-05-07T20:32:42.2302828Z contiguous: bool, 2025-05-07T20:32:42.2302919Z compiled: bool, 2025-05-07T20:32:42.2302994Z ) -> None: 2025-05-07T20:32:42.2303088Z torch.manual_seed(2025) 2025-05-07T20:32:42.2303170Z 2025-05-07T20:32:42.2303339Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2305129Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2305180Z 2025-05-07T20:32:42.2305300Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2305306Z 2025-05-07T20:32:42.2305406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2305634Z self=, 2025-05-07T20:32:42.2305709Z T=128, 2025-05-07T20:32:42.2305798Z D=7168, 2025-05-07T20:32:42.2305876Z scale_ub=1200.0, 2025-05-07T20:32:42.2305966Z contiguous=True, 2025-05-07T20:32:42.2306054Z compiled=True, 2025-05-07T20:32:42.2306125Z ) 2025-05-07T20:32:42.2306339Z self = 2025-05-07T20:32:42.2306513Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.2306517Z 2025-05-07T20:32:42.2306595Z @given( 2025-05-07T20:32:42.2306709Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2306812Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2306921Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2307044Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2307153Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2307229Z ) 2025-05-07T20:32:42.2307477Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2307574Z def test_silu_mul_quant( 2025-05-07T20:32:42.2307692Z self, 2025-05-07T20:32:42.2307780Z T: int, 2025-05-07T20:32:42.2307892Z D: int, 2025-05-07T20:32:42.2307988Z scale_ub: Optional[float], 2025-05-07T20:32:42.2308083Z contiguous: bool, 2025-05-07T20:32:42.2308166Z compiled: bool, 2025-05-07T20:32:42.2308241Z ) -> None: 2025-05-07T20:32:42.2308342Z torch.manual_seed(2025) 2025-05-07T20:32:42.2308416Z 2025-05-07T20:32:42.2308591Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2308667Z 2025-05-07T20:32:42.2308759Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2308894Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2308982Z x = x_sign * x_clamp 2025-05-07T20:32:42.2309062Z x0 = x[:, :D] 2025-05-07T20:32:42.2309146Z x1 = x[:, D:] 2025-05-07T20:32:42.2309219Z 2025-05-07T20:32:42.2309300Z if contiguous: 2025-05-07T20:32:42.2309440Z x0 = x0.contiguous() 2025-05-07T20:32:42.2309537Z x1 = x1.contiguous() 2025-05-07T20:32:42.2309614Z 2025-05-07T20:32:42.2309710Z if scale_ub is not None: 2025-05-07T20:32:42.2309815Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2309956Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2310032Z ) 2025-05-07T20:32:42.2310107Z else: 2025-05-07T20:32:42.2310210Z scale_ub_tensor = None 2025-05-07T20:32:42.2310282Z 2025-05-07T20:32:42.2310410Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2310505Z op = silu_mul_quant 2025-05-07T20:32:42.2310589Z if compiled: 2025-05-07T20:32:42.2310686Z op = torch.compile(op) 2025-05-07T20:32:42.2310799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2310872Z 2025-05-07T20:32:42.2310960Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2310975Z 2025-05-07T20:32:42.2311070Z moe/activation_test.py:117: 2025-05-07T20:32:42.2311199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2311351Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2311452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2311822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.2311920Z return fn(*args, **kwargs) 2025-05-07T20:32:42.2312411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2312508Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2312876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2313100Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2313453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2313550Z kernel = self.compile( 2025-05-07T20:32:42.2313935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2314117Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2314241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2314246Z 2025-05-07T20:32:42.2314460Z self = 2025-05-07T20:32:42.2315248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2315787Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb286dd3f60>} 2025-05-07T20:32:42.2316542Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2316769Z context = 2025-05-07T20:32:42.2316773Z 2025-05-07T20:32:42.2316946Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2317208Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2317314Z module_map=module_map) 2025-05-07T20:32:42.2317480Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2317576Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2317657Z E ^ 2025-05-07T20:32:42.2318076Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2318083Z 2025-05-07T20:32:42.2318500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2318505Z 2025-05-07T20:32:42.2318609Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2318831Z self=, 2025-05-07T20:32:42.2318912Z T=128, 2025-05-07T20:32:42.2318988Z D=7168, 2025-05-07T20:32:42.2319068Z scale_ub=1200.0, 2025-05-07T20:32:42.2319160Z contiguous=True, 2025-05-07T20:32:42.2319241Z compiled=False, 2025-05-07T20:32:42.2319315Z ) 2025-05-07T20:32:42.2319534Z self = 2025-05-07T20:32:42.2319702Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.2319707Z 2025-05-07T20:32:42.2319785Z @given( 2025-05-07T20:32:42.2319915Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2320059Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2320172Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2320296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2320406Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2320487Z ) 2025-05-07T20:32:42.2320731Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2320824Z def test_silu_mul_quant( 2025-05-07T20:32:42.2320908Z self, 2025-05-07T20:32:42.2320986Z T: int, 2025-05-07T20:32:42.2321062Z D: int, 2025-05-07T20:32:42.2321171Z scale_ub: Optional[float], 2025-05-07T20:32:42.2321260Z contiguous: bool, 2025-05-07T20:32:42.2321344Z compiled: bool, 2025-05-07T20:32:42.2321428Z ) -> None: 2025-05-07T20:32:42.2321525Z torch.manual_seed(2025) 2025-05-07T20:32:42.2321601Z 2025-05-07T20:32:42.2321777Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2321858Z 2025-05-07T20:32:42.2321956Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2322080Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2323972Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2323985Z 2025-05-07T20:32:42.2324100Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.2324109Z 2025-05-07T20:32:42.2324211Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2324485Z self=, 2025-05-07T20:32:42.2324725Z T=128, 2025-05-07T20:32:42.2324802Z D=5120, 2025-05-07T20:32:42.2324893Z scale_ub=1200.0, 2025-05-07T20:32:42.2324973Z contiguous=True, 2025-05-07T20:32:42.2325052Z compiled=True, 2025-05-07T20:32:42.2325128Z ) 2025-05-07T20:32:42.2325342Z self = 2025-05-07T20:32:42.2325514Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.2325519Z 2025-05-07T20:32:42.2325595Z @given( 2025-05-07T20:32:42.2325712Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2325816Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2325931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2326085Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2326204Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2326278Z ) 2025-05-07T20:32:42.2326533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2326624Z def test_silu_mul_quant( 2025-05-07T20:32:42.2326698Z self, 2025-05-07T20:32:42.2326777Z T: int, 2025-05-07T20:32:42.2326854Z D: int, 2025-05-07T20:32:42.2326948Z scale_ub: Optional[float], 2025-05-07T20:32:42.2327042Z contiguous: bool, 2025-05-07T20:32:42.2327129Z compiled: bool, 2025-05-07T20:32:42.2327202Z ) -> None: 2025-05-07T20:32:42.2327298Z torch.manual_seed(2025) 2025-05-07T20:32:42.2327368Z 2025-05-07T20:32:42.2327533Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2327613Z 2025-05-07T20:32:42.2327702Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2327826Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2329593Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2329648Z 2025-05-07T20:32:42.2329773Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.2329778Z 2025-05-07T20:32:42.2329880Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2330098Z self=, 2025-05-07T20:32:42.2330179Z T=128, 2025-05-07T20:32:42.2330255Z D=7168, 2025-05-07T20:32:42.2330338Z scale_ub=None, 2025-05-07T20:32:42.2330426Z contiguous=True, 2025-05-07T20:32:42.2330511Z compiled=True, 2025-05-07T20:32:42.2330588Z ) 2025-05-07T20:32:42.2330810Z self = 2025-05-07T20:32:42.2330973Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.2330978Z 2025-05-07T20:32:42.2331068Z @given( 2025-05-07T20:32:42.2331184Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2331283Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2331407Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2331524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2331636Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2331718Z ) 2025-05-07T20:32:42.2331958Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2332049Z def test_silu_mul_quant( 2025-05-07T20:32:42.2332143Z self, 2025-05-07T20:32:42.2332222Z T: int, 2025-05-07T20:32:42.2332346Z D: int, 2025-05-07T20:32:42.2332487Z scale_ub: Optional[float], 2025-05-07T20:32:42.2332572Z contiguous: bool, 2025-05-07T20:32:42.2332665Z compiled: bool, 2025-05-07T20:32:42.2332744Z ) -> None: 2025-05-07T20:32:42.2332841Z torch.manual_seed(2025) 2025-05-07T20:32:42.2332919Z 2025-05-07T20:32:42.2333084Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2334878Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2334895Z 2025-05-07T20:32:42.2335013Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2335144Z =============================== warnings summary =============================== 2025-05-07T20:32:42.2335457Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:42.2335757Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:42.2336062Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:42.2336936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:42.2337167Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:42.2337214Z 2025-05-07T20:32:42.2337433Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:42.2337599Z ================= 1 failed, 1 deselected, 3 warnings in 14.28s ================= 2025-05-07T20:32:43.9091040Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:43.9738341Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:32:43.9738956Z 2025-05-07T20:32:43.9739759Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:32:43.9740378Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:32:43.9740791Z 2025-05-07T20:32:43.9740797Z 2025-05-07T20:32:43.9740829Z 2025-05-07T20:32:43.9757897Z ##[error]Process completed with exit code 1. 2025-05-07T20:32:43.9846514Z Post job cleanup. 2025-05-07T20:32:44.0832499Z [command]/usr/bin/git version 2025-05-07T20:32:44.0877581Z git version 2.47.1 2025-05-07T20:32:44.0912508Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/c05eb607-b4a7-4c62-baaf-1e5e09a282de/.gitconfig' 2025-05-07T20:32:44.0922606Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/c05eb607-b4a7-4c62-baaf-1e5e09a282de' before making global git config changes 2025-05-07T20:32:44.0923477Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:32:44.0928214Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:32:44.0971621Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:32:44.1006073Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:32:44.1339901Z Entering 'external/asmjit' 2025-05-07T20:32:44.1406336Z Entering 'external/composable_kernel' 2025-05-07T20:32:44.1478514Z Entering 'external/cpuinfo' 2025-05-07T20:32:44.1552785Z Entering 'external/cutlass' 2025-05-07T20:32:44.1627210Z Entering 'external/googletest' 2025-05-07T20:32:44.1694918Z Entering 'external/hipify_torch' 2025-05-07T20:32:44.1761810Z Entering 'external/json' 2025-05-07T20:32:44.1851733Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:32:44.1878784Z http.https://github.com/.extraheader 2025-05-07T20:32:44.1892231Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:32:44.1923122Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:32:44.2253595Z Entering 'external/asmjit' 2025-05-07T20:32:44.2296004Z http.https://github.com/.extraheader 2025-05-07T20:32:44.2340588Z Entering 'external/composable_kernel' 2025-05-07T20:32:44.2384186Z http.https://github.com/.extraheader 2025-05-07T20:32:44.2433920Z Entering 'external/cpuinfo' 2025-05-07T20:32:44.2476735Z http.https://github.com/.extraheader 2025-05-07T20:32:44.2520429Z Entering 'external/cutlass' 2025-05-07T20:32:44.2565865Z http.https://github.com/.extraheader 2025-05-07T20:32:44.2616669Z Entering 'external/googletest' 2025-05-07T20:32:44.2661412Z http.https://github.com/.extraheader 2025-05-07T20:32:44.2704133Z Entering 'external/hipify_torch' 2025-05-07T20:32:44.2748213Z http.https://github.com/.extraheader 2025-05-07T20:32:44.2789949Z Entering 'external/json' 2025-05-07T20:32:44.2837306Z http.https://github.com/.extraheader 2025-05-07T20:32:44.2998426Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:32:44.3030343Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:32:44.3041450Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:32:44.3041809Z ##[endgroup] 2025-05-07T20:32:44.3143440Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:32:55.2834042Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:33:11.8890837Z Cleaning up orphan processes